Deploy LeanXcale using Kubernetes
1. Introduction
To deploy LeanXcale with Kubernetes, you can configure your database in different PODs to allow to have HA with data and process replication, so if one node fails we can continue working with our database.
All the PODs are created from the same docker image, but you should change the parametrization according to every function in the cluster.
If you are setting an HA environment, you will need at least three Kubernetes hosts. You just need to have two to distribute the PODs (so if one host fails, the other one will keep all functions working), but the third server is needed to guarantee that ZooKeeper PODs can keep the leader election to preserve consistency under network split partitions.
The following functions are considered:
2. ZooKeeper
ZooKeeper is the configuration master for all the components. It allows for leader election and keeps the heartbeat for all the components in the system.
For these reasons, ZooKeeper POD must be a StatefulSet POD and in High Availability deployments it must be configured at least with 3 replicas so leader election there is always a majority.
So this POD has to be configured with INSTANCE as ZK to start ZooKeeper. The other important part in this configuration is the ZK environment variable. This has to be set up with the value of all ZooKeeper servers in the cluster and has to be configured in all the PODs because it is the reference to connect to get the configuration and for the heartbeat. Note also that the way to define that list should be including the host resolution name according to your network configuration.
env:
- name: INSTANCES
value: "ZK"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
A complete configuration for ZooKeeper PODs in HA follows:
apiVersion: v1
kind: Service
metadata:
name: zk-hs
labels:
app: zk
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk
---
apiVersion: v1
kind: Service
metadata:
name: zk-cs
labels:
app: zk
spec:
ports:
- port: 2181
name: client
selector:
app: zk
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-hs
replicas: 3
selector:
matchLabels:
app: zk
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: zk
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: zk
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "ZK"
- name: MEM
value: "10"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check ZK
timeoutSeconds: 5
periodSeconds: 10
volumeMounts:
- name: local-pvc-zk
mountPath: "/lx/LX-DATA"
volumeClaimTemplates:
- metadata:
name: local-pvc-zk
namespace: default
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 500Mi
storageClassName: "local-storage"
3. MtM
This is a StatefulSet POD that does the master timestamping and configuration functions. The commit sequencer is in charge of distributing commit timestamps to the Local Transactional Managers. The Snapshot Sever provides the most fresh coherent snapshot on which new transactions can be started. The configuration manager handles system configuration and deployment information. It also monitors the other components. It does this by persisting in the zookeeper component seen in the previous section.
If you’ll be using HA, you’ll need to have two replicas of this POD.
Data replication is defined in terms of mirroring. When you define a data partition, the partition will be replicated as many times as defined in the variable MIRROR_SIZE. There is an important implication of this. If you define MIRROR_SIZE as 2, then you need that the KiVi datastores are always a multiple of 2. You can grow the cluster by increasing the number of replicas, but you have to increase it by adding 2 more replicas, so you increase with a new mirroring group.
The more specific configuration parameters for this POD are:
- name: INSTANCES
value: "MtM"
- name: MMMEM
value: "1"
- name: MIRROR_SIZE
value: "2"Key/Value
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
MMMEM states that you will be allocating 1GiB of memory for this component.
The full POD configuration with HA is:
---
apiVersion: v1
kind: Service
metadata:
name: mtm-service
labels:
app: mtm
spec:
ports:
- port: 10500
name: lxconsole
protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mtm
spec:
serviceName: mtm-service
replicas: 2
selector:
matchLabels:
app: mtm
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: mtm
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: mtm
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "MtM"
- name: MEM
value: "10"
- name: MMMEM
value: "1"
- name: MIRROR_SIZE
value: "2"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check MtM
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 30
4. KVMS
KVMS is the metadata server for LeanXcale’s key/value datastores. This is also a StatefulSet POD.
In case you’re using HA, you need to start at least 2 replicas that will connect in a master/slave configuration:
env:
- name: INSTANCES
value: "KVMS"
- name: KVMS
value: "kvms-0.ks kvms-1.ks"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
Important note is that the value of variable KVMS
has to contain the list of
KVMS pods according to kubernetes hostname.my-namespace
(https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields):
---
apiVersion: v1
kind: Service
metadata:
name: ks
labels:
app: kvms
spec:
ports:
- name: kvms
port: 14400
protocol: TCP
clusterIP: None
selector:
app: kvms
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kvms
spec:
serviceName: ks
replicas: 2
selector:
matchLabels:
app: kvms
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: kvms
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: kvms
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "KVMS"
- name: KVMS
value: "kvms-0.ks kvms-1.ks"
- name: MEM
value: "10"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check KVMS
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 30
volumeMounts:
- name: local-pvc-kvms
mountPath: /lx/LX-DATA
volumeClaimTemplates:
- metadata:
name: local-pvc-kvms
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 500Mi
storageClassName: local-storage
5. KVDS
These are the key-value datastores. Each datastore will persist a part of the data and you can have as many datastore PODs as you need to manage your data. The more datastores, the faster you will be able to manage your data, but you should also be careful about data partitioning and distribution to take advantage of the resources.
In HA configurations with mirroring, the number of replicas has to be multiple of the mirror size, so the system scales up keeping mirror groups.
The amount of memory used by the KVDS process is defined by the KVDSMEM
environment variable
in GiB. Usually one KVDS handles between 2GiB and 16GiB. The right ammount of memory and the number
of KVDS will depend on the size of your dataset and the workload that LeanXcale will manage.
These are the specific parameters for this POD:
env:
- name: MEM
value: "10"
- name: INSTANCES
value: "KVDS"
- name: KVDSMEM
value: "2"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
- name: KVMS
value: "kvms-0.ks.default.svc.cluster.local kvms-1.ks.default.svc.cluster.local"
- name: KB_KVDS_SERVICE
value: "kds"
MEM is the total value of the memory for the POD. It isn’t a hard limit, but must be higher than KVDSMEM. Note that KVDS PODs need to address the KVMS so they need the network qualified name of the KVMS.
KB_KVDS_SERVICE is the name of the service in the KVMS POD and is needed also for networking purposes. If not correctly set, the KVDS may not be registered properly and other components may not be able to connect to them.
A complete configuration for KVDS PODs in HA follows (though only 2 are considered):
---
apiVersion: v1
kind: Service
metadata:
name: kds
labels:
app: kvds
spec:
ports:
- name: kvds
port: 9992
protocol: TCP
clusterIP: None
selector:
app: kvds
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kvds
spec:
serviceName: kds
replicas: 2
selector:
matchLabels:
app: kvds
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: kvds
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: kvds
image: docker.leanxcale.com/lx-docker-hc:1.7a
#imagePullPolicy: Always
env:
- name: INSTANCES
value: "KVDS"
- name: MEM
value: "10"
- name: KVDSMEM
value: "2"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
- name: KVMS
value: "kvms-0.ks.default.svc.cluster.local kvms-1.ks.default.svc.cluster.local"
- name: KB_KVDS_SERVICE
value: "kds"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check KVDS
initialDelaySeconds: 12
periodSeconds: 30
volumeMounts:
- name: local-pvc-ds
mountPath: /lx/LX-DATA
volumeClaimTemplates:
- metadata:
name: local-pvc-ds
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 500Mi
storageClassName: "local-storage"
6. CflM
This is the Conflict Manager POD. Conflict Management can be scaled out to as many components as needed, so in case of a big transactional system, you can run a lot of Conflict Managers.
If using HA, you need at least two.
The more specific configuration parameters for this POD are:
- name: INSTANCES
value: "CflM"
- name: CFLICTMEM
value: "2"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
CFLICTMEM states that you will be allocating 2GiB of memory for each of these components.
The full POD configuration with HA enabled is:
---
apiVersion: v1
kind: Service
metadata:
name: cflm-service
labels:
app: cflm
spec:
ports:
- port: 13100
name: conflict
protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cflm
spec:
serviceName: cflm-service
replicas: 2
selector:
matchLabels:
app: cflm
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: cflm
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: cflm
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "CflM"
- name: MEM
value: "10"
- name: CFLICTMEM
value: "2"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check CflM
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 30
7. Query Engine Deployment
This process executes queries against the database, so it’s the point where the clients will connect to get results. Thanks to Kubernetes, it’s very easy to create as many replicas as you need while the load of the clients in the system increases.
The memory used by the Query Engine process is defined by the QEMEM
environment variable.
Also, you’ll be able to define the service to export the Query Engine port outside the Kubernetes cluster (in the example, nodePort
31529
).
---
apiVersion: v1
kind: Service
metadata:
name: qe
labels:
app: qe
spec:
type: NodePort
selector:
app: qe
ports:
- port: 1529
targetPort: 1529
nodePort: 31529
name: query
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: qe
spec:
replicas: 1
selector:
matchLabels:
app: qe
template:
metadata:
labels:
app: qe
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: qe
image: docker.leanxcale.com/lx-docker-hc:1.7a
ports:
- containerPort: 1529
env:
- name: INSTANCES
value: "QE"
- name: MEM
value: "10"
- name: QEMEM
value: "1"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check QE
initialDelaySeconds: 15
periodSeconds: 30
8. Loggers
These processes save the state of the database, so we need to configure them as stateful. There are 2 kind of loggers:
-
Loggers for timestamping: timestamping logging is very light, but allows the system to continue in case of crashes or problems. In case of HA you need 2 of these loggers.
-
Loggers for transactions: You can instantiate as may PODs as required by the workload. Under heavy transactional workload, you can scale out and distribute logging to avoid any bottleneck in logging transactions.
---
apiVersion: v1
kind: Service
metadata:
name: loggercssrv
labels:
app: logcms
spec:
ports:
- name: logger-cms
port: 13400
protocol: TCP
clusterIP: None
selector:
app: logcms
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: logcms
spec:
serviceName: loggercssrv
replicas: 2
selector:
matchLabels:
app: logcms
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: logcms
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: logcms
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "LgCmS"
- name: MEM
value: "10"
- name: LOGMEM
value: "1"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check LgCmS
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 30
volumeMounts:
- name: local-pvc-loggercs
mountPath: /lx/LX-DATA
volumeClaimTemplates:
- metadata:
name: local-pvc-loggercs
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 500Mi
storageClassName: local-storage
---
apiVersion: v1
kind: Service
metadata:
name: loggersrv
labels:
app: logtxn
spec:
ports:
- name: logger-ltm
port: 13420
protocol: TCP
clusterIP: None
selector:
app: logtxn
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: logtxn
spec:
serviceName: loggersrv
replicas: 2
selector:
matchLabels:
app: logtxn
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: logtxn
spec:
terminationGracePeriodSeconds: 10
securityContext:
fsGroup: 1000
containers:
- name: logtxn
image: docker.leanxcale.com/lx-docker-hc:1.7a
env:
- name: INSTANCES
value: "LgLTM"
- name: MEM
value: "10"
- name: LOGMEM
value: "1"
- name: ZK
value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- python3 /lx/LX-BIN/scripts/lxManageNode.py check LgLTM
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 30
volumeMounts:
- name: local-pvc-logger
mountPath: /lx/LX-DATA
volumeClaimTemplates:
- metadata:
name: local-pvc-logger
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 500Mi
storageClassName: local-storage
9. Persisted Volumes
There are Persistent Volume Claim Templates pre-defined inside the StatefulSets definitions, but the Kubernetes cluster administrator needs to define the Persistent Volumes based on the topology
of the cluster. Here, we have defined one of them using the Filesystem
volumeMode:
9.1. Persistent Volume
As an example of persistent volume to meet the Persistent Volume Claim Templates:
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-lx-node-1a
spec:
capacity:
storage: 500Mi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /DATA_LX/DOC_KUBERNETES/A
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- ip-172-31-63-60
Consider that you will need a persistent volume for each pod that has a persistent volume claim.
9.2. Storage Size Guidelines
The storage capacity in the examples throughout this document are not representative of the real size you may need. For real environments:
-
The size needed for
ZK
pods is relativelly small because it only holds ZooKeeper configuration. 10 GiB should be more than enough. -
The size for
logcms
pods, which is the logger for timestamping is also small and 10 GiB should be more than enough. -
The size for
logtxn
pods, depends on the amount of transactions and other activities like backup policies. Transaction logs usually rotate through several files, you may need more files for keeping incremental backups. To start, you may configure 100 GiB for these pods. -
KVMS
pods only store the metadata and 10 GiB would suffice. -
The size for
KVDS
pods directly depends on the size of your dataset.
10. Scale UP
You can scale up the components according to the rules in each section. Usually you will scale up KVDS, Query Engine and, less frequently, Conflict Managers and Loggers.
In the following example, you can see how you would scale up the number of Query Engines just by updating the Deployment configuration:
kubectl scale deployment/qe --replicas=4