Deploy LeanXcale using Kubernetes

1. Introduction

To deploy LeanXcale with Kubernetes, you can configure your database in different PODs to allow to have HA with data and process replication, so if one node fails we can continue working with our database.

All the PODs are created from the same docker image, but you should change the parametrization according to every function in the cluster.

If you are setting an HA environment, you will need at least three Kubernetes hosts. You just need to have two to distribute the PODs (so if one host fails, the other one will keep all functions working), but the third server is needed to guarantee that ZooKeeper PODs can keep the leader election to preserve consistency under network split partitions.

The following functions are considered:

2. ZooKeeper

ZooKeeper is the configuration master for all the components. It allows for leader election and keeps the heartbeat for all the components in the system.

For these reasons, ZooKeeper POD must be a StatefulSet POD and in High Availability deployments it must be configured at least with 3 replicas so leader election there is always a majority.

So this POD has to be configured with INSTANCE as ZK to start ZooKeeper. The other important part in this configuration is the ZK environment variable. This has to be set up with the value of all ZooKeeper servers in the cluster and has to be configured in all the PODs because it is the reference to connect to get the configuration and for the heartbeat. Note also that the way to define that list should be including the host resolution name according to your network configuration.

  env:
  - name: INSTANCES
    value: "ZK"
  - name: ZK
    value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"

A complete configuration for ZooKeeper PODs in HA follows:

apiVersion: v1
kind: Service
metadata:
  name: zk-hs
  labels:
    app: zk
spec:
  ports:
  - port: 2888
    name: server
  - port: 3888
    name: leader-election
  clusterIP: None
  selector:
    app: zk
---
apiVersion: v1
kind: Service
metadata:
  name: zk-cs
  labels:
    app: zk
spec:
  ports:
  - port: 2181
    name: client
  selector:
    app: zk
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zk
spec:
  serviceName: zk-hs
  replicas: 3
  selector:
    matchLabels:
      app: zk
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: zk
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: zk
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "ZK"
            - name: MEM
              value: "10"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check ZK
            timeoutSeconds: 5
            periodSeconds: 10
          volumeMounts:
          - name: local-pvc-zk
            mountPath: "/lx/LX-DATA"

  volumeClaimTemplates:
  - metadata:
      name: local-pvc-zk
      namespace: default
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 500Mi
      storageClassName: "local-storage"

3. MtM

This is a StatefulSet POD that does the master timestamping and configuration functions. The commit sequencer is in charge of distributing commit timestamps to the Local Transactional Managers. The Snapshot Sever provides the most fresh coherent snapshot on which new transactions can be started. The configuration manager handles system configuration and deployment information. It also monitors the other components. It does this by persisting in the zookeeper component seen in the previous section.

If you’ll be using HA, you’ll need to have two replicas of this POD.

Data replication is defined in terms of mirroring. When you define a data partition, the partition will be replicated as many times as defined in the variable MIRROR_SIZE. There is an important implication of this. If you define MIRROR_SIZE as 2, then you need that the KiVi datastores are always a multiple of 2. You can grow the cluster by increasing the number of replicas, but you have to increase it by adding 2 more replicas, so you increase with a new mirroring group.

The more specific configuration parameters for this POD are:

            - name: INSTANCES
              value: "MtM"
            - name: MMMEM
              value: "1"
            - name: MIRROR_SIZE
              value: "2"Key/Value
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"

MMMEM states that you will be allocating 1GiB of memory for this component.

The full POD configuration with HA is:

---
apiVersion: v1
kind: Service
metadata:
  name: mtm-service
  labels:
    app: mtm
spec:
  ports:
  - port: 10500
    name: lxconsole
    protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mtm
spec:
  serviceName: mtm-service
  replicas: 2
  selector:
    matchLabels:
      app: mtm
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: mtm
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: mtm
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "MtM"
            - name: MEM
              value: "10"
            - name: MMMEM
              value: "1"
            - name: MIRROR_SIZE
              value: "2"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check MtM
            initialDelaySeconds: 5
            timeoutSeconds: 5
            periodSeconds: 30

4. KVMS

KVMS is the metadata server for LeanXcale’s key/value datastores. This is also a StatefulSet POD.

In case you’re using HA, you need to start at least 2 replicas that will connect in a master/slave configuration:

  env:
    - name: INSTANCES
      value: "KVMS"
    - name: KVMS
      value: "kvms-0.ks kvms-1.ks"
    - name: ZK
      value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"

Important note is that the value of variable KVMS has to contain the list of KVMS pods according to kubernetes hostname.my-namespace (https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields):

---
apiVersion: v1
kind: Service
metadata:
  name: ks
  labels:
    app: kvms
spec:
  ports:
  - name: kvms
    port: 14400
    protocol: TCP
  clusterIP: None
  selector:
    app: kvms
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kvms
spec:
  serviceName: ks
  replicas: 2
  selector:
    matchLabels:
      app: kvms
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: kvms
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: kvms
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "KVMS"
            - name: KVMS
              value: "kvms-0.ks kvms-1.ks"
            - name: MEM
              value: "10"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check KVMS
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 30
          volumeMounts:
          - name: local-pvc-kvms
            mountPath: /lx/LX-DATA
  volumeClaimTemplates:
  - metadata:
      name: local-pvc-kvms
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 500Mi
      storageClassName: local-storage

5. KVDS

These are the key-value datastores. Each datastore will persist a part of the data and you can have as many datastore PODs as you need to manage your data. The more datastores, the faster you will be able to manage your data, but you should also be careful about data partitioning and distribution to take advantage of the resources.

In HA configurations with mirroring, the number of replicas has to be multiple of the mirror size, so the system scales up keeping mirror groups.

The amount of memory used by the KVDS process is defined by the KVDSMEM environment variable in GiB. Usually one KVDS handles between 2GiB and 16GiB. The right ammount of memory and the number of KVDS will depend on the size of your dataset and the workload that LeanXcale will manage.

These are the specific parameters for this POD:

        env:
        - name: MEM
          value: "10"
        - name: INSTANCES
          value: "KVDS"
        - name: KVDSMEM
          value: "2"
        - name: ZK
          value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
        - name: KVMS
          value: "kvms-0.ks.default.svc.cluster.local kvms-1.ks.default.svc.cluster.local"
        - name: KB_KVDS_SERVICE
          value: "kds"

MEM is the total value of the memory for the POD. It isn’t a hard limit, but must be higher than KVDSMEM. Note that KVDS PODs need to address the KVMS so they need the network qualified name of the KVMS.

KB_KVDS_SERVICE is the name of the service in the KVMS POD and is needed also for networking purposes. If not correctly set, the KVDS may not be registered properly and other components may not be able to connect to them.

A complete configuration for KVDS PODs in HA follows (though only 2 are considered):

---
apiVersion: v1
kind: Service
metadata:
  name: kds
  labels:
    app: kvds
spec:
  ports:
  - name: kvds
    port: 9992
    protocol: TCP
  clusterIP: None
  selector:
    app: kvds
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kvds
spec:
  serviceName: kds
  replicas: 2
  selector:
    matchLabels:
      app: kvds
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: kvds
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: kvds
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          #imagePullPolicy: Always
          env:
            - name: INSTANCES
              value: "KVDS"
            - name: MEM
              value: "10"
            - name: KVDSMEM
              value: "2"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
            - name: KVMS
              value: "kvms-0.ks.default.svc.cluster.local kvms-1.ks.default.svc.cluster.local"
            - name: KB_KVDS_SERVICE
              value: "kds"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check KVDS
            initialDelaySeconds: 12
            periodSeconds: 30
          volumeMounts:
          - name: local-pvc-ds
            mountPath: /lx/LX-DATA

  volumeClaimTemplates:
  - metadata:
      name: local-pvc-ds
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 500Mi
      storageClassName: "local-storage"

6. CflM

This is the Conflict Manager POD. Conflict Management can be scaled out to as many components as needed, so in case of a big transactional system, you can run a lot of Conflict Managers.

If using HA, you need at least two.

The more specific configuration parameters for this POD are:

            - name: INSTANCES
              value: "CflM"
            - name: CFLICTMEM
              value: "2"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"

CFLICTMEM states that you will be allocating 2GiB of memory for each of these components.

The full POD configuration with HA enabled is:

---
apiVersion: v1
kind: Service
metadata:
  name: cflm-service
  labels:
    app: cflm
spec:
  ports:
  - port: 13100
    name: conflict
    protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cflm
spec:
  serviceName: cflm-service
  replicas: 2
  selector:
    matchLabels:
      app: cflm
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: cflm
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: cflm
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "CflM"
            - name: MEM
              value: "10"
            - name: CFLICTMEM
              value: "2"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check CflM
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 30

7. Query Engine Deployment

This process executes queries against the database, so it’s the point where the clients will connect to get results. Thanks to Kubernetes, it’s very easy to create as many replicas as you need while the load of the clients in the system increases.

The memory used by the Query Engine process is defined by the QEMEM environment variable.

Also, you’ll be able to define the service to export the Query Engine port outside the Kubernetes cluster (in the example, nodePort 31529).

---
apiVersion: v1
kind: Service
metadata:
  name: qe
  labels:
    app: qe
spec:
  type: NodePort
  selector:
    app: qe
  ports:
  - port: 1529
    targetPort: 1529
    nodePort: 31529
    name: query
    protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qe
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qe
  template:
    metadata:
      labels:
        app: qe
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: qe
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          ports:
            - containerPort: 1529
          env:
            - name: INSTANCES
              value: "QE"
            - name: MEM
              value: "10"
            - name: QEMEM
              value: "1"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check QE
            initialDelaySeconds: 15
            periodSeconds: 30

8. Loggers

These processes save the state of the database, so we need to configure them as stateful. There are 2 kind of loggers:

  • Loggers for timestamping: timestamping logging is very light, but allows the system to continue in case of crashes or problems. In case of HA you need 2 of these loggers.

  • Loggers for transactions: You can instantiate as may PODs as required by the workload. Under heavy transactional workload, you can scale out and distribute logging to avoid any bottleneck in logging transactions.

---
apiVersion: v1
kind: Service
metadata:
  name: loggercssrv
  labels:
    app: logcms
spec:
  ports:
  - name: logger-cms
    port: 13400
    protocol: TCP
  clusterIP: None
  selector:
    app: logcms
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: logcms
spec:
  serviceName: loggercssrv
  replicas: 2
  selector:
    matchLabels:
      app: logcms
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: logcms
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: logcms
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "LgCmS"
            - name: MEM
              value: "10"
            - name: LOGMEM
              value: "1"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check LgCmS
            initialDelaySeconds: 5
            timeoutSeconds: 5
            periodSeconds: 30
          volumeMounts:
          - name: local-pvc-loggercs
            mountPath: /lx/LX-DATA

  volumeClaimTemplates:
  - metadata:
      name: local-pvc-loggercs
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 500Mi
      storageClassName: local-storage

---
apiVersion: v1
kind: Service
metadata:
  name: loggersrv
  labels:
    app: logtxn
spec:
  ports:
  - name: logger-ltm
    port: 13420
    protocol: TCP
  clusterIP: None
  selector:
    app: logtxn
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: logtxn
spec:
  serviceName: loggersrv
  replicas: 2
  selector:
    matchLabels:
      app: logtxn
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        app: logtxn
    spec:
      terminationGracePeriodSeconds: 10
      securityContext:
        fsGroup: 1000
      containers:
        - name: logtxn
          image: docker.leanxcale.com/lx-docker-hc:1.7a
          env:
            - name: INSTANCES
              value: "LgLTM"
            - name: MEM
              value: "10"
            - name: LOGMEM
              value: "1"
            - name: ZK
              value: "zk-0.zk-hs,zk-1.zk-hs,zk-2.zk-hs"
          livenessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - python3 /lx/LX-BIN/scripts/lxManageNode.py check LgLTM
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 30
          volumeMounts:
          - name: local-pvc-logger
            mountPath: /lx/LX-DATA

  volumeClaimTemplates:
  - metadata:
      name: local-pvc-logger
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 500Mi
      storageClassName: local-storage

9. Persisted Volumes

There are Persistent Volume Claim Templates pre-defined inside the StatefulSets definitions, but the Kubernetes cluster administrator needs to define the Persistent Volumes based on the topology of the cluster. Here, we have defined one of them using the Filesystem volumeMode:

9.1. Persistent Volume

As an example of persistent volume to meet the Persistent Volume Claim Templates:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-lx-node-1a
spec:
  capacity:
    storage: 500Mi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /DATA_LX/DOC_KUBERNETES/A
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ip-172-31-63-60

Consider that you will need a persistent volume for each pod that has a persistent volume claim.

9.2. Storage Size Guidelines

The storage capacity in the examples throughout this document are not representative of the real size you may need. For real environments:

  • The size needed for ZK pods is relativelly small because it only holds ZooKeeper configuration. 10 GiB should be more than enough.

  • The size for logcms pods, which is the logger for timestamping is also small and 10 GiB should be more than enough.

  • The size for logtxn pods, depends on the amount of transactions and other activities like backup policies. Transaction logs usually rotate through several files, you may need more files for keeping incremental backups. To start, you may configure 100 GiB for these pods.

  • KVMS pods only store the metadata and 10 GiB would suffice.

  • The size for KVDS pods directly depends on the size of your dataset.

10. Scale UP

You can scale up the components according to the rules in each section. Usually you will scale up KVDS, Query Engine and, less frequently, Conflict Managers and Loggers.

In the following example, you can see how you would scale up the number of Query Engines just by updating the Deployment configuration:

kubectl scale deployment/qe --replicas=4