LeanXcale is a distributed database. It can scale transactions from 1 node to thousands. This is achieved by following a number of basic principles, having a scalable architecture, and through intelligent implementation of the various database concepts.

1. Principles

The basic principles for LeanXcale transaction management are the following:

  • Separation of commit from the visibility of committed data

  • Proactive pre-assignment of commit timestamps to committing transactions

  • Detection and resolution of conflicts before commit

  • Transactions can commit in parallel due to:

    • They do not conflict

    • They have their commit timestamp already assigned that will determine its serialization order

    • Visibility is regulated separately to guarantee the reading of fully consistent states

ACID transactions can scale because they are decomposed in different components that can be distributed and scaled independently.

2. Components


LeanXcale is made up of the following components:

  • Query engine

    • OLAP Workers

    • JDBC Service

    • Parser

    • Optimizer

    • Plan Executor

    • Local Transaction Manager

  • Transaction Manager

    • Commit Sequencer

    • Snapshot Server

    • Config Manager

    • Conflict Managers

    • Loggers

  • KiVi Data Store

    • KiVi Metadata Server

  • Zookeeper

    • Distributed configuration management

2.1. Query Engine

The Query Engine transforms SQL queries into a query plan that is a tree of executable algebraic query operators over the datastore and that moves data across operators till finally producing the overall result requested by the original SQL statement.

2.2. JDBC

The JDBC API provides standardized connection and query execution for the query engine.

2.2.1. Parser

The SQL parser builds an execution plan from the queries.

2.2.2. Optimizer

The SQL optimizer that transforms the execution plan for efficiency.

2.2.3. Plan Executor

The plan executor is an SQL query engine that orchestrates the execution of the query plan.

2.2.4. OLAP Workers

For analytical query operators, the query engine can parallelize the query operators leveraging OLAP workers in the different Query Engine components. This way, analytical queries are parallelized and results can be gotten faster and more efficiently.

2.2.5. Local Transaction Manager

The local transaction manager is in charge of the life cycle of transactions and interfacing between the client side of the storage engine and the other transactional manager components.

2.3. Transaction Management

The transaction manager is in charge of guaranteeing the coherence of the operational data in the advent of failures and concurrent accesses. It provides abstracted transactions that enable to bracket a set of data operations to request that they are atomic.

The transaction manager layer is composed of several components.

2.3.1. Loggers

The loggers are components that log all transaction updates (called writesets) to persistent storage to guarantee durability of transactions. The logger subsystem is not built from a single logger process, but logger can be distributed in different processes to achieve better performance.

Each logger takes care of a fraction of the log records. Loggers log in parallel and are uncoordinated. There can be as many loggers as needed to provide the necessary IO bandwidth to log the rate of updates.

Loggers can also be replicated.

2.3.2. Conflict Managers

Conflict Managers are in charge of detecting write-write conflicts among concurrent transactions. By detecting these conflicts the LTMs can abort transactions that cannot enforce write isolation.

Each conflict manager takes care of a set of keys and there can be as many conflict managers as needed. They scale in the same way as hashing based key-value data stores

2.3.3. Commit Sequencer

The commit sequencer is in charge of distributing commit timestamps to the Local Transactional Managers.

2.3.4. Snapshot Server

The Snapshot Sever provides the most fresh coherent snapshot on which new transactions can be started.

2.3.5. Configuration Manager

The configuration manager handles system configuration and deployment information. It also monitors the other components.

2.4. DataStore

The data store is fully ACID, highly efficient Relational Key-Value datastore, which also supports columnar storage.

2.4.1. KiVi Data Stores

KiVi data stores supports secondary local indexes and performs operations taking advantage of modern CPU’s vectorial SIMD operations. They are designed to get the most of current multi-core and NUMA architectures, and implement a novel data structure that combines the advantages of B+ Trees for range queries and of LSM-Trees for random updates and inserts.

2.4.2. KiVi Metadata Server

This is the component that keeps the metadata information and coordination among different KiVi Data Stores.

2.5. ZooKeeper

Apache Zookeper is a centralized replicated service for maintaining configuration information and providing distributed synchronization and group services.

LeanXcale data management uses ZooKeeper for coordination among components and to keep configuration information and the health status of each of them.

3. High Availability

The components of LeanXcale can be configured to provide a highly available deployment. There are a few requirements that the layout of the components should meet to provide a highly available set-up.

  1. Zookeeper has to be installed at least in three different nodes and it has to be installed on an odd number of machines. Zookeeper is used for leader election and consensus so - in case of a network split partition - the components that are part of zookeepers' majority cluster will remain as active.

  2. There should be at least two instances of the rest of the components.

    Most components are active-active and the workload will be distributed among them. In case of crash, the workload will be transferred to the remaining components, so depending on the sizing, the performance may be degraded until the components are recovered.

  3. The Loggers must be installed in a different machine than the component they are storing the transactional information for. This way, if one machine crashes, the "on-the-fly" activity from the components on that machine can be recovered from the loggers that are running in another machine.

    In fact, Loggers can be configured in groups so as to allow the replication of transaction logging events.

3.1. Fault Tolerance in each Component

High availability is about having a component that can serve requests even if some component fails. This is how each component can be configured so requests can be served when there is any failure.

  • Zookeeper: As mentioned above, Zookeeper plays the role of consensus and leader election so there needs to be at least 3 instances running on different machines. All three instances have the same information. If one of the instances dies, the other 2 will keep working and the database will keep working with no impact on it. Upon restart of the new instance, all three instances will be realigned.

  • Configuration Manager: There has to be 2 instances of the Configuration Manager. However, the configuration manager will run Active-Passive because It has a very operational function: handling any reconfiguration and health monitoring. In fact, the database could work with no configuration manager if there were no reconfiguration or other events. If the active configuration manager is dead, the passive configuration manager will become active. This is detected through a Zookeeper heartbeat.

  • Snapshot Server: The Snapshot Server can run several instances in master-slave mode. The leader Snapshot Server is the component that provides and keeps Snapshot information for the rest of the components. The master keeps informing the slaves. Upon failure in the active, another Snapshot Server is selected and keeps providing all the information needed.

  • Commit Sequencer: As with the Snapshot Server, the Commit Sequencer works in master-slaves mode so they both have a behavior almost identical in terms of failure.

  • Conflict Manager: There can be as many Conflict Managers as needed. Conflicts are distributed in ranges of keys. So depending on the key, the conflict will be managed by a specific Conflict Manager. This way, the workload is distributed among all running Conflict Managers. Since conflicts are distributed but not replicated, if a Conflict Manager fails, then all transactions managing conflicts on that Conflict Manager will be aborted. Then, the Configuration Manager will reconfigure the rest of the Conflicts Managers and redistribute the ranges of keys. On reconfiguration the database will keep working as if there was no failure. Having two instances of the Conflict Manager in different machines will suffice, but more Conflict Managers can be configured to deal with a high transactional workload.

  • Transactional Loggers: As mentioned in the previous section, there can be as many transactional loggers as considered, but the minimal for high availability is having two instances in different machines. All components running transactions are connected to at least one transactional logger. The Configuration Manager - unless there is no other option - will connect the components to a logger running on a different machine. That way, if a machine and its local disk crashes, then on-the-fly committed transactions can be recovered from the loggers in the other machines. When one logger stops, then the Configuration Manager reconfigures the components that where connected to it and order them to connect to another logger. Once reconfigured, transactions will keep being logged thought it may happen that the logger and the component are in the same machine. When a new logger is started, the Configuration Manager will redistribute them again so the component and the logger run on different machines.

  • Query Engine: There can be as many Query Engines as needed to deal with the workload. SQL Sessions are balanced across the Query Engines. If a Query Engine fails, the SQL Sessions running on it, will be aborted if there is any pending operation and the Sessions will be switched to another Query Engine. By having 2 Query Engine instances in different machines you can achieve high availability.

  • KVMS: The KiVi datastore metadata server works on master-slaves configuration. The master will propagate all changes to all slaves. If a slave fails, the master won’t allow any changes in metadata until the slave is up. If the master fails, one of the slaves will be promoted to master. KVMS will be working in different machines. The database can work with no KVMS as far as there is no metadata request, but with this configuration the odds of not having a KVMS to work as master is really slim.

  • KVDS: The KiVi datastore server manages the data and its persistence. It is a very complex component because it doesn’t only manage data IO, but can deal with complex filters and aggregates so the minimal information goes to the Query Engine. Each KVDS is a shared nothing component and works with a local set of files (unless you use a storage cabin). Every deployment has a number of KVDS, and the data is distributed among those KVDS. On failure the data that one KVDS was addressing won’t be available, but the data managed by the rest of KVDS will be available. So you will always find a KVDS to serve a request, but this is when replication comes in place…​

3.2. Replication

To ensure that not only components are available, but data is available, you need replication. You can run KVDS in replication mode. That way you can guarantee that a number of KVDS manage the same piece of information.

LeanXcale uses its innovative transaction management for replication. This means that one partition table and their replicas will be modified as if they were part of one transaction. No extra-mechanism is needed which makes replication very efficient. Of course, you will be multiplying the number of KVDS needed by the replication factor.

The default replication schema is ROWA (Read Once Write All). Thus, reads will go just to one of the KVDS having the information while writes will go to all the replicas so all of them are aligned. By default, all reads will go to the same KVDS, so you can leverage having a hot cache. However, for very hot partitions or tables, some of them can be replicated in a way that reads can be balanced across all the replicas.

In case of failure of one of the replicas, the request will be forwarded to another replica, and the failing replica will have to be resynchronized upon restart.

3.3. Conclusion for High Availability and Replication

LeanXcale supports mechanisms for high availability and replication. The mechanisms are not constrained to any factor so configurations with higher replication factor or high availability can be set up.

LeanXcale provides default configurations for high availability and replication. The default configurations for high availability instantiates at least 2 instances of each component (except for Zookeeper which is 3) and starts 2 replicas of data information managed by different KVDS. By having 2 replicas distributed in different machines and having high availability with a factor of 2 instances, you can guarantee that the database will be resilient to one machine full failure. If you want a deployment with stronger fault tolerance, you can increase the number of replicas and the number of high availability instances.