Architecture

LeanXcale is a distributed database. It can scale transactions from 1 node to thousands. This is achieved by following a number of basic principles, having a scalable architecture, and through intelligent implementation of the various database concepts.

1. Principles

The basic principles for LeanXcale transaction management are the following:

  • Separation of commit from the visibility of committed data

  • Proactive pre-assignment of commit timestamps to committing transactions

  • Detection and resolution of conflicts before commit

  • Transactions can commit in parallel due to:

    • They do not conflict

    • They have their commit timestamp already assigned that will determine its serialization order

    • Visibility is regulated separately to guarantee the reading of fully consistent states

ACID transactions can scale because they are decomposed in different components that can be distributed and scaled independently.

2. Components

image

LeanXcale is made up of the following components:

  • Query engine

    • OLAP Workers

    • JDBC Service

    • Parser

    • Optimizer

    • Plan Executor

    • Local Transaction Manager

  • Transaction Manager

    • Commit Sequencer

    • Snapshot Server

    • Config Manager

    • Conflict Managers

    • Loggers

  • KiVi Data Store

    • KiVi Metadata Server

  • Zookeeper

    • Distributed configuration management

2.1. Query Engine

The Query Engine transforms SQL queries into a query plan that is a tree of executable algebraic query operators over the datastore and that moves data across operators till finally producing the overall result requested by the original SQL statement.

2.2. JDBC

The JDBC API provides standardized connection and query execution for the query engine.

2.2.1. Parser

The SQL parser builds an execution plan from the queries.

2.2.2. Optimizer

The SQL optimizer that transforms the execution plan for efficiency.

2.2.3. Plan Executor

The plan executor is an SQL query engine that orchestrates the execution of the query plan.

2.2.4. OLAP Workers

For analytical query operators, the query engine can parallelize the query operators leveraging OLAP workers in the different Query Engine components. This way, analytical queries are parallelized and results can be gotten faster and more efficiently.

2.2.5. Local Transaction Manager

The local transaction manager is in charge of the life cycle of transactions and interfacing between the client side of the storage engine and the other transactional manager components.

2.3. Transaction Management

The transaction manager is in charge of guaranteeing the coherence of the operational data in the advent of failures and concurrent accesses. It provides abstracted transactions that enable to bracket a set of data operations to request that they are atomic.

The transaction manager layer is composed of several components.

2.3.1. Loggers

The loggers are components that log all transaction updates (called writesets) to persistent storage to guarantee durability of transactions. The logger subsystem is not built from a single logger process, but logger can be distributed in different processes to achieve better performance.

Each logger takes care of a fraction of the log records. Loggers log in parallel and are uncoordinated. There can be as many loggers as needed to provide the necessary IO bandwidth to log the rate of updates.

Loggers can also be replicated.

2.3.2. Conflict Managers

Conflict Managers are in charge of detecting write-write conflicts among concurrent transactions. By detecting these conflicts the LTMs can abort transactions that cannot enforce write isolation.

Each conflict manager takes care of a set of keys and there can be as many conflict managers as needed. They scale in the same way as hashing based key-value data stores

2.3.3. Commit Sequencer

The commit sequencer is in charge of distributing commit timestamps to the Local Transactional Managers.

2.3.4. Snapshot Server

The Snapshot Sever provides the most fresh coherent snapshot on which new transactions can be started.

2.3.5. Configuration Manager

The configuration manager handles system configuration and deployment information. It also monitors the other components.

2.4. DataStore

The data store is fully ACID, highly efficient Relational Key-Value datastore, which also supports columnar storage.

2.4.1. KiVi Data Stores

KiVi data stores supports secondary local indexes and performs operations taking advantage of modern CPU’s vectorial SIMD operations. They are designed to get the most of current multi-core and NUMA architectures, and implement a novel data structure that combines the advantages of B+ Trees for range queries and of LSM-Trees for random updates and inserts.

2.4.2. KiVi Metadata Server

This is the component that keeps the metadata information and coordination among different KiVi Data Stores.

2.5. ZooKeeper

Apache Zookeper is a centralized replicated service for maintaining configuration information and providing distributed synchronization and group services.

LeanXcale data management uses ZooKeeper for coordination among components and to keep configuration information and the health status of each of them.

3. High Availability

The components of LeanXcale can be configured to provide a highly-available deployment. A few requirements must first be met around the layout of these components to ensure high availability in your setup.

  1. Zookeeper must be installed in at least three nodes and on an odd number of machines. Zookeeper is used for leader election and consensus, so, in the case of a network split partition, the components that are part of the Zookeeper majority cluster remain active.

  2. At least two instances should exist for the remainder of the components.

    Most components are Active-Active, and the workload is distributed among them. In the case of a crash, the workload is transferred to the remaining components. So, depending on the sizing, the performance may be degraded until the components are recovered.

  3. The Loggers must be installed on a separate machine than the component for which they are storing the transactional information. With this approach, if one machine crashes, then the "on-the-fly" activity from any one component on that machine can be recovered from its logger running on another machine. These Loggers can be configured in groups to enable replication of the transaction logging events.

3.1. Fault Tolerance in each Component

High availability is about having a component that can serve requests, even when some component fails. This capability is the result of configuring each component so that requests can be served when there is any failure..

  • Zookeeper: As mentioned above, Zookeeper plays the role of consensus and leader election, so at least three instances must be running on different machines. All three instances maintain the same information. If one instance dies, then the other two continue working, and the database continues functioning without experiencing any impact. After the restart of the failed instance, all three instances are again realigned.

  • Configuration Manager: Two instances of the Configuration Manager must be available. However, the configuration manager runs as Active-Passive because it handles the operational functions of reconfiguration and health monitoring. The database could still work without a configuration manager if no reconfiguration or other events occur. If the active configuration manager is dead, then the passive configuration manager becomes active, which is detected through a Zookeeper heartbeat.

  • Snapshot Server: The Snapshot Server can run several instances in master-slave mode. The leader Snapshot Server component provides and maintains the Snapshot information for all other components, while the master informs the slaves. Upon failure while active, another Snapshot Server is selected and continues providing the needed information.

  • Commit Sequencer: As with the Snapshot Server, the Commit Sequencer works in a master-slave mode, so it follows a nearly identical behavior in terms of failure.

  • Conflict Manager: There may be as many Conflict Managers installed as needed. Conflicts are distributed in ranges of keys. So, depending on the key, each conflict is managed by a specific Conflict Manager. With this approach, the workload is distributed among all running Conflict Managers. Because conflicts tend to be distributed and not replicated, if a Conflict Manager fails, then all transactions that are managing conflicts on that Conflict Manager are aborted. Then, the Configuration Manager reconfigures the remainder of the Conflict Managers and redistributes the ranges of keys. After reconfiguration, the database continues working as if no failure occurred. Two instances of the Conflict Manager on different machines are sufficient, but additional Conflict Managers can be configured to handle higher transactional workloads.

  • Transactional Loggers: As mentioned above, as many transactional loggers as needed may be installed, but the minimum for high availability is two instances on different machines. All components running transactions are connected to at least one transactional logger. The Configuration Manager, unless no other option is available, connects each component to a logger running on a different machine. With this approach, if a machine and its local disk crashes, then on-the-fly committed transactions can be recovered from the loggers from the separate machine. When one logger stops, then the Configuration Manager reconfigures the connected components to connect to another logger. After reconfiguration, transactions continue being logged. However, a component and its logger may occur on the same machine, so when a new logger is started, the Configuration Manager redistributes them to be running on separate machines.

  • Query Engine: There can be as many Query Engines as needed to handle the workload, and the SQL Sessions are balanced across these Query Engines. If a Query Engine fails, then any SQL Session with a pending operation is aborted and switched to another Query Engine. By having two Query Engine instances available on different machines, you can achieve high availability.

  • KVMS: The KiVi Datastore Metadata Server works on a master-slave configuration. The master propagates all changes to all its slaves. If a slave fails, then the master does not allow any changes in the metadata until the slave runs again. If the master fails, then one of the slaves is promoted to master. KVMS works on different machines, and the database can still function without the KVMS, as long as no metadata requests occur. However, with this configuration, the chance of not having a KVMS working as the master is low.

  • KVDS: The KiVi Datastore Server manages the data and its persistence and is very complex because it manages data IO as well as complex filters and aggregates to ensure minimal information goes into the Query Engine. Each KVDS is a shared-nothing component and works with a local set of files (unless you use a storage cabin). Every deployment includes multiple KVDS components, across which the data is distributed. On failure, the data that one KVDS was addressing is no longer available, but the data managed by the remainder of the KVDS components are available. So, a KVDS is always available to serve a request, but this scenario is when replication is necessary, as described in the following section.

3.2. Replication

Replication is necessary to ensure that components and data are available. KVDS can be run in a replication mode to guarantee that multiple KVDS components manage the same information.

LeanXcale leverages its innovative transaction management for replication, meaning that one partition table and its replicas are modified as if they are part of a single transaction. No additional mechanism is needed, so this replication approach is very efficient. Of course, you must multiply the number of KVDS components needed by the replication factor desired.

The default replication schema is ROWA (Read Once Write All) with which reads go to only one of the KVDS that has the information and writes go to all the replicas, ensuring all are aligned. By default, all reads go to the same KVDS so that you can leverage a hot cache. However, for very hot partitions or tables, some can be replicated in an alternate way where reads can be balanced across all replicas. In case of failure from one replica, the request is forwarded to another replica, and the failing replica must then be resynchronized after a restart.

3.3. Conclusion for High Availability and Replication

LeanXcale supports powerful mechanisms for high availability and replication. These mechanisms are not constrained to any factor, so configurations with a higher replication factor or high availability can be set up.

LeanXcale provides default configurations for high availability and replication. These default configurations for high availability instantiates at least two instances of each component (except for Zookeeper, which requires three) and starts two replicas of the data information managed by different KVDS components. By maintaining two replicas distributed across different machines and having high availability with a factor of two instances, the database can be guaranteed to be resilient to the complete failure of a single machine. If a deployment with a stronger fault tolerance is required, then the number of replicas can be increased, as well as the number of high availability instances.