LeanXcale Documentation :: LeanXcale Documentation

Introduction

When inserting data into a distributed database such as LeanXcale, adherence to specific practices is crucial for optimizing performance. In this guide we will go through a progression from a rudimentary and inefficient method to the most effective approach, highlighting the drawbacks of each method and how subsequent strategies overcome them.

We will use an open dataset with real data from Lending Club Data (Available on kaggle: https://www.kaggle.com/wordsforthewise/lending-club). We have a file, `loans.csv, with a dataset and ingest it several times with different timestamps to reach an appropriate volume for the purpose of this training. The file can be found in the jar in the resources section.

Let us first take a look at the structure of the code and the auxiliary classes. Each of the approaches, basically, do the table creation, ingest the data, and clean up the database. Since all approaches have the same structure, we have a class hierarchy with an abstract class, Try_abstract, that define the methods that all approaches will have, and provide also the code that will be reused across approaches, such as cleaning up the database. Each approach is presented as a new concrete class inheriting from Try_abstract with the changes introduced in the new approach. The first approach is in the class ´Try1_NaiveApproach´. This first approach makes most of the common pitfalls when ingesting data. After each approach we identify a problem, and suggest a solution for the problem that is shown in the next approach, and so on, till we present the optimal way to ingest data.

There are a number of auxiliary classes that are helpful for the purpose of one or more approaches, but are not the focus of this guide, including the class to read the CSV file, a utils class with a variety of methods, or an avl class for an AVL tree used in some of the approaches.

The full code for all approaches and auxiliary code is available at:

https://gitlab.com/leanxcale_public/highdataingestion