Naive Approach
Let us look at the code of the naive approach. First, let us look at the structure of each new approach. Basically, it is a class where we will redefine these two methods or only one of them: starting with the create table as can be seen below:
public class Try1_NaiveApproach extends Try_abstract {
public void createTable() throws SQLException {
public void ingestData() {
The new class inherits from the abstract class Try_abstract
and redefines
the methods createTable
and ingestData
.
The method createTable
states how to create the table, and the ingestData
method
has the main loop to ingest all the rows.
When the createTable
does not change, we will not repeat it and only show the changes
in the ingestData
method.
Here, we will use the following createTable
method that will be reused across a
number of approaches:
public void createTable() throws SQLException {
String SQL_CREATE_TABLE_HASH =
"CREATE TABLE " + TABLE_NAME + " (" +
COLS +
"PRIMARY KEY (client_id,loan_id)" +
")" +
" PARTITION BY HASH (client_id)" +
" DISTRIBUTE BY HASH";
utils.createTable(SQL_CREATE_TABLE_HASH);
}
As it can be seen the CREATE TABLE defines the columns of the table (COLS
is a string with
the definition of all columns in the table) and the primary key.
Additionally, it has PARTITION BY and DISTRIBUTE BY clauses.
For now, we will ignore these clauses and we will talk about them later.
Let us now take a look to the ingestData method that has the main loop ingesting the data:
As it can be seen it creates a reader for the CSV, LoansReader
.
The method read
will return a row or NULL
if the file reach the end.
Problem:
The main problem with this approach is that you run the client application on your laptop and the database server on a remote cloud host.

The latency between your laptop and the database server is very high (say 100ms that is around the latency between Madrid and New York), which means that you can insert a row at the frequency allowed by the network latency, i.e. in a second there are 1000ms, each round trip from the laptop to the server takes 200ms, so you can do a maximum of 1000/200=5 round trips per second. This means that even if the insertion is instantaneous, you can only insert at most 5 rows per second.

Solution:
The solution is to run the application in the same data center as the database server.