Lxld User’s guide

This document provides a guide to the Lxld loader.

1. Loader Usage

The loader is a program to load:

  • CSV files

  • Synthesized generated loads configured from a tuple description file.

To run the loader, we can use the JAR file for it. The lxld jar file has a name like lxld.v2.0.jar, and can be used to run the loader. It needs the lxjdbcdriver.jar in the CLASSPATH so it can use the leanXcale JDBC driver.

A more convenient way to run the loader is to download its package from the distribution, usually a file with a name like lxld.v2.0.tgz, and extract it.

This creates a file bin/lxld that can be used to run the loader, and a lib directory containing the JARs. The lxld command can be called from any directory, like any other program. For example:

unix$ tar zxf lxld.v2.0.tgz
unix$ bin/lxld -h
usage: Lxld [-nhvq] [-i n] [-Dflags] [-b batch] [-C ncache] [-c count]
		[-r nprocs] [-p nprocs|-s nprocs] cfgfile [csvfile...]

Common options are

  • -n: Perform a dry run and do not load anything.

  • -v: Makes the program verbose (reports all rows loading)

  • -q: Makes the program quiet.

  • -Dflags: set flags as debug flags.

  • -b: Set the load batch size (defaults to 1000).

  • -C: Set the loading cache size (in rows, where tuples are sorted before being loaded). This cache is per-server, or a single one if no per-server loading is set. The default size if 5MiB.

  • -c: Set a limit on the count of rows loaded/generated.

  • -p: Set the number of total loading process (defaults to 10). Use this or -s.

  • -s: Set the number of per-server loading processes (defaults to 2). Use this or -p.

  • -i: Do not load, but compute n split intervals for the table loaded.

  • '-r': When loading CSV files, use n reader processes per CSV file for large CSVs (defaults to 5).

The configuration file is the only mandatory argument. When the configuration is not used to generate synthetic loads, the CSV file(s) to load follow as arguments.

CSV files given are processed concurrently. For files larger than a few MiB, multiple reading processes are created. To use a single one (and preserver loading order) use -r 1.

Rows retrieved for loading are added to an in-memory cache, where they are sorted before loading to improve locality.

Then the cache gets full, the loader processes perform the load for it, while a separate cache is being filled up for further loading. If the add cache gets full while the one being loaded is still loading, no further rows are taken until the ongoing load finishes.

To disable caching, set a cache size of 0 with flag -C.

By default, the loader uses 2 per-server processes to perform the load, unless flag -p asks for n global loader processes to perform the load.

Actual load happens one batch at a time (1000 rows by default). Option -b can be used to set the batch size.

The JDBC user, password, URL, database, schema, and table names are set using the configuration file.

The default user is lxadmin when not given and, if no password is given, the evironment variable LXPASS is used to retrieve the password. The password must be set in one way or another.

Loading stats are always reported every 5 seconds unless flag -q (quiet) is given. Flag -v makes the program (very) verbose, to report each row as sent for loading.

Stats report the committed rows during the load. Caching happens before statistics get updated.

Errors related to configuration or table errors cause the loader to stop and abort. This usually happens before loading any data.

Errors while loading data cause the loader to print the error and the row causing it, and still continue trying to load the rest of the data.

If there was any error at all, the loader exits with failure.

2. Usage examples

In the following examples, we use the loader to feed synthetic loads. To feed a given CSV file, e.g., data.csv, simply add CSV file name(s) to the argument list.

By default, the loader uses a caching loader that relies on two loader processes (per server).

unix$ lxld  lxld.cfg
lxld: loader TblCLd[REGTBL] cache 134217728
now: 5s: 31700 rows 6334 tps	total: 5s: 31700 rows 6334 tps
now: 5s: 38000 rows 7589 tps	total: 10s: 69700 rows 6962 tps
...
total: 10s: 200000 rows 6600 tps

The loader prints (unless quiet, -q) the stats for rows loaded in the last 5 seconds and in total for the whole load.

As noted above, we used synthesized rows, to load a CSV we would run instead something like the following, supplying the CSV file names.

unix$ lxld  dat.cfg dat1.csv dat2.csv
lxld: loader MTCLd[RMI_COBROS_V2] cache 5242880
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:0:159753317]...
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:159753317:319506634]...
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:479259951:639013268]...
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:319506634:479259951]...
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:798766585:0]...
loading CSV[RMI_CALCULOS-RMI-RMI_COBROS_V2.csv:639013268:798766585]...
now: 5s: 0 rows 0 tps	total: 5s: 0 rows 0 tps
now: 5s: 0 rows 0 tps	total: 10s: 0 rows 0 tps
now: 5s: 0 rows 0 tps	total: 15s: 0 rows 0 tps
now: 5s: 0 rows 0 tps	total: 20s: 0 rows 0 tps
now: 5s: 502000 rows 100400 tps	total: 25s: 502000 rows 20067 tps
now: 5s: 511000 rows 102179 tps	total: 30s: 1013000 rows 33748 tps

As we used caching (default), no load happen until a whole cache was full and started to be flushed to the server while another cache was being filled up.

All CSV files given are loaded concurrently (but share the loaders and cache). Large CSV files are split into parts loaded concurrently. By default, 5 parts are made for each large CSV, unless flag -r says otherwise. For example, with -r 1, no CSV file is split for loading.

To see more about the load, flag -D lists debug information including the loaders created.

unix$ lxld lxld.cfg
TblCLd[REGTBL]: tbl loader
TblCLd[REGTBL]: regions...
TblCLd[REGTBL]: loaders...
Ld[ds102.0]: LX loader
Ld[ds102.1]: LX loader
MTCLd[ds102]: MTC loader
Ld[ds103.0]: LX loader
Ld[ds103.1]: LX loader
MTCLd[ds103]: MTC loader
Ld[ds100.0]: LX loader
Ld[ds100.1]: LX loader
MTCLd[ds100]: MTC loader
Ld[ds101.0]: LX loader
Ld[ds101.1]: LX loader
MTCLd[ds101]: MTC loader
lxld: loader TblCLd[REGTBL] cache 134217728
now: 5s: 31700 rows 6334 tps	total: 5s: 31700 rows 6334 tps
now: 5s: 38000 rows 7589 tps	total: 10s: 69700 rows 6962 tps
...
total: 10s: 200000 rows 6600 tps

Here, there is a table based caching loader (per server loader with caching), which creates a MT loader with two underlying processes for each server. The same effect can be achieved with flag -s 2.

We can ask for a flat set of loader processes instead of use a per-server loading structure. Here we ask for 4 loader processes. We leave the debug flag -D set so you can see the underlying loading structure.

	unix$ lxld -D -p 4 lxld.cfg
	MTCLd[REGTBL]: MTC loader
	Ld[REGTBL.0]: LX loader
	Ld[REGTBL.1]: LX loader
	Ld[REGTBL.2]: LX loader
	Ld[REGTBL.3]: LX loader
lxld: loader MTCLd[REGTBL] cache 134217728
now: 5s: 31700 rows 6334 tps	total: 5s: 31700 rows 6334 tps
now: 5s: 38000 rows 7589 tps	total: 10s: 69700 rows 6962 tps

Or, we can ask for 4 per-server processes instead

unix$ lxld -D -s 4 lxld.cfg
TblCLd[REGTBL]: tbl loader
TblCLd[REGTBL]: regions...
TblCLd[REGTBL]: loaders...
Ld[ds102.0]: LX loader
Ld[ds102.1]: LX loader
Ld[ds102.2]: LX loader
Ld[ds102.3]: LX loader
MTCLd[ds102]: MTC loader
Ld[ds103.0]: LX loader
Ld[ds103.1]: LX loader
Ld[ds103.2]: LX loader
Ld[ds103.3]: LX loader
...

To disable caching, or set caching to a desired size (in rows), use -C.

unix$ lxld  -C 0  lxld.cfg
lxld: loader TblLd[REGTBL] cache 0
...

This can be done also when using a flat set of loader processes (and not per-server loaders).

unix$ lxld -D -C 0 -p2  lxld.cfg
MTLd[REGTBL]: MT loader
Ld[REGTBL.0]: LX loader
Ld[REGTBL.1]: LX loader
lxld: loader TblLd[REGTBL] cache 0
...

It is possible to compute splits for CSVs to load. In this example, we take 1000 xamples and build 5 splits from them:

unix$ lxld -C 1000 -i 5  lxld.cfg file1.csv
lxld: loader Xample[xample] cache 1000
# splits:
split: Supplier#000000004,15
split: Supplier#000000062,19
split: Supplier#000003235,17
split: Supplier#000003297,17
split: Supplier#000006454,15

3. Field Types

Field types can be specified in the configuration, but note that JDBC would cast field types on its own, in any case.

Known types are:

  • bool: Boolean values

  • int: Integer values (java Long values)

  • float: Floating point values (java Double values).

  • dec: Decimal values (java BigDecimal values).

  • str: String values

  • bytes: raw-memory values (java byte[] values).

  • date: date values (java SQL Date values).

  • time: time values (java SQL Time values).

  • ts: timestamp values (java SQL Timestamp values).

These types are used also when generating synthetic loads. That is, expressions that define how to compute a field value have one of these types.

Extra types known are

  • byte: Byte values.

  • int16: 16-bit integer values (java Short values).

  • int32: 32-bit integer values (java Integer values).

  • int64: 64-bit integer values (java Long values).

When these types are used in synthetic loads, expressions compute the field value using a field type from the known types, and finally, the field value is converted to the specific type used.

4. Configuration File

The file consists on lines that set load parameters, or define sets of values for synthetic loads, or define fields to be loaded.

A line may be continued with indented lines if it becomes too large to fit in a single line.

The character # may be used to add comments to the file.

A line setting a parameter has the format

set parameter value

A line defining a set of values (useful for synthetic loads) has the format

def name : expr, expr, ... ,expr

where name is the name for the set, and the given expressions are the values in the set.

A line defining a field has the format

fld name type : expr

or

fld name type

when no expressions to generate synthetic loads are given.

In some cases, we want to generate synthetic field values to add them to the fields retrieved from a CSV. In such cases, a line can define a field as an additional field value using the syntax

add name type: expr

similar to the line defining a field. The difference is that these fields named name will be added to the rows read from the CSV.

Such additional fields should be defined after the other fields, as a convention.

When computing intervals (flag -i) the configuration file should specify which fields are keys, because no DB table will be used to perform the computation. A line of the form

key name type

or key name type : expr

is similar to a field definition, but sets the field as a key field. The key order is the definition order, unless a key position is given. To specify key positions (counting from 0), use lines like:

key N name type

or key N name type : expr

and N is the position of the field in the key (0 is the first key field, 1 is the second one, etc.).

In general, it is a good idea to define expressions for fields the configuration file to use it both for loading actual CSV files and for feeding synthetic loads to the system under test.

4.1. Parameters

To set a parameter, use a line with the format

set parameter value

The value can be a literal or a quoted literal (to be sure special caracters, operators, and blanks are taken as part of the value).

This parameter must be set:

set table 'TABLENAME'

where TABLENAME specifies the table to load.

These parameters might be set too:

set db 'DBNAME'
set schema 'SCHEMANAME'
set user 'USERNAME'

where DBNAME, SCHEMANAME, and USERNAME specify the database, schema, and user to use for the load.

When not set, they default to db, APP, and lxadmin.

The password can be set using

set pass 'PASSWD'

where PASSWD is the password. When no password is set, the loader retrieves its value from the LXPASS environment variable.

A password is needed unless it is a dry run.

The default URL to locate the service is jdbc:leanxcale://localhost:14420/db (where db is replaced by the database name). This may be changed like in

set url 'jdbc:leanxcale://localhost:14420/db'

4.2. Fields

A field definition looks like

fld 'NAME' TYPE : EXPR

or

fld 'NAME' TYPE

The loaded rows are expected to have one field per field defined, in the order they are defined by the configuration file. The table may have more fields or may have them in a different order.

The field name is given by NAME, and the type is specified using TYPE.

For example:

fld 'fld4' int32
fld 'fld2' str
fld 'fld1' int64
fld 'fld0' int64
fld 'fld3' int64

defines fields known by the table as fld4, fld2, fld1, fld0, and fld3, giving the type for each one. The order in the input rows is implied by the order of the definitions.

This is a full configuration example for loading a CSV for a table of persons, with an integer field for the person ID and a string field for the person name.

set db db
set schema app
set table PERSONS
fld 'id' int32
fld 'name' str

It is possible to specify that particular fields are keys. This is to be done when computing intervals, because no DB table is used in this case. For example

fld 'fld4' int32
fld 'fld2' str
key 'fld1' int64
fld 'fld0' int64
key 'fld3' int64

is similar to an example above, but makes the key be (fld1, fld3).

In this other case

fld 'fld4' int32
fld 'fld2' str
key 1 'fld1' int64
fld 'fld0' int64
key 0 'fld3' int64

the key becomes (fld3, fld1) instead.

4.3. CSV parameters

CSV loading can be adjusted by using parameters shown in this example:

# set the field separator to blanks
set fldsep	''
# set the field separator to the value (one char only)
set fldsep	'|'
# skip a header line in the CSV files
set skiphdr	yes
# limit the loading to the given number of rows
set count 100000
# set the comment characters to any of the ones in the given value
set comment '#|;'
# set the string used for null field values (otherwise use empty fields)
set nullstr '#null'
# set the input encoding to the named one (valid java charset names only).
# defaults to UTF-8
set encoding 'ISO-8859-1'
# set the character used to separate decimal digits in numbers (default '.')
set decsep ','
# make field/record separator take precedence over quotes in string fields
set usesep

If the field separator is set to the empty string, it means blanks. Otherwise, the first character in the given value is used as the field separator.

When decsep is used, dots or commas separating thousands are removed.

4.4. Value Sets for Synthetic Loads

A field definition may include how to generate field values for synthetic loads. In this case, the configuration can still be used to load CSV files, but, when no CSV file names are given to the loader, it generates field values using the given defintions.

When doing so, it is useful to define sets of values using definitions like the one. This is done with lines of the format

def values: val1, val2, .... valn

where values is the name given to the set (use any identifier desired) and the list of values describes the actual values in the set.

The values in the set must de constant values, but, basic arithmetic is tolerated and explicit type names can be used as functions to indicate the type for a value.

When no explicit type is given, a value that looks like a floating point value is considered either a float or a dec depending on the number of digits. When it fits in a double, it becomes a double.

In the same way, a value that looks like an integral value becomes either an int or a dec depending on the number of digits.

The same goes for date, time, ts, and bool values.

For date values, months go from 1 to 12, and year values less than 100 are considered years of this century.

Otherwise, a value is considered a string unless an explicit type is given.

For example, these are valid definitions:

def values: 1, 3, 5, 7
def decimals: dec(1), 3, 5, 7
def moredecimals: dec(3), dec(5), dec(7)
def otherdecimals: 432432432.23423432432566 43.45
def floats: 1, 3.4, 5, 7

In the second one, specifying a decimal for the first value forces all remaining values to become decimals, because decimal is more generic than int.

Number types, listed from more general to more specific are dec, double, and int.

Here are more definitions

# all of them use bytes
def values: bytes("hi there"), "and there"
# but these are strings
def strs: "hi", "there", "and there"
# the epoch date, and one for the time 43243434265423
def dates: date(0), date(43243434265423)

4.5. Field Expressions for Synthetic Loads

To generate synthetic loads, define fields with an expression for the field value. The format is

fld 'name' type: expr

where the field number is implied by the order of the definitions, and the field value is defined by the expression given. The position of the fields determine the tuple loaded, and, the field names are used to decide which table field to load.

This is a full example:

def values: 1, 3, 5, 7
def decimals: dec(1), 3, 5, 7
fld fld1: pick(values)
fld fld2: rand(decimals) + 3.2
fld fld3: 1000 * row + rand(10, 20, 30)

In this case, we specify that we want to create tuples with three fields:

  • the first field takes values from the values value set, in round-robin. The First tuple would use 1, the second tuple would use 3, and so on.

  • the second field takes values at random from the decimals value set, but adds 3.2 to each value taken.

  • the third field multiplies the row number by 1000 and adds a random value from the given set of values.

As we gave no type for each of the fields, the type implied by the expression is used.

5. Sythetic Load Expressions

Expressions can use the following predefined names:

  • row: the row number (0, 1, etc.)

  • $N: the N-th field defined. It must be a field defined before this one.

  • today: the current date.

  • now: the current time

Expressions can use the following arithmetic operators:

  • +: adds two values or concatenates strings or bytes.

  • -: subtracts values (or changes the sign).

  • *: multiples values.

  • /: divides values.

  • %: gets the remainder of dividing the given values.

  • &: performs the bit-and of the given values.

  • |: performs the bit-or of the given values.

  • ~: performs the bit-not of the given value.

  • <: left is less than right.

  • : left is less than or equal to right.

  • >: left is greater than right.

  • >=: left is greater than or equal to right.

  • ==: left is equal to right.

  • !=: left is not equal to right.

  • &&: boolean AND.

  • ||: boolean OR.

  • !: boolean NOT.

Evaluation order is:

  • Unary minus is performed first, then

  • bit operations, then

  • multiply, divide, and remainder, then

  • addition and substraction, then

  • compare operations, then

  • boolean NOT, then

  • boolean AND, then

  • boolean OR

Use parenthesis when this order is not the desired one.

Expressions can use the following functions:

  • pick(name): picks one value at round-robin from the named value set.

  • pick(val1, val2, …​ valn): picks one value at round-robin from the given values.

  • rand(name): picks one value at random from the named value set.

  • rand(num): a random number from 0 to num. It is integer if num is integer, or floating point if num is a floating point.

  • rand(val1, val2, …​ valn): picks one value at random from the given values.

  • last($N): the value for the N-th field defined as found in the last tuple generated.

  • int(val): the value as an int value.

  • float(val): the value as a float value.

  • str(val): the value as a str value.

  • bytes(val): the value as a bytes value.

  • dec(val): the value as a dec value.

  • date(val): the value as a date value.

  • date(year, month, day): the values as a date value.

  • time(val): the value as a time value.

  • ts(val): the value as a ts value.

  • year(val): the year from value.

  • mon(val): the month from value.

  • day(val): the day from value.

  • hour(val): the hour from value.

  • min(val): the minutes from value.

  • sec(val): the seconds from value.

  • len(val): the length of the string value.

  • substr(str, n1, n2): the substring val[n1:n2].

  • repl(val, rexp, str): the string replacing in val the first match of the regular expression rexp by str.

  • replall(val, rexp, str): the string replacing in val all matches of the regular expression rexp by str.

  • addyear(val, n): the date/time/timestamp resulting from adding n years to val.

  • addmon(val, n): the date/time/timestamp resulting from adding n months to val.

  • addday(val, n): the date/time/timestamp resulting from adding n days to val.

  • addmin(val, n): the date/time/timestamp resulting from adding n minutes to val.

  • addsec(val, n): the date/time/timestamp resulting from adding n seconds to val.

  • addmsec(val, n): the date/time/timestamp resulting from adding n milliseconds to val.

Note that values given to functions can be expressions and not just literal values.

6. Synthetic Load Examples.

We include some random examples and the first rows they generate.

Running with the config

fld: rand(10)

yields

1
3
7

Running with the config

def words: a, b, c
fld: rand(words)

yields

b
a
b

Running with the config

fld: rand('a', 'b')

yields

a
a
a

Running with the config

def words: a, b
fld: pick(words)

yields

a
b
a

Running with the config

fld: now

yields

2023-11-20-12:56:41.696
2023-11-20-12:56:41.696
2023-11-20-12:56:41.696

Running with the config

fld: row
fld: row + int(3.54453453458923442543534)

yields

0,3
1,4
2,5

Running with the config

fld: row + 1.42342343895439293423

yields

1.42342343895439293423
2.42342343895439293423
3.42342343895439293423

Running with the config

fld: addday(date(13, 12, 19), 7*row)

yields

2013-12-19
2013-12-26
2014-01-02

Running with the config

fld: repl('aaabbbaa', '[a]*', 'x') + 'Z'*row

yields

xbbbaa
xbbbaaZ
xbbbaaZZ

Running with the config

def vals: float(3+-4), 4324.53, 3, 4
def words: hola, que, tal, (foo * 4)
fld: 3 + row
fld: pick(words)
fld: last($1) * 2
fld: 'xxxx' * row

yields

3,hola,"",""
4,que,holahola,xxxx
5,tal,queque,xxxxxxxx

Running with the config

fld: row > 1 || row == 0
fld: row

yields

[true, 0]
[false, 1]
[true, 2]

Running with the config

# define a set of floats
def vals: float(3+-4), 4324.53, 3, 4
# define a set of strings
def words: hola, que, tal, (foo * 4)
# define a field value as the row number + 3
fld: 3 + row
# define another field as one of the words from words, in round-robin.
fld: pick(words)
# another field as the last value of 2nd field (a string), twice
fld: last($1) * 2
# another field as a string repeated as many times as the row number
fld: 'xxxx' * row

yields

[3, hola, , ]
[4, que, holahola, xxxx]
[5, tal, queque, xxxxxxxx]
[6, foofoofoofoo, taltal, xxxxxxxxxxxx]
[7, hola, foofoofoofoofoofoofoofoo, xxxxxxxxxxxxxxxx]
[8, que, holahola, xxxxxxxxxxxxxxxxxxxx]
[9, tal, queque, xxxxxxxxxxxxxxxxxxxxxxxx]
[10, foofoofoofoo, taltal, xxxxxxxxxxxxxxxxxxxxxxxxxxxx]
[11, hola, foofoofoofoofoofoofoofoo, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]
[12, que, holahola, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]