Monitoring and Alerting System

This document aims to provide a light analysis of LeanXcale Monitoring and Alerting System, as well as setting a practical guide about generating new alert and record rules in Prometheus.

1. Architecture

LeanXcale monitoring is build on two open-source solutions: Prometheus and Grafana.

The following image shows the architecture of components:

overview

LeanXcale Monitoring and Alerting System consist on the following elements:

  • Prometheus: Centralizes and stores metrics in a time series database.

  • Alert Manager: Triggers alarms based on Prometheus stored information (it is part of Prometheus software also).

  • Exporters: Running in monitorized hosts, export metrics sets to feed Prometheus:

    • Mastermind exporter: Publishes metrics from the following Lx components: Snapshot Server, Commit Sequencer and Config Manager.

    • node-exporter: Comes by default with Prometheus installation, exports a lot of metrics about machine performance.

    • metricsExporter.sh: It exports metrics about Lx components at two levels:

      • Lx Component running java code inside JVM.

      • Lx Component processes as OS level (I/O, memory, …​)

In addition, it’s possible to feed Prometheus with another metrics. It only needs key-value files placed in /tmp/scrap path. It’s common to use these to export metrics resulting from executing specific processes. For example, this is the file generated in the TPCC benchmarking tests:

appuser@5c050e45b4d0:/tmp/scrap$ cat escada.prom
tpccavglatency 101
tpmCs 28910
aborttpmCs 124

Monitoring is started when the cluster is started and the central components (Prometheus and Grafana) are deployed in the metadata server.

2. Starting Monitor

Just need to run the following script:

./lx/monitor/monitor.sh

It starts Prometheus server, the Alertmanager, Grafana and node_exporter.

You also need to run another script to get LeanXcale Metrics in port 9089:

./lx/metrics/metricsExporter.sh

It calls a Python script that will collect all the metrics.

3. Monitoring dashboard

To access the monitoring dashboard you have open the following URL in your browser:

http://{Metadata server hostname}:3000/

The following images shows an example of a working monitoring dashboard:

grafana cluster metrics

grafana cluster metrics 2

4. Taking a look at Prometheus configuration

Prometheus config file is /lx/monitor/prometheus-package/prometheus.yml:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "alert.rules.yml"
  - "basic_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'node-exporter'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['172.17.0.2:9101','172.17.0.2:9089'] #NODE_EXPORTER#

  - job_name: 'mastermind'
    static_configs:
    - targets: ['localhost:4242'] #MASTERMIND_EXPORTER#

All these parameters can also be checked in Prometheus console (http://172.17.0.2:9091), in Targets'' and Configuration'' menu options.

4.1. Prometheus rules

As you can see in previous config file, rules are included in yml files that Prometheus found in rule_files tag values. All the rules defined in all files can be checked in Prometheus console (http://172.17.0.2:9091/rules).

Prometheus supports two types of rules which may be configured and then evaluated at regular intervals (remind that are stored in Prometheus as time series data).

4.1.1. Recording rules

Allows you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This could help to improve Grafana Dashboards performance.

It’s important to remark that there is no firing action for these rules, they are just calculated and stored in Prometheus ir order to be queried or collected for dashboards in Grafana.

At this moment, there are a few recording rules in LeanXcale installation, you can find them in:

/lx/monitor/prometheus-package/basic_rules.yml

Note that rules are distributed in groups whose name should be unique in the file.

A typical syntax for a recordin rule is composed of two mandatory tags (record and expr ). An optional tag would be interval.

Let’s take an example from this file:

groups:
  - name: cluster metrics
    interval: 20s # How often rules in the group are evaluated (overrides global config)
    rules:
        - record: instance:cpu_busy:avg_rate1m # Rule name: There is no mandatory way in which you must name the rules. But *level:metric:operations* is recommended.
          expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) # expression to be evaluated to add a new value point to the time series

The rule in the example calculates an average rate of CPU usage based on the metric called node_cpu_seconds_total. Let’s see an example of how this metric is shown at (http://172.17.0.2:9101/metrics):

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 1276.51
node_cpu_seconds_total{cpu="0",mode="iowait"} 1.72
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 3.02
node_cpu_seconds_total{cpu="0",mode="softirq"} 44.63
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 17.02
node_cpu_seconds_total{cpu="0",mode="user"} 42.66
node_cpu_seconds_total{cpu="1",mode="idle"} 1273.47

This metric has two labes, cpu and mode. These labels can be accesed from the expression to discriminate the concrete metric that is being evaluated.

To know more about Querying and Expression Syntax in Prometheus, please check (https://prometheus.io/docs/prometheus/latest/querying/basics/).

5. Creating a New Alert

If you want to create a new alert:

  1. If it is an alert that is to be generated based on Prometheus host metrics, you just need to know the metric and the expression that is being evaluated. Go to Alerting Rules section.

  2. Maybe you want to create an alert based on a precomputed operation over a metric. To read about that, please check here.

  3. If you want to send an alert directly to Alertmanager, have a look at the LeanXcale standard format for manual alerts here, and consider these two options:

  4. Consider that alerts can also be configured to override global configuration and setting a time delay to be waited since the expression condition is fulfilled until the effective firing of the alert. Check here.

5.1. Alerting rules

Allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.

At this moment, alerting rules are defined on the following file:

/lx/monitor/prometheus-package/alert.rules.yml

Syntax is quite similar to recoding rules, although there are some new tags.

groups:
- name: example
  rules:
  - alert: HighErrorRate # Alert name
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 # Expression that sets de condition for the alert to be fired
    for: 10m # causes Prometheus to wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element. In this case, Prometheus will check that the alert continues to be active during each evaluation for 10 minutes before firing the alert.
    labels: # Set of additional labels to be attached to the alert.
      severity: page
    annotations: # specifies a set of informational labels that can be used to store longer additional information such as alert descriptions or runbook links
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

In label and annotations there are two variables that can be used:

  • $labels: Holds the label key/value pairs of an alert instance.

  • $Value: Holds the evaluated value of an alert instance (it means, the value that fires the alert).

If you set a time period in for tag, it causes Prometheus to wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element. In this case, Prometheus will check that the alert continues to be active during each evaluation for 10 minutes before firing the alert.

IMPORTANT!!!

It’s not recommended to use $value variable in labels tag. The reason is that the labels include a label called value, which will usually vary from one evaluation from the next. The effect of this is that each evaluation will see a brand new alert, and treat the previous one as no longer firing. This is as the labels of an alert define its identity, and thus the for will never be satisfied.

It should be used in annotations tag, let’s see two examples:

# INCORRECT: THIS ALERT MAY NOT BE NEVER FIRED
groups:
- name: example
  rules:
  - alert: ExampleAlertIncorrect
    expr: metric > 10
    for: 5m
    labels:
      severity: page
      value: "{{ $value }}" ######## DON'T INCLUDE $value HERE!
    annotations:
      summary: "Instance {{ $labels.instance }}'s metric is too high"

 # CORRECT ALERT
 - alert: ExampleAlert
    expr: metric > 10
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }}'s metric is {{ $value }}"Combining recording and alerting rules**

5.2. Alerting rules based on recording rules

Sometimes it could useful in LeanXcale procedures to set alerts based on complex precomputed metrics. Let’s see an example:

groups:
  - name: recording_rules
    interval: 5s
    rules:
      - record: node_exporter:node_filesystem_free:fs_used_percents
        expr: 100 - 100 * ( node_filesystem_free{mountpoint="/"} / node_filesystem_size{mountpoint="/"} )

  - name: alerting_rules
    rules:
      - alert: DiskSpace10%Free
        expr: node_exporter:node_filesystem_free:fs_used_percents >= 90
        # Note that previous expression evaluates the metric defined in the recording rule.
        labels:
          severity: moderate
        annotations:
          summary: "Instance {{ $labels.instance }} is low on disk space"
          description: "{{ $labels.instance }} has only {{ $value }}% free."

6. Testing rules file syntax

Prometheus provides a tool to check if your rule files are syntactically correct after modifying them.

In the LeanXcale installation, this tool is in:

/lx/monitor/prometheus-package/promtool

Usage example:

./promtool check rules alert.rules.yml

Checking alert.rules.yml
  SUCCESS: 5 rules found

7. Sending manually an alert to Alertmanager

7.1. Manual Alerts format

A Manual Alert would contain the following tags:

  • status: Alert’s activation state (firing, resolved)

  • labels:

    • alertname: Unique ID for the alert.

    • alertType: This field represents the type, the idea is setting a classification for alerts based on the possible anomalous situation that could happen in different LeanXcale components.

    • severity: Severity level (critical, warning)

    • component: LeanXcale componet that sends the Alert to Alertmanager (KVDS, LgLTM, QE, ZK, LgSnS, LgCmS, MtM, CflM, KVMS).

    • serviceIp: IP/Hostname:Port where the component that fires the alert is running.

  • annotations:

    • summary: A descriptive message of the problem.

An example of an alert to be fired would look like this:

[{
  "status": "firing",
  "labels": {
    "alertname": "1234567890",
    "alertType": "alertType1",
    "severity": "warning",
    "component": "KVDS",
    "serviceIp": "172.17.0.2:9993"},
  "annotations": {
    "summary": "Alarm message"}
}]

7.2. Alerts from Bash scripts

It’s possible to send an alert to Alertmanager using the sendAlert.sh script included in LeanXcale installation, you can find it in:

/lx/monitor/sendAlert.sh

This script finds out in Prometheus config file the host and port where the Alertmanager is running.

 ./sendAlert.sh <firing|resolved>  # Alert status
                <warning|critical> # Alert severity
                <AlertId>          # Alert unique ID
                <AlertType>        # Alert Type (indicates an anomalous situation)
                <Component>        # Lx Component that fires alarm
                <ServiceIP>        # hostname/IP:Port
                "<Message>"

Let’s see how we would have fired the example’s alert:

./sendAlert.sh firing warning 1234567890 alertType1 KVDS "172.17.0.2:9993" "Alarm message"

There is a script in LeanXcale installation that could help us to check if this is really working (you should have started LeanXcale Monitor previously, see first section). It takes one optional argument that indicates how much fo the alert information you want to see:

  • simple (default): Just alertname, start datetime and summary message.

  • extended: All alert’s tags in tabular format.

  • json: All alert’s tags (even internal ones) in json format

/lx/monitor/alarm_console.sh [simple|extended|json]

The previously fired alert will be shown like this:

Alertname   Starts At                Summary
1234567890  2019-05-29 09:15:36 UTC  Alarm message

And we also can see the alert in the Alertmanager console.

To resolve the alert, just run again the script with resolved status:

./sendAlert.sh resolved warning 1234567890 alertType1 KVDS "172.17.0.2:9993" "Alarm message"

7.3. Alerts from Java (Log4j appender)

In lx-dependencies project (https://gitlab.lsdupm.ovh/lx/lx-dependencies) is configured a common Log4j appender for Java code. It allows you to send an alert to Alertmanager.

This appender can be found in:

./lx-dependencies/TM/common/src/test/resources/log4j2.properties
appender.http.type=Http
appender.http.name=http
appender.http.layout.type=PatternLayout
appender.http.layout.pattern=%m
appender.http.url=http://172.17.0.2:9093/api/v1/alerts

...

#AlertManager
logger.ALERTMANAGER.name=ALERTMANAGER
logger.ALERTMANAGER.level=INFO
logger.ALERTMANAGER.appenderRefs=alertmanager
logger.ALERTMANAGER.appenderRef.alertmanager.ref=async

In this project, an AlertSender.java class is defined. It contains a static method that allows you to send a JSON alert message:

public static void sendAlert(Severity severity, String idAlarm, Status status, String message);

You can find a usage example in:

./lx-dependencies/TM/common/src/test/java/com/leanxcale/alert/TestAlertSender.java

8. Querying the Alertmanager

It’s possible to get alerts info from Alertmanager in an easy way using also amtool. Let’s say we have the following alerts active:

Labels                                                                                                         Annotations              Starts At                Ends At
 Generator URL
alertType="alertType1" alertname="1234567890" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:25:29 UTC  2019-05-29 09:30:29 UTC

alertType="alertType1" alertname="1234567891" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:25:35 UTC  2019-05-29 09:30:35 UTC

alertType="alertType1" alertname="1234567892" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:25:43 UTC  2019-05-29 09:30:43 UTC

alertType="alertType2" alertname="1234567893" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:26:05 UTC  2019-05-29 09:31:05 UTC

alertType="alertType2" alertname="1234567894" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:26:10 UTC  2019-05-29 09:31:10 UTC

In the first example, we ask the Alertmanager for the alert with alertname=1234567890 and in the second one, for all the alerts with alertType=alertType2:

appuser@5c050e45b4d0:/lx/monitor$ alertmanager-package/amtool --alertmanager.url=http://172.17.0.2:9093 alert query -o extended alertname="1234567890"
Labels                                                                                                         Annotations              Starts At                Ends At                  Generator URL
alertType="alertType1" alertname="1234567890" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:25:29 UTC  2019-05-29 09:30:29 UTC
appuser@5c050e45b4d0:/lx/monitor$ alertmanager-package/amtool --alertmanager.url=http://172.17.0.2:9093 alert query -o extended alertType="alertType2"
Labels                                                                                                         Annotations              Starts At                Ends At                  Generator URL
alertType="alertType2" alertname="1234567893" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:26:05 UTC  2019-05-29 09:31:05 UTC
alertType="alertType2" alertname="1234567894" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning"  summary="Alarm message"  2019-05-29 09:26:10 UTC  2019-05-29 09:31:10 UTC

9. Alerts

This is the detailed list of the events that are centralized in the alert system:

9.1. Query Engine events

Alert Identifier Type Description

START_SERVER

warning(info)

Indicates a server have been started. It shows the server address as IP:PORT

START_SERVER_LTM

warning(info)

Indicates the LTM is started and the QE is connected to a Zookeeper instance

START_SERVER_ERROR

critical

There has been an error while starting the server and/or the LTM

STOP_SERVER

warning(info)

Indicates the server has been stopped

SESSION_CONSISTENCY_ERROR

warning

Error when setting session consistency at LTM.

DIRTY_METADATA

warning

The metadata has been updated but the QE couldn’t read the new changes

CANNOT_PLAN

warning

The QE couldn’t come up with an execution plan for the query

FORCE_CLOSE_CONNECTION

warning(info)

A connection was explicitly closed by an user

FORCE_ROLLBACK

warning(info)

A connection was explicitly rollbacked by an user

FORCED_ROLLBACK_FAILED

warning

Forced rollback failed

FORCE_CANCEL_TRANSACTION

warning(info)

Forced transaction cancellation. The transaction was not associated to any connection

KV_DIAL_BEFORE

critical

QE is not connected to the DS

KV_NOT_NOW

warning

DS error: not now

KV_ABORT

warning(info)

DS error: aborted by user

KV_ADDR

warning

DS error: bad address

KV_ARG

warning

DS error: bad argument

KV_BAD

warning

DS error: corrupt

KV_AUTH

warning

DS error: auth failed

KV_BUG

warning

DS error: not implemented

KV_CHANGED

warning

DS error: resource removed or format changed

KV_CLOSED

warning

DS error: stream closed

KV_CTL

warning

DS error: bad control request

KV_DISKIO

critical

DS error: disk i/o error

KV_EOF

warning

DS error: premature EOF

KV_FMT

warning

DS error: bad format

KV_FULL

critical

DS error: resource full

KV_HALT

critical

DS error: system is halting

KV_INTR

warning

DS error: interrupted

KV_IO

critical

DS error: i/o error

KV_JAVA

warning

DS error: java error

KV_LOW

critical

DS error: low on resources

KV_LTM

warning

DS error: LTM error

KV_REC

warning

DS error: Rec error

KV_LOG

warning

DS error: Log error

KV_MAN

critical

DS error: please, read the manual

KV_NOAUTH

warning

DS error: auth disabled

KV_NOBLKS

critical

DS error: no more mem blocks

KV_NOBLOB

warning(info)

DS error: no such blob. QE might not being handling blobs correctly

KV_NOIDX

warning(info)

DS error: no such index. QE might not being handling index correctly

KV_NOMETA

critical

DS error: no metadata

KV_NOREG

warning

DS error: no such region

KV_NOTOP

warning

DS error: no such tid operation

KV_NOSERVER

critical

DS error: no server

KV_NOTAVAIL

critical

DS error: not available

KV_NOTID

warning

DS error: no such tid

KV_NOTUPLE

warning

DS error: no such tuple

KV_NOTUPLES

warning

DS error: no tuples

KV_NOUSR

warning(info)

DS error: no such user

KV_OUT

critical

DS error: out of resources

KV_PERM

warning(info)

DS error: permission denied

KV_PROTO

warning

DS error: protocol error

KV_RDONLY

warning

DS error: read only

KV_RECOVER

warning

DS error: system is recovering

KV_RECOVERED

warning

DS error: system recovered

KV_SERVER

critical

DS error: server

KV_TOOLARGE

warning

DS error: too large for me

KV_TOOMANY

warning

DS error: too many for me

KV_TOUT

warning

DS error: timed out

KV_REPL

warning

DS error: replica error

9.2. Health Monitor events

Alert Identifier Type Description

COMPONENT_FAILURE

warning

Failure from the indicated service

TIMEOUT_TOBEREGISTERED

warning

Waiting to register the indicated service

CANNOT_REGISTER

critical

Cannot register the indicated service. Restart it

9.3. Configuration events

Alert Identifier Type Description

COMPONENT_FAILURE

warning

The indicated service is down

HAPROXY_NO_CONNECTION

warning

No connection with HA proxy while stopping

TIMEOUT_BUCKET_UNASSIGNED

critical

The bucket reconfiguration couldn’t be done

TIMEOUT_TOBEREGISTERED

warning

The indicated servipe still has dependencies to solve

CANNOT_REGISTER

critical

CONSOLE_SERVER_DOWN

warning

Couldn’t start the console server

RESTART_COMPONENT

warning

Coudn’t restart the indicated service

SETTING_EPOCH

warning

Waiting until STS > RSts

RECOVERY_FAILURE

warning

The indicated service has not been recovered

RECOVERY_TIMEOUT

warning

The indicated service is bein recovered

HOTBACKUP

warning

Pending recovery from hotbackup

9.4. Transaction Log events

Alert Identifier Type Description

LOGDIR_ERROR

critical

Cannot create folder or is not a folder

LOGDIR_ALLOCATOR_ERROR

critical

Error managing disk in logger

LOGGER_FILE_ERROR

warning

IO file error

LOGNET_CONNECTION_FAILED

critical

Cannot dial logger

LOGSRV_CONNECTION_ERROR

critical

Network error

KVDS_RECOVERY_FAILED

critical

The kvds instance couldn’t be recovered from logger

FLUSH_FAILED

critical

Unexpected exception flushing to logger