Monitoring and Alerting System
This document aims to provide a light analysis of LeanXcale Monitoring and Alerting System, as well as setting a practical guide about generating new alert and record rules in Prometheus.
1. Architecture
LeanXcale monitoring is build on two open-source solutions: Prometheus and Grafana.
The following image shows the architecture of components:
LeanXcale Monitoring and Alerting System consist on the following elements:
-
Prometheus: Centralizes and stores metrics in a time series database.
-
Alert Manager: Triggers alarms based on Prometheus stored information (it is part of Prometheus software also).
-
Exporters: Running in monitorized hosts, export metrics sets to feed Prometheus:
-
Mastermind exporter: Publishes metrics from the following Lx components: Snapshot Server, Commit Sequencer and Config Manager.
-
node-exporter: Comes by default with Prometheus installation, exports a lot of metrics about machine performance.
-
metricsExporter.sh: It exports metrics about Lx components at two levels:
-
Lx Component running java code inside JVM.
-
Lx Component processes as OS level (I/O, memory, …)
-
-
In addition, it’s possible to feed Prometheus with another metrics. It only needs key-value files placed in /tmp/scrap path. It’s common to use these to export metrics resulting from executing specific processes. For example, this is the file generated in the TPCC benchmarking tests:
appuser@5c050e45b4d0:/tmp/scrap$ cat escada.prom
tpccavglatency 101
tpmCs 28910
aborttpmCs 124
Monitoring is started when the cluster is started and the central components (Prometheus and Grafana) are deployed in the metadata server.
2. Starting Monitor
Just need to run the following script:
./lx/monitor/monitor.sh
It starts Prometheus server, the Alertmanager, Grafana and node_exporter.
You also need to run another script to get LeanXcale Metrics in port 9089:
./lx/metrics/metricsExporter.sh
It calls a Python script that will collect all the metrics.
3. Monitoring dashboard
To access the monitoring dashboard you have open the following URL in your browser:
http://{Metadata server hostname}:3000/
The following images shows an example of a working monitoring dashboard:
4. Taking a look at Prometheus configuration
Prometheus config file is /lx/monitor/prometheus-package/prometheus.yml:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert.rules.yml"
- "basic_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'node-exporter'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['172.17.0.2:9101','172.17.0.2:9089'] #NODE_EXPORTER#
- job_name: 'mastermind'
static_configs:
- targets: ['localhost:4242'] #MASTERMIND_EXPORTER#
All these parameters can also be checked in Prometheus console
(http://172.17.0.2:9091), in Targets'' and
Configuration'' menu
options.
4.1. Prometheus rules
As you can see in previous config file, rules are included in yml files that Prometheus found in rule_files tag values. All the rules defined in all files can be checked in Prometheus console (http://172.17.0.2:9091/rules).
Prometheus supports two types of rules which may be configured and then evaluated at regular intervals (remind that are stored in Prometheus as time series data).
4.1.1. Recording rules
Allows you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This could help to improve Grafana Dashboards performance.
It’s important to remark that there is no firing action for these rules, they are just calculated and stored in Prometheus ir order to be queried or collected for dashboards in Grafana.
At this moment, there are a few recording rules in LeanXcale installation, you can find them in:
/lx/monitor/prometheus-package/basic_rules.yml
Note that rules are distributed in groups whose name should be unique in the file.
A typical syntax for a recordin rule is composed of two mandatory tags
(record
and expr
). An optional tag would be interval
.
Let’s take an example from this file:
groups:
- name: cluster metrics
interval: 20s # How often rules in the group are evaluated (overrides global config)
rules:
- record: instance:cpu_busy:avg_rate1m # Rule name: There is no mandatory way in which you must name the rules. But *level:metric:operations* is recommended.
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) # expression to be evaluated to add a new value point to the time series
The rule in the example calculates an average rate of CPU usage based on the metric called node_cpu_seconds_total. Let’s see an example of how this metric is shown at (http://172.17.0.2:9101/metrics):
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 1276.51 node_cpu_seconds_total{cpu="0",mode="iowait"} 1.72 node_cpu_seconds_total{cpu="0",mode="irq"} 0 node_cpu_seconds_total{cpu="0",mode="nice"} 3.02 node_cpu_seconds_total{cpu="0",mode="softirq"} 44.63 node_cpu_seconds_total{cpu="0",mode="steal"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 17.02 node_cpu_seconds_total{cpu="0",mode="user"} 42.66 node_cpu_seconds_total{cpu="1",mode="idle"} 1273.47
This metric has two labes, cpu and mode. These labels can be accesed from the expression to discriminate the concrete metric that is being evaluated.
To know more about Querying and Expression Syntax in Prometheus, please check (https://prometheus.io/docs/prometheus/latest/querying/basics/).
5. Creating a New Alert
If you want to create a new alert:
-
If it is an alert that is to be generated based on Prometheus host metrics, you just need to know the metric and the expression that is being evaluated. Go to Alerting Rules section.
-
Maybe you want to create an alert based on a precomputed operation over a metric. To read about that, please check here.
-
If you want to send an alert directly to Alertmanager, have a look at the LeanXcale standard format for manual alerts here, and consider these two options:
-
Consider that alerts can also be configured to override global configuration and setting a time delay to be waited since the expression condition is fulfilled until the effective firing of the alert. Check here.
5.1. Alerting rules
Allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
At this moment, alerting rules are defined on the following file:
/lx/monitor/prometheus-package/alert.rules.yml
Syntax is quite similar to recoding rules, although there are some new tags.
groups:
- name: example
rules:
- alert: HighErrorRate # Alert name
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 # Expression that sets de condition for the alert to be fired
for: 10m # causes Prometheus to wait for a certain duration between first encountering a new expression output vector element and counting an alert as firing for this element. In this case, Prometheus will check that the alert continues to be active during each evaluation for 10 minutes before firing the alert.
labels: # Set of additional labels to be attached to the alert.
severity: page
annotations: # specifies a set of informational labels that can be used to store longer additional information such as alert descriptions or runbook links
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
In label and annotations there are two variables that can be used:
-
$labels: Holds the label key/value pairs of an alert instance.
-
$Value: Holds the evaluated value of an alert instance (it means, the value that fires the alert).
If you set a time period in for
tag, it causes Prometheus to wait
for a certain duration between first encountering a new expression
output vector element and counting an alert as firing for this element.
In this case, Prometheus will check that the alert continues to be
active during each evaluation for 10 minutes before firing the alert.
IMPORTANT!!!
It’s not recommended to use $value variable in labels tag. The reason is
that the labels
include a label called value
, which will usually
vary from one evaluation from the next. The effect of this is that each
evaluation will see a brand new alert, and treat the previous one as no
longer firing. This is as the labels of an alert define its identity,
and thus the for
will never be satisfied.
It should be used in annotations tag, let’s see two examples:
# INCORRECT: THIS ALERT MAY NOT BE NEVER FIRED
groups:
- name: example
rules:
- alert: ExampleAlertIncorrect
expr: metric > 10
for: 5m
labels:
severity: page
value: "{{ $value }}" ######## DON'T INCLUDE $value HERE!
annotations:
summary: "Instance {{ $labels.instance }}'s metric is too high"
# CORRECT ALERT
- alert: ExampleAlert
expr: metric > 10
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }}'s metric is {{ $value }}"Combining recording and alerting rules**
5.2. Alerting rules based on recording rules
Sometimes it could useful in LeanXcale procedures to set alerts based on complex precomputed metrics. Let’s see an example:
groups:
- name: recording_rules
interval: 5s
rules:
- record: node_exporter:node_filesystem_free:fs_used_percents
expr: 100 - 100 * ( node_filesystem_free{mountpoint="/"} / node_filesystem_size{mountpoint="/"} )
- name: alerting_rules
rules:
- alert: DiskSpace10%Free
expr: node_exporter:node_filesystem_free:fs_used_percents >= 90
# Note that previous expression evaluates the metric defined in the recording rule.
labels:
severity: moderate
annotations:
summary: "Instance {{ $labels.instance }} is low on disk space"
description: "{{ $labels.instance }} has only {{ $value }}% free."
6. Testing rules file syntax
Prometheus provides a tool to check if your rule files are syntactically correct after modifying them.
In the LeanXcale installation, this tool is in:
/lx/monitor/prometheus-package/promtool
Usage example:
./promtool check rules alert.rules.yml
Checking alert.rules.yml
SUCCESS: 5 rules found
7. Sending manually an alert to Alertmanager
7.1. Manual Alerts format
A Manual Alert would contain the following tags:
-
status: Alert’s activation state (firing, resolved)
-
labels:
-
alertname: Unique ID for the alert.
-
alertType: This field represents the type, the idea is setting a classification for alerts based on the possible anomalous situation that could happen in different LeanXcale components.
-
severity: Severity level (critical, warning)
-
component: LeanXcale componet that sends the Alert to Alertmanager (KVDS, LgLTM, QE, ZK, LgSnS, LgCmS, MtM, CflM, KVMS).
-
serviceIp: IP/Hostname:Port where the component that fires the alert is running.
-
-
annotations:
-
summary: A descriptive message of the problem.
-
An example of an alert to be fired would look like this:
[{
"status": "firing",
"labels": {
"alertname": "1234567890",
"alertType": "alertType1",
"severity": "warning",
"component": "KVDS",
"serviceIp": "172.17.0.2:9993"},
"annotations": {
"summary": "Alarm message"}
}]
7.2. Alerts from Bash scripts
It’s possible to send an alert to Alertmanager using the sendAlert.sh
script included in LeanXcale installation, you can find it in:
/lx/monitor/sendAlert.sh
This script finds out in Prometheus config file the host and port where the Alertmanager is running.
./sendAlert.sh <firing|resolved> # Alert status
<warning|critical> # Alert severity
<AlertId> # Alert unique ID
<AlertType> # Alert Type (indicates an anomalous situation)
<Component> # Lx Component that fires alarm
<ServiceIP> # hostname/IP:Port
"<Message>"
Let’s see how we would have fired the example’s alert:
./sendAlert.sh firing warning 1234567890 alertType1 KVDS "172.17.0.2:9993" "Alarm message"
There is a script in LeanXcale installation that could help us to check if this is really working (you should have started LeanXcale Monitor previously, see first section). It takes one optional argument that indicates how much fo the alert information you want to see:
-
simple (default): Just alertname, start datetime and summary message.
-
extended: All alert’s tags in tabular format.
-
json: All alert’s tags (even internal ones) in json format
/lx/monitor/alarm_console.sh [simple|extended|json]
The previously fired alert will be shown like this:
Alertname Starts At Summary
1234567890 2019-05-29 09:15:36 UTC Alarm message
And we also can see the alert in the Alertmanager console.
To resolve the alert, just run again the script with resolved status:
./sendAlert.sh resolved warning 1234567890 alertType1 KVDS "172.17.0.2:9993" "Alarm message"
7.3. Alerts from Java (Log4j appender)
In lx-dependencies
project
(https://gitlab.leanxcale.com/lx/lx-dependencies) is configured a common
Log4j appender for Java code. It allows you to send an alert to Alertmanager.
This appender can be found in:
./lx-dependencies/TM/common/src/test/resources/log4j2.properties
appender.http.type=Http
appender.http.name=http
appender.http.layout.type=PatternLayout
appender.http.layout.pattern=%m
appender.http.url=http://172.17.0.2:9093/api/v1/alerts
...
#AlertManager
logger.ALERTMANAGER.name=ALERTMANAGER
logger.ALERTMANAGER.level=INFO
logger.ALERTMANAGER.appenderRefs=alertmanager
logger.ALERTMANAGER.appenderRef.alertmanager.ref=async
In this project, an AlertSender.java
class is defined. It contains a
static method that allows you to send a JSON alert message:
public static void sendAlert(Severity severity, String idAlarm, Status status, String message);
You can find a usage example in:
./lx-dependencies/TM/common/src/test/java/com/leanxcale/alert/TestAlertSender.java
8. Querying the Alertmanager
It’s possible to get alerts info from Alertmanager in an easy way using
also amtool
. Let’s say we have the following alerts active:
Labels Annotations Starts At Ends At
Generator URL
alertType="alertType1" alertname="1234567890" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:25:29 UTC 2019-05-29 09:30:29 UTC
alertType="alertType1" alertname="1234567891" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:25:35 UTC 2019-05-29 09:30:35 UTC
alertType="alertType1" alertname="1234567892" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:25:43 UTC 2019-05-29 09:30:43 UTC
alertType="alertType2" alertname="1234567893" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:26:05 UTC 2019-05-29 09:31:05 UTC
alertType="alertType2" alertname="1234567894" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:26:10 UTC 2019-05-29 09:31:10 UTC
In the first example, we ask the Alertmanager for the alert with alertname=1234567890 and in the second one, for all the alerts with alertType=alertType2:
appuser@5c050e45b4d0:/lx/monitor$ alertmanager-package/amtool --alertmanager.url=http://172.17.0.2:9093 alert query -o extended alertname="1234567890"
Labels Annotations Starts At Ends At Generator URL
alertType="alertType1" alertname="1234567890" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:25:29 UTC 2019-05-29 09:30:29 UTC
appuser@5c050e45b4d0:/lx/monitor$ alertmanager-package/amtool --alertmanager.url=http://172.17.0.2:9093 alert query -o extended alertType="alertType2"
Labels Annotations Starts At Ends At Generator URL
alertType="alertType2" alertname="1234567893" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:26:05 UTC 2019-05-29 09:31:05 UTC
alertType="alertType2" alertname="1234567894" component="KVDS" serviceIp="172.17.0.2:9993" severity="warning" summary="Alarm message" 2019-05-29 09:26:10 UTC 2019-05-29 09:31:10 UTC
9. Alerts
This is the detailed list of the events that are centralized in the alert system:
9.1. Query Engine events
Alert Identifier | Type | Description |
---|---|---|
START_SERVER |
warning(info) |
Indicates a server have been started. It shows the server address as IP:PORT |
START_SERVER_LTM |
warning(info) |
Indicates the LTM is started and the QE is connected to a Zookeeper instance |
START_SERVER_ERROR |
critical |
There has been an error while starting the server and/or the LTM |
STOP_SERVER |
warning(info) |
Indicates the server has been stopped |
SESSION_CONSISTENCY_ERROR |
warning |
Error when setting session consistency at LTM. |
DIRTY_METADATA |
warning |
The metadata has been updated but the QE couldn’t read the new changes |
CANNOT_PLAN |
warning |
The QE couldn’t come up with an execution plan for the query |
FORCE_CLOSE_CONNECTION |
warning(info) |
A connection was explicitly closed by an user |
FORCE_ROLLBACK |
warning(info) |
A connection was explicitly rollbacked by an user |
FORCED_ROLLBACK_FAILED |
warning |
Forced rollback failed |
FORCE_CANCEL_TRANSACTION |
warning(info) |
Forced transaction cancellation. The transaction was not associated to any connection |
KV_DIAL_BEFORE |
critical |
QE is not connected to the DS |
KV_NOT_NOW |
warning |
DS error: not now |
KV_ABORT |
warning(info) |
DS error: aborted by user |
KV_ADDR |
warning |
DS error: bad address |
KV_ARG |
warning |
DS error: bad argument |
KV_BAD |
warning |
DS error: corrupt |
KV_AUTH |
warning |
DS error: auth failed |
KV_BUG |
warning |
DS error: not implemented |
KV_CHANGED |
warning |
DS error: resource removed or format changed |
KV_CLOSED |
warning |
DS error: stream closed |
KV_CTL |
warning |
DS error: bad control request |
KV_DISKIO |
critical |
DS error: disk i/o error |
KV_EOF |
warning |
DS error: premature EOF |
KV_FMT |
warning |
DS error: bad format |
KV_FULL |
critical |
DS error: resource full |
KV_HALT |
critical |
DS error: system is halting |
KV_INTR |
warning |
DS error: interrupted |
KV_IO |
critical |
DS error: i/o error |
KV_JAVA |
warning |
DS error: java error |
KV_LOW |
critical |
DS error: low on resources |
KV_LTM |
warning |
DS error: LTM error |
KV_REC |
warning |
DS error: Rec error |
KV_LOG |
warning |
DS error: Log error |
KV_MAN |
critical |
DS error: please, read the manual |
KV_NOAUTH |
warning |
DS error: auth disabled |
KV_NOBLKS |
critical |
DS error: no more mem blocks |
KV_NOBLOB |
warning(info) |
DS error: no such blob. QE might not being handling blobs correctly |
KV_NOIDX |
warning(info) |
DS error: no such index. QE might not being handling index correctly |
KV_NOMETA |
critical |
DS error: no metadata |
KV_NOREG |
warning |
DS error: no such region |
KV_NOTOP |
warning |
DS error: no such tid operation |
KV_NOSERVER |
critical |
DS error: no server |
KV_NOTAVAIL |
critical |
DS error: not available |
KV_NOTID |
warning |
DS error: no such tid |
KV_NOTUPLE |
warning |
DS error: no such tuple |
KV_NOTUPLES |
warning |
DS error: no tuples |
KV_NOUSR |
warning(info) |
DS error: no such user |
KV_OUT |
critical |
DS error: out of resources |
KV_PERM |
warning(info) |
DS error: permission denied |
KV_PROTO |
warning |
DS error: protocol error |
KV_RDONLY |
warning |
DS error: read only |
KV_RECOVER |
warning |
DS error: system is recovering |
KV_RECOVERED |
warning |
DS error: system recovered |
KV_SERVER |
critical |
DS error: server |
KV_TOOLARGE |
warning |
DS error: too large for me |
KV_TOOMANY |
warning |
DS error: too many for me |
KV_TOUT |
warning |
DS error: timed out |
KV_REPL |
warning |
DS error: replica error |
9.2. Health Monitor events
Alert Identifier | Type | Description |
---|---|---|
COMPONENT_FAILURE |
warning |
Failure from the indicated service |
TIMEOUT_TOBEREGISTERED |
warning |
Waiting to register the indicated service |
CANNOT_REGISTER |
critical |
Cannot register the indicated service. Restart it |
9.3. Configuration events
Alert Identifier | Type | Description |
---|---|---|
COMPONENT_FAILURE |
warning |
The indicated service is down |
HAPROXY_NO_CONNECTION |
warning |
No connection with HA proxy while stopping |
TIMEOUT_BUCKET_UNASSIGNED |
critical |
The bucket reconfiguration couldn’t be done |
TIMEOUT_TOBEREGISTERED |
warning |
The indicated servipe still has dependencies to solve |
CANNOT_REGISTER |
critical |
|
CONSOLE_SERVER_DOWN |
warning |
Couldn’t start the console server |
RESTART_COMPONENT |
warning |
Coudn’t restart the indicated service |
SETTING_EPOCH |
warning |
Waiting until STS > RSts |
RECOVERY_FAILURE |
warning |
The indicated service has not been recovered |
RECOVERY_TIMEOUT |
warning |
The indicated service is bein recovered |
HOTBACKUP |
warning |
Pending recovery from hotbackup |
9.4. Transaction Log events
Alert Identifier | Type | Description |
---|---|---|
LOGDIR_ERROR |
critical |
Cannot create folder or is not a folder |
LOGDIR_ALLOCATOR_ERROR |
critical |
Error managing disk in logger |
LOGGER_FILE_ERROR |
warning |
IO file error |
LOGNET_CONNECTION_FAILED |
critical |
Cannot dial logger |
LOGSRV_CONNECTION_ERROR |
critical |
Network error |
KVDS_RECOVERY_FAILED |
critical |
The kvds instance couldn’t be recovered from logger |
FLUSH_FAILED |
critical |
Unexpected exception flushing to logger |