How to Monitor ArangoDB using collectd, Prometheus and Grafana
Information on how to set up a monitoring system for ArangoDB (standalone or cluster)
Introduction
ArangoDB provides several statistics via HTTP/JSON APIs. Such statistics can be used to monitor ArangoDB, when collected, stored and then visualized.
In this Article we will present an ArangoDB monitoring approach that makes use, under Linux, of the tools collectd, Prometheus and Grafana. We will start with an overview on how to install and configure the needed tools. Then we will walk you through the necessary steps required to get some data through the pipeline and visualize it. A more complete example is then included. Finally, we will provide an example to monitor the health of an ArangoDB Cluster.
Required Software Tools and Components
The following is the list of tools used in this setup:
- ArangoDB
- collectd
- Prometheus
- Grafana
The data flow between the above tools is as follows:
- collectd data from ArangoDB, using its plugin
curl_json
- Prometheus fetches data from collectd, which presents it via its plugin
write_prometheus
(available since collectd v. 5.7) - Grafana queries Prometheus to visualize the data
Installing the software
We assume you already installed ArangoDB.
For this setup to work, you will need at least one instance of collectd. Please use version 5.7 or higher, so the required write_prometheus
plugin is included. You may prefer to install collectd on every server in your setup, as it can feed lots of valuable information about those systems into your Prometheus database, like CPU, memory or disk usage, which can complement the data from ArangoDB nicely. However, one installation suffices to get the information provided by ArangoDB and you may want to start with that.
Finally, you need to install Prometheus and Grafana.
Basic configuration
In the following examples, we use the following names for the different installation:
coordinator.arangodb.local
for one ArangoDB coordinatorcollectd.local
for your collectd instanceprometheus.local
for your Prometheus instance
These may also be installed on the same machine. Just replace the names used here with the actual names (or plain IP addresses) of your installations.
collectd
Assuming you are using a default collectd installation, it should already contain the following lines in /etc/collectd/collectd.conf
to include additional *.conf
files in the directory
/etc/collectd/collectd.conf.d:
<Include "/etc/collectd/collectd.conf.d">
Filter "*.conf"
</Include>
You may want to set/add a line to specify the time interval in seconds after which collectd fetches another set of data:
Interval 60
However, this can also be set for each plugin separately.
Now add the following file to configure the write_prometheus
plugin:
/etc/collectd/collectd.conf.d/write_prometheus.conf
with the following content:
LoadPlugin write_prometheus <Plugin "write_prometheus"> Port "9103" </Plugin>
After (re)starting collectd, the Prometheus interface should already be available. To check if it works, open the address http://collectd.local:9103/metrics
in your browser. Do not forget to replace collectd.local
with your actual collectd server. You should see something like this:
# HELP collectd_df_df_complex write_prometheus plugin: 'df' Type: 'df_complex', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_df_df_complex gauge collectd_df_df_complex{df="etc-hostname",type="free",instance="3c77f4c05a29"} 377251528704 1518599082748 collectd_df_df_complex{df="etc-hostname",type="reserved",instance="3c77f4c05a29"} 23305961472 1518599082748 collectd_df_df_complex{df="etc-hostname",type="used",instance="3c77f4c05a29"} 57782255616 1518599082748 ...
Now we are ready to connect collectd to Prometheus.
Prometheus
A minimal working configuration file looks like this:
/etc/prometheus/prometheus.yml
scrape_configs: - job_name: node static_configs: - targets: - 'collectd.local:9103'
In case you already have a configuration file, you only need to add the line - collectd.local:9103
to an existing job node
, or add your own. You may also add multiple targets here if you chose to install multiple collectd instances. Later you will be able to discern metrics between the targets as Prometheus will enrich your time series with the labels instance="collectd.local:9103"
and job="node"
.
You may also want to configure how often Prometheus fetches data from collectd (taking into account also the Interval
setting of collectd):
/etc/prometheus/prometheus.yml
global: scrape_interval: 60s
The default setting for scrape_interval
is 1m
. More information can be found in the Prometheus documentation on configuration.
After (re)starting Prometheus, visit http://prometheus.local:9090/targets
in your browser. There should be a table node containing your endpoint, and its State should be UP: this means Prometheus is already scraping data from your collectd instance. It may take a minute (depending on the scrape_interval
you have used) until the status changes from UNKNOWN to UP.
Prometheus is now set up.
Grafana
After logging into your Grafana installation, you should arrive at the Home Dashboard , where there is a link to Create your first data source. Alternatively, navigate to Configuration → Data sources and from there to Add data source.
Fill out the field Name for your Prometheus data source (choose freely). You probably want to check the box Default to set it as your default data source. As Type, choose Prometheus.
Add your Prometheus server under HTTP → URL: http://prometheus.local:9090
.
Finally, click on Save & Test. If everything is configured correctly, you should get the message Data source is working.
Step-by-step example: Adding data to the pipeline
In this example, we add two metrics to our setup:
- The total physical memory in the ArangoDB Cluster (the sum of the physical memory of all Coordinators)
- The total resident set size, i.e. the amount of memory used by the ArangoDB instances
Other metrics can be added the same way.
Initial configuration of collectd / curl_json
This step has to be done only once. You can extend the configuration later as needed.
Add a config file for the curl_json
collectd plugin:
/etc/collectd/collectd.conf.d/curl_json.conf
LoadPlugin curl_json TypesDB "/etc/collectd/arangodb_types.db" <Plugin curl_json> # Interval 60 <URL "http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort"> # Instance "arango_coordshort" # Set your authentication to Aardvark here: User "root" Password "" # IMPORTANT: Add <Key> blocks here! The configuration file will not be valid # until there is at least one <Key> block. </URL> </Plugin>
Optionally, you may override the Interval setting, specifying every how many seconds curl_json
should fetch data from ArangoDB. Please note that choosing a very low setting may generate load and therefore reduce the performance of the database.
Also optionally, you may add an Instance parameter. If you do set it, for example to arango_coordshort
, the label curl_json="arango_coordshort"
will be added to all metrics configured in the < URL >
block. Otherwise, the label curl_json="default"
will be used.
You have to configure your credentials User and Password which you use to login to http://coordinator.arangodb.local:8529/
.
Also, please create the file /etc/collectd/arangodb_types.db
. It may initially be empty.
Getting data from ArangoDB to collectd with curl_json
The URL http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort
may be visited with a browser to get an overview of the available data. The response looks something like this:
http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort
{ "enabled": true, "data": { "http": { ... }, "times": [ ... ], "physicalMemory": 101083078656, "residentSizeCurrent": 818298880, ... } }
So the data we’re looking for is available under data/physicalMemory
and data/residentSizeCurrent
, respectively. These need to be added in the curl_json
configuration above.
First we add two new types:
/etc/collectd/arangodb_types.db
coordshort_physicalMemory value:GAUGE:U:U coordshort_residentSizeCurrent value:GAUGE:U:U
Using these types, curl_json
will use the names
collectd_curl_json_coordshort_physicalMemory
and collectd_curl_json_coordshort_residentSizeCurrent
for the metrics. You may choose your own names for the types. If you just use builtin types (e.g. gauge
) instead, all data will be fed into the same metric (e.g. collectd_curl_json_coordshort_gauge
) and can only be discerned using labels.
Now replace the lines
# IMPORTANT: Add blockshere! The configuration file will not be valid # until there is at least one block.
In your <URL>
block with:
/etc/collectd/collectd.conf.d/curl_json.conf
<Key "data/physicalMemory"> Type "coordshort_physicalMemory" </Key> <Key "data/residentSizeCurrent"> Type "coordshort_residentSizeCurrent" </Key>
The Key
is the path to the data in the JSON document above, while the Type
is the one we added to /etc/collectd/arangodb_types.db
.
After a restart of collectd and a minute (or whatever Interval
is configured) of waiting, corresponding lines similar to the following should appear in the endpoint of write_prometheus
:
http://collectd.local:9103/metrics
# HELP collectd_curl_json_coordshort_physicalMemory write_prometheus plugin: 'curl_json' Type: 'coordshort_physicalMemory', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_curl_json_coordshort_physicalMemory gauge collectd_curl_json_coordshort_physicalMemory{curl_json="default",type="data-physicalMemory",instance="3c77f4c05a29"} 101083078656 1518609151473 # HELP collectd_curl_json_coordshort_residentSizeCurrent write_prometheus plugin: 'curl_json' Type: 'coordshort_residentSizeCurrent', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_curl_json_coordshort_residentSizeCurrent gauge collectd_curl_json_coordshort_residentSizeCurrent{curl_json="default",type="data-residentSizeCurrent",instance="3c77f4c05a29"} 810180608 1518609151473
A minute or so (depending on scrape_interval
) later the first values should arrive in Prometheus. This can be checked by executing, for example, the query collectd_curl_json_coordshort_physicalMemory
in the Prometheus GUI under Graph. It should yield some results in either the Console or the Graph tab. If the message No datapoints found. appears, the metrics weren’t scraped (yet).
Creating a graph in Grafana
Now that the metrics on physical memory and resident set size, named collectd_curl_json_coordshort_physicalMemory
and collectd_curl_json_coordshort_residentSizeCurrent
, respectively, arrived in Prometheus, graphs to visualize them can be added in Grafana.
First, create a new dashboard (unless you created one already): either click on Create your first dashboard on Grafana’s Home Dashboard, or navigate to Create → Dashboard. You have to save all changes made to a dashboard explicitly, either by pressing Ctrl+S
, or by clicking on the floppy disk symbol in the upper right.
Then, a New panel dialog should be open. You can add more panels to the dashboard with the Add panel button in the upper right. Select the Graph visualization.
Navigate to Panel title and Edit.
In the General tab, you can set the panel’s Title; e.g. ArangoDB cluster: total memory. In the Metrics tab, set query A to collectd_curl_json_coordshort_physicalMemory
and set the Legend format to Physical memory. Now add another query B, set it to collectd_curl_json_coordshort_residentSizeCurrent
and its Legend format to Resident set size. Switch to the Axes tab, and set Left Y’s Unit to Data (IEC) → bytes. Close the panel by clicking on the X to the right.
If you are satisfied with the result, do not forget to save the dashboard!
More complete configurations
Add the following lines to
/etc/collectd/arangodb_types.db
coordshort_physicalMemory value:GAUGE:U:U coordshort_residentSizeCurrent value:GAUGE:U:U coordshort_clientConnectionsCurrent value:GAUGE:U:U coordshort_bytesSentPerSecond value:GAUGE:U:U coordshort_bytesReceivedPerSecond value:GAUGE:U:U coordshort_avgRequestTime value:GAUGE:U:U coordshort_http_requestsPerSecond value:GAUGE:U:U coordshort_http_optionsPerSecond value:GAUGE:U:U coordshort_http_putsPerSecond value:GAUGE:U:U coordshort_http_headsPerSecond value:GAUGE:U:U coordshort_http_postsPerSecond value:GAUGE:U:U coordshort_http_getsPerSecond value:GAUGE:U:U coordshort_http_deletesPerSecond value:GAUGE:U:U coordshort_http_othersPerSecond value:GAUGE:U:U coordshort_http_patchesPerSecond value:GAUGE:U:U
and the following lines in the <URL>
block in
/etc/collectd/collectd.conf.d/curl_json.conf
<Key "data/physicalMemory"> Type "coordshort_physicalMemory" </Key> <Key "data/residentSizeCurrent"> Type "coordshort_residentSizeCurrent" </Key> <Key "data/clientConnectionsCurrent"> Type "coordshort_clientConnectionsCurrent" </Key> <Key "data/bytesSentPerSecond/0"> Type "coordshort_bytesSentPerSecond" </Key> <Key "data/bytesReceivedPerSecond/0"> Type "coordshort_bytesReceivedPerSecond" </Key> <Key "data/avgRequestTime/0"> Type "coordshort_avgRequestTime" </Key> <Key "data/http/optionsPerSecond/0"> Instance "OPTION" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/putsPerSecond/0"> Instance "PUT" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/headsPerSecond/0"> Instance "HEAD" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/postsPerSecond/0"> Instance "POST" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/getsPerSecond/0"> Instance "GET" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/deletesPerSecond/0"> Instance "DELETE" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/othersPerSecond/0"> Instance "other" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/patchesPerSecond/0"> Instance "PATCH" Type "coordshort_http_requestsPerSecond" </Key>
Hence restart collectd
Grafana dashboard
In the Grafana GUI, navigate to Create → Import and paste the following JSON to get a dashboard with some cluster graphs. You only need to select your data source to configure it. The dashboard was created with Grafana 4.6.3, the current stable version at the time of writing this Article. If there are problems importing it, check your version first.
Adding ArangoDB Cluster Health info to collectd/Prometheus/Grafana
To perform this step we assume you already have a working setup of ArangoDB, collectd, Prometheus and Grafana (see previous sections).
The Cluster Health information, that is used to show the number of Coordinators and DBServers on the Dashboard of the ArangoDB Web Interface, while available as JSON via HTTP, is not suitable for direct consumption with the curl_json
plugin in collectd. However, it is possible to get around this limitation using the exec
plugin and a small script.
Requirements
The packages curl
and jq
need to be installed on your system.
Adding and configuring the plugin in collectd
Create the following bash script:
/etc/collectd/arango_cluster_health.plugin.bash
<#!/bin/bash> HOSTNAME="<\${COLLECTD_HOSTNAME:-\$(hostname -f)}>" INTERVAL="<\${COLLECTD_INTERVAL:-60}>" ARANGO_HEALTH_URL="<\$1>" ARANGO_USER="<\$2>" ARANGO_PASSWORD="<\$3>" if ! which curl jq > /dev/null then exit 1 fi while sleep "<\$INTERVAL>" do JSON="<$(curl -s -u "\$ARANGO_USER:\$ARANGO_PASSWORD" "\$ARANGO_HEALTH_URL")>" if [ $? -ne 0 ] then continue fi TOTAL_COORDINATORS="<$(jq '.Health | map(select(.Role == "Coordinator")) | length' <<<"\$JSON")>" GOOD_COORDINATORS="<$(jq '.Health | map(select(.Role == "Coordinator" and .Status == "GOOD")) | length' <<<"\$JSON")>" TOTAL_DBSERVERS="<$(jq '.Health | map(select(.Role == "DBServer")) | length' <<<"\$JSON")>" GOOD_DBSERVERS="<$(jq '.Health | map(select(.Role == "DBServer" and .Status == "GOOD")) | length' <<<"\$JSON")>" cat <<COLLECTD PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsTotal" interval="<\$INTERVAL>" N:"<\$TOTAL_COORDINATORS>" PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsGood" interval="<\$INTERVAL>" N:"<\$GOOD_COORDINATORS>" PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversTotal" interval="<\$INTERVAL>" N:"<\$TOTAL_DBSERVERS>" PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversGood" interval="<\$INTERVAL>" N:"<\$GOOD_DBSERVERS>" COLLECTD done
Make the script above executable:
$ chmod +x /etc/collectd/arango_cluster_health.plugin.bash
Add the following types to the types database:
/etc/collectd/arangodb_types.db
health_coordinatorsTotal value:GAUGE:U:U health_coordinatorsGood value:GAUGE:U:U health_dbserversTotal value:GAUGE:U:U health_dbserversGood value:GAUGE:U:U
Register it with the exec plugin by creating this file:
/etc/collectd/collectd.conf.d/exec.conf
LoadPlugin exec <Plugin exec> Exec "nobody:nogroup" "/etc/collectd/arango_cluster_health.plugin.bash" "http://coordinator.arangodb.local:8529/_admin/cluster/health" </Plugin>
The address coordinator.arangodb.local:8529
needs to be set to a coordinator of the Cluster to monitor. If needed, username and password can be provided in the URL for HTTP basic auth, i.e. replace http://coordinator.arangodb.local:8529
with http://USERNAME:PASSWORD@coordinator.arangodb.local:8529
. Note that the password can be read by users on the same system using ps
. User and group (nobody
and nogroup
) can be chosen freely, as long as they have permission to execute the script /etc/collectd/arango_cluster_health.plugin.bash
.
Adding useful dashboards
The following JSON documents can be added to the rows
array of the Grafana dashboard example shared above.
$ chmod +x /etc/collectd/arango_cluster_health.plugin.bash
Add the following types to the types database:
/etc/collectd/arangodb_types.db
health_coordinatorsTotal value:GAUGE:U:U health_coordinatorsGood value:GAUGE:U:U health_dbserversTotal value:GAUGE:U:U health_dbserversGood value:GAUGE:U:U
Register it with the exec plugin by creating this file:
/etc/collectd/collectd.conf.d/exec.conf
LoadPlugin exec <Plugin exec> Exec "nobody:nogroup" "/etc/collectd/arango_cluster_health.plugin.bash" "http://coordinator.arangodb.local:8529/_admin/cluster/health" </Plugin>
The address coordinator.arangodb.local:8529
needs to be set to a coordinator of the Cluster to monitor. If needed, username and password can be provided in the URL for HTTP basic auth, i.e. replace http://coordinator.arangodb.local:8529
with http://USERNAME:PASSWORD@coordinator.arangodb.local:8529
. Note that the password can be read by users on the same system using ps
. User and group (nobody
and nogroup
) can be chosen freely, as long as they have permission to execute the script /etc/collectd/arango_cluster_health.plugin.bash
.
You can alternatively add them manually, by adding a panel of type Singlestat. Add one each for the total number of Coordinators and DBServers, using the metrics collectd_exec_health_coordinatorsTotal
and collectd_exec_health_dbserversTotal
, respectively. Go to the Options tab, and under Value, set Stat to Current. Then, add one each for the number of faulty Coordinators and DBServers.
As queries, use collectd_exec_health_coordinatorsTotal - collectd_exec_health_coordinatorsGood
and collectd_exec_health_dbserversTotal - collectd_exec_health_dbserversGood
, respectively.
Under Options, also set Stat to Current. Check the box Coloring → Background, set Thresholds to 1,1
and choose an all-clear color (e.g. green) as the first and a warning color as the second (e.g. red) and third. That way, as soon as one server goes down, the panel turns red.
The following is a screenshot of a possible Grafana dashboard: