How to Monitor ArangoDB using collectd, Prometheus and Grafana

Information on how to set up a monitoring system for ArangoDB (standalone or cluster)

Introduction

ArangoDB provides several statistics via HTTP/JSON APIs. Such statistics can be used to monitor ArangoDB, when collected, stored and then visualized.

In this Article we will present an ArangoDB monitoring approach that makes use, under Linux, of the tools collectd, Prometheus and Grafana. We will start with an overview on how to install and configure the needed tools. Then we will walk you through the necessary steps required to get some data through the pipeline and visualize it. A more complete example is then included. Finally, we will provide an example to monitor the health of an ArangoDB Cluster.

Required Software Tools and Components

The following is the list of tools used in this setup:

ArangoDB
collectd
Prometheus
Grafana

The data flow between the above tools is as follows:

collectd data from ArangoDB, using its plugin curl_json
Prometheus fetches data from collectd, which presents it via its plugin write_prometheus (available since collectd v. 5.7)
Grafana queries Prometheus to visualize the data

Installing the software

We assume you already installed ArangoDB.

For this setup to work, you will need at least one instance of collectd. Please use version 5.7 or higher, so the required write_prometheus plugin is included. You may prefer to install collectd on every server in your setup, as it can feed lots of valuable information about those systems into your Prometheus database, like CPU, memory or disk usage, which can complement the data from ArangoDB nicely. However, one installation suffices to get the information provided by ArangoDB and you may want to start with that.

Finally, you need to install Prometheus and Grafana.

Basic configuration

In the following examples, we use the following names for the different installation:

coordinator.arangodb.local for one ArangoDB coordinator
collectd.local for your collectd instance
prometheus.local for your Prometheus instance

These may also be installed on the same machine. Just replace the names used here with the actual names (or plain IP addresses) of your installations.

collectd

Assuming you are using a default collectd installation, it should already contain the following lines in /etc/collectd/collectd.conf to include additional *.conf files in the directory

/etc/collectd/collectd.conf.d:

 <Include "/etc/collectd/collectd.conf.d">  
  Filter "*.conf"
 </Include>

You may want to set/add a line to specify the time interval in seconds after which collectd fetches another set of data:

 Interval 60

However, this can also be set for each plugin separately.

Now add the following file to configure the write_prometheus plugin:

 /etc/collectd/collectd.conf.d/write_prometheus.conf

with the following content:

LoadPlugin write_prometheus
<Plugin "write_prometheus">
Port "9103"
</Plugin>

After (re)starting collectd, the Prometheus interface should already be available. To check if it works, open the address http://collectd.local:9103/metrics in your browser. Do not forget to replace collectd.local with your actual collectd server. You should see something like this:

# HELP collectd_df_df_complex write_prometheus plugin: 'df' Type: 'df_complex', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_df_df_complex gauge
collectd_df_df_complex{df="etc-hostname",type="free",instance="3c77f4c05a29"} 377251528704 1518599082748
collectd_df_df_complex{df="etc-hostname",type="reserved",instance="3c77f4c05a29"} 23305961472 1518599082748
collectd_df_df_complex{df="etc-hostname",type="used",instance="3c77f4c05a29"} 57782255616 1518599082748
...

Now we are ready to connect collectd to Prometheus.

Prometheus

A minimal working configuration file looks like this:

/etc/prometheus/prometheus.yml

scrape_configs:
 - job_name: node
   static_configs:
     - targets:
     - 'collectd.local:9103'

In case you already have a configuration file, you only need to add the line - collectd.local:9103 to an existing job node, or add your own. You may also add multiple targets here if you chose to install multiple collectd instances. Later you will be able to discern metrics between the targets as Prometheus will enrich your time series with the labels instance="collectd.local:9103" and job="node".

You may also want to configure how often Prometheus fetches data from collectd (taking into account also the Interval setting of collectd):

/etc/prometheus/prometheus.yml

global:
  scrape_interval: 60s

The default setting for scrape_interval is 1m. More information can be found in the Prometheus documentation on configuration.

After (re)starting Prometheus, visit http://prometheus.local:9090/targets in your browser. There should be a table node containing your endpoint, and its State should be UP: this means Prometheus is already scraping data from your collectd instance. It may take a minute (depending on the scrape_interval you have used) until the status changes from UNKNOWN to UP.

Prometheus is now set up.

Grafana

After logging into your Grafana installation, you should arrive at the Home Dashboard , where there is a link to Create your first data source. Alternatively, navigate to Configuration → Data sources and from there to Add data source.

Fill out the field Name for your Prometheus data source (choose freely). You probably want to check the box Default to set it as your default data source. As Type, choose Prometheus.

Add your Prometheus server under HTTP → URL: http://prometheus.local:9090.

Finally, click on Save & Test. If everything is configured correctly, you should get the message Data source is working.

Step-by-step example: Adding data to the pipeline

In this example, we add two metrics to our setup:

The total physical memory in the ArangoDB Cluster (the sum of the physical memory of all Coordinators)
The total resident set size, i.e. the amount of memory used by the ArangoDB instances

Other metrics can be added the same way.

Initial configuration of collectd / curl_json

This step has to be done only once. You can extend the configuration later as needed.
Add a config file for the curl_json collectd plugin:

/etc/collectd/collectd.conf.d/curl_json.conf

LoadPlugin curl_json
        TypesDB "/etc/collectd/arangodb_types.db"
        <Plugin curl_json>
        #    Interval 60
            <URL "http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort">
        #        Instance "arango_coordshort"
            # Set your authentication to Aardvark here:
            User "root"
            Password ""
        
        # IMPORTANT: Add <Key> blocks here! The configuration file will not be valid
        # until there is at least one <Key> block.
        </URL>
        </Plugin>

Optionally, you may override the Interval setting, specifying every how many seconds curl_json should fetch data from ArangoDB. Please note that choosing a very low setting may generate load and therefore reduce the performance of the database.

Also optionally, you may add an Instance parameter. If you do set it, for example to arango_coordshort, the label curl_json="arango_coordshort" will be added to all metrics configured in the < URL > block. Otherwise, the label curl_json="default" will be used.

You have to configure your credentials User and Password which you use to login to http://coordinator.arangodb.local:8529/.

Also, please create the file /etc/collectd/arangodb_types.db. It may initially be empty.

Getting data from ArangoDB to collectd with curl_json

The URL http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort may be visited with a browser to get an overview of the available data. The response looks something like this:


http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort

{
  "enabled": true,
  "data": {
    "http": {
      ...
    },
    "times": [
      ...
    ],
    "physicalMemory": 101083078656,
    "residentSizeCurrent": 818298880,
    ...
  }
}

So the data we’re looking for is available under data/physicalMemory and data/residentSizeCurrent, respectively. These need to be added in the curl_json configuration above.

First we add two new types:


/etc/collectd/arangodb_types.db


coordshort_physicalMemory value:GAUGE:U:U
coordshort_residentSizeCurrent value:GAUGE:U:U

Using these types, curl_json will use the names
collectd_curl_json_coordshort_physicalMemory and collectd_curl_json_coordshort_residentSizeCurrent for the metrics. You may choose your own names for the types. If you just use builtin types (e.g. gauge) instead, all data will be fed into the same metric (e.g. collectd_curl_json_coordshort_gauge) and can only be discerned using labels.

Now replace the lines


# IMPORTANT: Add blocks  here! The configuration file will not be valid
# until there is at least one  block.

In your <URL> block with:


/etc/collectd/collectd.conf.d/curl_json.conf


 <Key "data/physicalMemory">
            Type "coordshort_physicalMemory"
        </Key>
        <Key "data/residentSizeCurrent">
            Type "coordshort_residentSizeCurrent"
        </Key>

The Key is the path to the data in the JSON document above, while the Type is the one we added to /etc/collectd/arangodb_types.db.

After a restart of collectd and a minute (or whatever Interval is configured) of waiting, corresponding lines similar to the following should appear in the endpoint of write_prometheus:


http://collectd.local:9103/metrics


# HELP collectd_curl_json_coordshort_physicalMemory write_prometheus plugin: 'curl_json' Type: 'coordshort_physicalMemory', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_curl_json_coordshort_physicalMemory gauge
collectd_curl_json_coordshort_physicalMemory{curl_json="default",type="data-physicalMemory",instance="3c77f4c05a29"} 101083078656 1518609151473
# HELP collectd_curl_json_coordshort_residentSizeCurrent write_prometheus plugin: 'curl_json' Type: 'coordshort_residentSizeCurrent', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_curl_json_coordshort_residentSizeCurrent gauge
collectd_curl_json_coordshort_residentSizeCurrent{curl_json="default",type="data-residentSizeCurrent",instance="3c77f4c05a29"} 810180608 1518609151473

A minute or so (depending on scrape_interval) later the first values should arrive in Prometheus. This can be checked by executing, for example, the query collectd_curl_json_coordshort_physicalMemory in the Prometheus GUI under Graph. It should yield some results in either the Console or the Graph tab. If the message No datapoints found. appears, the metrics weren’t scraped (yet).

Creating a graph in Grafana

Now that the metrics on physical memory and resident set size, named collectd_curl_json_coordshort_physicalMemory and collectd_curl_json_coordshort_residentSizeCurrent, respectively, arrived in Prometheus, graphs to visualize them can be added in Grafana.

First, create a new dashboard (unless you created one already): either click on Create your first dashboard on Grafana’s Home Dashboard, or navigate to Create → Dashboard. You have to save all changes made to a dashboard explicitly, either by pressing Ctrl+S, or by clicking on the floppy disk symbol in the upper right.

Then, a New panel dialog should be open. You can add more panels to the dashboard with the Add panel button in the upper right. Select the Graph visualization.

Navigate to Panel title and Edit.

In the General tab, you can set the panel’s Title; e.g. ArangoDB cluster: total memory. In the Metrics tab, set query A to collectd_curl_json_coordshort_physicalMemory and set the Legend format to Physical memory. Now add another query B, set it to collectd_curl_json_coordshort_residentSizeCurrent and its Legend format to Resident set size. Switch to the Axes tab, and set Left Y’s Unit to Data (IEC) → bytes. Close the panel by clicking on the X to the right.

If you are satisfied with the result, do not forget to save the dashboard!

More complete configurations

Add the following lines to

/etc/collectd/arangodb_types.db

 
coordshort_physicalMemory value:GAUGE:U:U
coordshort_residentSizeCurrent value:GAUGE:U:U
coordshort_clientConnectionsCurrent value:GAUGE:U:U
coordshort_bytesSentPerSecond value:GAUGE:U:U
coordshort_bytesReceivedPerSecond value:GAUGE:U:U
coordshort_avgRequestTime value:GAUGE:U:U
coordshort_http_requestsPerSecond value:GAUGE:U:U
coordshort_http_optionsPerSecond value:GAUGE:U:U
coordshort_http_putsPerSecond value:GAUGE:U:U
coordshort_http_headsPerSecond value:GAUGE:U:U
coordshort_http_postsPerSecond value:GAUGE:U:U
coordshort_http_getsPerSecond value:GAUGE:U:U
coordshort_http_deletesPerSecond value:GAUGE:U:U
coordshort_http_othersPerSecond value:GAUGE:U:U
coordshort_http_patchesPerSecond value:GAUGE:U:U

and the following lines in the <URL> block in

/etc/collectd/collectd.conf.d/curl_json.conf

 <Key "data/physicalMemory">
            Type "coordshort_physicalMemory"
        </Key>
        <Key "data/residentSizeCurrent">
            Type "coordshort_residentSizeCurrent"
        </Key>
        <Key "data/clientConnectionsCurrent">
            Type "coordshort_clientConnectionsCurrent"
        </Key>
        <Key "data/bytesSentPerSecond/0">
            Type "coordshort_bytesSentPerSecond"
        </Key>
        <Key "data/bytesReceivedPerSecond/0">
            Type "coordshort_bytesReceivedPerSecond"
        </Key>
        <Key "data/avgRequestTime/0">
            Type "coordshort_avgRequestTime"
        </Key>
        <Key "data/http/optionsPerSecond/0">
            Instance "OPTION"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/putsPerSecond/0">
            Instance "PUT"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/headsPerSecond/0">
            Instance "HEAD"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/postsPerSecond/0">
            Instance "POST"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/getsPerSecond/0">
            Instance "GET"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/deletesPerSecond/0">
            Instance "DELETE"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/othersPerSecond/0">
            Instance "other"
            Type "coordshort_http_requestsPerSecond"
        </Key>
        <Key "data/http/patchesPerSecond/0">
            Instance "PATCH"
            Type "coordshort_http_requestsPerSecond"
        </Key>

Hence restart collectd

Grafana dashboard

In the Grafana GUI, navigate to Create → Import and paste the following JSON to get a dashboard with some cluster graphs. You only need to select your data source to configure it. The dashboard was created with Grafana 4.6.3, the current stable version at the time of writing this Article. If there are problems importing it, check your version first.

Expand for full JSON

{

"__inputs": [

{

"name": "DS_PROMETHEUS",

"label": "Prometheus",

"description": "",

"type": "datasource",

"pluginId": "prometheus",

"pluginName": "Prometheus"

}

"__requires": [

{

"type": "grafana",

"id": "grafana",

"name": "Grafana",

"version": "4.6.3"

{

"type": "panel",

"id": "graph",

"name": "Graph",

"version": ""

{

"type": "datasource",

"id": "prometheus",

"name": "Prometheus",

"version": "1.0.0"

{

"type": "panel",

"id": "singlestat",

"name": "Singlestat",

"version": ""

}

"annotations": {

"list": [

{

"builtIn": 1,

"datasource": "-- Grafana --",

"enable": true,

"hide": true,

"iconColor": "rgba(0, 211, 255, 1)",

"name": "Annotations & Alerts",

"type": "dashboard"

}

]

"editable": true,

"gnetId": null,

"graphTooltip": 0,

"hideControls": false,

"id": null,

"links": [],

"rows": [

{

"collapse": false,

"height": 283,

"panels": [

{

"aliasColors": {},

"bars": true,

"dashLength": 10,

"dashes": false,

"datasource": "${DS_PROMETHEUS}",

"fill": 1,

"id": 1,

"legend": {

"avg": false,

"current": false,

"max": false,

"min": false,

"show": true,

"total": false,

"values": false

"lines": false,

"linewidth": 1,

"links": [],

"nullPointMode": "null",

"percentage": false,

"pointradius": 5,

"points": false,

"renderer": "flot",

"seriesOverrides": [],

"spaceLength": 10,

"span": 5,

"stack": true,

"steppedLine": false,

"targets": [

{

"expr": "collectd_curl_json_coordshort_http_requestsPerSecond",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "{{type}}s per second",

"refId": "A"

}

"thresholds": [],

"timeFrom": null,

"timeShift": null,

"title": "ArangoDB cluster: HTTP requests per second",

"tooltip": {

"shared": true,

"sort": 0,

"value_type": "individual"

"type": "graph",

"xaxis": {

"buckets": null,

"mode": "time",

"name": null,

"show": true,

"values": []

"yaxes": [

{

"format": "short",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

{

"format": "short",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

}

]

{

"aliasColors": {},

"bars": false,

"dashLength": 10,

"dashes": false,

"datasource": "${DS_PROMETHEUS}",

"fill": 1,

"id": 2,

"legend": {

"avg": false,

"current": false,

"max": false,

"min": false,

"show": true,

"total": false,

"values": false

"lines": true,

"linewidth": 1,

"links": [],

"nullPointMode": "null",

"percentage": false,

"pointradius": 5,

"points": false,

"renderer": "flot",

"seriesOverrides": [],

"spaceLength": 10,

"span": 5,

"stack": false,

"steppedLine": false,

"targets": [

{

"expr": "collectd_curl_json_coordshort_bytesSentPerSecond",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "Bytes sent per second",

"refId": "A"

{

"expr": "collectd_curl_json_coordshort_bytesReceivedPerSecond",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "Bytes received per second",

"refId": "B"

}

"thresholds": [],

"timeFrom": null,

"timeShift": null,

"title": "ArangoDB cluster: network throughput",

"tooltip": {

"shared": true,

"sort": 0,

"value_type": "individual"

"type": "graph",

"xaxis": {

"buckets": null,

"mode": "time",

"name": null,

"show": true,

"values": []

"yaxes": [

{

"format": "Bps",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

{

"format": "short",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

}

]

{

"cacheTimeout": null,

"colorBackground": false,

"colorValue": false,

"colors": [

"#299c46",

"rgba(237, 129, 40, 0.89)",

"#d44a3a"

"datasource": "${DS_PROMETHEUS}",

"format": "none",

"gauge": {

"maxValue": 100,

"minValue": 0,

"show": false,

"thresholdLabels": false,

"thresholdMarkers": true

"id": 3,

"interval": null,

"links": [],

"mappingType": 1,

"mappingTypes": [

{

"name": "value to text",

"value": 1

{

"name": "range to text",

"value": 2

}

"maxDataPoints": 100,

"nullPointMode": "connected",

"nullText": null,

"postfix": "",

"postfixFontSize": "50%",

"prefix": "",

"prefixFontSize": "50%",

"rangeMaps": [

{

"from": "null",

"text": "N/A",

"to": "null"

}

"span": 2,

"sparkline": {

"fillColor": "rgba(31, 118, 189, 0.18)",

"full": false,

"lineColor": "rgb(31, 120, 193)",

"show": true

"tableColumn": "",

"targets": [

{

"expr": "collectd_curl_json_coordshort_clientConnectionsCurrent",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "Client connections",

"refId": "A"

}

"thresholds": "",

"title": "Client connections",

"type": "singlestat",

"valueFontSize": "80%",

"valueMaps": [

{

"op": "=",

"text": "N/A",

"value": "null"

}

"valueName": "current"

}

"repeat": null,

"repeatIteration": null,

"repeatRowId": null,

"showTitle": false,

"title": "Dashboard Row",

"titleSize": "h6"

{

"collapse": false,

"height": 308,

"panels": [

{

"aliasColors": {},

"bars": false,

"dashLength": 10,

"dashes": false,

"datasource": "${DS_PROMETHEUS}",

"fill": 1,

"id": 4,

"legend": {

"avg": false,

"current": false,

"max": false,

"min": false,

"show": true,

"total": false,

"values": false

"lines": true,

"linewidth": 1,

"links": [],

"nullPointMode": "null",

"percentage": false,

"pointradius": 5,

"points": false,

"renderer": "flot",

"seriesOverrides": [],

"spaceLength": 10,

"span": 6,

"stack": false,

"steppedLine": false,

"targets": [

{

"expr": "collectd_curl_json_coordshort_avgRequestTime",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "average request time",

"refId": "A"

}

"thresholds": [],

"timeFrom": null,

"timeShift": null,

"title": "ArangoDB cluster: Request duration",

"tooltip": {

"shared": true,

"sort": 0,

"value_type": "individual"

"type": "graph",

"xaxis": {

"buckets": null,

"mode": "time",

"name": null,

"show": true,

"values": []

"yaxes": [

{

"format": "s",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

{

"format": "short",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

}

]

{

"aliasColors": {},

"bars": false,

"dashLength": 10,

"dashes": false,

"datasource": "${DS_PROMETHEUS}",

"fill": 1,

"id": 5,

"legend": {

"avg": false,

"current": false,

"max": false,

"min": false,

"show": true,

"total": false,

"values": false

"lines": true,

"linewidth": 1,

"links": [],

"nullPointMode": "null",

"percentage": false,

"pointradius": 5,

"points": false,

"renderer": "flot",

"seriesOverrides": [],

"spaceLength": 10,

"span": 6,

"stack": false,

"steppedLine": false,

"targets": [

{

"expr": "collectd_curl_json_coordshort_physicalMemory",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "Physical memory",

"refId": "A"

{

"expr": "collectd_curl_json_coordshort_residentSizeCurrent",

"format": "time_series",

"intervalFactor": 2,

"legendFormat": "Resident set size",

"refId": "B"

}

"thresholds": [],

"timeFrom": null,

"timeShift": null,

"title": "ArangoDB cluster: total memory",

"tooltip": {

"shared": true,

"sort": 0,

"value_type": "individual"

"type": "graph",

"xaxis": {

"buckets": null,

"mode": "time",

"name": null,

"show": true,

"values": []

"yaxes": [

{

"format": "bytes",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

{

"format": "short",

"label": null,

"logBase": 1,

"max": null,

"min": null,

"show": true

}

]

}

"repeat": null,

"repeatIteration": null,

"repeatRowId": null,

"showTitle": false,

"title": "Dashboard Row",

"titleSize": "h6"

}

"schemaVersion": 14,

"style": "dark",

"tags": [],

"templating": {

"list": []

"time": {

"from": "now-6h",

"to": "now"

"timepicker": {

"refresh_intervals": [

"5s",

"10s",

"30s",

"1m",

"5m",

"15m",

"30m",

"1h",

"2h",

"1d"

"time_options": [

"5m",

"15m",

"1h",

"6h",

"12h",

"24h",

"2d",

"7d",

"30d"

]

"timezone": "",

"title": "ArangoDB cluster",

"version": 2

}

Adding ArangoDB Cluster Health info to collectd/Prometheus/Grafana

To perform this step we assume you already have a working setup of ArangoDB, collectd, Prometheus and Grafana (see previous sections).

The Cluster Health information, that is used to show the number of Coordinators and DBServers on the Dashboard of the ArangoDB Web Interface, while available as JSON via HTTP, is not suitable for direct consumption with the curl_json plugin in collectd. However, it is possible to get around this limitation using the exec plugin and a small script.

Requirements

The packages curl and jq need to be installed on your system.

Adding and configuring the plugin in collectd

Create the following bash script:

/etc/collectd/arango_cluster_health.plugin.bash

<#!/bin/bash>

HOSTNAME="<\${COLLECTD_HOSTNAME:-\$(hostname -f)}>"
INTERVAL="<\${COLLECTD_INTERVAL:-60}>"

ARANGO_HEALTH_URL="<\$1>"
ARANGO_USER="<\$2>"
ARANGO_PASSWORD="<\$3>"

if ! which curl jq > /dev/null
then
    exit 1
fi

while sleep "<\$INTERVAL>"
do
    JSON="<$(curl -s -u "\$ARANGO_USER:\$ARANGO_PASSWORD" "\$ARANGO_HEALTH_URL")>"
    if [ $? -ne 0 ]
    then
        continue
    fi

    TOTAL_COORDINATORS="<$(jq '.Health | map(select(.Role == "Coordinator")) | length' <<<"\$JSON")>"
    GOOD_COORDINATORS="<$(jq '.Health | map(select(.Role == "Coordinator" and .Status == "GOOD")) | length' <<<"\$JSON")>"
    TOTAL_DBSERVERS="<$(jq '.Health | map(select(.Role == "DBServer")) | length' <<<"\$JSON")>"
    GOOD_DBSERVERS="<$(jq '.Health | map(select(.Role == "DBServer" and .Status == "GOOD")) | length' <<<"\$JSON")>"

    cat <<COLLECTD
PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsTotal" interval="<\$INTERVAL>" N:"<\$TOTAL_COORDINATORS>"
PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsGood" interval="<\$INTERVAL>" N:"<\$GOOD_COORDINATORS>"
PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversTotal" interval="<\$INTERVAL>" N:"<\$TOTAL_DBSERVERS>"
PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversGood" interval="<\$INTERVAL>" N:"<\$GOOD_DBSERVERS>"
COLLECTD
done

Make the script above executable:

$ chmod +x /etc/collectd/arango_cluster_health.plugin.bash

Add the following types to the types database:

/etc/collectd/arangodb_types.db

health_coordinatorsTotal                value:GAUGE:U:U
health_coordinatorsGood                 value:GAUGE:U:U
health_dbserversTotal                   value:GAUGE:U:U
health_dbserversGood                    value:GAUGE:U:U

/etc/collectd/collectd.conf.d/exec.conf

 LoadPlugin exec
        <Plugin exec>
          Exec "nobody:nogroup" "/etc/collectd/arango_cluster_health.plugin.bash" "http://coordinator.arangodb.local:8529/_admin/cluster/health"
        </Plugin>

The address coordinator.arangodb.local:8529 needs to be set to a coordinator of the Cluster to monitor. If needed, username and password can be provided in the URL for HTTP basic auth, i.e. replace http://coordinator.arangodb.local:8529 with http://USERNAME:PASSWORD@coordinator.arangodb.local:8529. Note that the password can be read by users on the same system using ps. User and group (nobody and nogroup) can be chosen freely, as long as they have permission to execute the script /etc/collectd/arango_cluster_health.plugin.bash.

Adding useful dashboards

The following JSON documents can be added to the rows array of the Grafana dashboard example shared above.

$ chmod +x /etc/collectd/arango_cluster_health.plugin.bash

Add the following types to the types database:

/etc/collectd/arangodb_types.db

health_coordinatorsTotal                value:GAUGE:U:U
health_coordinatorsGood                 value:GAUGE:U:U
health_dbserversTotal                   value:GAUGE:U:U
health_dbserversGood                    value:GAUGE:U:U

/etc/collectd/collectd.conf.d/exec.conf

 LoadPlugin exec
        <Plugin exec>
          Exec "nobody:nogroup" "/etc/collectd/arango_cluster_health.plugin.bash" "http://coordinator.arangodb.local:8529/_admin/cluster/health"
        </Plugin>

Expand for full JSON

{

"collapse": false,

"height": 120,

"panels": [

{

"cacheTimeout": null,

"colorBackground": false,

"colorValue": false,

"colors": [

"#299c46",

"rgba(237, 129, 40, 0.89)",

"#d44a3a"

"datasource": "${DS_PROMETHEUS}",

"format": "none",

"gauge": {

"maxValue": 100,

"minValue": 0,

"show": false,

"thresholdLabels": false,

"thresholdMarkers": true

"id": 6,

"interval": null,

"links": [],

"mappingType": 1,

"mappingTypes": [

{

"name": "value to text",

"value": 1

{

"name": "range to text",

"value": 2

}

"maxDataPoints": 100,

"nullPointMode": "connected",

"nullText": null,

"postfix": "",

"postfixFontSize": "50%",

"prefix": "",

"prefixFontSize": "50%",

"rangeMaps": [

{

"from": "null",

"text": "N/A",

"to": "null"

}

"span": 1,

"sparkline": {

"fillColor": "rgba(31, 118, 189, 0.18)",

"full": false,

"lineColor": "rgb(31, 120, 193)",

"show": false

"tableColumn": "",

"targets": [

{

"expr": "collectd_exec_health_coordinatorsTotal",

"format": "time_series",

"intervalFactor": 2,

"refId": "A"

}

"thresholds": "",

"title": "Coordinators",

"type": "singlestat",

"valueFontSize": "80%",

"valueMaps": [

{

"op": "=",

"text": "N/A",

"value": "null"

}

"valueName": "current"

{

"cacheTimeout": null,

"colorBackground": true,

"colorValue": false,

"colors": [

"#299c46",

"#bf1b00",

"#bf1b00"

"datasource": "${DS_PROMETHEUS}",

"format": "none",

"gauge": {

"maxValue": 100,

"minValue": 0,

"show": false,

"thresholdLabels": false,

"thresholdMarkers": true

"id": 7,

"interval": null,

"links": [],

"mappingType": 1,

"mappingTypes": [

{

"name": "value to text",

"value": 1

{

"name": "range to text",

"value": 2

}

"maxDataPoints": 100,

"nullPointMode": "connected",

"nullText": null,

"postfix": "",

"postfixFontSize": "50%",

"prefix": "",

"prefixFontSize": "50%",

"rangeMaps": [

{

"from": "null",

"text": "N/A",

"to": "null"

}

"span": 1,

"sparkline": {

"fillColor": "rgba(31, 118, 189, 0.18)",

"full": false,

"lineColor": "rgb(31, 120, 193)",

"show": false

"tableColumn": "",

"targets": [

{

"expr": "collectd_exec_health_coordinatorsTotal - collectd_exec_health_coordinatorsGood",

"format": "time_series",

"intervalFactor": 2,

"refId": "A"

}

"thresholds": "1,2",

"title": "Coordinators down",

"type": "singlestat",

"valueFontSize": "80%",

"valueMaps": [

{

"op": "=",

"text": "N/A",

"value": "null"

}

"valueName": "current"

{

"cacheTimeout": null,

"colorBackground": false,

"colorValue": false,

"colors": [

"#299c46",

"rgba(237, 129, 40, 0.89)",

"#d44a3a"

"datasource": "${DS_PROMETHEUS}",

"format": "none",

"gauge": {

"maxValue": 100,

"minValue": 0,

"show": false,

"thresholdLabels": false,

"thresholdMarkers": true

"id": 8,

"interval": null,

"links": [],

"mappingType": 1,

"mappingTypes": [

{

"name": "value to text",

"value": 1

{

"name": "range to text",

"value": 2

}

"maxDataPoints": 100,

"nullPointMode": "connected",

"nullText": null,

"postfix": "",

"postfixFontSize": "50%",

"prefix": "",

"prefixFontSize": "50%",

"rangeMaps": [

{

"from": "null",

"text": "N/A",

"to": "null"

}

"span": 1,

"sparkline": {

"fillColor": "rgba(31, 118, 189, 0.18)",

"full": false,

"lineColor": "rgb(31, 120, 193)",

"show": false

"tableColumn": "",

"targets": [

{

"expr": "collectd_exec_health_dbserversTotal",

"format": "time_series",

"intervalFactor": 2,

"refId": "A"

}

"thresholds": "",

"title": "DBServers",

"type": "singlestat",

"valueFontSize": "80%",

"valueMaps": [

{

"op": "=",

"text": "N/A",

"value": "null"

}

"valueName": "current"

{

"cacheTimeout": null,

"colorBackground": true,

"colorValue": false,

"colors": [

"#299c46",

"#bf1b00",

"#bf1b00"

"datasource": "${DS_PROMETHEUS}",

"format": "none",

"gauge": {

"maxValue": 100,

"minValue": 0,

"show": false,

"thresholdLabels": false,

"thresholdMarkers": true

"id": 9,

"interval": null,

"links": [],

"mappingType": 1,

"mappingTypes": [

{

"name": "value to text",

"value": 1

{

"name": "range to text",

"value": 2

}

"maxDataPoints": 100,

"nullPointMode": "connected",

"nullText": null,

"postfix": "",

"postfixFontSize": "50%",

"prefix": "",

"prefixFontSize": "50%",

"rangeMaps": [

{

"from": "null",

"text": "N/A",

"to": "null"

}

"span": 1,

"sparkline": {

"fillColor": "rgba(31, 118, 189, 0.18)",

"full": false,

"lineColor": "rgb(31, 120, 193)",

"show": false

"tableColumn": "",

"targets": [

{

"expr": "collectd_exec_health_dbserversTotal - collectd_exec_health_dbserversGood",

"format": "time_series",

"intervalFactor": 2,

"refId": "A"

}

"thresholds": "1,2",

"title": "DBServers down",

"type": "singlestat",

"valueFontSize": "80%",

"valueMaps": [

{

"op": "=",

"text": "N/A",

"value": "null"

}

"valueName": "current"

}

"repeat": null,

"repeatIteration": null,

"repeatRowId": null,

"showTitle": false,

"title": "Dashboard Row",

"titleSize": "h6"

}

You can alternatively add them manually, by adding a panel of type Singlestat. Add one each for the total number of Coordinators and DBServers, using the metrics collectd_exec_health_coordinatorsTotal and collectd_exec_health_dbserversTotal, respectively. Go to the Options tab, and under Value, set Stat to Current. Then, add one each for the number of faulty Coordinators and DBServers.

As queries, use collectd_exec_health_coordinatorsTotal - collectd_exec_health_coordinatorsGood and collectd_exec_health_dbserversTotal - collectd_exec_health_dbserversGood, respectively.

Under Options, also set Stat to Current. Check the box Coloring → Background, set Thresholds to 1,1 and choose an all-clear color (e.g. green) as the first and a warning color as the second (e.g. red) and third. That way, as soon as one server goes down, the panel turns red.

The following is a screenshot of a possible Grafana dashboard:

How to Monitor ArangoDB using collectd, Prometheus and Grafana

Introduction

Required Software Tools and Components

Installing the software

Basic configuration

collectd

Prometheus

Grafana

Step-by-step example: Adding data to the pipeline

Initial configuration of collectd / curl_json

Getting data from ArangoDB to collectd with curl_json

Creating a graph in Grafana

More complete configurations

Grafana dashboard

Adding ArangoDB Cluster Health info to collectd/Prometheus/Grafana

Requirements

Adding and configuring the plugin in collectd

Adding useful dashboards

Quick Links

Info

About Us

Stay In Touch