Performance analysis using pyArango Part I

September 21 2017,/General

Community, Performance, python, Tutorial

This is Part I of Performance analysis using pyArango blog series. Please refer here for: Part II (cluster) and Part III (measuring system capacity).

Usually, your application will persist of a set of queries on ArangoDB for one scenario (i.e. displaying your user’s account information etc.) When you want to make your application scale, you’d fire requests on it, and see how it behaves. Depending on internal processes execution times of these scenarios vary a bit.

We will take intervals of 10 seconds, and graph the values we will get there:

average – all times measured during the interval, divided by the count.
minimum – fastest requests
maximum – slowest requests
the time “most” aka 95% of your users may expect an answer within – this is called 95% percentile

We want to push the installation hard, and not waste too much resources on these statistics. However, we want it in real time, so we can align it to other performance figures, i.e.:

the number of open file handles
number of running threads
the used cpu resources

Our monitoring solution

Out of first thoughts that one would want this kind of metrics the statsd pattern evolved, which has been meanwhile implemented in many monitoring solutions. The one we will pick here is collectd, since collectd can also aggregate the resource usage of a process.

Since later on, we will require to monitor per process file-descriptors, we patch collectd with this pull request (or use a newer version than 5.7.2 once it’s available).

In a permanent monitoring solution, one would i.e. want to configure graphite to collect the values. To simplify the scope of this article, we use the collectd rrd backend and inspect our metrics using KCollectd which is a simple KDE program available with most current linux distributions.

Monitoring ArangoDB CPU Usage with collectd

We match the process by regular expression using this collectd configuration snippet:

 

        CollectFileDescriptor true
        CollectContextSwitch true
        ProcessMatch "adb_single" ".*bin/arangod"

Integrating statsd in our setup

Statsd (as mentioned above) strives to add client instrumentation to your program. Since we only want to have statistics, we don’t need 100% accuracy. Thus we don’t care if some of the values measured get lost if we gain the advantage that we don’t have to hold up the program reporting it.

For that reason, statsd packages are sent via UDP in a fire and forget fashion from the client to the server.

We add the statsd client to pyarango so it can give us these metrics on a per query scale.

This implementation identifies AQL queries by making a hash of the query string – therefore you need to use bind values in order to get useful values.

Getting in touch with “real” numbers

Here’s a little python script creating some load on the DB so we can measure some nice values:


    #!/usr/bin/python
    import statsd
    from pyArango.connection import *
    from pyArango.collection import *
    
    statsdc = statsd.StatsClient('127.0.0.1', '8125')
    
    # conn = Connection(username="root", password="", statsdClient = statsdc, reportFileName = "/tmp/abc")
    conn = Connection(username="root", password="", statsdClient = statsdc)
    
    db = conn["_system"]
    
    if not db.hasCollection('test'):
        testCol = db.createCollection('Collection', name='test')
    else:
        testCol = db.collections['test']
        testCol.truncate()
    
    i = 0;
    while i < 100000:
        i+=1
        testCol.createDocument({'foo': 'bar', 'count': i}).save()
        aql = '''
        FOR doc IN test FILTER doc.count == @i - 1 RETURN doc
        '''
        db.AQLQuery(aql, rawResults=True, batchSize=1, count=True, bindVars= {'i' : i})
        print('.')

Learning which is the identifier of our query

On a first run, we limit the loop, so we can identify our queries in the `reportFileName` in `/tmp/abc`:

[467d53]:
    FOR doc IN test FILTER doc.count == @i - 1 RETURN doc

Inspecting the results

We run the test for some minutes, and monitor it with KCollectd. We choose several metrics to display by dragging them from the tree on the left into a graph area:

performance analysis: Test without Index

Under the `statsd` branch we find the values our AQL query emits. We create one graph with it.

To see the resource usage of the ArangoDB process we also add another graph for it, and a third graph that shows the number of file descriptors ArangoDB is using.

From inspecting the graphs, we see the query up to using 60ms which is not that bad, but its runtime behavior degrades over time – which is rather undesirable.

The blue line identifies the 95%-percentile, which is way below the brown line with the maximum reply time. This indicates that there are single replies that are much worse than the percentile – a behavior that one wouldn’t like to have at all.

All of the above is usually an indicator for a full collection scan happening since we only query one document, it may be found faster or slower, which is what makes the values bounce.

Taking a look at our code, we see that this aligns to the number of documents we fill into the `test`-collection.

Thus we take a closer look at the monitored query using explain:

Query string:
     FOR doc IN test FILTER doc.count == @i - 1 RETURN doc

Execution plan:
 Id   NodeType                    Est.   Comment
  1   SingletonNode                  1   * ROOT
  2   EnumerateCollectionNode   100000     - FOR doc IN test   /* full collection scan */
  3   CalculationNode           100000       - LET #1 = (doc.`count` == 0)   /* simple expression */   /* collections used: doc : test */
  4   FilterNode                100000       - FILTER #1
  5   ReturnNode                100000       - RETURN doc

Indexes used:
 none

Optimization rules applied:
 none

AHA! no index is used – this is why it worsens with the collection filling up.

Fix it…

So we create an index in the setup phase of our test code: `testCol.ensureHashIndex([‘count’], sparse=False)`

Once its there, explain will show us it’s going to be used:

Query string:
     FOR doc IN test FILTER doc.count == @i - 1 RETURN doc

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode          1     - FOR doc IN test   /* hash index scan */
  5   ReturnNode         1       - RETURN doc

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  6   hash   test         false    false       100.00 %   [ `count` ]   (doc.`count` == 0)

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index
  3   remove-unnecessary-calculations-2

Re-run test and analyze

and the values we measure show much better results:

performance analysis: Test with index

While the first run wasn’t finished yet after 7 minutes, the second run is already done in 4 minutes. We saw that the `Thread count` numbers weren’t in the range of the CPU time values at all, so we move it over to the file handle count graph.

Watching the reply times of the query from the first run, we can see worst reply times were around 60 ms. At the start we see the `truncate`-operation is giving us little slower values, but during the regular test run the maximum values are at around 5ms – which is perfectly acceptable.

We also can see the resource usage by the ArangoDB process is way better – 100k jiffies compared to 900k jiffies in the first test run.

We also see that the system CPU time is 1/3rd of the total CPU, which means we do a lot of I/O (write new documents to disk) while mildly loading the machine with userspace calculations.

In the first run, the system CPU usage would be next to unmeasurable.

Learn how to speed up your AQL queries. Get A Primer on Query Performance Optimization in ArangoDB

What’s next

In the next articles, we’re going to take a look at how to measure and instrument transactions (Part II), and what to look at on clusters. We will also implement a user scenario to measure the possible capacity of a given system. Stay tuned!

Willi Goesgens

Willi is part of ArangoDB Core team. He primarily works on performance tuning, the release process of ArangoDB and helping the community with his broad experience.

September 21 2017,Willi Goesgens

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

Performance analysis using pyArango Part I

Our monitoring solution

Monitoring ArangoDB CPU Usage with collectd

Integrating statsd in our setup

Getting in touch with “real” numbers

Learning which is the identifier of our query

Inspecting the results

Fix it…

Re-run test and analyze

What’s next

Willi Goesgens

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch