Using The New ArangoDB Geo Index Cursor via AQL

This tutorial will show you how to import OpenStreetMap data into an ArangoDB instance and execute efficient geo queries on your database.

Requirements:

arangodb3.2 (or later) installation
python3 and it’s modules lxml, pyarango
bunzip2, bash

The tutorial is split into three parts:

data acquisition and import
creating the index
querying ArangoDB with geo index

Import

We have chosen to search for restaurants near our headquarter in Cologne. This will give us some new ideas where to have lunch and yields easy verifiable results.

The import.sh downloads an osm file and extracts the file using bunzip2. The extracted file is then imported into a running arangod instance using the places_to_eat.py passing import as argument. places_to_eat.py makes use of lxml that allows event based xml-parsing. This allows us to deal with huge osm xml files. Finally the pyarango python driver is used to connect to the database and store the extracted information about restaurants like location (latitude/longitude) and name.

Index Creation

Now that the data is imported we create a geo index to execute performant geo queries. This can be done with the following command:

// create index
arangod> db.places_to_eat.ensureIndex({ type: "geo", fields: [ "lat", "lon" ] })
{
  "constraint" : false,
  "fields" : [
    "lat",
    "lon"
  ],
  "id" : "places_to_eat/9167",
  "ignoreNull" : true,
  "sparse" : true,
  "type" : "geo2",
  "unique" : false,
  "isNewlyCreated" : true
}

This command will create a geo index on the fields lat and lon. You need to make sure that the data stored for latitude and longitude is given in degree and as floating point type. Providing location values as string is not supported.

Now you can verify that the index has been created with:

// inspect index
arangod> db.places_to_eat.getIndexes()
[
  {
    "fields" : [
      "_key"
    ],
    "id" : "places_to_eat/0",
    "selectivityEstimate" : 1,
    "sparse" : false,
    "type" : "primary",
    "unique" : true
  },
  {
    "constraint" : false,
    "fields" : [
      "lat",
      "lon"
    ],
    "id" : "places_to_eat/9167",
    "ignoreNull" : true,
    "sparse" : true,
    "type" : "geo2",
    "unique" : false
  }
]

Using The Index

Now with data and index in place we are ready to explore what restaurants are near us. We do this by writing a query that should look very familiar to AQL-users:

$ arangosh
// define query                                       hq location  restaurant location
arangod>; q="FOR d IN places_to_eat SORT distance(50.9316394,6.9398916,d.lat,d.lon) ASC LIMIT 1 RETURN d"

This query iterates over the places_to_eat collection and sorts the documents by distance to our headquarter located at geo-coordinate (50.9316394,6.9398916). Finally we limit the number of results to 1.

The new distance function that we have used in this query takes 2 pairs of geo-coordiantes one represents a fix location and the other the locations in the collection we use (accessors d.lat/d.lon).

Now let us use the db._explain() function to see what is going on in the database:

// explain query
arangod> db._explain(q)
Query string:
 FOR d IN places_to_eat SORT distance(50.9316394,6.9398916,d.lat,d.lon) ASC LIMIT 1 RETURN d
 
Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  7   IndexNode       4443     - FOR d IN places_to_eat   /* geo2 index scan */
  5   LimitNode          1       - LIMIT 0, 1
  6   ReturnNode         1       - RETURN d
 
Indexes used:
 By   Type   Collection      Unique   Sparse   Selectivity   Fields             Ranges
  7   geo2   places_to_eat   false    true             n/a   [ `lat`, `lon` ]   NEAR(d, 50.9316394, 6.9398916)
 
Optimization rules applied:
 Id   RuleName
  1   geo-index-optimizer
  2   remove-unnecessary-calculations-2

When the optimizer discovers the distance function it replaces the enumerate collection node and the sort node by an index node. Using the index node will provide the documents in sorted order and we need only inspect as many elements as required by the LIMIT statement. The second point will be elaborated further shortly. Let us inspect the queries result first:

arangod> db._query(q).toArray()
[
  {
    "_key" : "8995",
    "_id" : "places_to_eat/8995",
    "_rev" : "_VS53pTm---",
    "cuisine" : "argentinian",
    "amenity" : "restaurant",
    "lat" : 50.931856,
    "lon" : 6.9400906,
    "name" : "El Gaucho - Original argentinisches Restaurant & Steakhaus"
  }
]

There is El Gaucho, the restaurant with the best streaks in town, almost below our office!

Now lets understand why we need to inspect less documents when using the geo index. Therefore we assume that we want to query restaurants that are not to far away. So lets tweak our query to search within a certain area:

arangod> q2="FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000  RETURN d"

This query will make the advantage very obvious if you compare the optimized and non optimized version of the execution plan. So we take first a look at an evaluation plan that does not utilise the geo index:

arangod> db._explain(q2,{}, { 'optimizer' : { 'rules' : ["-geo-index-optimizer"] }})
Query string:
 FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000  RETURN d
 
Execution plan:
 Id   NodeType                  Est.   Comment
  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode   4443     - FOR d IN places_to_eat   /* full collection scan */
  3   CalculationNode           4443       - LET #1 = (DISTANCE(50.9316394, 6.9398916, d.`lat`, d.`lon`) < 1000)   /* simple expression */   /* collections used: d : places_to_eat */
  4   FilterNode                4443       - FILTER #1
  5   ReturnNode                4443       - RETURN d
 
Indexes used:
 none
 
Optimization rules applied:
 none

As you can see we need to iterate the full collection because we retrieve the documents in an arbitrary order. Now let us inspect the optimized rule:

arangod> db._explain(q2)
Query string:
 FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000  RETURN d
 
Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode       4443     - FOR d IN places_to_eat   /* geo2 index scan */
  5   ReturnNode      4443       - RETURN d
 
Indexes used:
 By   Type   Collection      Unique   Sparse   Selectivity   Fields             Ranges
  6   geo2   places_to_eat   false    true             n/a   [ `lat`, `lon` ]   WITHIN(d, 50.9316394, 6.9398916, 1000, false)
 
Optimization rules applied:
 Id   RuleName
  1   geo-index-optimizer
  2   remove-unnecessary-calculations-2

The optimized rule does not iterate the full collection, it does not need a FilterNode. All information is contained in the IndexNode and only documents within the given radius are considered.

Conclusion

Using a geo index with the new distance function will shorten the execution time of queries. This is especially true for queries that utilise sort and filter conditions as shown in the article. The improvement is archived because the optimizer will adjust the rules in a way that: