Using The New ArangoDB Geo Index Cursor via AQL
This tutorial will show you how to import OpenStreetMap data into an ArangoDB instance and execute efficient geo queries on your database.
Requirements:
- arangodb3.2 (or later) installation
- python3 and it’s modules lxml, pyarango
- bunzip2, bash
The tutorial is split into three parts:
- data acquisition and import
- creating the index
- querying ArangoDB with geo index
Import
We have chosen to search for restaurants near our headquarter in Cologne. This will give us some new ideas where to have lunch and yields easy verifiable results.
The import.sh downloads an osm file and extracts the file using bunzip2. The extracted file is then imported into a running arangod instance using the places_to_eat.py passing import
as argument. places_to_eat.py
makes use of lxml that allows event based xml-parsing. This allows us to deal with huge osm xml files. Finally the pyarango python driver is used to connect to the database and store the extracted information about restaurants like location (latitude/longitude) and name.
Index Creation
Now that the data is imported we create a geo index to execute performant geo queries. This can be done with the following command:
// create index arangod> db.places_to_eat.ensureIndex({ type: "geo", fields: [ "lat", "lon" ] }) { "constraint" : false, "fields" : [ "lat", "lon" ], "id" : "places_to_eat/9167", "ignoreNull" : true, "sparse" : true, "type" : "geo2", "unique" : false, "isNewlyCreated" : true }
This command will create a geo index on the fields lat
and lon
. You need to make sure that the data stored for latitude and longitude is given in degree and as floating point type. Providing location values as string is not supported.
Now you can verify that the index has been created with:
// inspect index arangod> db.places_to_eat.getIndexes() [ { "fields" : [ "_key" ], "id" : "places_to_eat/0", "selectivityEstimate" : 1, "sparse" : false, "type" : "primary", "unique" : true }, { "constraint" : false, "fields" : [ "lat", "lon" ], "id" : "places_to_eat/9167", "ignoreNull" : true, "sparse" : true, "type" : "geo2", "unique" : false } ]
Using The Index
Now with data and index in place we are ready to explore what restaurants are near us. We do this by writing a query that should look very familiar to AQL-users:
$ arangosh // define query hq location restaurant location arangod>; q="FOR d IN places_to_eat SORT distance(50.9316394,6.9398916,d.lat,d.lon) ASC LIMIT 1 RETURN d"
This query iterates over the places_to_eat
collection and sorts the documents by distance to our headquarter located at geo-coordinate (50.9316394,6.9398916). Finally we limit the number of results to 1.
The new distance function that we have used in this query takes 2 pairs of geo-coordiantes one represents a fix location and the other the locations in the collection we use (accessors d.lat/d.lon).
Now let us use the db._explain()
function to see what is going on in the database:
// explain query arangod> db._explain(q) Query string: FOR d IN places_to_eat SORT distance(50.9316394,6.9398916,d.lat,d.lon) ASC LIMIT 1 RETURN d Execution plan: Id NodeType Est. Comment 1 SingletonNode 1 * ROOT 7 IndexNode 4443 - FOR d IN places_to_eat /* geo2 index scan */ 5 LimitNode 1 - LIMIT 0, 1 6 ReturnNode 1 - RETURN d Indexes used: By Type Collection Unique Sparse Selectivity Fields Ranges 7 geo2 places_to_eat false true n/a [ `lat`, `lon` ] NEAR(d, 50.9316394, 6.9398916) Optimization rules applied: Id RuleName 1 geo-index-optimizer 2 remove-unnecessary-calculations-2
When the optimizer discovers the distance function it replaces the enumerate collection node and the sort node by an index node. Using the index node will provide the documents in sorted order and we need only inspect as many elements as required by the LIMIT
statement. The second point will be elaborated further shortly. Let us inspect the queries result first:
arangod> db._query(q).toArray() [ { "_key" : "8995", "_id" : "places_to_eat/8995", "_rev" : "_VS53pTm---", "cuisine" : "argentinian", "amenity" : "restaurant", "lat" : 50.931856, "lon" : 6.9400906, "name" : "El Gaucho - Original argentinisches Restaurant & Steakhaus" } ]
There is El Gaucho, the restaurant with the best streaks in town, almost below our office!
Now lets understand why we need to inspect less documents when using the geo index. Therefore we assume that we want to query restaurants that are not to far away. So lets tweak our query to search within a certain area:
arangod> q2="FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000 RETURN d"
This query will make the advantage very obvious if you compare the optimized and non optimized version of the execution plan. So we take first a look at an evaluation plan that does not utilise the geo index:
arangod> db._explain(q2,{}, { 'optimizer' : { 'rules' : ["-geo-index-optimizer"] }}) Query string: FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000 RETURN d Execution plan: Id NodeType Est. Comment 1 SingletonNode 1 * ROOT 2 EnumerateCollectionNode 4443 - FOR d IN places_to_eat /* full collection scan */ 3 CalculationNode 4443 - LET #1 = (DISTANCE(50.9316394, 6.9398916, d.`lat`, d.`lon`) < 1000) /* simple expression */ /* collections used: d : places_to_eat */ 4 FilterNode 4443 - FILTER #1 5 ReturnNode 4443 - RETURN d Indexes used: none Optimization rules applied: none
As you can see we need to iterate the full collection because we retrieve the documents in an arbitrary order. Now let us inspect the optimized rule:
arangod> db._explain(q2) Query string: FOR d IN places_to_eat FILTER distance(50.9316394,6.9398916,d.lat,d.lon) < 1000 RETURN d Execution plan: Id NodeType Est. Comment 1 SingletonNode 1 * ROOT 6 IndexNode 4443 - FOR d IN places_to_eat /* geo2 index scan */ 5 ReturnNode 4443 - RETURN d Indexes used: By Type Collection Unique Sparse Selectivity Fields Ranges 6 geo2 places_to_eat false true n/a [ `lat`, `lon` ] WITHIN(d, 50.9316394, 6.9398916, 1000, false) Optimization rules applied: Id RuleName 1 geo-index-optimizer 2 remove-unnecessary-calculations-2
The optimized rule does not iterate the full collection, it does not need a FilterNode. All information is contained in the IndexNode and only documents within the given radius are considered.
Conclusion
Using a geo index with the new distance function will shorten the execution time of queries. This is especially true for queries that utilise sort and filter conditions as shown in the article. The improvement is archived because the optimizer will adjust the rules in a way that:
- only relevant documents will be inspected
- the number of node in the plan will be reduced
We hope you enjoy the new functionality and provide us some feedback so we can further improve your experience.