Pattern Matching
This is an add-on to our Graph Course and uses the same example dataset of airports and flights. We will construct a complex pattern matching query step by step to find the best flights between two airports with the lowest total travel time. For instructions how to import and work with the dataset, as well as an introduction to graphs and graph queries please check out the course!
Why not a Shortest Path Query?
In a shortest path query, it is not possible to apply filters because of the algorithm used under the hood to efficiently determine one shortest path between two vertices. It can optionally take a weight attribute into account to choose one shortest path over the other, but that’s about it. However, you can use a shortest path query to determine the minimal length for such a path.
RETURN LENGTH( FOR v IN OUTBOUND SHORTEST_PATH 'airports/BIS' TO 'airports/JFK' flights RETURN v )
This length minus 1 can be used as traversal depth in a separate query. If the result of above query is 3, then use a depth of 2. You start the traversal from the same start vertex and add a filter to match the same end vertex (also see Multiple Path Search). You may add additional filters to refine the search. If you traverse a graph to find paths fulfilling complex conditions then we call this kind of search pattern matching.
Excursion – Storing Results in Variables
Results of simple expressions as well as of entire subqueries can be stored in variables. To declare a variable, use the LET keyword followed by the variable name, an equal sign and the expression. The code must be in parentheses if the expression is a subquery.
LET h = FLOOR(f.DepTime / 100) LET m = f.DepTime % 100 LET formatted = CONCAT(h, ':', m) RETURN { hours: h, minutes: m, formatted }
The time format is like 1245. Hours and minutes can be separated with some math: divide by 100 and round off the decimal places for the hours part and modulo 100 for the division remainder to get the minutes part.
The calculated values are stored in two different variables h and m. These variables are then used to create another variable formatted like „12:45“ and all three are returned as result.
RETURN { myVariable }
is a short form of RETURN { myVariable: myVariable }
if you wondered.
In below example, the hours and minutes of the departure time are pre-calculated and stored in the variables h and m. They are used further down in the RETURN statement to create an ISO timestamp (DATE_ISO8601() documentation) together with the date attributes of the data – ignoring time zones and daylight saving time; we will later use the included UTC timestamps for time calculations:
FOR f IN flights FILTER f._from == 'airports/BIS' LIMIT 100 LET h = FLOOR(f.DepTime / 100) LET m = f.DepTime % 100 RETURN { year: f.Year, month: f.Month, day: f.Day, time: f.DepTime, iso: DATE_ISO8601(f.Year, f.Month, f.Day, h, m) }
Your results in Table view should look like this:
Finding the Best Flight Step by Step
The real-world question we try to answer is:
What are the best connections between the airports Bismarck Municipal (BIS) and John F. Kennedy International (JFK) determined by the lowest total travel time?
Answering this question is a bit more complex so let’s go through it step-by-step and extend the query each time.
Step 1 – Flights from BIS to JFK
We just want to get from BIS to JFK – we don’t care about time, day or month yet.
Note: We already know that 2 steps is the shortest path, that’s why IN 2 OUTBOUND
is used as minimal and maximal traversal depth.
FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights FILTER v._id == 'airports/JFK' LIMIT 5 RETURN p
The result shows a graph but when switching to JSON view you will see 5 different flight plans from BIS → JFK
Step 2 – Only Flights on January First
We make sure that we fly on the same date on each segment, here on New Year (Month = 1, Day = 1).
FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights FILTER v._id == 'airports/JFK' FILTER p.edges[*].Month ALL == 1 FILTER p.edges[*].Day ALL == 1 LIMIT 5 RETURN p
The result now shows two possibilities. In this case we could also fly via MSP (Minneapolis).
By the way: ALL ==
is an array comparison operator.
Step 3 – Calculate the Flight Time
We make sure that we pick those flights which have the lowest total flight time. For that, we calculate the difference between the departure time at start airport and the arrival time at the target airport in minutes with the DATE_DIFF() function. We use the result to sort in ascending order. Finally, we return the top 5 flights as well as the flight time.
FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights FILTER v._id == 'airports/JFK' FILTER p.edges[*].Month ALL == 1 FILTER p.edges[*].Day ALL == 1 LET flightTime = DATE_DIFF(p.edges[0].DepTimeUTC, p.edges[1].ArrTimeUTC, 'i') SORT flightTime ASC LIMIT 5 RETURN { flight: p, time: flightTime }
Have a look at the results. You will see that some flight times are negative.
We need to make sure that departure and arrival time don’t overlap.
Step 4 – No Overlapping Flights
Let’s put in the final part of the query to get the best flights to JFK. In this case we assume that we need 20 minutes to get our connecting flight to JFK. By adding another filter that ensures that the next plane does not take off before we landed with the previous one, plus time for the transit, we won’t see negative flight times anymore and get viable flight connections:
FOR v, e, p IN 2 OUTBOUND 'airports/BIS' flights FILTER v._id == 'airports/JFK' FILTER p.edges[*].Month ALL == 1 FILTER p.edges[*].Day ALL == 1 FILTER DATE_ADD(p.edges[0].ArrTimeUTC, 20, 'minutes') < p.edges[1].DepTimeUTC LET flightTime = DATE_DIFF(p.edges[0].DepTimeUTC, p.edges[1].ArrTimeUTC, 'i') SORT flightTime ASC LIMIT 5 RETURN { flight: p, time: flightTime }
Have a look at the results and you will see, that our best flight would not be via Denver (DEN) – which a shortest path query may have suggested – but via Minneapolis (MSP).
Optimization with Vertex-Centric Indexes
There are quite a lot of edges the traverser has to inspect and follow if we run our query against the example data. We can improve the query performance significantly, if we add a vertex-centric index to traverse edges of the relevant day only:
- Click on COLLECTIONS in the ArangoDB WebUI
- Open the flights collection
- Click on the Indexes tab
- Click the green button in the action column to add a new index
- Set Type to Hash Index
- Type
_from,Month,Day
into Fields - Leave the index options Unique and Sparse unticked
- Click the green Create button
When the index is ready, go back to the query editor and re-run the final version of our query. You should see a faster query execution time. Click on Explain and look below the Execution plan to see that the new index is being used:
Indexes used: By Type Collection Unique Sparse Selectivity Fields Ranges 2 hash flights false false 1.47 % [ `_from`, `Month`, `Day` ] base OUTBOUND
Without the vertex-centric index, all outgoing edges of the departure airport need to be followed using the standard edge index, then checked if they meet our conditions (on a certain day, arriving at desired destination, with a viable transit).
The extra index allows the traverser to quickly lookup the outgoing edges of the departure airport (_from
attribute) for a certain day (Month
, Day
attributes), which eliminates the need to fetch and filter out all edges with flights on different days. It reduces the number of edges that need checking with a cheap index lookup and saves quite some time.
See our documentation about vertex-centric indexes for more details.
Conclusion
We constructed a complex pattern matching query to find the best flights between two particular airports. From here on, you can experiment further with the query on your own. For example, try other airports or conditions, possibly using some of the other attributes included in the dataset. All attributes are described in the Graph Course.
Can you come up with cool pattern matching queries that you want to share? Or do you still have questions? Let us know, we are happy to help!