Time traveling with graph databases

July 3 2018,/General

Graph databases are often used to analyze relations within highly interconnected datasets. Social networks, recommendation engines, corporate hierarchies, fraud detection or querying a bill of materials are common use cases. But these datasets change over time and you as a developer or data scientist may want to time travel and analyze these changes.

While ArangoDB may not come with built-in support for managing the revision history of graph data, we’ll show in this article how to manage it in a performant manner for some general classes of graphs. Best of all, this won’t require any groundbreaking new ideas. We’ll simply borrow a few tools and tricks from the persistent data structure literature and adapt them for good performance within ArangoDB. We hope that this will help enable new ways to use everyone’s favorite avocado-fueled datastore, and power some useful applications. (Not familiar with a graph database? Take the free course)

New to multi-model and graphs? Check out our free ArangoDB Graph Course.

First, some motivation

Before we dive in, you might be asking yourself “why would I even want to preserve revision history for graph data?” That’s a fair question.

One common reason is that you want to explicitly manage some history for your data. Maybe you want to track document revisions like Wikipedia. Or you are storing some information about a network that changes and you want to seamlessly visualize how it evolves over time. Or you have a user-facing editor and you’d like to support undo/redo functionality. Or maybe you want to mimic file directory management with simple live rollback mechanisms.

Another common reason would be to provide immutability semantics for your data. This can in some cases provide a certain sort of referential integrity and simplify your application logic. (For an accessible take on this idea, check out ImmutableJS, or look into some of the arguments for functional programming.)

Interested in trying out ArangoDB? Fire up your cluster in just a few clicks with ArangoDB ArangoGraph: the Cloud Service for ArangoDB. Start your free 14-day trial here.

Let’s be upfront about the limitations

First of all, if you are familiar with persistent data structures, we will primarily be discussing mechanisms for partial persistence, rather than the more complicated full or confluent persistence. What this means is that we’ll be talking about applications where you can read any version of the data, but you can only make updates to the current version. To some extent, you can make modifications to an old version of the data, but it will require erasing the existing future from that point. (Think of it this way: you can hit “undo” a few times, and edit from that version, but once you make a change, you can no longer hit “redo.”)

Second, these techniques will not all work for arbitrary graph data. We will in some cases be limiting ourselves to directed acyclic graphs (DAGs, from here on out). In simple terms, all edges have a direction, and if there exists a path from a vertex u to a vertex v following edges only in their forward direction, then there cannot also be a forward path from v which circles around back to u. For more information on DAGs, the Wikipedia article on the subject has a more precise definition and a nice discussion about some of the properties and uses of DAGs. It is fine if your application contains some additional, non-DAG data within the graph, just so long as we are only concerned with storing revision data for the DAG portion. In this case, it will probably be useful to keep any edges which may introduce cycles in separate edge collections for which we explicitly do not store revision history.

Thirdly, in some cases, we will limit ourselves further to applications where we can decompose the DAG into logically grouped components which are themselves induced trees. Ignoring those mathy terms, this means that for each group of vertices that makes sense to lump together as a component in the application, each vertex has at most one incoming edge from another vertex in the group. Another colloquial way to put it would be that no vertex has multiple parents within its own group. In the figure below, we a DAG with three components, T₁, T₂, and T₃, each of which is an induced tree.

induced tree components

Figure 1 – Induced tree components

Finally, whenever you see ∞ (the infinity symbol), this is merely a placeholder for some arbitrarily large number, typically the largest integer which can be stored in whatever representation we are using. Inside an ArangoDB document attribute, for instance, this number would be 2⁵³-1. Typing that out everywhere would make the presentation a bit cluttered and tedious, so I’m opting to use ∞.

Now, with all that out of the way, let’s dive in!

A simple, powerful primitive for time-traveling

For a moment, let’s ignore how the data in the diagram below was generated, and simply examine the structure and how we might traverse it.

dag time slicing

Figure 2 – DAG time-slicing

As you may notice, each edge in our graph G is labeled with an integer range of the form (a, b). We use this to denote the time at which each edge was created and the time at which it expired. In our normal JSON format, this could be modeled as {created: a, expired: b}. This is all the data it takes to enable time-slicing. Given the point in time t from which we wish to query our graph G, we can easily filter to include only the edges which were valid at time t. Using the bind parameter @queryTime with value t in the query below, we should get all the valid paths.

FOR vertex, edge, path
  IN 1..10 OUTBOUND 'vertices/A' GRAPH 'G'
  FILTER path.edges[*].created ALL <= @queryTime AND path.edges[*].expired ALL > @queryTime
  RETURN path

Suppose we execute this query at time 0. We should be able to reach [A, B, C, D, E]. At time 3, we should be able to reach [A, C, D, E, F, G, H].

What do the above timestamps encode, in terms of graph revisions? Effectively, we created the graph at time 0 with vertices [A, B, C, D, E], and edges [(A, B), (A, C), (B, D), (C, D), (C, E)] each with timestamp range (0, ∞). Then, at time 1, we expired the edge (A, B) by changing the timestamp range to (0, 1), and inserted F with an edge (A, F) and timestamps (1, ∞). Then, at time 2 we inserted G and H, as well as edges [(A, G), (G, D), (G, E), (E, H)]) with timestamps (2, ∞). At time 3, we added the edge (D, H) with timestamps (3, ∞). At time 4 we recreated the edge (A, B) with timestamps (4, ∞). At time 7 we expired the edge (D, H) by changing the timestamps to (3, 7), and finally, at time 8 we expired the edge (E, H) by changing the timestamps to (2, 8).

Examining the sequence of operations, you may notice that we initially use a placeholder of ∞ for an expired timestamp, which eventually gets overwritten to “delete” an object. Other than that, we never change data once it is written. If we want to change the attributes of an edge, we can expire the original and create a new copy with the desired changes and appropriate timestamps. This behavior, often called copy-on-write, is a powerful technique that we will explore more fully in the next sections.

To summarize, we’ve used simple timestamp ranges to allow time-slicing in queries, which gives a surprisingly convincing impression of time traveling. We can query our data as it is now. We can query it as if it were yesterday at 12:01 PM. We can query it as of exactly the time when a particular change was made.

Isn’t that neat?

Let’s track some simple changes

All right. We figured out the basics of graphy time travel above, and we got a tiny introduction to copy-on-write behavior. We’re going to explore the latter more fully.

Internally, at a very low level, ArangoDB already uses copy-on-write semantics. Whenever you update or replace a document, you get back a copy with a new _rev value. We have created an entirely separate copy of your document on-disk, and the old copy will eventually be garbage collected. This is common practice in database software, as it allows us to keep the old copy around as long as there are any ongoing transactions which might need it. What we are going to do next is mimic some of this behavior in user-space.

From here on out, we will assume that not just edges, but also vertices will contain timestamps in the form {created: a, expired: b}. We saw in the previous section that it’s simple to add an edge or vertex to our persistent structure by creating a new edge or vertex respectively in ArangoDB and setting its
created attribute to the current time and expired to a large placeholder value. Similarly, it is simple to “delete” it at time t by changing expired from the placeholder value to t. But what if we want to change the attributes of an edge or vertex?

If you guessed “copy-on-write,” then you’re following along! If you are familiar with persistent data structures, you may have heard of fat vertices or edges. For typical in-RAM data structures, it’s often possible to simply store all versions of a vertex’s or edge’s data directly in the vertex or edge respectively. Since ArangoDB uses copy-on-write internally, it will be more efficient to take a slightly different approach.

To modify the attributes of an edge, we simply expire the existing edge and create a new edge with the same source and target vertices, the desired attributes, and the appropriate timestamps. This takes a constant number of primitive graph operations, so it is quite efficient. See the diagram below for an example.

fat edge

Figure 3 – Fat edges

To modify the attributes of a vertex, we need to do a little more work. First, as in the case of an edge modification, we will expire the old vertex and create a new vertex with the desired attributes and appropriate timestamps. Next, we must find all currently valid edges incident to the old vertex. We must now expire them, and create copies which point to or from the new vertex as appropriate. See the diagram below for an example.

fat vertex

Figure 4 – Fat vertices

Note that in order to keep multiple versions of the same edge or vertex associated with each other, you’ll have to include some persistent identifier attribute, e.g. id. When you create a copy of the entity, simply keep this attribute the same and let ArangoDB auto-generate new _key and _rev values. For fast access, you’ll want to create an index over your persistent identifier field.

How bad is all this copying?

For the fat edges, the copying isn’t really bad at all. This is a single edge write, and ArangoDB would be copying the full edge document internally anyway. So we pay no penalty in time. We do, however, explicitly create a second copy which cannot be gargage-collected, thus taking up extra space.

The same applies for the fat vertices. The vertex copy is effectively free, though it does take extra space in the long run. What we have to be careful about with the vertices is that copying a vertex also seems to require copying all live edges. That means we are no longer limited to a constant number of graph operations! What about “supernodes” with thousands of edges? Uh-oh!

What now?

Logical abstractions and proxy vertices to the rescue

We go back to our inspiration, fat vertices, and realize that we can create a simple abstraction. Remember, for in-RAM persistent data structures, it is possible to play some tricks and store all versions of the vertex in one place. We can mimic this by wrapping up all versions of our vertex, along with a few additional vertices and edges for bookkeeping, inside a single logical unit called a logical vertex.

Each real ArangoDB vertex inside the logical grouping will have the same persistent ID. We will have one special in-proxy vertex, and one special out-proxy vertex. All incoming edges destined for the logical vertex will go to the in-proxy. All outgoing edges originating from the logical vertex will originate from the out-proxy. Internally, each version of vertex document data will be stored in a real vertex as before, but we will no longer have many edges connecting each one to other parts of the graph. Instead, each data vertex will have a single incoming edge connecting it to the in-proxy, and one outgoing edge connecting it to the out-proxy. These two edges will be marked with timestamps identical to those on the data vertex they connect to the proxies. See the figure below for a transformed version of the previous example diagram.

logical proxy transformation

Figure 5 – Logical proxy transformation

In this version, making a change to the vertex data becomes simpler. Instead of dealing with all the incident edges, we can just focus on the data! To modify the vertex document data, simply expire the old version’s vertex and incident proxy-connecting edges, and create a new copy and proxy-connecting edges with the appropriate timestamps and desired data. Now we have it down to a constant number of graph operations and constant additional space per modification, just like for edges! See the figure below for a simple example of how to modify vertex data using the new logical vertex method.

logical vertex modification

Figure 6 – Logical vertex modification

Simple history: accomplished!

Putting together the fat vertex and edge techniques along with the logical vertex abstraction and proxy vertex trick, we now have simple revision history tracking for graphs with constant time and space overhead for each modification. Mission: accomplished!

“But, wait,” you say, “didn’t you say something about DAGs and induced trees?” You’re right, I did. The techniques we’ve discussed so far work for (at least most) general graphs. But the keyword in this section heading is “simple.” These techniques allow us to perform a typical “save” operation, making the changes “in-place.” What if we want instead to do something like “save as”, making the changes in a new copy or branch?

An obvious approach would be to simply create a copy of the entire graph structure and relabel it in some appropriate way to distinguish from the original. Hopefully, this sounds like a bad idea—that’s a lot of copying!

This is where we start to make some trade-offs. By introducing limitations on the graph structure, we can reduce the necessary copying and keep such logical copying and branching operations performant just using some additional localized, copy-on-write-based techniques.

Handling branching changes

Keep in mind now that our goal is to create a new branch such that any future changes in the new branch are hidden from traversals of the existing structure. Our primary traversal technique which will enable this is still the same basic time-slicing primitive that we introduced at the beginning of this article. In order to hide these changes, we need to ensure that we cannot traverse an edge which is valid in the old branch and end up at a vertex generated by changes in the new branch. To do this, we will utilize the path-copying technique to create distinctly new copies of the affected vertices. Effectively, we will be lazily creating a deep copy of the portions of the graph that change but avoiding copying anything that does not change.

If we allow arbitrary structure in the graph and traversals, then it becomes unclear what portions of the graph actually logically change and thus need to be copied when a modification is made. Let’s start by examining a simple example of how path-copying works in a basic tree structure. In the diagram below, vertex superscript denotes that the vertex belongs to a separate branch, e.g. A² belongs to branch 2.

Figure 7 – Basic path copying

Suppose we have the initial tree on the left, and we have created a new branch. Instead of copying the whole tree, we will just use path-copying when necessary. If we want to make a change to vertex E after the branch, we will create a new copy E², as well as new copies of any ancestors. In this case, we must create new copies of C and A, labeled C² and A² respectively. We create identical copies of these nodes, including the timestamps, as they remain logically unchanged. As for the edges, we copy any path-internal edges leading to non-modified vertices so that they point to the new copies. We copy any edges leaving the path such that they point to the original vertices. Finally, for the modified vertex E, we create a copy of the edge from it’s path parent to the original vertex and expire it. Then we create a new edge to the copy E² that is created at the present time. Note that we also copy over any edges to descendants without modification.

What we have now is the original, unmodified tree rooted at A which can still be read in the original branch, as well as a new version of the tree in the new branch rooted at A². The key insight here is that since the edges in the tree all go from parent to child, and there are no edges that go back up, we can implement a modification by only changing the ancestors in the tree. Or put another way, a traverser positioned at D cannot see C or C², and therefore does not need to care about the changes. This allows us to avoid copying any vertices other than the set of direct ancestors of the vertices we wish to modify.

This is precisely why we listed the limitation that our graphs must be DAGs. If the graph has cycles, it becomes somewhat unclear how to select the set of vertices which must change. One way to define the set of ancestors is via a query like the following:

FOR vertex
  IN 1..∞ INBOUND 'vertices/C' GRAPH 'G'
  OPTIONS { bfs: true, uniqueVertices: global}
  RETURN vertex

With cycles, this becomes messy, and will probably require application-specific logic to fix. If you consider this a bit more closely, you may realize now why we want to require not only that the whole graph be a DAG, but that each logical component of the graph be an induced tree. In an application, it will likely be the case that we may have many different logical components for which we want to branch separately, but they may share common dependencies, each with their own individual branching history. If each vertex can have arbitrarily many parents and we must branch the graph as a whole, then the set of ancestors we must consider can have arbitrary fan-out, and grow very large even with small depth.

By contrast, suppose we only allow branching at the component level, that each component is an induced tree, and that we only allow a component to link to another component via target’s root vertex (not arbitrary internal vertices). Then each vertex has exactly one parent of interest within a given branch at a given point in time and the number of vertices we must copy is proportional to our modified vertex’s depth. (We may need to copy arbitrarily many edges, but in most cases, the cost will be somewhat proportional to the number of nodes, and therefore not too bad.)

Note that this does not restrict the application from “editing across component boundaries.” If components A and B share a common dependency on component C, and we wish to make a change to C which updates B, but not A, we can: first create a new branch for C which contains the desired changes. Then link to the new version from B (either in a new branch of B or in the existing branch using an in-place update depending on the application policy).

The Graph course will take you all the way from zero knowledge about ArangoDB to advanced graph query techniques in AQL – the ArangoDB query language.
Take the course

All together now…

Now, finally, let’s walk through a full-fledged example, and see how component-based branching via path-copying fits together with our logical vertex transformation and proxy vertex optimization for in-place modifications.

Figure 8 – All together: Original

In the diagram above, we have three components T₁ (comprised of [A, B, C, D, E]), T₂ (comprised of [X, Y]), and T₃ (comprised of [Q, R, S]). Initially, each starts with a single, original branch, denoted b₁. To refer to the root of branch j of component i, we maintain a simple version tag document (T_i, b_j) outside the main graph structure.

Figure 9 – All together: After branching T₁

The first change we make is to create a new branch (T₁, b₂) and modify node E. We need only copy the path to the root node A, so we make copies E², C², and A². The new root of the tree in (T₁, b₂) is A². Note that when we copy a logical vertex for modification, we copy the vertex’s entire internal revision history. (Some optimizations could be made to reduce this copying; however, this makes the traversal more complicated, so we will not delve into it in this article.)

Figure 10 – All together: After branching T₃

The next change we make is to update S, but only with respect to B‘s view of T₃. To do this, we will create a new branch (T₃, b₂). We will copy the path from S to its component root, Q, introducing new nodes S² and Q². Now, since T₁ is already branched, we probably don’t want to make the update to B in-place. Instead, we will create a copy B¹
, as well as a copy of A called A¹, and update our root label to point (T₁, b₁) to point at A¹. Note that A and B are no longer live in any branch, and are now expired along with their incident edges.

Figure 11 – All together: after updating T₂

Finally, we update Y in T₂ to use the new branch (T₃, b₂) for its T₃ dependency. Since T₂ is not branched, we can safely make this modification in-place using our previously established methods.

The cost of path-copying is proportional to the length of the path from the component root to the vertex being modified and the number of outgoing edges along this path. In bad cases this could be significant, but for many applications this is probably just fine. It is certainly much better than the naive approach of simply copying an entire component up-front in order to perform a branch operation.

In conclusion…

This article has already become quite long. Let’s get this wrapped up, shall we?

Ripe for tailoring and tweaking

The techniques discussed in this article have been quite general. If you have an application which would benefit from tracking graph revision history, I hope this has served as a helpful starting place. You can certainly tailor these techniques to your own application, evaluating some of the trade-offs and tweaking until you settle on the version that makes the most sense for your workloads.

A perfect fit for Foxx

This also seems as good a time as any to mention our Foxx microservice framework. Rather than exposing all the nitty-gritty details of proxy nodes and time-slicing, you could easily wrap your graph in a simple Foxx API that handles the transformations for you. Your service need only expose the persistent identifier for each graph entity, and can take an optional timestamp to read historical data (using the current time if none is specified).

Talk to us

We in the ArangoDB team love to hear about new and fun things that people are doing with our database, so tell us if you’re using revision history for something cool! If you’re having trouble figuring out how some of this works, drop by our community slack. And, of course, if you need some in-depth help with data modeling, feel free to inquire about our consulting services.

Credit where credit is due

While I have already linked in a few places above to a Wikipedia article on persistent data structures, the relevant information from that article mostly derives from a 1986 paper by Driscoll, Sarnak, Sleator, and Tarjan titled “Making Data Structures Persistent“. That paper introduced the fat vertex and path-copying techniques for in-RAM structures that were discussed and adapted in this article. At the time of publication, the ideas were quite novel and, due to their power and elegance, have since become fairly standard techniques in data structure design. I also feel I should disclose that one of the authors, Bob Tarjan, was my Ph.D. advisor, so I may be a bit biased to favor this work over potential alternative approaches.

Download Graph Course

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

July 3 2018,Jan Steemann

1 Comments

Aditya Mukhopadhyay on February 3 2020, at 1:02 pm

I have implemented a fully functional historical graph database on top of ArangoDB adapting some of the concepts mentioned here, but otherwise very different in its overall design and interface. Maybe some readers could find it useful:

https://github.com/CivicGraph/CivicGraph

Reply

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now