AWS Neptune: A New Vertex in the Graph World — But Where’s the Edge?
At AWS Re:Invent just a few days ago, Andy Jassy, the CEO of AWS, unveiled their newest database product offerings: AWS Neptune. It’s a fully managed, graph database which is capable of storing RDF and property graphs. It allows developers access to data via SPARQL or java-based TinkerPop Gremlin. As versatile and as good as this may sound, one has to wonder if another graph database will solve a key problem in modern application development and give Amazon an edge over its competition.
Neptune Confirms Value of Graph Databases
AWS entering the graph sphere is basically a good thing. It validates that graph databases are useful. They continue to gain more and more traction in many industries. Analyzing relationships and finding patterns in highly connected data is becoming more important — and fun.
Furthermore, it puts more pressure on graph database vendors to devise fully managed services and stay ahead regarding features. Otherwise, they will become irrelevant. Overall, we can expect even better products in the graph space, which is a very good thing.
Before saying more about Neptune and how good it may or may not be, let’s take a moment to consider graph databases and related queries. To appreciate Neptune, you need to understand the advantages, the power of graphs, its attraction.
The Attraction of Graphs
Graph queries tend to be less complex compared to nested JOIN operations in relational databases. As a result, they’re easier to write, understand and maintain. Below is a one-to-one example in which a client of ArangoDB made a direct comparison of a graph traversal in ArangoDB Query Language (AQL) to a highly optimized stored procedure in SQL:
Example 1, ArangoDB Graph Query (Traversal)
FOR v,e,p IN 1..50 INBOUND ‘pmconfig/176489032’ pm_content RETURN p.vertices
The `FOR` statement above initializes the query with three variables: v, e, and p. An `IN…INBOUND` statement is used to define the depth and range of a search and its direction (here `INBOUND` but also `OUTBOUND` or `ANY` is possible).
This example is fairly succinct. If you’re experienced with relational databases and are not familiar with ArangoDB or similar databases, the example above may seem confusing. Before you make that conclusion, look at the example below. It contains the same query to be executed on a relational database and as a highly optimized stored procedure with nested joins in SQL.
Example 2, Stored Procedure Containing Same Query
BEGIN DECLARE count INT; DECLARE newCount INT; DECLARE depth INT; SELECT COUNT(*) INTO newCount FROM pmWork.tmpConfs; CREATE TEMPORARY TABLE pmWork.tmpLoopConfs(cid INT UNIQUE KEY NOT NULL, parent INT NOT NULL,level INT NOT NULL) ENGINE=INNODB; INSERT pmWork.tmpLoopConfs SELECT * FROM pmWork.tmpConfs; SET count=0; SET depth=0; WHILE(newCount!=count) DO SET count=newCount; SET depth=depth+1; CREATE TEMPORARY TABLE pmWork.tmpNewConfs( cid INT UNIQUE KEY NOT NULL, parent INT NOT NULL,level INT NOT NULL) ENGINE=INNODB; INSERT IGNORE pmWork.tmpNewConfs SELECT pmConfProperty.refId,confId,depth FROM pmConfigs, pmConfProperty,pmWork.tmpLoopConfs" WHERE confId=pmConfigs.id&&(ptype='foreign’|| type IN('composite','privateComp')&&confId=cid; INSERT IGNORE pmWork.tmpConfs SELECT * FROM pmWork.tmpNewConfs; DROP TABLE pmWork.tmpLoopConfs; ALTER TABLE pmWork.tmpNewConfs RENAME pmWork.tmpLoopConfs; SELECT COUNT(*) INTO newCount FROM pmWork.tmpConfs; END WHILE; DROP TABLE pmWork.tmpLoopConfs; END
As you can see, this is much more complicated — and much lengthier. The reason for the elaborateness of this SQL statement is to allow for greater depth. Graph queries are simpler for developers to construct and much more powerful and versatile in many cases. They can be useful and are especially important for gaining insights to highly connected data (recommendation engines, fraud detection, hierarchy management and more). Still, what can Neptune offer that other database systems cannot?
Is Neptune Needed
There are already many other databases which provide reliable, performant and scalable graph solutions. ArangoDB leads the native multi-model sphere by providing multiple data models, including graphs, in one database core and one query language. For AWS to create and offer Neptune at a minimum seems to satisfy a need they may have to say that they also have a graph database.
During his AWS re:Invent keynote speech, Jassy said that Neptune will fill a gap in how developers want to access data. “The landscape of how people use databases today is really different than what’s been the case over the last number of years,” he said. “You don’t use relational databases for every application. That ship has sailed.” He also added “Modern companies who use modern technology are not only going to use multiple types of databases in all their applications, but many are going to use multiple types of databases in a single application.”
Again, the gap is already being filled by others like ArangoDB, but we appreciate that they want to participate. The last prediction about multiple database may not always be correct. True, developers need and want to use different access patterns to their data. This doesn’t necessarily mean different databases. Another single model database like Neptune, doesn’t address the need to use different data models in an application or even an entire company. With native multi-model, you only have to use one technology; if you use Neptune you end up with multiple databases.
Jassy continued by saying, “So what people really want is … a fully managed graph database that’s fast, that’s reliable, that’s scalable, that doesn’t force them into one-size-fits-all choices.” While we don’t disagree with the fully-managed, fast, reliable and scalable part of this statement, Neptune will force developers into a one-size-fits all approach. Or it will at least push them into a world in which consistency over different storage technologies becomes expensive.
The best access strategy to data depends on your needs within your application. While graphs are great, they are not always the most efficient access pattern. The same is true for JOINs in relational databases—and in document-oriented solutions. A native combination of these models, if efficiently combined in the same database core and query language, might be a better choice for investing in new storage technologies.
If you can predict the depth of a search, then a JOIN operation is likely to be faster than a graph traversal. But if you don’t know the depth of a search or want to search in a depth range (i.e., as in the Example above), a graph traversal will very likely be more efficient. This is because of the constant latency of hash index look-ups used to retrieve paths in a graph traversal.
Therefore, single model databases like Neptune have a downside. They serve only a part of an application’s requirements with efficiency. They don’t for other parts. A native multi-model database, however, can avoid this downside and is therefore a real innovation over the next single model database.
Next Single Model vs. Native Multi-Model
We at ArangoDB feel that another single model database won’t help developers to solve challenges in modern application development, or keep up with shortening life-cycles of their applications. What developers want today is a performant and flexible storage solution that removes complexities and lets them focus instead on creating the business logic.
The need for multiple access patterns on data might mean that as much as a third of an application would benefit from graphs, or that the application will benefit by the same amount using JOINs, depending on the situation. Regardless, would that need and those benefits justify the risk and complexity of integrating an additional technology? Alternatively, you could just use a native multi model database and rewrite the queries to leverage the benefits of graph traversals and joins. That’s so much simpler and easier and achievable with ArangoDB — instead of something new that’s still in its early releases.
There is a core problem which causes the need for multiple data models and related access patterns in applications: Without one technology serving at least more than one data model, developers have to learn and master multiple technologies (i.e., setup, configuration, maintenance). They then also have to learn the different query languages.
If Jassy and all of the multi-model database vendors are right that modern application development needs multiple data models, how do you guarantee data consistency among different storage technologies? Even a fully managed service doesn’t help to achieve this. Instead, you may have to solve it in the application layer — if this is even possible in a performant manner. An ACID native multi-model database definitely has an edge here.
Calling it Scalable Doesn’t Necessarily Mean it Scales
Randall Hunt, Senior Technical Evangelist of AWS, wrote in a blog post that Neptune stores the graph attributes as key/value pairs. Since this might be a necessary step for providing easier scalability for Neptune, it raises a performance question.
When traversing a graph or searching for patterns in graph data, you often want to apply filter conditions on attributes of vertices and edges. Storing a graph as key/value pairs can lead to unneeded overhead: You first have to retrieve the key/value pairs containing attributes to a vertex or edge. You then have to check if the filter applies or not. If you store the vertices and edges as JSON documents and leverage indexes on attributes, look-ups can be significantly faster.
Many developers want the option to scale-out with their database to a cluster. Some want this for high availability, and some because they need the computing power and additional storage. Jassy said several times in his presentation that it’s super easy to scale on AWS with Neptune. He said that moving from one to many more instances is done with a click and it’s something developers want.
Yes, many projects need a scalable solution, but scaling with highly connected data and performing complex queries is definitely a non-trivial task for a database, especially if you want to stay performant.
The schema below shows a fairly randomly sharded graph. Edges, as well as vertices needed during a traversal, do not necessarily reside on the same machine when sharded to a cluster. The orange and red line shows a potential graph traversal. The red parts of that same line mark network hops.
The problem visualized in this schema above is that the traversal includes six network hops. This causes network latency. Compared to in-memory computations, even a good network is still as much as three-hundred times slower. It’s more than questionable that this kind of sharding and traversing will provide an acceptable performance for most use cases.
We didn’t find anything in the current resources provided by AWS about how Neptune will solve this graph inherent problem. We probably will have to wait for benchmarks involving very large sharded graphs and deep graph traversals, because these are the hard problems for a distributed graph database.
As you might have guessed, the ArangoDB team already provides a feature which enables performance close to a single instance for sharded graphs. The SmartGraphs feature enables sharding by a definable shard_key (e.g., region, customer) and thereby directs queries to the correct shard to be processed locally on the database servers.
It’s good that AWS, like Microsoft, has joined the graph pack. Having a fully-managed service certainly gives AWS an advantage over other vendors which do not provide such services yet.
One key question, though, will be price. Traditionally, AWS is known to be quite expensive compared with hosting a database yourself. There are so many factors which can impact the cost: instance, storage, transfer IN and transfer OUT, etc. It’s difficult to calculate what you will pay at the end of each month, especially if you scale vertically or horizontally. We sincerely doubt that it will be a reasonable cost for small to mid-size use cases. The question to consider will be if the maintenance effort you save justifies the additional costs.
Seamless scalability is nice, especially when you work with growing datasets. However, scaling with highly connected data is not that easy for a distributed database. Even the proposed multiple read replica of Neptune doesn’t solve the problem of network hops during graph traversals on sharded graph data. It may just raise the costs of your setup.
In conclusion, we appreciate what Amazon is trying to do by developing and offering Neptune — even if there are other similar database systems already available. But we think there are better choices: in particular, ArangoDB. We realize that a native multi-model like ArangoDB is definitely not the final solution when it comes to data storage. Still, we think it’s the better choice and heading in the right direction.
Get the latest tutorials, blog posts and news: