Transforming a Graph to a SmartGraph: The Smartifier
This tutorial focuses on transforming an existing Graph dataset into a SmartGraph for Enterprise level scaling.
We will continue this tutorial where the Pregel Community Detection Tutorial ended.
The dataset
In the other tutorial, we have applied a community detection algorithm on the pokec social network published by SNAP. This algorithm generated an additional attribute community
on each of the vertices. This attribute is numeric and describes that all vertices with identical attributes, form a community.
It is defined that vertices have more edges to members of the same community and fever edges are across different communities. In case you have followed the tutorial you now have a running ArangoDB cluster containing the dataset, you can reuse this cluster and skip the next paragraph. If you have not followed the proceeding tutorial you can just download a dump of the data here.
After you downloaded and extracted the data you have to setup a cluster of ArangoDB, easiest by using the ArangoDBStarter. Afterwards you need to import the dataset using the arangorestore
tool. Use your favorite command prompt and navigate to the folder where you extracted the dataset. You should see a folder pokec-with-communities/
in the directory.
Afterwards, you can restore the data into your cluster with (maybe you need to adjust the endpoint to a coordinator):
arangosrestore --input-directory pokec-with-communities --server.endpoint tcp://localhost:8530
You should now have two collections: one with profiles
(~1.6m entries) and one with relations
(~30.5m entries).
Exporting the data from ArangoDB
The Smartifier is an offline process that is designed to transform large datasets offline into the SmartGraph format, it would put too much load on a running database if it would work online. Therefore we first need to export the dataset we have in our cluster.
The tool we use for this task is arangoexport
which allows exporting collections in several data formats on disc. The smartifier we want to apply on this dataset only supports csv
and jsonl
, so you can export the data in one of those formats.
Let us export the data in jsonl
for this tutorial with the following command:
arangoexport --type jsonl --collection profiles --collection relations --output-directory export --server.endpoint tcp://localhost:8530
After this command succeeded you will end up with two jsonl
files in the export
directory.
The Smartifier
Finally we have all we need to transform the dataset. The smartifier is available in ArangoDBs graphutils repository or can be directly executed with docker. For simplicity let us stick to the docker container for this tutorial.
Start the docker container with the following command (please replace /var/tmp/export
with the path to your export directory):
docker run -it -v /var/tmp/export:/export neunhoef/graphutils bash
Now you will be in a command prompt of docker and the export is in /export
.
If you preferred to build the Smartifier yourself just navigate with your command prompt to where the export
folder is and do not append the /
before it for all following commands.
The purpose of the Smartifier is to read in a vertex and an edge file. It will scan the vertices for the Smart-Attribute (the feature we want to share the graph by) and will reorganize vertices and edges according to this attribute, such that both honor the planned sharding. As we have used an existing ArangoDB dataset the files will contain collection names.
If you need to import the data into collections with different names then the ones you started with you can add a flag to remove those names on the go.
In the following tutorial we are going to do exactly this, so we will use the --removeCollectionName true
flag. All files will be rewritten inplace, so we will not get any direct output but the files in export
will look differently.
So now let’s finally run it.
We need to execute it on the files named profiles.jsonl
(vertices) and relations.jsonl
(edges).
The original vertex collection name is profiles
(this is required as we can have vertices from different collections in the same edge file and we do only want to update the profiles
here). Finally we need to use the attribute community
to shard the graph. All together we run the following command:
smartifier --type jsonl --removeCollectionName true /export/profiles.jsonl profiles /export/relations.jsonl community
This will take a while, we are modifying 32mio lines of JSON, please be patient.
And after this command succeeded you have files ready for SmartGraph import.
More infos on how to import the data and to use SmartGraphs can be found in the SmartGraph tutorial.