Using Cytoscape with ArangoDB
In this tutorial, we would like to visualize the data of a graph stored in ArangoDB for a human read-able overview.
This overview often helps to get a general understanding of non-artifically created data, or for a third party dataset that was not designed by ourselves.
The dataset
In this tutorial we have the case of a third party dataset designed by Marius Bäsler in his master thesis.[1]
His goal is to find the origins of parasitism with the help of GLoBIs interaction database.
This data dump can be downloaded here.
The dataset describes several organisms that live either in symbiotic or parasitary relation to one another.
In order to import the dataset we can just restore it into a running ArangoDB with arangorestore
:
1
|
arangorestore --input-directory /path/to/extracted/dump
|
After this command succeeded you will end up with two collections:
nodes_otl_sub
a document collection containing species, genera and families.edges_otl_sub
a edge collection, where each edge defines a relation between the nodes.
Now we have the dataset in ArangoDB and are ready to go.
Data Normalization
The goal is to export the data in xgmml
format, which is readable by cytoscape
the tool we want to use to visualize the data.
Unfortunately, this format requires that all vertices only have string datatypes.
So we need to normalize our dataset first and convert all attributes of the vertices to string.
Furthermore, each document needs to have identical attributes, which is also done by this step.
NOTE: this step requires some computation and does not scale well for larger datasets, if you have this situation and need some guidance please contact us on Slack, we can help you out there.
In order to do this normalization we are going to execute the following AQL:
LET attrs = ( FOR node IN nodes_otl_sub FOR x IN ATTRIBUTES(node, true) RETURN DISTINCT x ) FOR node IN nodes_otl_sub LET newNode = ZIP(attrs, ( FOR attr IN attrs RETURN TO_STRING(node[attr]) )) UPDATE node WITH newNode IN nodes_otl_sub
In the first step this aql collects a distinct list of attributes available in the dataset.
In the second step, it iterates over all nodes.
Then it will create a new node that has each attribute replaced with a TO_STRING
variant of it’s value.
Note here: If the attribute is not set, it will cause to save the empty string.
And then updates the document in the collection with the new node.
So after this query succeeded all vertices have all attributes and all of them are of type string.
Now we are ready to go for the export.
Exporting the data
To visualize the data we need it in xgmml
format.
In order to transform the dataset into this format, we are using the arangoexport
tool.
$> arangoexport --help Usage: arangoexport [] Section 'global options' (Global configuration) --collection restrict to collection name (can be specified multiple times) (default: ) --configuration the configuration file or 'none' (default: "") --fields comma separated list of fileds to export into a csv file (default: "") --graph-name name of a graph to export (default: "") --output-directory output directory (default: "/home/mchacki/devel/export") --overwrite overwrite data in output directory (default: false) --progress show progress (default: true) --type type of export. possible values: "csv", "json", "jsonl", "xgmml", "xml" (default: "json") --version reports the version and exits (default: false) --xgmml-label-attribute specify document attribute that will be the xgmml label (default: "label") --xgmml-label-only export only xgmml label (default: false)
This tool natively supports xgmml
format so it is rather straight forward to use it.
For this export, we need to name the collections we want to export, so in our case nodes_otl_sub
and edges_otl_sub
.
Obviously, we need to name the xgmml
format as type.
For easier visualization we like to give the graph a name otl
.
Finally xgmml
allows defining one attribute as label.
We select the name
for this tutorial.
So in total, our call will look like this:
$> arangoexport --collection nodes_otl_sub --collection edges_otl_sub --type xgmml --graph-name otl --xgmml-label-attribute name
And produces the following output:
Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.2.0, database: '_system', username: 'root' # Export graph with collections nodes_otl_sub, edges_otl_sub as 'otl' # Exporting collection 'nodes_otl_sub'... # Exporting collection 'edges_otl_sub'... Processed 2 collection(s), wrote 128432121 byte(s), 176 HTTP request(s)
After this export succeeded you will have an export
containing a file named otl.xgmml
.
This finally is the xgmml
representation of our dataset.
Data visualisation
In order to visualize and analyze the dataset please download Cytoscape. For details of this product please refer to their website. For this tutorial we are just going to use it as a visualization tool.
Cytoscape: import xgmml file
Cytoscape: apply organic layout
Cytoscape: graph overview
Cytoscape: part of the graph zoomed in
Feel free to explore the graph yourself.
[1] The present graph is part of Marius Bäsler’s master thesis (Bäsler 2017 – https://github.com/majuss/globi-parasites). He’s trying to find the origins of parasitism with the help of the OpenTreeOfLife (Hinchliff et al. 2014 – doi: 10.1073/pnas.1423041112) and GlobalBioticInteractions (Poelen et al., 2014 – https://doi.org/10.1016/j.ecoinf.2014.08.005).