KIT (Karlsruhe Institute of Technology): Managing heterogeneous metadata with ArangoDB

"The Karlsruhe Institute of Technology is a public research university and one of the largest research and education institutions in Germany."

by Ajinkya Prabhune, Hasebullah Ansari, Anil Keshav

What is your use case?

The principle focus of our use case is in the management of heterogeneous metadata generated during execution of experiments by scientific communities.
The scientific community that we currently aim to address is the localisation microscopy (nanoscopy) scientific community.

What is localisation microscopy?
Localisation microscopy [3] is a novel imaging technique that aims at bridging the resolution gap between the traditional light microscope that is capable of generating images at ~200 nm and the electron microscopy at ~10 nm. One such embodiment that has been adopted by the localisation microscopy group is the Spectral Precision Distance Microscopy (SPDM). A typical nanoscopy experiments consist of multiple steps that are systematically defined in workflow language and executed by a workflow engine. During the execution of the workflow, each step produces data and its associated metadata.

The metadata produced in each step is different and needs to be handled separately. We have two categories of metadata that are managed:

(1) Metadata describing the details of the experiments and the context in which the experiment was undertaken (Descriptive Metadata),
(2) The scientific workflow details and the run-time events (Provenance Metadata).

The necessary extractor, transformers and loaders for handling these metadata are implemented in the Nanoscopy Open Reference Data Repository (NORDR) [1]. Moreover, the NORDR is to a central location where all the nanoscopy experiment big datasets are stored for long-term archival. A sample result image generated by the nanoscopy workflow is shown in the figure above.

What problem did ArangoDB solve for you?

Using a single multi-model database allowed us to homogeneously design and implement various services for handling our metadata.

As the metadata produced by the nanoscopy experiments is heterogeneous, the descriptive metadata defined by the nanoscopy community is persisted in the document store of the ArangoDB. The workflow execution plan and the corresponding run-time traces are modelled in the W3C ProvONE standard and stored as graphs in the graph store of ArangoDB, see the figure below [2].

Instead of integrating ad hoc tools and infrastructure to enable large-scale harvesting of metadata, we implemented the standard OAI-PMH protocol specifications across the three data models (stores) of the ArangoDB. In the key-value store, we maintain various associations among metadata for a given data set that are stored in the NORDR.

Using a single multi-model database allowed us to homogeneously design and implement various services for handling our metadata. The straightforward sharding configurations enabled us to scale our metadata framework on multiple shards easily. We have currently reached around 90 GB of descriptive metadata on four ArangoDB shards, and we expect the database size to increase multi-fold within the coming months.

What is your experience with using ArangoDB in production?

The best part of using ArangoDB is its simplicity to utilise its multi-model data store (key-value, document and graph) through one database system. Moreover, the native AQL query language allows to define and execute the custom user queries easily. Some of the promising features of ArangoDB are its easy-to-setup cluster system (Coordinator and DBServers), graph traversing, full-text search and more.

The best part of using ArangoDB is its simplicity to utilise its multi-model data store (key-value, document and graph) through one database system.

In ArangoDB 2.8 version we faced difficulties in maintaining the parent-child relation when indexing the document (which was necessary for our use case where the structure of document (metadata schema) changes abruptly). For example, we have a very complex metadata schema (model) that consists of nested arrays, to index the inner more child in the nested array was not possible thus, limiting our full-text feature.

We would also like to suggest to improve the visualisation of the web-based graph representation, with larger graphs the rendering performance starts to deteriorate. Performance wise ArangoDB is adequately good with the feature of having 3-in-1 datastore capability. We will continue extending our metadata framework with ArangoDB as its primary database system. Any new features to the metadata management framework will be extended considering ArangoDB.

As per our knowledge we are the first ones to implement the OAI-PMH protocol on a NoSQL DB, especially ArangoDB.

Also, regarding provenance management in ArangoDB graph data store we have the first accepted publications. We are aware of communities who have used Neo4j. However, an end-to-end solution for all types of metadata was achieved by us using the ArangoDB database. Our aim of using the ArangoDB (multi-model) was to educate other scientific communities to see the benefits of using ArangoDB as the comprehensive database when handling their complex metadata specific requirements.

Big thanks to Ajinkya Prabhune, Hasebullah Ansari and Anil Keshav for investing their time to write this article!

References
[1] Prabhune, Ajinkya, Rainer Stotzka, Thomas Jejkal, Volker Hartmann, Margund Bach, Eberhard Schmitt, Michael Hausmann, and Juergen Hesser. “An optimized generic client service API for managing large datasets within a data repository.” In Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on, pp. 44-51. IEEE, 2015.

[2] Prabhune, Ajinkya, Aaron Zweig, Rainer Stotzka, Michael Gertz, and Juergen Hesser. “Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs.” In International Provenance and Annotation Workshop, pp. 204-208. Springer International Publishing, 2016.

[3] Cremer, Christoph. “Optics far beyond the diffraction limit.” In Springer handbook of lasers and optics, pp. 1359-1397. Springer Berlin Heidelberg, 2012.

Big thanks to Ajinkya Prabhune, Hasebullah Ansari and Anil Keshav for investing time to write this use case!

Also using ArangoDB? Write a few lines – post it to your blog or send it to us and we’ll publish it here.

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

KIT (Karlsruhe Institute of Technology): Managing heterogeneous metadata with ArangoDB

What is your use case?

What problem did ArangoDB solve for you?

What is your experience with using ArangoDB in production?

Quick Links

Info

About Us

Stay In Touch