home shape
IRIC-logo
right blob img min

IRIC (Institute for Research in Immunology and Cancer): ArangoDB for Genomes and Graphs

"Cancer is the #1 health problem in Canada. For this reason, IRIC has built an outstanding team of men and women united around a common mission: to have a significant impact on the treatment of cancer. The core values of IRIC and its researchers are excellence and integrity in teaching and research, creativity, innovation, collaboration, and collegiality."

by Tariq Daouda, PHD Candidate in Bioinformatics, IRIC

What is your use case?

At the IRIC (Institute for Research in Immunology and Cancer – Montreal, Canada) I use ArangoDB to manage my whole dataset. I have a pretty complex graph with the whole human reference genome, some molecules and some 50k short sequences from patients. It’s not a huge graph in the sense that it’s perfectly manageable on a single machine but it nonetheless contains about 1M edges representing genomic interactions between more than 145k molecules for several patients.

Which problem did ArangoDB solve for you?

I discovered ArangoDB while looking for a database for a personal web project. After doing some research I concluded that ArangoDB was the most sensible choice because:

  • It is written in C++.
  • Distributed under the Apache V2 licence, with no additional costs for going into production.
  • It is multi paradigm.
  • AQL is eloquent and the transition from SQL is easy.
  • It’s API is very well documented.

The python driver at that time did not support all the graph functionalities I needed, so I started working on my own driver: pyArango, developing it further as my web project grew. The more I got acquainted with ArangoDB, the more I liked it.

When you start working on a research project, you can only have a vague idea of what to store and how to store it. Because of these limitations, schemas can quickly become a hindrance, and schemaless pure document databases can force you to implement pretty complicated software layers on top of them. That’s exactly what I had: a complicated dataset managing system heavily relying on python scripts.

After having spent several weeks using and experimenting with ArangoDB it became clear that my research project could benefit from using it. The fact that ArangoDB is multi-paradigm allows me to interact with the same data from different perspectives, and a few lines of AQL can do the job of a complicated python script much faster.

I therefore made the choice of replacing my bulky complicated system with a cleaner, leaner one built on top of ArangoDB.

What is your experience in production?

Great! The first time I had to update 95k documents – ArangoDB did it instantaneously. I was pretty impressed.

It took me about a week to export my data-set to ArangoDB but I definitely do not regret that choice. Everything now is much more fast and maintainable. A few lines of AQL are more eloquent than tens of lines of python code and everything runs much faster. As a result I worry much less about bugs, and I am able to run much more experiments in shorter periods of time.

Great! The first time I had to update 95k documents – ArangoDB did it instantaneously. I was pretty impressed.

Big thanks to Tariq Daouda (@tariqdaouda) for investing time to write this use case!

Also using ArangoDB? Write a few lines – post it to your blog or send it to us and we’ll publish it here.