Meet Pl@ntNet: One of the Largest Citizen Science Observatories in the World
Pl@ntNet is a citizen science platform for the production, aggregation, and dissemination of botanical observations, supported by four French research organizations: CIRAD, INRAE, Inria and IRD. It is based on an IT, web, and mobile infrastructure (iOS and Android) that enables the identification of plants using AI algorithms. Every day, 500,000 users – from teachers and students, to farmers and park rangers – take pictures of plants using the app in order to discover the flora in their region, and in turn contribute to ecology and biodiversity research worldwide.
Available in 24 languages, Pl@ntNet collects tens of millions of plant observations every year. The organization celebrated its 10th birthday in May 2020.
The Pl@ntNet iOS app.
The Challenge: Scaling Complex Queries When Data Volumes Double Every Year
The goal of Pl@ntNet is to not only provide a free image-based plant identification tool to the public, but also help collect as many geolocated plant data and images as possible for research purposes. The first version of the Pl@ntNet app was built using a popular document-oriented database to store the millions of documents created through its mobile and web apps. However, as the app’s usage started doubling every year, so did the database volume, and Pl@ntNet began experiencing challenges around limited expressiveness of complex queries — taking the Pl@ntNet team more than a week to build views for data retrieval.
The usage of Pl@ntNet doubles every year, which started to cause challenges for their queries to scale across such high data volumes.
“While we had excellent performance for write and key-based reads, we experienced inefficiencies browsing data in many different ways – and disk usage was very high,” shares Mathias Chouet, backend developer and database engineer at CIRAD. “As the need to analyze data for research purposes increased, these limitations became a real issue.”
The Solution: A Document-oriented Database with SQL-like Queries
Pl@ntNet hired an engineer to conduct a three-month comparative study of high-end databases that could remain consistent and performant while:
- Undergoing tens of writes a second
- Handling hundreds of millions of documents and dynamically querying them
- Replicating itself on other servers
- Providing SQL-like queries and indexes with a flexible storage model
The engineer decided to evaluate ArangoDB, CouchDB 2, HBase, MongoDB, and PostgreSQL. After not being satisfied with the querying capabilities of CouchDB, the indexes of MongoDB, and the deployment and storage models of HBase, Pl@ntNet narrowed in on ArangoDB and PostgreSQL.
“We were comparing PostgreSQL and ArangoDB on a quite complex query, involving three collections and tens of millions of documents,” recalls Chouet. “As ArangoDB’s data model is much more flexible, we were expecting the querying time to be much higher, but found that it was exactly the same. Then we looked at disk usage and found that it was just a little higher — this was quite convincing.”
In addition to passing its performance-based tests, Pl@ntNet ultimately selected ArangoDB due the fact that it is document-oriented with SQL-like queries, as well as for its replication capabilities.
The Implementation: 250 Million Documents Across a Single Database
Pl@ntNet contains 14 document collections in a single database, consisting of around 250 million total documents that include user accounts, geolocated plant observations, taxonomic data, push messages, and geographical data. As its app usage has been doubling every year since 2015, they expect to reach several hundreds of millions of documents in the next three years.
As illustrated below, Pl@ntNet replicates ArangoDB first on a small machine for backup, and a second, larger machine for data mining — the latter is where they issue complex memory-intensive AQL queries.
Pl@ntNet replicates ArangoDB first on a small machine for backup, and a second, larger machine for data mining.
The Results: Fast, Complex Queries at Scale and 95% Increase in Storage Efficiency
With ArangoDB, Pl@ntNet was able to reduce its database size on disk by 95% – from 2 TB to 100 GB. AQL, ArangoDB’s query language, provided them with the performant, dynamic queries they needed to analyze their data. ArangoDB also helps Pl@ntNet save time, both for their engineers, as well as in computing time for big queries. And lastly, since ArangoDB has been deployed in production servers, they haven’t had to do any unplanned maintenance.
“Pl@ntNet is a wonderful scientific, human, educational, and technological adventure,” shares Chouet. “Every day, thousands of students learn to discover or inventory the flora of their region, their schoolyard or the surrounding natural areas. Tens of millions of plant observations are collected every year and analysed by ecology and biodiversity researchers. Keeping Pl@ntNet and its database working is mandatory to go on collecting data and feed the biodiversity studies depending on this data.”
Pl@ntNet is a non-profit research and educational initiative. Its applications are free, accessible to all, and without advertising. Therefore, they rely on financial contributions to maintain the IT infrastructure. If you would like to support Pl@ntNet, you can do so by making a donation here.