ArangoSearch:
Full-text search engine including similarity ranking capabilities
ArangoSearch is a C++ based full-text search engine including similarity ranking capabilities natively integrated into ArangoDB.
ArangoSearch allows users to combine two information retrieval techniques: boolean and generalized ranking retrieval. Search results “approved” by the boolean model can be ranked by relevance to the respective query using the Vector Space Model in conjunction with BM25 or TFIDF weighting schemes.
ArangoSearch is a first-class citizen in ArangoDB. With the debut release of ArangoSearch (ArangoDB 3.4) the following capabilities are supported:
- Complex Searches with Boolean Operators
- Relevance-Based Matching
- Phrase and Prefix Matching
- Relevance Tuning on Query-Time
- Full combinability of search queries with all supported data models & access patterns
- Scalability
Over the upcoming releases, the dedicated ArangoSearch team will complement the supported feature set.
Learn the new search capabilities with the detailed ArangoSearch tutorial
The VIEW concept
ArangoSearch uses a special type of materialized view to provide full-text search across multiple collections. Within the definition of a view of type arangosearch
, the user specifies entire collections or individual fields to be covered by an inverted index with one or more general text analyzers. The view concept is currently exclusive to ArangoSearch, more general views (SQL like views, materialized views) may be introduced with later versions of ArangoDB.
The view concept is key to ArangoSearch. Developers can create an arbitrary amount of individually configured views. A single ArangoSearch view may contain documents coming from different collections making it possible to perform complex federated searches even over the whole graph.
To each view actual data is connected via ArangoSearch Links. Each link defines the following settings:
- which fields have to be indexed (or all)
- which analyzer(s) have to be applied to a field
- how deep hierarchical JSON documents have to be processed
- how lists/arrays have to be indexed in terms of individual position tracking
For each link one or more of the 12 analyzers can be defined which tokenize the values into case-sensitive word stems. The current release supports most common languages like English, Chinese, German, Spanish, French, Swedish, Russian and many more.
Part of every view and the heart of ArangoSearch is an inverted index – The ArangoSearch Index. The index structure and management approach is currently inspired by the well-known search engine Lucene and allows for efficient data access.
Similarity & Ranking in Full-text Search Engine ArangoSearch
One of the important advantages of ArangoSearch is the ability to score and sort the query resultset by document relevance, allowing the most relevant documents to be returned prior to less relevant documents. This further allows limiting the result set size to N documents which best match the filter conditions.
For similarity ranking, ArangoSearch uses a Vector Space Model which calculates the term weight for each term via scorer algorithms.
The current view implementation exposes the following scorers (case sensitive):
- BM25 – a frequency based scorer based on the BM25 algorithm
- TFIDF – a frequency based scorer based on the TFIDF algorithm
Both scorers analyze the term frequency (number of times a term occurs in a document) and the inverse document frequency (a measure to determine a terms “weight” across documents).
Example query from the hands-on tutorial
FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') SORT BM25(d) DESC LIMIT 10 RETURN d
ArangoSearch for Cluster Usage
ArangoSearch is a distributed search engine capable of handling datasets sharded over multiple machines.
Do you want us to manage your clusters?
Following the general cluster architecture of ArangoDB, ArangoSearch queries are sent to the Coordinator and then planned, optimized and executed. The Coordinator then sends the request to the DBserver responsible for the respective part of the ArangoSearch query. Queries are then processed locally. By this general architecture for distributed query processing in ArangoDB, also ArangoSearch queries can be executed efficiently.
Limitations in the current ArangoSearch version
ArangoDB 3.4 includes the first production-ready release of ArangoSearch. Not all features on our roadmap are already implemented but the Core Team is continuing to extend and optimize the capabilities of text search and ranking for ArangoDB.
Eventually read committed
In order to speed up indexing, the ArangoSearch view processes modification requests coming from ArangoSearch link on a batch basis. From time to time an asynchronous job commits accumulated data creating new index segments. Data is being visible right after the commit, so that speaking of transaction isolation, ArangoSearch view is on the eventually read committed level.
To learn more about the limitations of the current version, visit the release notes in the docs.
Learn the new search capabilities with the detailed ArangoSearch tutorial.