Basic querying
Let’s try to find some movies:
db._query("RETURN LENGTH(FOR d IN v_imdb SEARCH d.type == 'Movie' RETURN d)")
The query above returns 12862 documents.
We can specify search criteria to get only drama movies:
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' RETURN d")
The query above returns 2690 documents.
Or even get drama movies that are relatively short, between 10 and 50 minutes:
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' AND d.runtime IN 10..50 RETURN d")
The query above returns only 35 documents.
SEARCH vs FILTER
Curious user may note that we used SEARCH
statement instead of familiar FILTER
in the examples above, there is subtle but fundamental difference between the keywords.
In the context of ArangoSeach view FILTER
always means post-processing, it doesn’t use any indices, but rather performs a full-scan over the result produced by a corresponding view expression.
SEARCH
keyword does guarantee to use ArangoSearch indices allowing the most efficient execution plan.
SEARCH expression has the following noticeable differences compared to FILTER expressions:
- SEARCH expression, in contrast to
FILTER
, is meant to be treated as a part of theFOR
operation, not as an individual statement - There may be only one
SEARCH
expression related to a correspondingArangoSearch view
expression - Though
ArangoSearch view
contains documents having at least 1 indexed field,SEARCH
expression works only with indexed attributes (see Link properties), non-indexed attributes are treated as non-existent - Data source specific AQL extensions (e.g.
PHRASE
,STARTS_WITH
,BOOST
) are allowed only within a context ofSEARCH
expression
It’s always recommended to prefer SEARCH
expression over the FILTER
expression, that’s almost always possible.
The following query gives you an example of how SEARCH
and FILTER
expressions can be combined. It will return all drama movies within a random duration interval:
db._query("FOR d IN v_imdb SEARCH d.type == 'Movie' AND d.genre == 'Drama' FILTER d.runtime IN RAND()*10..RAND()*100 RETURN d")
To learn more about SEARCH
expressions and its fundamental differences compared to FILTER
expression please check our AQL documentation.
Phrase search
With a phrase search one can search for documents containing an exact sentence or phrase rather than comparing a set of keywords in random order. In ArangoSearch phrase search is available via PHRASE function.
The following example gives me documents that contain phrase “Star wars” in the description:
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'Star wars', 'text_en') RETURN d")
Proximity search
Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
PHRASE function described above allows performing complex proximity queries as well.
In the next example I’m looking for the word sequence to be present in the description of a movie: in <any word> galaxy
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'in', 1, 'galaxy', 'text_en') RETURN d")
The following snippet gathers results of several parameterized proximity searches:
var res = new Array(); for (var i = 1; i <= 5; ++i) { res.push( db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'in', @prox, 'galaxy', 'text_en') RETURN d", { prox: i }).toArray() ); }
With a PHRASE function you can perform really complex proximity searches, e.g. story <any word> <any word> man <any word> <any word> undercover
:
db._query("FOR d IN v_imdb SEARCH PHRASE(d.description, 'story', 2, 'man', 2, 'undercover', 'text_en') RETURN d")
Min match
MIN_MATCH function is another very interesting and useful ArangoSearch extensions. Essentially that is something in between of disjunction and conjunction. It accepts multiple SEARCH
expressions and numeric value denoting a minimum number of search expressions to be satisfied. The function matches documents where at least a specified number of search expressions are true.
In the following example we’re looking for 2 episodes of “Stars Wars” saga, namely “Phantom Menace” and “New hope”:
db._query("FOR d IN v_imdb SEARCH MIN_MATCH(PHRASE(d.description, 'Phantom Menace', 'text_en'), PHRASE(d.description, 'Star Wars', 'text_en'), PHRASE(d.descri
Analyzer
In last example we used PHRASE
functions 3 times with the same analyzer. The query above can be simplified with the help of special ANALYZER function:
db._query("FOR d IN v_imdb SEARCH ANALYZER(MIN_MATCH(PHRASE(d.description, 'Phantom Menace'), PHRASE(d.description, 'Star Wars'), PHRASE(d.description, 'New Hope'), 2), 'text_en') RETURN d")
The query above will produce the same output as the previous one.
ANALYZER function does nothing but overrides analyzer in a context of SEARCH expression with another one, denoted by a specified analyzer, making it available for search functions. By default every SEARCH function that accepts analyzer as a parameter has identity analyzer in its context.
Ranking in ArangoSearch
In the beginning of the tutorial, we’ve described naive ranking similarity measure. Of course, it doesn’t work well in real-world use cases. Though understanding of how ranking works requires some advanced knowledge, in this section we’re going to give you an overview on ranking model AragnoSearch uses for measuring documents relevance.
The ranking, in general, is the process of evaluating document rank/score according to specified search criteria. For text processing, ArangoSearch uses Vector Space Model.
It postulates the following: in the space formed by the terms of the query the document vectors that are closer to a query vector are more relevant.
Practically the closeness is expressed as the cosine of the angle between two vectors, namely cosine similarity.
In other words, we treat documents and search criteria as vectors and evaluate cosine of the angle between them. Then we can sort documents by the evaluated cosine score value and finally retrieve documents that are closer (more relevant) to a specified query than the others.
Let’s consider the following example:
Assume we have 3 documents in a collection:
{ "text" : "quick dog" } // doc1 { "text" : "brown fox" } // doc2 { "text" : "quick brown fox" } // doc3
Now imagine we have a query:
FOR d in myView SEARCH d.text IN ["quick", "brown"] RETURN d
For this query let’s imagine two dimensional space formed by words quick and brown
Then let’s put every document to the graph above:
From the plot above it’s clear that doc3 is the most relevant to the given query.
Ranking in AQL
In order to compute cosine, we have to evaluate vector components at first. There are a number of probability/statistical weighting models but for now, ArangoSearch provides two, probably the most famous schemes:
Under the hood both models rely on 2 main components:
Term frequency (TF) – in the simplest case defined as the number of times that term t occurs in document d. Inverse document frequency (IDF) is a measure of how much information the word provides, i.e. whether the term is common or rare across all documents.
In AQL both weighting methods are exposed via corresponding functions: BM25, TFIDF. Note that both function accept AQL corresponding loop variable as the first argument. In 3.4 these functions have the following limitation: that’s not possible to use evaluated score value in generic AQL expressions (e.g. filters or calculations), the aforementioned functions must be used in SORT statement related to an ArangoSearch view.
Let’s imagine that we’re extremely interested in a nice sci-fi movie. I want to find something that matches the following set of keywords: “amazing, action, world, alien, sci-fi, science, documental, galaxy”:
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') LIMIT 10 RETURN d")
Without any ranking I got “The Prime of Miss Jean Brodie” in the very beginning of result set. That’s is definitely not what I did want to watch. Nevertheless, you may get different results, that’s non-deterministic since we haven’t specified any sorting condition there.
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') SORT BM25(d) LIMIT 10 RETURN d")
After inspecting result set you note that first place is taken by “Religulous” movie. That’s again something that is not particularly relevant to our original query. The reason is the following: in the query above, we didn’t explicitly specified sorting order. Default value is ASC which means that we implicitly asked ArangoDB to show us the most irrelevant results first. Let’s fix it by explicitly specifying our intention to get the most relevant docs in the beginning of result set:
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') SORT BM25(d) DESC LIMIT 10 RETURN d")
And now the most relevant document is “AVPR: Aliens vs. Predator – Requiem”. Yes, that’s definitely better.
Query time relevance tuning
Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via BOOST function. Similarly to ANALYZER function, BOOST function accepts any other valid SEARCH expression and boost factor. It does nothing but imbues the provided boost value into SEARCH expression context making it available for scorer functions. The documents that match boosted part of search expression will get higher scores.
Let’s continue playing with sci-fi movies. The results we got in previous section are already good enough, but what if we want something more relevant to outer space, the movie that can show us the beauty of distant interstellar galaxies. For the sake of our interest to sci-fi movies, we can ask ArangoSearch to prefer “galaxy” amongst the others keywords:
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') || BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') SORT BM25(d) DESC LIMIT 10 RETURN d")
And the winner is “Star Trek Collection”, super relevant result indeed.
For information retrieval experts ArangoSearch provides an easy and straightforward way of fine-tuning weighting schemes at query time, e.g. BM25 accepts free coefficients as the parameters and may be turned into BM15:
db._query("FOR d IN v_imdb SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') || BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') SORT BM25(d, 1.2, 0) DESC LIMIT 10 RETURN d")
With BM25 we have a new champion: “The Ice Pirates” movie.