ArangoSearch Improvements in ArangoDB 3.7
Disclaimer: These functionalities are currently available only in the Alpha 2 release of the upcoming ArangoDB 3.7. You can download the Alpha 2 preview here.
ArangoDB 3.7 will come with many new features and improvement for ArangoDB’s integrated search and ranking engine ArangoSearch.
This tutorial will detail those features with examples available via an interactive demo based on Repl.it at the bottom of this page.
The new features include:
- Fuzzy filter support
- Stored Values in ArangoSearch Views
- A new
LIKE
operator - Enhanced
PHRASE
functionality
To make it more fun, we will use the IMDB dataset with data about movies like title and description of a movie.
Stored Values in ArangoSearch Views
Typically, when performing queries, you FILTER
on some criteria to locate a document and then either return that entire document or a few of its attributes. This process can be quite fast in ArangoDB but with the help of ArangoSearch Views it is now faster. With the introduction of the ability to store values in ArangoSearch Views you are able to store fields of documents in views. This avoids the need to load the entire document into memory from the storage engine, inspect the entire document, and then return the requested attributes. Instead you only return the specific attributes requested, directly from the view, without needing to access the storage engine. In order to add stored values to a view you can assign the values you would like to store to the storedValues
attribute in the view properties, like so:
db._createView("imdb_with_stored_values", "arangosearch", { storedValues: [ "title" ] })
So, running the following query against our IMDB dataset returns results that:
- Perform faster than a view without stored values
- Applies late document materialization, thanks to the sort+limit combination
FOR d IN imdb_with_stored_values SEARCH d.type == 'Movie' SORT bm25(d), d.title LIMIT 100 RETURN d.title
This is a trivial example as we are just retrieving all of the Movies (not actors) in the dataset and sorting based on the score and movie title. However, even with this simple example, execution time can be multiple times faster than without stored values. Have a look in the Repl for two examples, the first query is performed using a view without stored values while the second uses a view that has been configured to store the title attributes. You can inspect the code and AQL query that is running in the Repl editor at the bottom of this page and click the respective buttons to see the execution stats for each query.
This process of storing values now happens implicitly when setting a primarySort, as well. The benefit of storing groups of values comes in the form of further performance gains. The query optimizer will prefer the stored group of values over single stored values, if it can cover the query more completely. For further information on this feature please refer to the documentation for everything that comes with stored values.
You can also run these queries in a sandbox WebUI running on Oasis by going here:
WebUI: https://tutorials.arangodb.cloud:8529/
Username: ArangoUser
Password: arango37
LIKE Support
Before the release of 3.7 the LIKE
operator was already available as an AQL function but not available as part of a SEARCH
expression. This LIKE
operator allows for matching strings by using a wildcard operator, in our example the percent symbol(%). Attempting to perform these wildcard searches over large datasets can start to negatively impact performance. With the addition of LIKE
support in ArangoSearch it is now possible to take full advantage of the ArangoSearch indexes while performing these searches. This offers a nice performance increase and fits in naturally with other ArangoSearch keywords and functions. There are two syntax options available for LIKE
, it can be treated as a function as demonstrated in the docs or as a keyword, which is show below.
For example, this query searches our IMDB dataset for titles that begin with ‘Star Wars’ and then we use a wildcard(%) to indicate anything can be after the phrase ‘Star Wars’ but the word after the wildcard must include the letter ‘V’ since the Star Wars movies use Roman Numerals to number their movies. We follow-up the ‘V’ with another wildcard(%) to indicate anything can follow it as well.
FOR d IN imdb_norm SEARCH ANALYZER(d.title LIKE "Star Wars%V%", "normAnalyzer") RETURN d.title
The results of this query include:
"Star Wars: Episode IV - A New Hope", "Star Wars: Episode VI - Return of the Jedi", "Star Wars: Episode V: The Empire Strikes Back"
Something worth pointing out is that Episode IV shows up even though the V comes after I, this still qualifies when using LIKE
and wildcards.
You can inspect the code and AQL query that is running in the editor and click the respective buttons to see the execution stats for each query, along with the results of the query below the execution stats.
You can also run these queries in our sandbox WebUI running on Oasis:
WebUI: https://tutorials.arangodb.cloud:8529/
Username: ArangoUser
Password: arango37
The addition of wildcard support allows for creating some flexible queries while still getting back relevant results. You can use the wildcard even when not using the LIKE
keyword, as we do in the later on with the PHRASE
query example.
Fuzzy Filter Support
Perhaps the most substantial addition that comes with the ArangoSearch improvements in the 3.7 release is the ability to perform fuzzy searches. The support for this comes along with three new functions
NGRAM_MATCH()
NGRAM_SIMILARITY()
NGRAM_POSITIONAL_SIMILARITY()
These functions are available as a part of an ArangoSearch SEARCH
expression and in non-view AQL queries. This means, even if you aren’t using ArangoSearch views in your query you can take advantage of these new functions. It is important to note that NGRAM_MATCH
will be far more performant when using views due to the search optimized indexes, while the similarity functions will not utilize the indexes of views.
If you are not already familiar with fuzzy search it attempts to approximate results based on search terms. Fuzzy search considers that these search terms may be misspelled or not exact phrase matches. If you would like to learn more about this read here about Approximate String Matching (Fuzzy search).
Fuzzy search is used in many applications and is a powerful tool for search engines. Fuzzy search takes the understanding that users may not know the exact terms to search for or may input errors and finds results that align with the intent of the search. In the example below we use NGRAM_MATCH
to help find a movie title even though we have misspelled Star, by missing an ‘a’.
FOR d IN imdb_with_stored_values SEARCH NGRAM_MATCH(d.title, "Str War", 0.7, "bigram") SORT BM25(d) DESC RETURN d.title
In this query we setup a typical FOR
loop and iterate over the previously created view with stored values(stored values not required for Fuzzy search). Then we use the NGRAM_MATCH
Search function to search the description of the movies in our view to find movies with similar results. The .7 is the threshold amount, this is how much ‘fuzziness’ or wiggle-room we want to give the search.
The threshold indicates just how far from our supplied phrase the results should be allowed to go. The number must be between 0 and 1 and the closer to 1 you get the more accurate you are requesting the results to be. The next thing is the analyzer we are using and this deserves a little further explanation.
This ngram analyzer was configured with a min and max of 2, which means it looks at words 2 letters at a time. This is useful for determining the longest common sequence and context. The idea behind n-gram matching is searching for similar words, but not necessarily exact matches. One of the simplest ways of calculating similarity between two words is calculating the longest common sequence (LCS) of letters. The longer the LCS is the more similar the words are. However, this approach has one big disadvantage – absence of context. For example, words <connection> and <fonetica> have a long LCS (o-n-e-t-i) but very different meanings. To add some context, ngram sequences are used.
Each word is split into a series of letter groups and these groups are then matched. If we use the same words, but calculate similarity based on 3-grams, an ngram with max and min of 3, we will get a better similarity measure: con-onn-nne-nec-ect-cti-tio-ion vs. fon-one-net-eti-tic-ica gives shorter LCS ( zero matches). To get rid of length differences we normalize the LCS length by word length. We calculate these matches to get a rating with a value between 0 (no match at all) and 1(fully matched). The ability to use this rating generated with ngrams is implemented in ArangoSearch with the NGRAM_MATCH
function.
This functionality is why we are still able to get relevant results even with misspelled words:
[ "Star Wars: Episode IV - A New Hope", "Star Wars: Episode I - The Phantom Menace", "Star Wars: Episode VI - Return of the Jedi", "Star Wars: Episode II - Attack of the Clones", "Star Wars: Episode III: Revenge of the Sith", "Star Wars Collection", "Star Wars: Episode V: The Empire Strikes Back", "A Nightmare on Elm Street 3: Dream Warriors", "Star Wars: The Clone Wars", "The Star Wars Holiday Special", "Star Wars: Revelations" ]
SIMILARITY
NGRAM_MATCH
offers the full package of options when wanting to perform fuzzy search. In addition to this new functionality comes:
NGRAM_SIMILARITY
NGRAM_POSITIONAL_SIMILARITY
These two functions are similar to NGRAM_MATCH
but instead return the actual similarity value and are not able to take advantage of the indexes of views. They are still able to perform the same functionality as NGRAM_MATCH
but have different goals, which involve returning the similarity values for other calculations or scoring.
PHRASE Enhancements
Now that you have read about the improvements and additions to ArangoSearch, the PHRASE
improvements help tie some of them together. The purpose of the updated PHRASE
is to allow for a more natural way to combine ArangoSearch features in your AQL queries.
PHRASE
can now be expressed as an arbitrary number of phrase parts. This means that it is now possible to build a phrase expression using as many or as few ‘parts’ as needed. The purpose of this enhancement is to improve readability in queries and allow for more flexibility.
The following query showcases an initial look at how it is now possible to include multiple components within a single PHRASE
statement. This statement includes a path, term, skip token, wildcard, and analyzer all in one tidy PHRASE
:
FOR d IN imdb_with_stored_values SEARCH PHRASE(d.title, ["Star Wars", 1, {WILDCARD: "%v%"} ], "text_en") RETURN d.title
This retrieves the expected results:
[ "Star Wars: Episode IV - A New Hope", "Star Wars: Episode VI - Return of the Jedi", "Star Wars: Episode V: The Empire Strikes Back" ]
The PHRASE enhancements also come with the ability to evaluate terms with nested arrays and with the upcoming 3.7 Alpha 3 release, Levenshtein match implements fuzzy search within PHRASE
as well.
Try it out now!
These queries, along with a sandbox to start trying your read-only queries in are available below in the Repl.it Editor. You can also view the full IMDB dataset loaded, ArangoSearch Views, and run all of these queries in the ArangoDB WebUI for free, right now. By visiting this link and using these credentials:
WebUI: https://tutorials.arangodb.cloud:8529/
Username: ArangoUser
Password: arango37