home shape
b1883261844b8f93363c55d9b4d6b5cd

State of the Art Preprocessing and Filtering with ArangoSearch

Estimated reading time: 10 minutes

Just in case you haven’t heard about ArangoSearch yet, it is a high-performance Full-Text Search engine integrated in ArangoDB (meaning connected with the other data-models and AQL). Feel free to check out ArangoSearch – Full-text search engine including similarity ranking capabilities for more details.

In ArangoDB version 3.7 the ArangoSearch team added Fuzzy Search support (see the comprehensive article Fuzzy search by Andrey Abramov). With Fuzzy Search data preprocessing and filtering becomes even more important. In the upcoming ArangoDB 3.8 release, ArangoSearch efforts will be focused on improving this part. In this post I’m going to uncover some of the new features we are proud to present. 

We have added the ability to use AQL queries directly  in ArangoSearch analyzer. This will make it possible to create “calculated” fields for ArangoSearch views and to use most of the AQL features to filter and preprocess your data.

The second feature is analyzer chaining. Before ArangoDB 3.8 there was no way for combining several analyzers. For example with this new feature one could add case conversion to ngram analyzer removing the necessity of separate “fixed-case” fields.

A brief history of analyzers in ArangoDB

From the very beginning of ArangoSearch (i.e., ArangoDB 3.4) data-processing capabilities are exposed through an entity named Analyzer. In a nutshell analyzer defines a transformation algorithm and its parameters. Those times we offered a predefined set of analyzers. There were analyzers for processing numerics, boolean, null data and analyzers for providing text tokenization and normalization for several languages. These analyzers are called Built-in analyzers and are always available.  

Fixed parameters of built-in analyzers were limiting  use cases for ArangoSearch. Thus we introduced custom analyzers for textual data in ArangoDB 3.5. This opened a wide range of possible data transformations as from this moment users could configure text data processing algorithms. In the documentation you can find the list of available analyzers types. 

In ArangoDB 3.6 we’ve added more features to the existing analyzer types to support autocomplete scenarios. There were added anchoring markers and UTF-8 support for ‘ngram’ analyzer. For text analyzer we’ve added `edgeNgrams`. 

While we are working hard to provide more and more data processing algorithms, we missed one important opportunity. Why not allow users to define their own processing and filtering algorithm? And as we already have a super powerful AQL query language, why not unleash its power in ArangoSearch data processing? So we decided to do it the AQL way.

Do it the AQL way

AQL has many useful functions for textual data processing. Our first intention was to expose these functions through new analyzer type `calculation`. This was looking promising  – user just writes function name and parameters. Data is set to function by some parameter with a fixed name. An analyzer internally calls a function implementation and the result is used in indexing. So with one piece of code we add all existing and future AQL functions to ArangoSearch data processing means. 

But why use only AQL functions and not use whole AQL? So we made another step forward and decided to allow running a full AQL query to filter and process data for indexing (and renamed the analyzer type to `aql` ). 

Of course running an AQL query inside an analyzer required a bit more work. First of all we need a database to run a query. And the user running the analyzer should be able to run a query in that database. The obvious choice was to use the current database. If we are indexing data from some collection, we clearly have access to that collection and consequently have access to the database (so at least we could run a query in that database). But what if we are executing a query like the following?

RETURN TOKENS(“Some text”, “someAnalyzer”)

Here at analyzer level we don’t have any clues what database we are using. Of course we could extract the current database from the user context and somehow add this information to the analyzer interface. 

But that would lead us to another pitfall – the recovery process. In short, ArangoDB has a WAL based recovery procedure. Even after a system disaster the WAL is replayed and the database tries to restore as much data as possible. And ArangoSearch index is part of this process. If WAL contains data modification for indexed collections, these modifications are reflected in ArangoSearch index. To properly add data to the ArangoSearch index we need analyzers to be runnable during recovery. But at the same time the database is not queryable during recovery. And this rules out usage of the current database for running aql analyzer queries. A ‘sandbox’ database was created to circumvent this limitation. All instances of the aql analyzers now internally use this sandbox database. It has no collections, no users, no any other data. It even has no storage representation. So it is very lightweight and always ready to run a query, even during recovery of other databases. And as sandbox has no data it is never recovered by itself. 

From now only a few things need to be added. In contrast with regular AQL queries execution here we know for sure that query will not change between executions – it is set at the moment of analyzer creation and could not be altered. This makes it possible to parse the query and create an Ast and instantiate  execution plan for it only once. And using an analyzer will just mean binding new value into the prepared query and running it – this gave us significant performance improvement.

And one more option deserves to be mentioned. By default if aql analyzer encounters null it emits an empty string token, but ‘keepNull’ could make it discard nulls. This could be used to filter data in index – just set keepNull = false and return null whenever you want to discard value from index. One could even run an AQL loop and return nulls and strings accordingly.

And of course this description won’t be complete without limitations we’ve put. As we use the sandbox database for aql analyzer  – no data access/modification queries are allowed (e.g. INSERT/UPDATE/REMOVE statements, collection/view enumerations). 

The second limitation is related to allowed AQL functions. We need ‘aql’ analyzer to be runnable on DBServers (in cluster deployment) as the indexing process takes place there. That’s why we have forbidden to use anything that is V8 related and limited functions to ones that could run on DBServer. So you could not use user-defined functions and functions like DOCUMENT.

Another limitation is analyzer-dependent functions. As mentioned above, the analyzer should be able to run during recovery. And there is no guarantee that custom analyzers are fully loaded at that point. So functions referencing other analyzers are forbidden (e.g. TOKENS).

And last but not least – the memory limit. The analyzer is run for each query involved in the indexing and there could be many queries running simultaneously, so memory consumption for every single analyzer instance should be limited. This is controlled by the `memoryLimit` property. It is set by default to 1Mb but can be raised up to 32Mb.

As a result we added to ArangoSearch a new text analyzer with capabilities of using all AQL calculation features and even loops to process and filter text data with only reasonable limitations. While the aql analyzer provides a rich set of options, it is best to use a dedicated analyzer for performing certain transformations like text stemming. But what if we need to combine features of several analyzers into one? For example we want to do `aql` transformation with already stemmed values. Here comes the other new feature in ArangoDB 3.8.

Let’s make a pipeline

Chaining several analyzers into one was a long running story for ArangoSearch. For historical reasons ArangoSearch allowed applying several analyzers to one field in document but output from each analyzer stored in index separately. So there was no way of processing output from one analyzer by another. There were several workarounds like adding to document an additional field where value was produced by the `TOKENS` function (or any other AQL function) and then defining another analyzer for that particular field. But that was a very verbose solution. 

And in ArangoDB 3.8 we are planning to add a new analyzer type  `pipeline`. The idea is simple. This is a container for several custom analyzers. And output of one analyzed is directly connected to the input of another. And output of the last analyzer in the pipeline is the whole analyzer output. 

The tricky part was related to properly calculate positions for tokens inside the initial document. As several analyzers in the pipeline could be tokenizers (e.g. produce more than one  output for a single input) taking the token position from the last or first analyzer in pipeline was not an option. We have developed a new algorithm for calculating positions to reflect all position moves in the pipeline. 
In general the whole pipeline now acts as a single analyzer from a user perspective. If all pipeline members hold position – pipeline holds position also. If one or more members change positions  – pipeline moves accordingly by taking into account real “moves” and “resets” caused by upstream pipeline member moves. So it will be possible to use it with `PHRASE` and `NGRAM_MATCH` ArangoSearch functions.

Putting it all in action

So let’s see how all of these features work. We will use the imdb dataset for demonstration. As an example task let’s build a search system that will allow us to find actors with names that “sounds like” our search pattern. To calculate “sound” code we will use a well known ‘soundex’ algorithm. This algorithm will translate arbitrary text strings into some code. For example, the name ‘John’ will be encoded as ‘J500’ and ‘Jack’ as ‘J200’. And these codes could be indexed and searched by ArangoSearch.

For the beginning we will create a soundex analyzer based on the `soundex` function (all examples below are expected to be run with arangosh and assume imdb sample dataset is loaded in the current database).

var analyzers = require("@arangodb/analyzers");
analyzers.save(
  "soundex", "aql", 
  { queryString:'RETURN SOUNDEX(@param)', batchSize:1 });

As you may note, analyzer parameters are quite straightforward. It’s just queryString, where @param denotes analyzer input. And batchSize denotes size of “block” for internal AQL processing. As here we expect only one output “row” we could save some memory and set batch size to 1. Default value is 10. And for regular AQL queries this value is 1000.

Let’s see how it works:

db._query("RETURN TOKENS('test', 'soundex')");

Now let’s search for actors by similarity of name sounding. To achieve it we will use the conjunction of soundex analyzer and LEVENSHTEIN_MATCH function to add some ‘fuzziness’.

First, we index all names with soundex analyzer:

db._createView("soundex_name", "arangosearch", { 
  links: { 
    imdb_vertices: { 
      fields: { name: { analyzers: ['soundex'] } }, 
      includeAllFields:false }
    }
});

Now it’s time to search someone like “John”:

db._query("
  FOR d IN soundex_name 
  SEARCH ANALYZER(LEVENSHTEIN_MATCH(d.name, SOUNDEX('John'),1), 
                  'soundex') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN d");

This will give us names like “Johnny White”, “John Hall”, “John Wood” and even “Ju-wan On”.

That already looks promising. But if we look at names in imdb_vertices it is easy to see most of the names consist of several parts. And we need to “soundex” each part individually and not mix all parts into one value.  To separate parts we will use a delimiter analyzer and combine it with soundex via pipeline.

analyzers.save(
  "soundex_pipe", 
  "pipeline", 
  { 
    pipeline: [
      { type:"delimiter", properties: { delimiter:" " } },
      { type:"aql", properties: { queryString:'RETURN SOUNDEX(@param)', batchSize:1 } }
    ]
  }, 
  ["norm","frequency", "position"]);

As you could see ‘pipeline’ analyzer is as simple as just enumeration of requested analyzers in an array. We need another view to index data with new analyzer:

db._createView(
  "soundex_pipe", 
  "arangosearch", { 
    links: { 
      imdb_vertices: {
        fields: { name: {analyzers: ['soundex_pipe'] } }, 
        includeAllFields:false }
    }
});
db._query("
  FOR d IN soundex_pipe 
  SEARCH ANALYZER(d.name == SOUNDEX('John'), 'soundex_pipe') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN {name:d.name}");

That will give us all names containing something sounding like “John”.

But what if we need not “sounding” but “writing” like “John” ? That is the task for “Fuzzy search”. It was already described earlier, here I just want to cover one particular case – case insensitive n-gram matching (we’ve heard many requests for it in community slack). Here is how it works:

Let’s create an analyzer for n-gramming. It will be regular ‘ngram’, but we will add normalizing to it:

analyzers.save(
  'bigram', 
  'pipeline', 
  { 
    pipeline: [ 
      { type:'norm',  properties: { locale:'en', case:'lower'} }, 
      { type:'ngram', properties: { min:2, max:2, preserveOriginal: false, streamType:'utf8' } }
    ]
  }, 
  ["frequency","norm","position"]);
db._createView(
  "ngram_view", 
  "arangosearch", { 
    links: { 
      imdb_vertices: {
        fields: { name: { analyzers: ['bigram'] } }, 
        includeAllFields: false }
    }
});
db._query("
  FOR d IN ngram_view 
  SEARCH NGRAM_MATCH(d.name, 'travolta', 0.75, 'bigram') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN d.name");

To check that we actually got case-insensitive matching we could raise the threshold from 0.75 to 1 and still see “Travolta” in output, despite different “T” and “t”.

 db._query("
  FOR d IN ngram_view 
  SEARCH NGRAM_MATCH(d.name, 'travolta', 1, 'bigram') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN d.name");

To illustrate running the AQL loops with `aql` analyzer we will do a custom delimiting with filtering.

analyzers.save(
  "delim", 
  "aql", 
  { 
    queryString: "FOR str IN SPLIT(@param, '-') FILTER !STARTS_WITH(str, 'foo') RETURN str", 
    batchSize:100
  }
);

And then we will run the following query:

db._query("RETURN TOKENS('fooA-booB-fooC', 'delim')");

And the result will be:

[
  [
    "booB"
  ]
]

As you could see the input is delimited by SPLIT function and the resulting array is enumerated with filtering out  tokens starting from ‘foo’.

Further improvements

For now an analyzer could be applied only to a particular field of indexed document. It might be good to do document-wise filtering and data processing where @param will denote not a field but a whole document and one could add new fields, compute some value based on several other fields, or filter out the whole document from index by means of keepNull option. 

Another improvement is related to data-types. As one may note, for now custom analyzers operate only with strings. But index performance could benefit if we allow data type conversion (e.g. convert string date-time representation into timestamp via ‘aql’ analyzer and store/search it as a numeric).

I hope that this post provides useful insight about the upcoming features of the new ArangoDB 3.8 release. Feel free to leave comments below or ping me on the ArangoDB Community Slack (@Dronplane.arangodb)

Continue Reading

You can now become an ArangoDB Certified Professional

Take Alpha 2 of the upcoming ArangoDB 3.7 for a spin!

Opening the ArangoDB ArangoGraph API & Terraform Provider

Lobov

Andrei Lobov

Andrei is C++ developer in ArangoDB core team with 10+ years experience in multithreaded automation systems. Currently he is developing ArangoSearch engine. Andrei is passionate about building reliable, fault-tolerate software.

Leave a Comment





Get the latest tutorials, blog posts and news: