Present and Future of ArangoDB Fulltext Index

January 11 2018,/General

The ArangoDB Fulltext index allows you to search for text in arbitrary strings. It is a great way to implement things like autocompletion, product searches or many other use-cases which need some form of fulltext search.
The Fulltext Index is suitable for you if your use-case can be broken down to:

Full matches of words
Prefix matches of words
You do not need a “ranking” of the matching documents

Usage Example

Using the fulltext index is fairly straightforward, you create the index on an existing emails collection:

emails.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 3 })
emails.insert({text: “banana apple”})
emails.insert({text: “banana mango”})
emails.insert({text: “banana avocado”})

Search for documents

FOR mail IN FULLTEXT(emails, "text", "banana,-apple")
    RETURN mail

This will return documents containing “banana”, but not “apple”. Other word combinations are possible, for more information check the Fulltext documentation page.

New to multi-model and graphs? Check out our free ArangoDB Graph Course.

Dealing with Human Language

The current fulltext index has a number of shortcomings when it comes to dealing with arbitrary languages, substring search and especially with non-latin languages like Chinese, Japanese or arabic. In the following is a list of tasks which the current fulltext index does not yet perform out of the box, but might be relevant for your use-case:Normalizing Words

1. Normalizing Words

Removing diacritics
Removing wordstems
Matching synonymous words

Removing diacritics like ^,°,`, from words e.g. turning “ç” into “c” such that “Curaçao” will also match “Curacao”.

Wordstems are the “common” forms of words as opposed to word inflections. For example, you build plurals in English by adding an -s (house / houses) similarly for past tense forms you can have inflections like pay, paid, paying. A lot of the time we want to also match inflected forms of words, which is possible by indexing only the word-stem.

Words should also be matched for synonyms e.g. “quick” also matches for “fast”.

2. Removing Stop-Words

Commonly used words such as “the”, “and” in English or “und”, “am”, “an” in German are not relevant for a lot of use-cases. Removing these words before indexing them can improve the perceived quality of search results

3. Identifying Word Boundaries

In non-latin languages such as Chinese words may consists of one or more characters. For example in Chinese the word “公共汽車” means bus but “汽車” would, for example, mean car.

What’s next?

For the next release of ArangoDB, we are planning to release a new feature which will offer advanced text analysis functionality (and more). This will allow you to perform matchings by similarity as well as sorting result sets according to several different scoring algorithms.

This will allow you to index and analyze human language input, as well as data from other scientific domains via customizable analyzers: It will, for example, be possible to store DNA sequences in ArangoDB and later search for sub-sequences by similarity.

For more information check out this section of the ArangoDB docs.
[Editors note: The current working title of the upcoming feature is IResearch. We are thinking of another name for it, so the name might change.]

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

January 11 2018,Jan Steemann

4 Comments

Bramandityo Prabowo on January 23 2018, at 11:43 am

i wait this feature to mature enough to beat elasticsearch.

Reply
- ArangoDB database on January 24 2018, at 4:12 pm
  
  We will release ArangoSearch very soon and it would be of great help for us, if you take it for a spin and let us know, if something is missing. We hope to cover already a lot of use cases with the initial release of ArangoSearch but of course user feedback will round the edges faster 🙂
  
  Reply
  - David Marko on January 26 2018, at 3:50 pm
    
    What is ArangoSearch ?
    
    Reply
  - yehosef on January 28 2018, at 11:47 am
    
    I’d be interested to check it out when it’s ready – let me know – myname at gmail
    
    Reply

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

Present and Future of ArangoDB Fulltext Index

Usage Example

Search for documents

Dealing with Human Language

1. Normalizing Words

2. Removing Stop-Words

3. Identifying Word Boundaries

What’s next?

Jan Steemann

4 Comments

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch