Present and Future of ArangoDB Fulltext Index
The ArangoDB Fulltext index allows you to search for text in arbitrary strings. It is a great way to implement things like autocompletion, product searches or many other use-cases which need some form of fulltext search.
The Fulltext Index is suitable for you if your use-case can be broken down to:
- Full matches of words
- Prefix matches of words
- You do not need a “ranking” of the matching documents
Usage Example
Using the fulltext index is fairly straightforward, you create the index on an existing emails collection:
emails.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 3 })
emails.insert({text: “banana apple”})
emails.insert({text: “banana mango”})
emails.insert({text: “banana avocado”})
Search for documents
FOR mail IN FULLTEXT(emails, "text", "banana,-apple")
RETURN mail
This will return documents containing “banana”, but not “apple”. Other word combinations are possible, for more information check the Fulltext documentation page.
Dealing with Human Language
The current fulltext index has a number of shortcomings when it comes to dealing with arbitrary languages, substring search and especially with non-latin languages like Chinese, Japanese or arabic. In the following is a list of tasks which the current fulltext index does not yet perform out of the box, but might be relevant for your use-case:Normalizing Words
1. Normalizing Words
- Removing diacritics
- Removing wordstems
- Matching synonymous words
Removing diacritics like ^,°,`, from words e.g. turning “ç” into “c” such that “Curaçao” will also match “Curacao”.
Wordstems are the “common” forms of words as opposed to word inflections. For example, you build plurals in English by adding an -s (house / houses) similarly for past tense forms you can have inflections like pay, paid, paying. A lot of the time we want to also match inflected forms of words, which is possible by indexing only the word-stem.
Words should also be matched for synonyms e.g. “quick” also matches for “fast”.
2. Removing Stop-Words
Commonly used words such as “the”, “and” in English or “und”, “am”, “an” in German are not relevant for a lot of use-cases. Removing these words before indexing them can improve the perceived quality of search results
3. Identifying Word Boundaries
In non-latin languages such as Chinese words may consists of one or more characters. For example in Chinese the word “公共汽車” means bus but “汽車” would, for example, mean car.
What’s next?
For the next release of ArangoDB, we are planning to release a new feature which will offer advanced text analysis functionality (and more). This will allow you to perform matchings by similarity as well as sorting result sets according to several different scoring algorithms.
This will allow you to index and analyze human language input, as well as data from other scientific domains via customizable analyzers: It will, for example, be possible to store DNA sequences in ArangoDB and later search for sub-sequences by similarity.
For more information check out this section of the ArangoDB docs.
[Editors note: The current working title of the upcoming feature is IResearch
. We are thinking of another name for it, so the name might change.]
4 Comments
Leave a Comment
Get the latest tutorials, blog posts and news:
i wait this feature to mature enough to beat elasticsearch.
We will release ArangoSearch very soon and it would be of great help for us, if you take it for a spin and let us know, if something is missing. We hope to cover already a lot of use cases with the initial release of ArangoSearch but of course user feedback will round the edges faster 🙂
What is ArangoSearch ?
I’d be interested to check it out when it’s ready – let me know – myname at gmail