Word Embeddings in ArangoDB

June 25 2021,/General

Estimated reading time: 12 minute

This post will dive into the world of Natural Language Processing by using word embeddings to search movie descriptions in ArangoDB.

In this post we:

Discuss the background of word embeddings
Introduce the current state-of-the-art models for embedding text
Apply a model to produce embeddings of movie descriptions in an IMDb dataset
Perform similarity search in ArangoDB using these embeddings
Show you how to query the movie description embeddings in ArangoDB with custom search terms

Check it out on github Last updated: 23/08/2023 15:05:13

Continue Reading

ArangoML Part 4: Detecting Covariate Shift in Datasets

ArangoML Part 3: Bootstrapping and Bias Variance

ArangoML Part 2: Basic Arangopipe Workflow

Alex Geenen

Alex is a Machine Learning Ecosystem Engineer at ArangoDB. He is passionate about the practical application of new developments in the fast-moving field of Machine Learning.

June 25 2021,Alex Geenen

2 Comments

Fabio Mencoboni on July 2 2021, at 2:24 pm

Very cool tutorial- thanks for sharing. I am really excited about using ArangoDB with Semantic queries, and this is a great overview. A couple questions:
* If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.
* I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?
* The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?
LET numerator = (SUM(
FOR i in RANGE(0,767)
RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
))

Reply
- Alex Geenen on July 6 2021, at 1:44 pm
  
  Hi Fabio,
  
  If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.
  
  Yes that’s correct!
  
  I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?
  
  Yes, ArangoSearch allows you to perform tokenization and full-text search directly in the database. At this point, word embeddings aren’t directly supported, which is what this tutorial lets you do. ArangoSearch does support vector space models such as BM-25 and TF-IDF for scoring search results. Please see here if you want to learn more about them.
  
  The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?
  
  Great question! The answer is that it depends. If you’re querying a single server, it will use a sequential scan (so a single thread). If you’re querying a collection on a cluster, and the collection is sharded across different servers, then there will be concurrency at a database server level, but within those server processes it will also be scanned sequentially.
  
  Reply

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

Word Embeddings in ArangoDB

Continue Reading

Alex Geenen

2 Comments

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch