Global Relay: Modeling contextual relevance with search views and graph traversals
ArangoDB powers Global Relay’s intuitive directory search within its collaboration messaging suite
Results:
- Intuitive directory search to help individuals seamlessly find who they want to communicate with
- Easily applied contextual relevancy through the use of graph edge traversals
- Simplified development with graph and search in the same data store
The Scenario: A people directory with intuitive search
Global Relay is a leading provider of compliant electronic communications archiving, messaging, supervision, information governance, and eDiscovery. Its 20,000+ customers in 90 countries include highly-regulated organizations and other corporate firms, such as financial services, insurance, technology, energy (oil and gas), legal, government, and healthcare.
Global Relay’s collaboration suite allows organizations to communicate internally and externally while ensuring compliance, privacy, and security. One of the modules in this suite is instant messaging (IM), which provides individuals with voice and IM capabilities via mobile and desktop.
When it comes to IM, before someone can send a message to someone else, they need to be able to find the right person. To help its users find the person they wish to connect with, Global Relay wanted to offer intuitive directory search functionality. The goal of Global Relay’s intuitive directory search is to enable its users to find the person they want to talk to in a single search, ideally in the top five search results.
While this sounds simple, it’s challenging to achieve. Explains Phil Persad, senior architect at Global Relay, “People are not documents, and document rating systems found in most search engines are often not great when you use them with people.” For example, search engines tend to use a combination of term frequency and inverse document frequency when computing relevance scores. With term frequency, documents are deemed more relevant the more frequently a word appears. Inverse document frequency is the opposite – rare words are weighted more heavily than common words, which is beneficial for multi-term OR searches.
“Personal contact information typically is a short document with a relatively small number of fields, such as first name, last name, job title, and company,” elaborates Persad. “Because of this, term frequency isn’t applicable, and inverse document frequency rarely offers any useful scoring.”
The Requirements: Contextual relevancy tuning that scales
Global Relay’s main requirements for its intuitive directory search capabilities were:
Field-based relevancy. When searching for people, all fields are not equal. Finding a match in a name is more relevant than finding a match in a job title or company field. “If you search for the term ‘Dev’, a person with the name Devin is likely more relevant than everyone with the word ‘developer’ in their job title,” Persad details.
Contextual relevancy. Common names that also match against company names or job titles – like the ‘Dev’ example above – can return a larger number of results if only utilizing document-based ranking. Contextual relevancy factors in pre-existing relationships between individuals; it does this by adding additional weight to those people who are relevant to the user performing the search. For example, a user’s manager will have more weight than a coworker in a different part of the organization.
Case-insensitive analyzer that disables stemming. This capability prevents letter casing that impacts search results and avoids breaking the word down to its ‘stem’ by removing common suffixes such as -s, -es, -ing, or -ed. “With stemming, the name ‘James’ would stem down to ‘Jam’ because it ends in -es,” Persad clarifies. “And if someone searched ‘Jame’, they won’t find the stemmed version of James. We wanted to disable stemming on fields that represent names, titles, or companies.”
Late document materialization. By filtering data in a search view as much as possible before looking up a source document, Global Relay was looking to improve query performance substantially.
Global Relay tried Elasticsearch, but things got complicated when it came to contextual relevancy. Elasticsearch has a parent/child document structure. In Global Relay’s case, a conversation would be a parent document, and an individual a child document. Applying contextual relevancy involves mapping between the two documents, resulting in the ID of the child document being a concatenation of the parent document, the conversation ID, and the child. Explains Persad, “That means if an individual is in a hundred different conversations, that participant object shows up a hundred different times. If any field relating to that person changes, such as their name, job title, or company, we would have to find that person 100 different times inside of 100 different conversations and update them in each place.”
Elasticsearch’s approach would have resulted in a combinatorial explosion of mappings, which simply wouldn’t scale. So, Global Relay needed to find a more scalable solution.
Why ArangoDB: The ability to combine graph with search
Due to the relationships Global Relay needed to incorporate into its intuitive directory search, it decided to try the ArangoDB graph database. By prioritizing the connections between data points, graph databases make it easier to identify relationships between people, places, things, events, and locations. ArangoDB also has a built-in full-text search and ranking engine, ArangoSearch.
“Using ArangoDB to combine graph and search capabilities within the same use case, we’ve dramatically simplified development.”
– Phil Persad, Global Relay
Global Relay’s number one challenge was how to scalably incorporate contextual relevance into its intuitive directory search. Persad was pleased to find that through an edge traversal, Global Relay could easily look for an individual within other documents as follows:
FOR doc IN ParticipantSearch SEARCH ANALYZER( (STARTS_WITH(doc.company, TOKENS("bob", "text_en")) OR BOOST(STARTS_WITH(doc.name, TOKENS("bob", "text_en")), 1.2)) AND (STARTS_WITH(doc.company, TOKENS("Acme", "text_en")) OR BOOST(STARTS_WITH(doc.name, TOKENS("Acme", "text_en")), 1.2)), "text_en") //Search for all matching participants FOR conv IN 1..1 INBOUND doc has_participants // Traverse the graph to find participants sharing a context with the matches FOR participant in 1..1 OUTBOUND conv has_participants // Filter for the user who is performing the search to find participants who share a context with them FILTER participant._key == "1" LET score = BM25(doc) SORT score DESC RETURN { doc, score }
Because ArangoDB uses edges to define relationships, Global Relay represents a participant just once across its entire data set. With this setup, when a participant’s name or job title changes and needs to be updated, Global Relay can do that with a single write – without propagating across all possible instances of the participant. For example, if we want to track which people have participated in a conversation, here is how we could represent that in a graph in ArangoDB:
Global Relay could also leverage the BOOST functionality in ArangoSearch to ensure relevant fields like first name are weighted more heavily than others. In addition, it could sort documents using BM25 instead of term frequency-inverse document frequency, which is less relevant to its use case. ArangoSearch also provided Global Relay with case-insensitive search and late-document materialization to optimize its queries.
The Implementation: Context modeled in graph and search views
As illustrated below, Global Relay’s instant messaging clients send traffic to its core instant messaging servers. As events happen, such as users joining and leaving conversations or having voice calls, the IM cluster ships them onto a Kafka event bus for consumption by Global Relay’s search cluster. The search cluster uses this data and enriches it with data from the user repository to build a model of contexts and their participants – all of which is stored in ArangoDB. From there, IM servers can proxy user queries to ArangoDB when performing directory searches.
The Results: Simplified development and faster go-to-market
By using ArangoDB to power its intuitive directory search, Global Relay was able to:
- Combine graph and search capabilities within the same use case;
- Dramatically simplify development by having both data models within the same data store;
- Avoid needing to design distributed transactions to save data across multiple data stores;
- Ultimately get products and features to market faster.
For more details, you can watch Phil’s presentation from ArangoDB Summit 2022: