Complex Document Search
ArangoDB helps Cognitiv+ dramatically improve the customer experience while supporting multiple use cases
- Optimized discovery for customers by enabling Cognitiv+ developers to segregate specific data within collections
- Built flexible architecture to support ML and front-end development
- Accelerated complex queries
- Consolidated and simplified architecture for scalability and easy maintenance
- Created targeted views and reports for customers, further improving their experience
The Scenario: Disconnected ML and web development
Located in London, Cognitiv+ streamlines the legal review process by helping companies and their employees in real estate, financial services, insurance, procurement, legal services, and construction navigate and understand their documents and make better decisions. Getting contracts right is a high-stakes process for customers: a typical Fortune 1000 company handles more than 40,000 ongoing contracts, and industry estimates put the cost of inefficient contracting anywhere from 5 to 40 percent of the total contract value.
To help customers avoid these risks, Cognitiv+ uses machine learning (ML) to automatically review and classify documents, extract the information professionals need, and present it in reports. It’s easier, cheaper, faster, and more efficient than a manual review. Although this may sound simple, all the document data is unstructured, and most of it comprises legal and contractual documents, often hundreds of pages long.
Cognitiv+ customer workflow.
Initially, the company’s data scientists built ML models that, while comprehensively categorizing almost every sentence in the lengthy documents, presented customers with too much extraneous information, resulting in a less-than-optimal experience.
The challenge was amplified because one team using Node.js and AngularJS developed the web app for user interactions. Another handled ML development in Python. The two teams worked autonomously with little cooperation or understanding.
Adding to the challenge, a monolithic backend was becoming a single point of failure that wouldn’t scale. The single database for the initial product was a NoSQL document-oriented database system that worked fine for information but fell short in how Cognitiv+ could extract and present information. Besides handling documents, there was also almost no way to control constantly-changing elements such as users, organizations, projects, and roles — or to manage the complex relationships among them.
The document-centric data model had to be improved. For instance, assigning roles took up too much space, and obtaining results required too many queries. The Cognitiv+ team also found that the key to analyzing documents and presenting the right results to customers, without irrelevant information or false positive results, was to get more granular with the data.
They realized they needed to break documents into a hierarchical structure of zones and sections, dividing them by, for example, titles, contents, paragraphs, and sentences. Documents also had to be categorized based on the entire document, and the sections, paragraphs, or text snippets within the document. Only then would they be passed to a specific ML model for processing.
How data is logically structured in Cognitiv+.
The Requirements: Scalable, maintainable, versatile, compliant
Cognitiv+ rebuilt its system and chose a new database with several priorities in mind. Specifically, it had to:
- Scale to accommodate company growth
- Be easily maintained to save hassle and cost versus the previous system
- Be versatile to support multiple use cases
- Store JSON-based documents and the links between them
- Support collections and enable developers to segregate specific data and control access to it
- Help customers and internal staff explore, filter, and query documents with ease
- Supply targeted views via text and other searches to find specific items in legal documents
- Offer permission-based access and encryption to support SOC 2 compliance
The team evaluated multiple options for a new database engine, including document and relational databases. Document databases did not support the new data model with various types of relationships. Relational databases, with fixed schemas, require too much configuration and ongoing changes to support the many hierarchical structures involved with the company’s data.
They also looked at graph databases, but most presented challenges. “The only one that had a document model was ArangoDB,” says Cognitiv+ Director of Technology Delio D’Anna. “While Neo4j has a nice query language, it has many limitations. And even if you use Neo4j Enterprise Edition, you still have many limitations, and you don’t have document support. Options like Dgraph also had weaknesses and didn’t support how our developers worked.”
Why ArangoDB: Intuitive query language, multiple databases in one
D’Anna and his team considered using a mix of databases but discovered that ArangoDB met all their needs in one database. It has multi-model capabilities, makes it easy to traverse graphs, and offers excellent performance to get data quickly to ML models and back to customers. ArangoDB also provides granular, permission-based access to collections without many queries, making securing sensitive customer data easier. Arango Query Language (AQL) and ArangoDB’s multi-model capabilities were also vital to the Cognitiv+ team.
“ArangoDB has a very nice query language. Even our junior developers can quickly create complex queries. In addition, the ability to create multiple databases in one, easily traverse graphs, provide permission-based access, and use encryption for SOC 2 compliance made ArangoDB the easy choice.”
- Delio D’Anna, Cognitiv+
The Implementation: ArangoDB serves synchronous and asynchronous services, feeding ML training models
D’Anna and his team developed an architecture with a front-end and an API gateway with Redis handling authorization. ArangoDB runs the remaining services — authentication, user management, and asynchronous interactions via NATS — all with controlled access to specific collections within the database.
They are now experimenting with ArangoML to analyze information better, extract pertinent data, and store ML experiments. ArangoDB is also used to obtain targeted data quickly and feed it to the Ray framework to produce ML training materials. Once models are trained, the team uses GraphML to make sense of them and traverse extracted insights.
Reporting is also becoming more sophisticated with ArangoDB in place. For example, the team can target specific data, such as the risks of accepting a contract, helping customers make smarter decisions.
Below is what Cognitiv+ built for searching and analyzing legal documents. This implementation includes a front-end and an API gateway with Redis handling authorization. ArangoDB runs the other services — authentication, user management, and asynchronous interactions — all with controlled access to specific collections within the database.
Cognitiv+ application architecture.
The Results: Granular data access, improved customer experience, flawless performance
Granular data access for improved customer experience: Instead of receiving extraneous data, customers only see specific clauses or sections they need to review in contracts, saving them time and improving their experience using Cognitiv+. They also receive reports with detailed information, such as the risk of accepting a contract.
No training required, thanks to an intuitive query language: With AQL, even less experienced developers can execute complex queries.
Responsive search: ArangoDB makes exploring, filtering, and querying documents simple. Customers get targeted views via the Cognitiv+ web UI or can search to find specific items in legal documents.
Security: Permission-based access and encryption safeguard customers’ sensitive legal materials and have enabled SOC 2 compliance.
Scalability: ArangoDB scales up or down to meet demand. It is also simple to maintain, reducing costs and improving efficiency.
Exceptional performance: Internal Cognitiv+ users and customers receive query results of precise information quickly.