This article is the first in a series of technical deep dives where we focus on the engineering challenges of building a platform like PreSeries. Today, we explain how we drastically improved the speed of search queries performed by PreSeries’ users.
One of the problems that we were facing was the low performance we were getting from our Mongo replica set when we tried to query big tables (tables with more than 30 millions of documents) using multiple conditions.
It only got worse when we tried to run searches using keywords or free text. Every time that we wanted to use a new search field, we were forced to create multiple indexes that considerably increased the volume of the database. And there is a limitation on the number of indexes that you can create on a mongo collection (close to 72), making it almost impossible to manage all the possible scenarios we want to address.
Eventually, the solution we adopted was inspired by looking at how the DataStax™ Enterprise Platform™ works. We were inspired by DataStax’s use of Solr as a search backend to complement the power of Cassandra databases, thus allowing the creation of Solr indexes in Cassandra tables.
If you take a look at their architecture, you can see that they use Apache Cassandra as its core database engine in conjunction with a Solr service to manipulate sophisticated search queries without compromising the service performance. That makes sense, because managing queries in Cassandra is much harder than in Mongo. Sure, you have access to secondary indexes in Cassandra to query fields that would normally not be queryable, but at the expense of performance, which would be severely affected.
What is Solr?
Apache Solr™ is a highly reliable, scalable, and fault-tolerant distributed indexing solution with the latest search and index technology. Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support. It is a document database that offers SQL support and executes it in a distributed manner. Solr enables powerful full-text search and near real-time indexing.
At this point, we knew we wanted to bring the Solr impressive features to the PreSeries Datamarts in order to improve the performance of our queries, but how? Our main concern was how to maintain and synchronize the collections from our Datamarts with the Solr service automatically, and how to do so as quick as possible. Eventually, we found the solution in the mongo-connector project.
In the following figure I am representing the Solr architecture, formed by a cluster of nodes. We can see how Solr splits the data into shards, distributes the shards between the nodes based on the number of shards per node we have configured, and maintain replicas of these shards to assure reliability using a replication factor that can be established by core (SolrCloud’s index).
Figure 1- Architecture of a normal Apache Solr™ deployment
What is mongo-connector and how it works?
The mongo-connector is a Python utility that creates a pipeline from a MongoDB cluster (replica set) to one or more target systems, and one of the available targeted systems is, happily, Solr. Mongo-connector is modular, it is based on a core library and independent managers that know how to talk with the different target systems. In our case, the manager required by us was the Sorl-Doc-Manager library.
Below, is the architecture of the solution with all the processes between Mongo and Solr:
Figure 2- Architecture of the MongoDB to Solr synchronization
With mongo-connector, we can synchronize one or multiple mongo collections into one unique Solr core. The question that arises is the following: how can we differentiate between the data of different collections if we put everything in the same core/index? The answer is easy, the library adds a special field named ‘ns’ (namespace) that will contain the name of the original mongo collection. This field will become a discriminator field that will be able to use to filter results of the original collection we are interested in.
In the following figure we can see how to synchronize three MongoDB collections that are holding the insights generated for the companies we are tracking, into one unique Solr collection. Each of these original collections maintains a view of the companies, applying different periodicities: weekly, monthly, and yearly.
Figure 3- Synchronized data from MongoDB to Solr
Let’s breakdown this chart and highlight hte main parts:
- On the left we see the structure of the mongo collections that will be synchronized with Solr. We can see different type of fields that goes from strings, to decimals, or arrays of strings.
- In the middle, we can see the syntax of the command we should run in order to start the synchronization process. This process is responsible for checking the mongo OptLog collection in order to detect all the changes in the mongo database, and apply those changes into the Solr collection as soon as they have been detected.
- What is the meaning of the optlog-ts parameter? the process will maintain updated the informed file (companies_collections.log) with the timestamp of the latest mongo change indexed into Solr. The sync process will use the value saved in this file as the start point for future synchronizations.
- The auto-commit-interval parameter has a big impact in the performance of the entire process. It will directly affects the performance of index updates. This parameter informs Solr about how often we need to flush the changes into the disk. In our case, our ETLs do changes that affect millions of documents every time they are executed. Therefore, this actions will provoke millions of operations in Solr, which implies a lot of changes in the index.
- The -n parameter allows us to inform about the collections in mongo that we want to synchronize into the Solr core/index. In our case, we are talking about three different collections.
- And finally, we inform about the core/index in Solr (companies_data) that will hold all the data that will come from mongo.
- On the right-side of the image we describe the schema of the destination collection in Solr. We can identify 3 new fields in this new collection that are: _TS, NS and ID.
- _TS: this field is automatically generated by mongo-connector, and it maintains the date at which the original document was created or updated in mongo.
- NS: this field maintains the reference to the original collection name in mongo. In this case, this field could contains one of the following values: [“preseries_db.companies_data_weekly“, “preseries_db.companies_data_monthly“, “preseries_db.companies_data_yearly“]
- ID: this field maintains the original mongo _ID of each document.
In Solr, we can conduct queries of different types:
- Advances queries, using multiple clauses and comparators. Fuzzy searches.
- Nested queries. Nest an arbitrary query type inside another query type.
- Faceting, that give us our category counts, among other things.
- Range Faceting, that give us the ability to divide a numeric fields into a categories of ranges. It allows us discretize numerical fields and analyze the range counts, among other things.
- Boosting, that allows us to score higher some fields over others.
- TF-IDF and BM25, Solr uses the TF-IDF as its scoring algorithm. TF-IDF for “term frequency versus the inverse document frequency.” It returns how frequently a term occurs in your field or document versus how frequently that term occurs overall in your collection. This algorithm has some issues, issues mitigated by using the BM25 algorithm. An algorithm that smoothes this process, effectively letting documents reach a saturation point, after which the impact of additional occurrences are mitigated.
- Grouping and Aggregations
- Obtain statistics about the fields: quantity/count, missing, max, min, average, number of distinct values, list of distinct values, the more frequent values.
These are some examples of the kind of queries we are able to do:
Figure 4- Field statistics query
Figure 5- Faceting query
Figure 6- Range Faceting query with complex filtering
Figure 7- Pivot Faceting or Decision Trees
Figure 8- Aggregations
Issues faced along the road
Not everything has been quick and easy.
The main issues we have encountered were related to the process of maintaining both of the backends synchronized once the initial synchronization was made.
After the initial synchronization, that is made using bulk operations to speed up the process, the process becomes very slow. It happens because mongo-connector is processing changes one by one. Every time a change is detected, the connector loads the data from mongo, transforms the mongo-document into a solr-document making use of the schema published in Solr schema, and finally it sends the operation to Solr. No bulk operations were done!
Another issue was that “updates” were very slow, taking days to finish. The Solr Manager was doing a two-step process by document, the first step was to load the data from the solr-backend to check the existence of the document, and the second step was to send back to Solr the updated version of the document. In this scenario, bulk updates aren’t possible.
The solution we found to that was to clone the original mongo-connector project from github, and make some changes in the implementation.
These are the changes we made in the code:
- We do bulk inserts and updates when possible. We load and process as many changes as possible and we send them to Solr in a bulk.
- The first step in an update is to request the current version of the document in Solr, to apply later the changes in the loaded doc. Now, we are requesting multiple docs at the same time, instead of one by one. This is a bulk read. We reduce in thousands the number of queries we need to send to MongoDB.
- We use a high auto-commit-interval, around 20 minutes or so, to avoid flushing changes every time a document (or a bulk of documents) is sent to Solr. We need to take into account that we make millions of changes in mongo every time we execute an ETL, so this has big impact.
The results are stunning! As we can see in the next image, we went from response times measured in seconds in MongoDB to respond times measured in milliseconds in Solr, both accessing collections with more than 50 million of documents.
Our API is now also able to solve almost all the queries in a second. We have reduced considerably our response times, and this is allowing us to serve many more requests, in considerably less time, and using less resources.
We have been able also to reduce considerably the size of our MongoDB replica set., removing most of the indexes that we were needing to solve queries with multiple criteria.
Results of the performance tests:
Some explanation about the queries:
- By ID: a simple query where we look for the data of an specific company by its ID in an specific snapshot (we generate snapshots every week about the details of the companies).
- By Keywords: we look for the companies that contained the terms “Machine Learning” or “Software” in any of the company textual fields.
- By Complex Criteria: we look for the companies that matched the following criteria:
- Founded after 1st January 2010.
- In “running” or “acquired” status
- With a score greater than 40%
- Organizations that are only companies (we have also schools and groups indexed in the database)
- By Keywords + Complex Criteria: the two previous queries together.
Want to build your very own startup deal sourcing & assessment platform with PreSeries? Get in touch here!