Jixee has long been a supporter of Elasticsearch well before its rebranding to Elastic in early 2015. We use it for logs as well as to power Jixee’s search functionality. Elastic has released many features in their latest iterations, but to our dismay, they also deprecated a long running method for synchronizing data with Elastic: Rivers. Thus far, Rivers had been the method that Jixee used to synchronize data from our MongoDB cluster into our Elastic cluster. This presented a very important question for Jixee. Now that Rivers are deprecated, what do we do moving forward? In this post, I’ll cover our thought process on answering that question, the options we considered, and what we decided to use to replace rivers. If you’re interested in a hands on tutorial about mongo-connector and Elasticsearch, check out my other post on the topic, How to Use mongo-connector with Elasticsearch.
The End of The River
Why did Elastic remove Rivers support? According to this article written by Shay Banon, cluster stability while using Rivers was the main reason for deprecating the functionality. River support was made possible by installing the River plugin on the Elastic side of things, and while this is very convenient, it does require Elastic to manage the River process. Doing this adds overhead to Elastic, and can present a slew of other issues, that are described in Shay’s article. While upsetting at first, I can appreciate the decision to deprecate this functionality, since it does allow Elastic to focus on what it does best, index large amounts of data for search.
The deprecating of Rivers essentially told us that the way our Elastic clusters need to ingest data is externally. For Jixee, this meant 1 of 2 things:
- Create our own ingestion process.
- Or, find a tool that already exists which provides the same functionality as Rivers.
After much consideration, and a prototype implementing #1, we opted to go with #2 and chose to use mongo-connector.
Why We Chose mongo-connector
In a way, this seems like a no brainer. Mongo-connector replaces Rivers, so why even consider other options? Let’s go through our thought process.
Writing Our Own Solution
In order to write this ingestion process ourselves, there were important points to consider:
- Initial import/sync process
This is actually the simplest step in this solution. It takes all of our objects from MongoDB and adds them to Elastic. If data already exists, it simply overwrites whatever is in MongoDB. So far, so good.
- Real-time updating of Elastic
Once we’ve initially seeded our Elastic indexes, we have to make sure that whenever Jixee’s data is updated, added, or deleted in MongoDB, that it also needs to be updated, added or deleted on our Elastic cluster. Generally speaking, this is a straightforward process, but what happens if a MongoDB write succeeds and an Elastic write fails? Data between MongoDB and Elastic will be out of sync. Our solution to this issue was to use a job queue, but this added a lot more complexity to our code, as well as requiring us to add another service to maintain in our already growing number of services that make up Jixee’s infrastructure.
- Monitoring Data Ingestion Success / Scheduled syncing
In our testing, steps 1 or 2 did not always keep Elastic up to date with real time data. We needed to run a process that filled in the gaps. This consisted of running the tool we made in step 1 at a scheduled interval. It works, but in my opinion, is a very clunky solution. Based on the frequency that we would have to run this process, it required putting load on the MongoDB cluster, so it needed to run against a MongoDB Replica designated just for this purpose. In addition, the mere fact that we would have to run it at all meant that, at times, we might have data missing from Elastic. This does not fulfil our goal of being a solution that keeps user data up-to-date in real time, and would create a bad experience for our users.
From the mongo-connector github page, “mongo-connector creates a pipeline from a MongoDB cluster to one or more target systems, such as Solr, Elasticsearch, or another MongoDB cluster. It synchronizes data in MongoDB to the target then tails the MongoDB oplog, keeping up with operations in MongoDB in real-time.” Here are some points we considered while deciding this was the right tool for us.
- MongoDB Replica Set Oplog
Since we already utilize MongoDB Replica Sets in our infrastructure, we satisfied a major requirement for mongo-connector.
- No Modifications to Our Code
This is obviously a huge plus. We were able to implement mongo-connector without making any changes to our code. This point alone scored high in our decision.
- New Service to Add to Jixee’s Infrastructure
Yes, as stated above, this can be a downside. More services translates to more complexity and maintenance, but mongo-connector has proven to be an easy service to implement and maintain, so it didn’t dissuade us from taking this path.
- Easy to Install and Implement
Installing mongo-connector is made easy by their helpful documentation, and because Python is already a part of the Jixee stack, it was a great fit.
Since we’ve implemented mongo-connector, we’ve discovered one drawback, which we solved by monitoring a couple metrics. If the mongo-connector process ever stalled or failed, it was important to know when it happened so action can be taken. In our case, the process might stall if the load on the Elastic cluster was too high. Our solution was simple. We wrote a monitoring script that compared the number of objects in our collections on MongoDB vs Elastic. If the variance reaches a certain threshold, we restart the mongo-connector process. We’ve since added more power to our Elastic cluster, and this problem hasn’t surfaced again.
While it is fun and challenging to build solutions in house, being in the tech industry for awhile has afforded us the wisdom to know that re-inventing the wheel is not always the best option. In Jixee’s case, adding complexity to our codebase proved to be more trouble than it was worth, and mongo-connector allowed us to replace Elastic Rivers with minimal pain.