[Editor’s Note: This post is a part of a series on this blog, called Jixee Hotfix. It will feature real problems that our engineering team encounter on a weekly basis and the solutions they come up with to fix it. Posts are written by the engineers encountering the problems. This post was written by our Sr. Engineer, John Sykes.]
This week at Jixee, I revisited our search capabilities in order to improve upon what we currently have. One of the key focuses was on a core aspect of Jixee…tasks. Our search system was only taking whole words and would then compare them against the titles of various tasks. Anyone that uses Jixee knows there are several aspects to a task and just having the title involved for searching is a narrow way of delivering content. We also decided to add in task attachments to help sort through all of the uploaded documents that can gather over time. Lastly, the biggest challenge of the week was dealing with a rivers plugin and searching by partial words via nGrams.
Our first improvement consisted of adding additional task fields to our search system. Adding this was faster than originally expected. At the time, we were mostly depending on Elastic Search’s default analyzers for search strings and indexes. This was enough to cover the extra fields for this step. Our query result consisted of the following:
What this allows us to do is a general search across multiple fields. With a type of “most_fields” we get scores based on how many fields had some form of match. The title^2 is syntax for weights, thus giving our title matches a higher score if a match was found. We have also added the ability to search a tasks comment. So if you remember part of a comment that had key information, you can now search for that and pull up the task associated with it.
I became aware of Elasticsearch’s (now called Elastic) plan to deprecate the use of rivers in future releases. Rivers was a quick way of connecting external information and automatically syncing the information with stored indexes ensuring new data is immediately searchable. The reason to move away from it was to have a solution that was more scalable in the long run. As things scale up, so does the memory costs under heavy river indexing and activity. While we did update our database and search engines, we are still currently using rivers. We decided that since we are filtering search results by an account basis, that the search sample size is greatly reduced as compared to searching against the entire dataset. There are also only a few collections tied to rivers, therefore, we have plenty of room before additional scaling solutions need to be implemented.
The biggest challenge I faced this week was using search analyzers with nGrams. These allow a set number of variations to be stored as an index. For example, the word house could be split up into h ho hou hous house and various ordering depending on settings. This aids in using search phrases that may not be complete and compare them against these types of indexes. The biggest wall I faced was getting this to work with our rivers plugin. Ultimately we came up with this type of query.
This approach does not use nGrams, rather a wildcard based approach. The analyze_wildcard allows us to analyze the search term while giving us the freedom to enter partial search terms in various parts of the original word. Given our test speeds and typical account sample sizes, this is a viable alternative for now.
*Note: If you ever decide to wildcard a field, be sure to set that index to not_analyzed in the mapping. This will remove the indexing of that field thus saving space and potentially, performance.