March 18, 2012

Elasticsearch Space-savers

After setting up ElasticSearch, you'll be faced with the task of optimizing your index configuration for speed and for size. There are millions of documents in our index, and, for performance, it’s important that all of that be kept in memory. As a result, index size is pretty important. We spent a while tweaking ElasticSearch to minimize its index size, and it turns out that there’s a decent number of low-hanging fruit.

_source

For example, ElasticSearch stores the original data along with the indexed data for each document and returns the full document with each search result. We just wanted ElasticSearch queries to assemble a set of document IDs to be fetched from the database, so this overhead is unnecessary. Cutting this data shrank our search index four-fold.

:_source => { :enabled => false }

_all

In addition, ElasticSearch stores an aggregated _all field on each document, which contains the analyzed output from all of the other fields in the document. This doesn’t add any new information, and its purpose is just to simplify the query interface.

We don’t need to be able to query all fields (for example, we only use user IDs to partition the index); setting a flag to exclude these from the _all field and preventing them from being analyzed saved us another 4-5GB for ~2 million documents.

'id' => { :type => 'string', :include_in_all => false }

Multi-field types

Other problems were a little trickier. How do we minimize the size of the email address index given that we’d like to be able to perform both prefix and substring searches on them?

Email addresses may be very long and overflow our ngram tokenizer (which maxes out at fifteen characters), and so we decided to construct a multi-field tiered index for certain fields in to accommodate all of the search use cases. We don't know if this is a best practice, but it seems to work pretty well for us.