Elasticsearch: Archiving Indexes on a Budget - Jixee | Task and Communication Hub For Developers

If you’ve been to an Elasticsearch meet up, you’ve likely had a conversation about Elastic cluster sizes, data retention policies, and archive strategies. In any of those conversations, it’s possible you also came across a great majority of companies who have big budgets to build a proper Elastic cluster. There are big companies with budgets, and then there are the rest of us, the startups who try to utilize the coolest technologies, without the biggest of budgets.

I thought it might be helpful to share my index archiving process for those of us who don’t have the largest of budgets to throw at our Elastic implementations.

Our Use Case

We have a single Elastic server that we use as part of our ELK stack. It indexes web server and application logs. Our ingestion rate is 11GB of logs a day at our current level. We keep 2 weeks of logs available for querying, and archive a year’s worth in case we need to go back in time to look at logs. Our server specs are important for our archive process, especially disk storage. We are using using a single 8G instance with 500G of disk space. Some of you are probably aghast at how small of an implementation that is, and also wondering how the heck we store a year’s worth of logs with that much disk space. Well the short answer is that we don’t store it on our Elastic server. Let me take you through our archive process.

Nope to Curator

It’s worth mentioning that for those companies who have a bigger budget, the ideal way to deal with data retention policies is to use Curator. Curator is a great tool that allows you to manage your indexes and snapshots with ease. Using it effectively requires a snapshot repository backend that has nearly endless, or at least a LOT of storage capacity. Examples of those backends are S3, Azure storage, a Hadoop cluster, or a mount point with a large amount of disk space. These repository backends are supported by Elastic with the use of freely available plugins, or in the case of a mount point, by default. We didn’t have the budget to support connecting our Elastic instance to a big snapshot storage repository, so we came up with another way.

Cheap Storage

We came up with a way to store our index snapshots in cheap storage. One year of our current index size clocks in roughly around 3TB, which can generate a decently sized monthly bill on well known storage implementations. Some cheap storage options we considered:

– Amazon Glacier**

– Amazon S3

– Google Cloud Storage

– Large external hard drives

– Tape drives

** One thing to note about Amazon Glacier, is that it is by far the cheapest cloud storage solution, but it does come with its quirks. One such quirk is that if you are scripting your archive steps, sending large files to Glacier can take a long, long, long time. I found it a bit cumbersome to work with in an automated fashion, so we opted for a solution that offered an affordable SFTP interfaced storage and it works great for us.

Homegrown

The type of cheap storage you end up going with will determine your archive process. These steps are easily modified to suit your specific needs. Let’s get to the meat of this post, and describe the steps we need to take. For the commands below, let’s assume our Elastic instance is at localhost:9200.

1) First our Elastic instance needs a Snapshot Repository Backend. We’re going to create one directly on the root partition of our server in the dir /es_snapshots. We’re going to call the repository “log_archive”:

curl -XPUT http://localhost:9200/_snapshot/log_archive -d ‘{ "type": "fs", "settings": { "location":"/es_snapshots", "compress":true}}’

2) Now that we have a snapshot repository backend, since our data retention policy states that we only keep 14 days worth of indexes for live querying, we snapshot indexes older than 14 days to our Snapshot Repository. An example of how we would snapshot an index would look like the following. Note the wait_for_completion flag, which will tell the command to wait until the snapshot is done, useful for scripting:

curl -XPUT http://localhost:9200/_snapshot/log_archive/logstash-2016.01.01?wait_for_completion=true -d '{ "indices":"logstash-2016.01.01", "ignore_unavailable":"true", "include_global_state": false }'

3) Delete the index entirely from Elastic once the snapshot has successfully finished:

curl -XDELETE http://localhost:9200/logstash-2016.01.01

4) As your Elastic index collection grows, repeat steps 2 and 3, until the snapshot repository backend contains 7 snapshots. For our current index size, this totals in around 80G. Once there are 7 snapshots, we tar up the snapshot repository directory with bzip2*:

tar cjf /tmp/es_snaphots.2016.01.01-2016-01.07.tar.bz2 /es_snapshots

*A little note on compression. In our snapshot line above, we are stating we want the index to be compressed, but we can get a much better compression ratio out of it, if we compress it further using any of the popular compression algorithms. I chose bzip2 because it was the best choice between disk space saved, and time it took to compress and uncompress. In my testing, 7z offered the best disk space savings, but just wasn’t practical in terms of the length of time it took to compress and uncompress, and gzip is the fastest, yet didn’t save much on disk space. This can be tuned to whatever works best for your process.

5) Once our snapshot repository directory is converted into a tarball. This is when we send it to our cheap storage solution. This can be done many different ways, so I’ll leave it to you to decide what cheap storage solution will work best for your use case. In our case, it’s a simple SCP command:

scp /tmp/es_snaphots.2016.01.01-2016-01.07.tar.bz2 [email protected]:/elastic/archive/

6) Once your storage has received the archive, remove it:

rm -f /tmp/es_snaphots.2016.01.01-2016-01.07.tar.bz2

7) Now comes the magical/tricky/clunky part. In this step we remove the snapshot repository from Elastic. Doing this will leave the data untouched, and only remove the reference from Elastic:

curl -XDELETE http://localhost:9200/_snapshot/log_archive

8) Now that the repository is removed from Elastic, we want to clear out the already archived snapshots from the directory location:

rm -fr /es_snapshots/*

9) Now we re-add the repository:

curl -XPUT http://localhost:9200/_snapshot/log_archive -d ‘{ "type": "fs", "settings": { "location":"/es_snapshots", "compress":true}}’

10) Now we have a clean snapshot repository, ready for the next batch.

11) If we ever have to restore one of these archived indexes into our Elastic instance, we uncompress it into a location:

cd /tmp; tar xjf es_snaphots.2015.01.01-2015-01.07.tar.bz2

***Make sure you don’t uncompress this into your current /es_snapshots directory, this should be a temporary location.

Then create a new snapshot repo and point it at the new uncompressed dir:

curl -XPUT http://localhost:9200/_snapshot/log_archive_restore -d ‘{ "type": "fs", "settings": { "location":"/tmp/es_snapshots", "compress":true}}’

And restore the index you need:

curl -XPOST http://localhost:9200/_snapshot/log_archive_restore/logstash-2016.01.05/_restore -d '{ "ignore_unavailable": "true","include_global_state": false}'

An important note about restoring, is that if you look at the documentation, you’ll notice you can include names of multiple indices in a restore. In our case, we will always only include one index with one snapshot, so restoring a snapshot will only ever restore a single index into Elastic.

Why Remove Then Re-Add the Snapshot Repo

So at this point you may be scratching your head, wondering why we remove and then re-add the snapshot repository for archiving. The thinking here is this:

– We needed to figure out a way to decrease disk space required for our snapshots, while also giving us the ability to store them in ‘chunks’ or ‘batches’ on a cheap storage solution of our choice.

– The fact that we have “batches” of log archives allows for easier recovery of individual indexes. Also, the fact that we are only storing one index per snapshot allows us more visibility as to which snapshots include which indices, simply by doing a list of directory contents.

– This method allows us to control the size of our ‘batches’, based on our requirements.

While this process may seem a bit unorthodox, it has allowed us to save costs on the storage of our index archive as well as enjoy the benefits of using Elasticsearch.

WaySmarterThanYou

I’ve been looking for exactly this – THANK YOU!!! I’ve been testing graylog and it seems great, but I couldn’t figure out how to archive without buying the $$$ version. Snapshots of indices were not working as expected and elasticsearch had limited docs on it. Going to try this out over the next week or so – I’ll let you know how it works for us!!! Thanks again!!!
- WaySmarterThanYou
  
  Took a bit of tweaking on some commands for my installation – but overall spot on. Thanks again.
Cryptoanarchist

Definitely check out Curator 5 as it can perform snapshot deletes as well as index deletes after it does your snapshots for you. It can now also use the reindex api to manage roll up work for you too.

Our Use Case

Nope to Curator

Cheap Storage

Homegrown

Why Remove Then Re-Add the Snapshot Repo

Leave a Comment Cancel reply