Intro to Elasticsearch’s Awesome Aggregations

Callum GavinTech StuffLeave a Comment

elasticsearch

We’ve become big fans of Elasticsearch since we added it to the Raygun backend stack, as it accepts the huge volumes of data you guys throw at us with minimal fuss, then allows us to perform powerful queries on your behalf. This gives you the insight you need to fix bugs quickly. ElasticSearch has been a core part of our infrastructure since the pre-1.0 days, and the developments in the API have been notable since then.

I’m particularly fond of the new Elasticsearch Aggregations API which replaces the previous Facets implementation (which is now deprecated). Both allow you to perform analytical operations on a subset of your data, but the key advantage of Aggregations is that it allows nesting – by building up a tree of aggregations, at each level you can access data on the current scope as you desire. At a basic level, this allows you to perform SQL-like operations, but provides many more analytical and statistical results from your data.

Keep reading for four reasons why Elasticsearch and its Aggregations features can make your life easier as a developer when dealing with datasets in a NoSQL context.

Searching

Let’s consider our problem domain to be storing and searching a set of newspaper articles. We’ll start by assuming there’s an Elasticsearch instance set up locally with a mapping created and some documents indexed against it – and there is an index for each day. A trivial search query for this would look like this:

POST http://localhost:9200/20141114/article/_search

That will return the amount of document hits in that index. Passing in a defined size will return that many hits, with their internal ID, the index it came from, and the mapping type. Say we’ve got stored some additional fields with the Mapping API. Those stored fields would represent properties of the indexed documents, so for articles this could be ‘headline’, ‘author’, ‘dateWritten’ and so on. We can pull this data out using various features of the Aggrations API, which we’ll now take a look at.

Filtering

Wanting to return only a subset of documents is a really common querying operation. With Aggregations, it looks like this:

At the top level of the response there will be an “aggs” object, which will contain an “ArticlesAfterTime” object beneath it. This contains only the documents which have their ‘dateWritten’ field at a later date than the one specified above, and includes a document_count for them. You’ll also notice we’ve set a ‘size’ of 0 so we return all documents. You may also need to set this on each inner aggregation too.

Here’s the most awesome part of Aggregations – we can now nest additional aggregations beneath the one above, which will be scoped to only include the results from that level.

So now we can do stuff with the articles written after the start of Nov 1st. How about grouping?

Grouping

Grouping results by a certain field is another really powerful and common operation. For our articles, say we want to group them up by a ‘type’ field In SQL dialects this might look something like ‘SELECT * FROM foo GROUP BY type’. In Elasticsearch land, we can write it like this:

Note we’ve nested a new “aggs” object beneath ‘ArticlesAfterTime’ (which is a label we define, so it can be referred to later). This is a Terms aggregation, and performs the grouping by the field we specify (‘type’). Say we’ve got a bunch of types like ‘sports’, ‘news’, ‘editorials’ – the response will look something like this:

As you can see buckets can be iterated, giving you the doc count for each.

Distinct counts on a field

Terms aggregations are great in that they give you a total count of documents in a certain context. Say you want a count of unique values. If you want to count up the occurrences of a field on documents, the value count aggregation will produce the total. If you want to do a DISTINCT count however, look to the Cardinality aggregation. Say we want to count the unique words in each article’s ‘blurb’ field, which would look something like this (previous nested query ommitted for clarity, but this would go beneath GroupByType):

This generates an approximate count which can be tuned by setting the precision_threshold parameter (check the docs) which allows you to trade memory for accuracy (internally it uses the HyperLogLog algorithm which my colleague Alex has detailed.

Found this Elasticsearch post informative?

Raygun is big on helping developers write the best code they can. If you haven’t already checked out our world-class error tracking solution, grab your free trial here.

Next level software intelligence across your entire stack. Get deeper analysis into how your applications are really performing. Learn more.

Leave a Reply

Your email address will not be published. Required fields are marked *