Using the Druid GroupBy V2 engine for real-time analytics

| 13 min. (2659 words)

There’s been a lot of interest in alternate database solutions recently, after the initial generation of NoSQL databases a few years ago. While building our Real User Monitoring product, Raygun Pulse, we initially turned to PostgreSQL for its rich set of built-in analytics functions on top of the full power of a RDBMS and SQL (joins and all). Now that we’re down the track however, despite opting for multi-tenancy, Postgres wasn’t scaling for the relatively complex queries we needed for the business logic in Raygun Pulse. Having seen similar stories from others, including those who invested in other technologies such as the Hadoop ecosystem or key-value stores, we began researching other more purpose-built analytics databases. Druid happens to be the one we settled on, as it offers several nice properties one wants in an analytics database.

Introduction to Druid for analytics

Firstly, it claims sub-second OLAP queries, real-time streaming ingestion, and the ability to scale to the petabyte range. I can indeed vouch for the first two. It does this by dropping full ad-hoc queries and joins that SQL offers, and instead moves grouping and rollups to ingestion time, by requiring the desired string dimensions/numeric metric columns be specified up front with what ‘aggregators’ you need at query time. In exchange for this, Druid then provides ‘slice ‘n dice’ queries that return in ~1s, grouping on as many dimensions are required by the user, and selecting whatever metrics they desire, with the chosen aggregators. In this way it’s ideally tuned for exploratory analytics on datasets over time, and more importantly it removes several bottlenecks that appear once you need to scale beyond a single machine. It uses ZooKeeper to operate as a distributed database, to enable large cluster sizes (known to run up to many thousands of cores).

The core queries that Druid has surfaced for the last few years have been Timeseries queries – that is, one or many metrics plotted over time, grouped up by some interval – and TopN, which is one dimension, e.g ‘country’, ordered by one metric, e.g ‘loadtime’ or ‘viewcount’. This offers guaranteed accuracy for the first 900, descending, which is a common use case. The ‘N’ can be tuned higher, but starts to introduce inaccurate results beyond that point.

Druid also offers a richer query for that use case, GroupBy. GroupBy supports grouping on arbitrary dimensions and ordering by as many metrics you want, in whatever order. As you’d expect, it’s the equivalent of SQL’s GROUP BY. It also features additional nested inner GroupBys and a HAVING clause for filtering results once they’re grouped, so it’s much more flexible. The trade-off is that up until now there’s been a maximum (raw) row count that it can consider, and queries can hit that fail or cause an OOM.

The new Druid GroupBy V2 engine

Fortunately, a new version of Druid has been released, v0.9.2. This contains a completely rewritten GroupBy engine that removes the above limitations and promises a large speed-up over GroupBy V1. The drawback to V1 is that it is an on-heap implementation and didn’t have the same sub-second guarantees the mainstream TopN query and Timeseries query types had. While it provides a richer querying model than plain TopN, it also degraded faster so it could realistically only be used for internal offline analytics.

The new Druid GroupBy V2 engine logic removes those limitations with a new off-heap implementation. This brings similar performance to GroupBy queries that TopN/Timeseries enjoy, allowing GroupBy to be also used for user-facing exploratory analytics (OLAP-style). This means that multi-dimension grouping queries can now be performed with Druid in production for user-facing applications, on scaled datasets (Druid is known to run on clusters with 100+ PB of data) with interactive query response times. This should be beyond the data sizes, response times, horizontal scalability and developer experience that current Hadoop/Cassandra/NoSQL or more traditional RDBMS systems can provide, while surfacing a querying model that is just as rich with the new grouping logic.

By default Druid will continue to use the previous V1 implementation, but V2 is expected to be switched to the default in a future release. In this next section I’ll take a look at what’s required for existing Druid installations to use the new engine with GroupBy queries. In order to use this now, a couple of config changes need to be made, along with a new option to set on the context object for each GroupBy query. At the end of the post I’ll give an overview of our experience working with Druid, and what to look out for when spinning up a Druid cluster of your own.

Using the Druid GroupBy V2 engine

Firstly it will pay to refer to the documentation for the GroupBy V2 engine. As that states, to use it you’ll need to set a new property in your Druid service configs, druid.processing.numMergeBuffers, which must be set to a positive integer. This applies at least to Historical nodes, Broker node(s) and MiddleManagers so you can set it in common.runtime.config (the MiddleManagers need less ram though). This will allocate fixed size memory buffers for the grouping number crunching, and that number will be multiplied by druid.processing.buffer.sizeBytes so you will need that much RAM available in the JVM process (set in jvm.config for the preceding services). The total calculation is:

druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers + druid.processing.numThreads + 1)

The result of that must be less than or equal to -XX:MaxDirectMemorySize in your jvm.config for the above services. If there isn’t enough RAM the process won’t start (it will log out the numbers in question to stdout). You will also see the buffers being allocated during startup. You can begin testing by setting this to 1, and increasing it up to around 10+ (depending on how much RAM you have and how often you’ll be performing these queries).

The second step is to enable the V2 engine in the queries, by setting it on the context object:

"context": {
  "groupByStrategy": "v2"
}

When the above is set on a "queryType": "groupBy" query, it will use the new engine – that’s it! You can also set this as the default for all GroupBy queries by setting druid.query.groupBy.defaultStrategy to “v2”, without needing to set the query context.

There is another important property: maxOnDiskStorage. This can be set in either the context or configs. This property controls what happens if the merging buffer runs out of memory during a query. By default it is set to 0, which means large queries will fail if enough memory isn’t available. If it is set to a positive integer, disk spilling will be enabled, however SSDs are still much too slow and Druid’s interactive analytics guarantees can only really be met when all segments and calculations are performed in memory. You’ll see a large linear slow-down if a lot of disk access is needed during the GroupBy merge phase, thus if possible maxOnDiskStorage should be kept at 0 and the numMergeBuffers and/or processing.buffer.sizeBytes increased until queries with the upper-bounded time range as required by your users/business logic return.

Caveats and speed-ups

After enabling the above you should notice that GroupBy queries are usable. As the docs note TopN queries are still preferable and have the edge on performance, and they should be used if your business requirements can be fulfilled by them. If they can’t however, V2 GroupBys are now an option especially if you can fit them into memory. If not, you may need a larger cluster with more memory available, limit the amount of aggregations/post-aggregations used within the GroupBy queries, or limit the interval time range considered.

A key point to note with the various queries is that they tend to degrade linearly, by some factor. For instance TopN and Timeseries are efficient and will mostly hover around the sub-second/second mark when using long/double aggregations and post-aggregations. When using more exotic aggregations such as the Approximate Histogram or ones from the DataSketch extension, the linear degradation may extend to many seconds, depending on segment size, count, available memory and whether the nodes need to hit the disk. This will be highly specific to the particular cluster setup and data so benchmarking and adjusting in production is a key requirement to running a performant Druid cluster.

Generally though, we saw the advertised 2x-5x speedup with the GroupBy V2 engine vs V1 – towards the lower end for some particularly heavy queries. Fortunately the majority were able to be refactored into TopN queries, which showed an additional speedup and met our interactive real-time requirements.

Other 0.9.2 highlights

Full release notes for 0.9.2 are available here. Noteworthly ones are:

  • New stats aggregators are available to calculate variance and standard deviation. This is in the druid-stats core extension, and good documentation for these are available here. Calculation of variance/std. dev is available at ingestion time for numeric columns, and happens to be Apache Hive’s implementation. This is a particularly nice feature to have as it rounds out the median/quantile aggregators available in the Approximate Histogram/Datasketch extensions.
  • Speed-ups for DataSketches by 80%, and for HyperUniques by 20-30% – nice wins for retention analysis queries, and the all-important cardinality queries, respectively.
  • The ability to filter on long metrics, for example the required time metric (column), inside queries. Dimensions already have rich query-time filtering, for the use case of removing massive amounts of rows predicated on the value of a dimension for each row. This is now being extended to metrics, which previously weren’t filterable. In future this should be available for double-type metrics, allowing a subset of values to be queried.
  • Ability to disable rollup – I haven’t looked into this too much, but apparently this solves a use case where the input rows won’t be aggregated, and instead will generate one output row for each when queried. This is available as a flag on granularitySpec.

Performance tips for running and optimizing a Druid cluster

As mentioned above, having enough memory on the compute nodes (those running Historicals/Realtimes/Brokers) and a large enough cluster to memory map all queryable segments is critical to get sub-second query performance. This is part of Druid’s current design, as while in its early days it was a purely in-memory database, now it memory maps the segments where data is stored. This essentially allows scaling beyond what physical memory is available in the cluster, for the case where the data requirements are large. On the other hand, if the dataset can fit in memory across the available boxes in the cluster, it will behave like an in-memory database and be highly performant.

Thus Druid offers tuneable knobs to choose how much data you want or can have in memory, depending on your data volumes and requirements. The main knob for setting this is druid.server.maxSize, and is documented in the Historical config docs and the Performance FAQ. This setting is in fact quite critical to offering good performance when querying the data in your Historical nodes. The key thing is to set it to less than the maximum amount of physical memory available on the server. If it is set to greater than the available memory, more segments will be assigned than can fit in-memory and the Historical node will attempt to load them from disk, causing massive amounts of paging and your queries to return very slowly, e.g 10-30s+.

The calculation for druid.server.maxSize should take into account the resources available to the JVM for the Historical service, and is total memory - (heap size + direct memory + JVM overhead). Using the sample Production Cluster config which uses r3.8xlarges for the Historical nodes, heap size should be set to 8g, direct memory set to the calculation under ‘Using the GroupBy V2 engine’ above, and a small amount for the JVM. The remaining physical memory is the amount left for segments, and you’ll need that much memory available across all your Historical boxes for all the segments you want to make available for querying (as shown by the cluster size in the Coordinator Console).

About the Druid project

Druid was conceived and spun out of Metamarkets, who are running it for analytics in the adtech space. The project itself is open sourced under the Apache license, and has and continues to receieve frequent releases and fixes. Like many growing open source database projects there’s a lot of interest in it, and the project is currently dealing with scaling the merging of PRs for fixes and new features, taking inspiration from existing successful OS project governance models. They also don’t mind taking known-good implementations of logic and algorithms from other OS projects, increasing code quality. I believe the core is solid and is worth investigating if you need an analytics solution that will scale horizontally up to reasonably large ingestion, storage and query workloads, while offering tunable performance guarantees and reasonable costs running on cloud VMs. The comparisons with alternate DB solutions on the landing page are in my mind quite accurate and offer a good overview of what Druid can provide.

Trade-offs and pitfalls to note

Like all databases it isn’t perfect and there are trade-offs to be made. While it’s reasonably easy to get a toy single-machine running with the sample Wikipedia dataset the procedure to stand-up a production cluster isn’t trivial (vs. something like ElasticSearch, for instance), and tuning ingestion and querying can involve a chunk of time until you figure out where the knobs are. Depending on the complexity of your model you may also spend a while getting your ingestion spec configured just so. Ingestion in particular benefits from having a preprocessing pipeline as you may need to merge or enrich your datastream with other sources. Having message queue infrastructure available or some of the frameworks available in the Hadoop ecosystem make this easier. Tuning the JVM memory configs and runtime configs can also be slightly obscure until you understand your data and query loads, and the tradeoffs you wish to make regarding the hardware you have available vs. how much cash you want to spend.

The tooling (consoles) and APIs available for managing the cluster/segments are good, however dealing with configuration and deployments is a bit manual and may need extra tooling built to smooth your workflow. Not much is prescribed by the release .tar.gz and you’ll need to add your configs as well as any extensions that are required. This is so as deployment can be auto-scaled, for instance with AWS, but I wouldn’t advise setting that up immediately. Fortunately Druid’s documentation is excellent and quite clear (compared to how ElasticSearch’s docs used to be – although it has been a while since I’ve been deep in those). You will need to read and understand a significant chunk of them, but the info is available, supplemented by the Druid Google Groups. The production cluster config also may be overkill for your ingestion and querying needs – definitely experiment with hardware here.

There is a bunch of interesting work continuing in the project, including a SQL querying layer which may increase adoption. I don’t mind the JSON querying API as it is today, though – it is pretty well thought out and logical, but there is also a bunch of client libraries for various languages available.

Give Druid a go

To sum up, if you’re doing analytics and you’re new to Druid, give it a look – check out the Quickstart by batch ingesting the sample Wikipedia dataset. If you’ve got a cluster up and running, upgrade your cluster and see if the new Druid GroupBy V2 engine queries work for you, as there may be a significant speed-up waiting.

Thanks for reading!

Understand the root cause of errors and crashes in your software applications in seconds with Raygun Crash Reporting. Take a free 14 day trial and see what Raygun can do for your software team, or book a demo and get all your questions answered personally by an experienced member of our team here.