How our tiger team reduced SQL query latency by 300% using automation

By Matt Fleming | Posted Oct 10, 2019 | 5 min. (1022 words)

This is a guest post from Matt Fleming from Code Blueprint. Matt is a Senior Performance Engineer at SUSE.

Some development problems are too complex, some timelines too tight, and some projects too greenfield for established teams to tackle. When you need to create a new team of developers for an ambitious project, the venerable cross-functional or tiger team provides the perfect model for bringing a ragtag crew together to achieve a shared goal.

I had the honor of leading such a team at SUSE. The team’s objective was to optimize the performance of running in-memory databases in a virtualized environment. We were given a testsuite with over 500 SQL performance tests and challenged to make sure the majority of results were no worse than 12% of the bare metal score. When we started out, the virtualized workload performance was over 50% worse. Within six months, we’d achieved our 12% target by carefully tuning and optimizing our system configurations.

Improving SQL performance

Though the team was made up of experts from a variety of areas, the real secret to the project’s success was using automation and monitoring to make sure everyone was analyzing performance the same way.

That’s a lesson we had to learn the hard way: when we started out, each of us ran tests on our own machines using a common set of testing scripts that took care of running the tests but not configuring the machine, monitoring performance metrics, or comparing results. Problems quickly sprang up when we shared our tunings and optimizations with the rest of the team and even the familiar line “it works on my machine” didn’t help with the fact that no one could reproduce the results.

This happened time after time until eventually we agreed to take a step back and implement automation for everything from configuring the machine to figuring out whether performance improved or not. Implementing automation resulted in a step-change in the project’s progress and in our team’s morale.

Below are the lessons we learned on our way to becoming a high-performance team.

1. Automation codifies processes

Though each individual contributor knows their stuff in a tiger team, quickly establishing processes and best practices is the key to making progress in the same direction. Using automation for deploying test machines, running workloads, and comparing results ensures that everyone is solving the same problem.

If you need to bring in additional team members after the team has been formed, having automation makes the onboarding smoother because they can focus on bringing their unique perspective and experience to solving the immediate problem and not, for example, figuring out the best way to use the 95th or 99th percentile when comparing latency benchmark results.

2. Computers are better at analyzing numbers

Humans are notoriously bad at manually comparing performance results. By automating the comparison, not only do you save your team from this very taxing job but you also benefit from more accurate results. During the very early stages of our project, we chased optimizations that didn’t actually affect the performance of the tests but which we only discovered once we automated the way we compared results.

A secondary and more emotional benefit is that it lowers the burden for experimentation. Lots of optimization and tuning work is part experience, part intuition, and part flat-out guesswork. If your team is free to experiment, preferably with quick feedback, then that will help them reject unworkable ideas sooner and get to their target faster.

3. Use monitoring to catch anomalies

One of the unique things about performance testing is that you’ll often see one-off problems and anomalies that you’re later unable to reproduce. A classic example of this is unusual latency spikes. When you’re running an optimization project, many things are changing rapidly and it’s not always possible to manually track which change led to a particular problem.

With automated performance monitoring, you’ll actually be able to identify when these conditions occur and capture some of the system characteristics at the same time. Debugging these issues then becomes a team-wide task since everyone can share the same data.

4. Automation helps you continuously improve

By codifying your decisions and processes with automation, it’s easier to improve them over time. Improvements can come from less intrusive ways to monitor your software or by capturing a more comprehensive set of metrics. Sometimes the project objectives can shift and you need to watch subsystems and numbers you were previously ignoring.

All of this is simpler if you’ve built a strong foundation by first using automation as a way to record your current best practices.

5. Historical data is a goldmine

It’s often useful to retrospectively look at past data when coming up with new performance optimizations because you can see what happened the last time you all pinned tasks to processors or whether using two disks in parallel improved transaction latency. Reading data from past experiments, configurations, and investigations can sometimes spark new ideas.

And when your team does well, other people will want to know what you did. Automated error monitoring makes it easy to save your data for posterity and show the progress you made throughout the lifetime of the project. You can use this to write up team reports and summaries of the project for your stakeholders and show not just that you’ve achieved the project goals, but also how far you’ve come along the way.

Building great teams

There are numerous things that go into making a successful team, and the tooling you use, automated or not, is only a small part of it. Team member personalities, positive feedback, and clear communication of objectives are all important in building great teams.

But automation can remove the burden of manually running tasks and the chance for inconsistencies to creep into processes and system configurations. It provides ready answers when you need to know the current status (are you doing better or worse than before?) and how far you’ve come since the start of the project.

Automation can’t make a bad team good, but it can clear the way and help a good team focus on becoming a great one.