Auto-scaling and self-defensive services in Golang

Jason FauchelleGeneral, Other, Provider Updates, Web Development29 Comments

How I implemented an auto-scaling and self-defensive service in Golang

The Raygun service is made up of many moving parts, each specialized for a particular task. One of these processes is written in Golang and is responsible for desymbolicating iOS crash reports. You don’t need to know what that means, but in short, it takes native iOS crash reports, looks up the relevant dSYM files, and processes them together to produce human readable stack traces.

How I implemented an auto-scaling and self-defensive service in Golang

The operation of the dsym-worker is simple. It receives jobs via a single consumer attached to a Redis queue. It grabs a job from the queue, performs the job, acknowledges the queue and repeats. We have a single dsym-worker running on a single machine, which has usually been enough to process the job load at a reasonable rate. There are a few things that can and has happened with this simple setup which require on-call maintenance:

  • Increased load. Every now and then, usually during the weekend when perhaps people use their iOS devices more, the number of iOS crash reports coming in could be too much for a single job to be processed at a time. If this happens, more dsym-worker processes need to be manually started to handle the load. Each process that is started attaches a consumer to the job queue. The Golang Redis queue library it uses then distributes jobs to each consumer so that multiple jobs can be done at the same time.
  • Unresponsiveness. That is to say that the process is still running, but isn’t doing any work. In general, this can occurr due to infinite loops or deadlocks. This is particularly bad as our process monitor sees that it is still running, and so it is only when the queue reaches a threshold that alerts are raised. If this happens, the process needs to be manually killed, and a new one started. (Or perhaps many, to catch up on the work load)
  • Termination. The process crashes and shuts down entirely. This has never happened to the dsym-worker, but is always a possibility as the code is updated. If this happens, a monitor alerts that the process has died, and it needs to be manually started up again.

It’s not good needing to deal with these in the middle of the night, and sometimes it isn’t so good for the person responsible for the code either.

These things can and should be automated, and so I set out to do so.


So overall, we need auto-scaling to handle variable/increased amounts of load, the ability to detect and replace unresponsive workers in some way, and the ability to detect and restart dead processes. Time to come up with a plan of attack.

Single worker strategy

My first idea was extremely simple. The Golang Redis queue library we use, as you may expect, has the ability to attach multiple consumers to the queue from within a single process. By attaching multiple consumers, more work can be done at once, which should help with implementing the auto-scaling. Furthermore, if each consumer keeps track of when they last completed a job, they can be regularly checked to see if it has been too long since it has done any work. This could be used to implement simple detection of unresponsive consumers. At the time, I wasn’t focused on the dead worker detection, and so started looking into the feasibility of this plan so far.

It didn’t take long to discover that this strategy was not going to cut it – not in Golang at least. Each consumer is managed in a goroutine within the Golang Redis queue library. If an unresponsive consumer is detected then we need to kill it off, but it turns out that one does not simply kill off a goroutine (oh, I should mention, I’m quite new to Golang). For a goroutine to end, it should generally complete its work, or be told to break out of a loop using a channel or some other mechanism. If a consumer is stuck in an infinite loop though, as far as I can tell, there isn’t a way to command the goroutine to close down. If there is, it’s bound to mean modifying the Golang Redis queue library. This strategy is getting more complicated than it’s worth, so lets try something else.

Master worker strategy

My next idea was to write a whole new program that spawns and manages several workers. Each worker can still just attach a single consumer to the queue, but more processes running means more work being done at once. Golang certainly has the ability to start up and shut down child processes, so that helps a lot with auto scaling. There are various ways for separate processes to communicate with each other, so the workers can tell the master process when they last completed a job. If the master process sees that a worker hasn’t done any work for too long, then we get both unresponsive and death detection – more on those later.

Hivemind strategy

And another approach that I briefly thought of is more of a hivemind set up. A single worker process could have both the logic to process a single job at a time, as well as spawning and managing other worker processes. If an unresponsive or dead process is detected, one of the other running processes could assume responsibility for starting up a new one. Collectively they could make sure there was always a good number of processes running to handle the load. I did not look into this at all, so have no idea how sane this is. It could be an interesting exercise though.

In the end, I went with the master process approach. The following is how I tackled each challenge.

Auto scaling

The master process starts by spinning up a goroutine that regularly determines the number of processes that should be running to handle the load. This main goroutine then starts or stops worker processes to match this number. The calculation of the desired worker count is very simple. It considers both the current length of the queue, as well as the rate at which the queue count is changing. The longer the queue is, or the faster jobs are being queued, the more workers that should be spawned. Here is a simplified look at he main goroutine:

The master process needs to keep track of the child processes that it starts. This is to help with auto-scaling, and to regularly check the health of each child. I found that the easiest way to do this was to use a map with integer keys, and instances of a Worker struct as the values. The length of this process-map is used to determine the next key to use when adding a worker, and also which key to delete when removing a worker.

Golang provides an os/exec package for high level process management. The master process uses this package to spawn new worker processes. The master and worker processes are deployed to the same folder, so “./dsym-worker” can be used to start the workers up. However, this does not work when running the master process from a Launch Daemon. The first line in the StartWorker function below is how you can get the working directory of the running process. With this we can create the full path of the worker executable to run it up reliably. Once the process is running, we create an object from the Worker struct and store it in the process-map.

Determining the desired worker count for the job load, and then starting/stopping workers to meet that number is all there really is to auto-scaling in this case. I’ll cover how we stop workers further down.

Inter process communication in Golang

As mentioned previously, my simple approach for detecting an unresponsive worker is done by each worker reporting the time at which it last completed a job. If the master process finds that a worker has not done a job for too long, then consider it unresponsive and replace it with a new worker. To implement this, the worker processes needs to communicate in some way to the master process to relay the current time whenever it completes a job. There are many ways that this could be achieved:

  • Read and write to a file
  • Set up a local queue system such as Redis or RabbitMQ
  • Use the Golang rpc package
  • Transfer gobbed data through a local network connection
  • Utilize shared memory
  • Set up named pipes

In our case, all that we are communicating is just timestamps, not important customer data, so most of these are a bit overkill. I went with what I thought was the easiest solution – communicating through the standard out pipe of the worker processes.

After starting up a new process via exec.Command as described previously, the standard out pipe of the process can be obtained through:

Once we have the standard out pipe, we can run up a goroutine to concurrently listen to it. Within the goroutine, I’ve gone with using a scanner to read from the pipe as seen here:

Code after the scanner.Text() call will be executed every time a line of text is written to the standard out pipe from the worker process.

Unresponsiveness detection

Now that inter process communication is in place, we can use it to implement the detection of unresponsive worker processes. I updated our existing worker to print out the current time using the Golang fmt package upon completing a job. This gets picked up by the scanner where we parse the time using the same format it was printed in. The time object is then set to the LastJob field of the relevant Worker object that we keep track of.

Back in the main goroutine that is regularly iterating the process-map, we can now compare the current time with the LastJob time of each worker. If this is too long, we kill off the process and start a new one.

Killing a process can be done by calling the Kill function of the Process object. This is provided by the Command object we got when we spawned the process. Another thing we need to do is delete the Worker object from the process-map.

After killing off the misbehaving worker, a new one can be started by calling the StartWorker function listed further above. The new worker gets referenced in the process-map with the same key as the worker that was just killed – thus completing the worker replacement logic.

Termination detection

Technically, the detection and resolution of unresponsive processes also covers the case of processes that terminate unexpectedly as well. The dead processes won’t be able to report that they are doing jobs, and so eventually they’ll be considered unresponsive and be replaced. It would be nice though if we could detect this earlier.

Attempt 1

When we start a process, we get the pid assigned to it. The Golang os package has a function called FindProcess which returns a Process object for a given pid. So that’s great, if we just regularly check if FindProcess returns a process for each pid we keep track of, then we’ll know if a particular worker has died right? No, FindProcess will always return something even if no process exists for a given pid… Let’s try something else.

Attempt 2

Using a terminal, if you type “kill -s 0 {pid}”, then a signal will not be sent to the process, but error checking will still be performed. If there is no process for the given pid, then an error will occur. This can be implemented with Golang, so I tried it out. Unfortunately running this in Golang does not produce any error. Similarly, sending a signal of 0 to a non existent process also doesn’t indicate the existence of a process.

Final solution

Fortunately, we have actually already written a mechanism that will allow us to detect dead processes. Remember the scanner being used to listen to the worker processes? Well, the standard out pipe object that the scanner is listening to is a type called ReadCloser. As the name suggests, it can be closed, which happens to occur if the worker process at the other end stops in any way. If the pipe closes, the scanner stops listening and breaks out of the scanner loop. So right there after the scanner loop we have a point in code where we know a worker process has stopped.

All we need to do now is determine if the worker shut down as a result of the normal operation of the master process (e.g. killing unresponsive workers, or down-scaling for decreased load), or if it terminated unexpectedly. Before the master processes kills/stops a worker for any reason, it deletes it from the process-map. So, if the worker process is still in the process-map after the scanner stops, then it has not shut down at the hands of the master process. If that is the case, start up a new one to take its place.

The functionality of this can easily be tested by using the kill command in a terminal to snipe a worker process. The Mac Activity Monitor will show that a new one replaces it almost instantly.

Graceful shut down

When I first prototyped the auto-scaling behaviour with Golang, I was calling the KillWorker function listed further above to kill off processes when not so many were needed. If a worker is currently processing a job when it is killed off in some way, what happens to the job? Well, until the job has been completed, it sits safely in an unacked queue in Redis for a particular worker. Only when the worker acknowledges the queue that the job has been completed will it disappear. The master process regularly checks for dead Redis connections, and moves any unacked jobs for them back to the ready queue. This is all managed by the Golang Redis queue library we’re using.

This means that when a worker process terminates unexpectedly, no jobs are lost. It also means that killing off processes manually works totally fine. However, it feels kinda rude, and means processing those jobs is delayed. A better solution is to implement graceful shut down – that is to allow the worker process to finish the job they are currently processing, and then naturally exit.

Step 1 – master process tells worker to stop

To start off, we need a way for the master process to tell a particular worker process to begin graceful shut down. I’ve read that a common way of doing this is to send an OS signal such as ‘interrupt’ to the worker process, and then have the worker handle those signals to perform graceful shut down. For now though, I preferred to leave the OS signals to their default behaviours, and instead have the master process send “stop” through the standard in pipe of a worker process.

Step 2 – worker begins graceful shut down

when a worker process receives a “stop” message, it uses the Golang Redis queue library to stop consuming, and sets a boolean field to indicate that it’s ready for graceful shut down. Another boolean field is used to keep track of whether or not a job is currently in progress. The program is kept alive as long as one of these boolean fields is true. If they are both false, then it means it has no jobs to process, is marked for graceful shut down and so the program is allowed to terminate naturally.

Step 3 – worker tells master that it’s done

In the master process, we need to stop keeping track of any shut down workers by deleting them from the process-map. We could do this after sending the worker a “stop” message, but what would happen if the last job happened to cause the worker to get stuck in an unexpected infinite loop? To clean this up better, when a worker process has finished its last job and is able to shut down, it prints out a “stop” message. Just like the timestamps, this message gets picked up in the scanner we set up previously. When the master sees this message, it’s fine to stop keeping track of that worker, so delete it from the process-map.

Who watches the watcher?

At this point, the dsym-worker can now auto-scale to handle increased load, and has self-defensive mechanisms against unexpected termination and unresponsiveness. But what about the master process itself? It is a lot more simple than the worker processes, but is still at risk of crashing, especially as this is my first attempt at this kind of process set up. If it goes down, we’re right back to where we started with on-call goodness. May as well set up a mechanism to automate restarting the master process too.

There are a few ways to make sure that the master process is restarted upon failure. One way could be to use the ‘KeepAlive” option in a Launch Daemon config. Another option could be to write a script that checks the existence of the process, and start it if not found. Such a script could be run every 5 minutes or so from a Cron job.

What I ended up doing was to create yet another Golang program which initially starts the master process, and then restarts it if termination is detected. This is achieved using the same technique that the master process uses. Overall it is very small and simple, with nothing that I know of to go wrong. So far it’s holding up well. Another thing that would help is if I actually fix any issues that could cause the master process to crash… but I digress.

Orphaned workers

If the master process is shut down via an interrupt signal, it is automatically handled in a way that all the child processes get shut down too. This is really handy for deploying a new version, as only the top level process needs to be told to shut down. If the master process out right crashes though, then it’s a different story. All the worker processes hang around and keep processing work, but with nothing to supervise them. When a new master process is started, more workers are spawned, which could cause a problem if this keeps happening without any clean up.

This is an important situation to handle, so here is my simple solution against this. The os package in Golang provides a method called Getppid(). This takes no parameters, so you can simply call it any time and get the pid of the parent process. If the parent dies, the child is orphaned and the function will return 1 – the pid of init. So within a worker process, we can easily detect if it becomes orphaned. When a worker first starts it can obtain and remember the pid of its initial parent. Then, regularly call Getppid, and compare the result to the initial parent pid. If the parent pid has changed change, the worker has been orphaned, so commence graceful shut down.

Finishing words

And that’s about it. I hope you found this look into how I implemented an auto-scaling and self-defensive service in Golang at least a little bit interesting and not overly ridiculous. So far it’s all working really well, and I have some ideas to make it even more robust. If you know of any existing techniques, processes or libraries that I could have used instead, then I’d love to hear them in the comments below.

All the processes that make up our dsym-worker of course use Raygun to report any errors or panics that occur. This has made tracking down and fixing issues a breeze. Sign up for a free trial of Raygun if you also need error reporting in your services or applications.

Next level software intelligence across your entire stack.

29 Comments on “Auto-scaling and self-defensive services in Golang”

    1. Jason Fauchelle

      Thanks Andy,

      If I had more time, the Hivemind strategy would be an interesting challenge. Thanks for your interest.

  1. Mark

    My 2 cents, I would try to isolate the logic into separate applications to make it dirt easy. 🙂

    What I would do:
    * Replace Redis with Beanstalkd
    * Drop the goroutine-idea, let a small application (worker) get a job from beanstalkd,
    process it and have a small additional goroutine in it that does os.Exit if processing is too slow
    * Write a small program you start from cron to add/remove workers depending on N-jobs/Xminutes
    (you could read this from stats in beanstalkd)

    * If the worker is too slow the job is offered to another worker
    * You can kill workers with 0 impact (assuming your worker writes its output atomic)
    * You can ‘bury’/’delay’ specific jobs in edge-cases and see this in the beanstalkd stats (useful for monitoring)
    * No ‘parent’ process that could die and lose all workers
    * All job logic is done by beanstalkd, so this means less code and proven code in production for this task

    Regarding “who watches the watchers” you could let your Puppet/Ansible/Chef run check if cron.d/beanstalkd is running and if not start it so processing is only slowed down in case cron stopts running for whatever reason.

    And to lower issues like SPoF run this on at least 2 servers where the clients could connect to a domain (DNS-entry having round-robin or loadbalancer that sends 50/50)

    1. Jason Fauchelle

      Hi Mark, Thanks for detailing another approach to this, I appreciate it. A lot of good tips here.

    1. Jason Fauchelle

      Thanks Justin, Good spotting, I slightly modified that code for the blog post without checking. I should be more careful when it comes to pointers.

  2. Mathew

    From the looks of it, each worker process is now dedicating 100% of its cpu cycles towards its task. If the Hivemind works, you have to split some of the cycles towards management of workers. Question is, are you willing to take such a overhead ? Bearing in mind the overhead is multiplied for each worker vs 1 watchdog process approach.

    1. Jason Fauchelle

      Thanks Mathew, Certainly something to think about if I get to try the Hivemind approach. In the case of our dSYM worker, yes I would be willing to take the overhead.

  3. Hans Klunder

    Nice article.

    Since you already use Redis as your bus, you could also use key expiry ( to detect the state of your workers (and the master ;-))

    Eg. The worker:
    – starts it registers its node+pid in redis and a “heartbeat” key using key expiry.
    – refreshes the heartbeat key periodically
    – would also watch a “gracefull stop” key and act upon it.

    The master:
    – starts new workers
    – sets gracefull stop keys for those workers that should stop
    – kills all workers who’s heartbeat has expired

    Now you can stop and start the master anytime and everything will be running smoothly.
    So why not run the master periodically from cron, this way you don’t need to worry about long running processes or dying masters. If the new master finds another master running at start it kills it off as it has been running way too long (the king is dead, long live the king ;-)).
    As a bonus workers can run on multiple servers as long as you have a way for the master to start and kill servers there.

    1. Jason Fauchelle

      Hi Hans, Thanks for detailing this alternate approach to communicating between the processes.

  4. PunKeel


    Great article!

    If I understand what you did there, it scales using only one “dedicated” server. Would it be hard to implement the same scaling feature using multiple servers, or even with Cloud Instances ?


    1. Jason Fauchelle

      Thanks. Yes, all this is running on a single server. The solution would be quite different if scaling using multiple machines. Typically this would be done with a load-balancer.

  5. Hans Klunder

    My first comment seems to have been lost so I’ll try again

    Since you are already using redis you could also use redis EXPIRE to create a heartbeat
    – registers itself with node ID and pid in redis
    – updates the expiry on a “heart beat” key periodically in redis
    – watches for a “graceful shut” key in its name in redis
    – deregisters itself from redis on graceful shut

    The master periodically:
    – starts new workers and sets graceful shut keys
    – checks for workers with expired heartbeats and kills them off
    – cleans up registration for killed workers

    As this decouples the master from the workers, the master can now run from cron avoiding long running processes. On startup the master checks if the previous master is still running, if so it kills it first.

  6. Rob

    If this is all running on a single machine, why scale down the number of workers? Surely there’s an optimal number that maximises throughout that you can just leave it at?

    1. Jason Fauchelle

      Hi Rob, You are right, I could have just gone with a fixed worker count rather than implementing auto-scaling. My unresponsiveness detection may need to be revised though – during periods of less work to do, it could appear that some workers are not doing work for too long. Would need to also consider if each worker actually had work to do which would be easy enough.

  7. Michael Schurter

    While we don’t currently autoscale, we built Metafora* to support that sort of behavior. The scaler would be a Metafora task capable of inspecting the cluster’s resources and making changes according.

    Our codebase is 100% Go, so we chose to avoid having to deal with OS processes for tasks and focus on writing handlers in a crash safe way, so if a handler deadlocks we can kill -9 the process and everything will just naturally rebalance.


  8. Mark Mandel

    Depending on if you wanted to go down this path, but alot of these issues could be solved with Kubernetes…

    It can be used to make sure that your containers (which run your processes) are always up, if they go down, restart them. You could scale up your worker containers based on your queue metrics based on a simple API, which would be a Job/its own container process.

    Basically I feel it would remove alot of the plumbing that you are building just to keep things alive.

    1. Jason Fauchelle

      Hi Mark, Thanks for your feedback, I appreciate it. I will certainly look into Kubernetes.

  9. Oleksii

    Hi, thanks for article.
    I just wanted to ask did you try to use package Context?
    it helps to manage goroutines set timeout or cancel them when you need.
    Also if you decided to not use goroutines and handle workers in processes why did you choose Golang?


    1. Jason Fauchelle

      Thanks Oleksii, a great question. The process was written in Golang a few years ago, mainly due to depending on a couple of packages that we’ve only found available in Golang which would be quite a task to rewrite in another language. It’s still possible that we’ll switch to another language for this process at some stage, though it’s very low priority at present. I have not yet looked into the Context package, I’ll check it out.

  10. Pingback: TechNewsLetter Vol:13 | Devops Enthusiast

  11. Tom

    Really great read! I love seeing how other people handle these scenarios. I’ve been in a similar situation, but relied upon AWS auto-scaling groups and ECS Tasks to do this (SQS and Lambda also involved). Unfortunately none of which uses Golang =(

  12. Artyom

    You may find code at useful. The main goal of the project is the same, but some approaches differ; it also uses arbitrary programs as workers, so it’s not limited to “print time to stdout” worker types. Worker hang detection can be handled by cpu usage soft/hard ulimit set in worker wrapper script. This also takes into account overall server load, etc. It’s been used in production for over a year now with dozens of queues across multiple hosts, running hundreds of workers at peak hours.

Leave a Reply

Your email address will not be published. Required fields are marked *