What's a highly available Chef Server

To be honest with you, I feel like the title is a little bit of a click bait. There is already a description of how to create and operate chef clusters on chef.io. However, clusters of that size and complexity can be a little bit over the top and require additional effort, monitoring and possibly dedicated engineers.

That’s being said, the solution I will show you here isn’t the same. Chef Server is rather reliable. Personally we had no issues with it so far, because of that, we focus on backup and restore easily as described in this article.

If a Chef Server fails, we can easily throw the instance away and bring up another one, but this happens very rarely. What is a lot more likely scenario before an instance would fail is that you need to apply patches, bugfixes or upgrade the operating system. And this is what we are focusing on at RevDB.

In this article, I will describe how to configure your Chef Server instance to be replaced only when the replacement is ready to serve production, hence giving you a highly available Chef Server. We have found this method a lot easier, cheaper and more than reliable enough to serve our needs.

Configuring the EC2 instance

In this previous post, I’ve already explained how to configure and create a Chef Server and in this, how to backup and restore it. Today, we will add a few additional parameters to the instance.


In the world of EC2 instances, configured by terraform, a local-exec provisioner, calls a local executable, after the resource is created. More on this, here. This will give us the power to do all sorts of things on an ec2 instance, that we want to replace our current Chef Server, for example, restore from backup and make sure it’s up.

The added twist here is that terraform will wait for this instance to build up and execute the local-exec component, before swapping it for the existing one. That’s the create_before_destroy rule. More on this functionality available here.

  provisioner "local-exec" {
    interpreter = ["python", "-c"]
    command     = data.template_file.wait_provisioned.rendered
    environment = {
      AWS_HOSTNAME = aws_instance.chef-server.public_ip

  lifecycle {
    create_before_destroy = true

Waiting for it to come online

Reasonably, if you are following along, your question at this point is: What’s in the provisioner?

First of all, to make sure the file is properly deployed, we use the template mechanism from above, that takes the following addition to the terraform files

data "template_file" "wait_provisioned" {
  template = file("${path.module}/wait_provisioned.py")

With this, we have added a simple, but clever script into the module’s root

import socket
from os import environ

import sys
import time


def main(hostname, service_name="echo", port=None, timeout=3600):
    port = port or socket.getservbyname(service_name, "tcp")

    retry_time = 3
    end_time = time.time() + timeout
    while time.time() < end_time:
            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            s.connect((hostname, port))
            return 0
        except (TimeoutError, ConnectionRefusedError) as err:
    return 1

if __name__ == '__main__':
    except KeyboardInterrupt:

How to use your new powers

If you are like me when I first reviewed this solution, built by our very own Aleks, your brain should be on fire. This is just brilliant, when you combine with the other two posts.

So what can you do ? If you remember the first post in this series, we specify the AMI version for the Chef Server.

resource "aws_instance" "chef-server" {
  ami           = "ami-02eac2c0129f6376b" # AMI ids are region specific

Change the AMI version, run terraform apply and see how the new instance will come up online, before the old one will be destroyed. And that’s our version of Highly available chef server.

Categories: ChefRevDB


Leave a Reply

Your email address will not be published. Required fields are marked *