Thursday, August 23, 2018

Chef Server Backups

Recently, I ran across an issue where one of the Chef servers for our lab in AWS was powered down. I attempted to start the EC2 instance and connect to make a backup to migrate it to a new server but was unable to connect to it, this caused our automated development machine builds to fail and of course there was no backup. This caused alarm because the person who setup the server with roles and cookbooks had left the company and there was no documentation for how this was setup. We had some of the roles and cookbooks in GitHub, but it turned out that not all of them were there. I will go into detail later on how I managed to get a backup of the database.

We have a few Chef servers for each environment (Lab, Staging, and Production). None of which were being backed up, at least even to itself. So I took on a project to start a process to back them up which started with using the chef-server-ctl backup command as part of Chef Manage. This started with a simple cron job running as root to run it nightly at 2 AM and backing up to the local drive as well as removing older files that are not needed. The script was as follows:

#!/bin/bash
#--Back up server

sudo /usr/bin/chef-server-ctl backup --yes

#--Remove old files
ls -1 /var/opt/chef-backup/*.tgz | sort -r | tail -n +6 | xargs rm > /dev/null 2>&1

With a simple crontab that would log the output to a file :

0 2 * * * /home/chefServer/chefServerBackup.sh > /var/log/chefServerBackup.log 2>&

This was the first iteration of the job and the next step was to get the backups off the Chef server. To accomplish this, I created a local user on the server and called it chefBackup, created some ssh keys to allow for scp to another machine without a password.

#!/bin/bash
#--Back up server
sudo /usr/bin/chef-server-ctl backup --yes
#--Get last file
BACKUP="`ls -dtr1 /var/opt/chef-backup/* | tail -1`"
#--Move to ChefDK server
scp $BACKUP chefBackup@server01:$BACKUP
#--Remove old files
#--Crontab in Root

Notice that the removal of the old files remains in the root user, this was because I didn't want to give the chefBackup user too many permissions and it didn't have access to run the rm command. The removal of old files crontab looks like this:

0 3 * * * ls -1 /var/opt/chef-backup/*.tgz | sort -r | tail -n +6 | xargs rm > /dev/null 2>&1

Now the backups are no longer just on the Chef Server itself and pushed to another server. So something catastrophic must happen to loose both of these servers to where you don't have any backups (which I probably just jinxed myself). This is better than nothing at the moment. Next step is to push these off to an S3 bucket or some sort of shared storage that is not on-premise or at the least in a different data center which means more updates coming. Hope this helps. The only thing that I can stress is that it is good to have backups, but foolish to have backups that are never tested. So be sure to regularly test the backups to make sure they will work in an event of an actual disaster.