I recently wrote about Understanding Riak Clusters and designing a backup strategy. One of our customer has a 5 node Riak cluster running on AWS EC2 and we had to create a backup job for it. If you are running riak enterprise edition, the best way to do a backup is to do a full sync replication every day to a node in a different datacenter. Since we are not running enterprise edition, we decided to go with file system level backups of each node. Since we were running on Amazon EC2, the ebs snapshots feature comes in handy and it is faster compared to rsync or archiving etc.
The script iterates through a list of nodes and does the following:
Makes an SSH connection to the node using fabric
Stops Riak service by running riak stop command. Since our storage backend is leveldb and not bitcask, stopping services is necessary before initiating a snapshot.
Takes snapshots of all ebs volumes attached to the instance
Starts riak service post snapshot using riak start command
Checks if all the primary vnodes are up and running using riak-admin transfers command. If they aren’t, you’ll generally see a text like this - does not have \d+ primary partitions running
Checks if there are handoffs pending and waits till they are done before moving on to next node. When a node is down in a riak cluster, vnodes from the other live nodes temporarily takes responsibility for some data and once node is back online, returns the data to original owner. This is a called a hinted handoff. We need to make sure that there are handoff transfers before moving on to the other nodes. Makes sure that riak kv service is up.
Moves to other node and starts from step #1 and at the end, sends out an email with status.
Considerations before running this script:
You need fabric, boto python libraries
Fabric executes remote sudo commands for stopping and starting riak. You need to edit the sudoers file and change requiretty to !requiretty. It apparently provides no additional security benefit and can be removed
Blow is the script. You may also download it from here: