Sunday 13 September 2015

Saving Images to AWS S3 Scriptomagically

Whilst I've been messing around creating boot images, I've hit against the problem of needing to archive off some large images for later use.  Now I've finally got access to a high-speed bandwidth internet link, I'm can back stuff off to Amazon's AWS S3 cloud in a reasonably timely fashion.

s3cmd does a great job of interfacing with AWS from a Linux CLI, but is designed to deal with precreated files, not stuff that is dynamically made.    When you're talking about multi-gigabyte files, it isn't always an option to make a local archive file before pushing it to the remote storage location.

I'm used to using pigz, dd and ssh to copy files like this, and wanted to achieve something similar to s3, however there doesn't seem to be many guides to achieving this.  I have however made it work on my debian based distro relatively easily.

Tooling


This is the tooling I combined

s3cmd

You need a recent version of s3cmd to make this work - v1.5.5 or above is apparently what supports stdin/stdout which you'll need.
As of writing, this can be obtained from the s3tools git repository @ https://github.com/s3tools/s3cmd
You'll need git and some python bits and pieces but building was straightforward in my case.

Before you start, make sure you setup s3cmd using the command s3cmd --configure

pigz

I use pigz, although you can use gzip to achieve the same thing.  For those that don't know, Pigz is a multi-threaded version of gzip - it offers much better performance than gzip on modern systems.

tar

tar is pretty much on every linux system, and helps deal with folder contents in a way that gzip/pigz can't.

Usage

The command I built is as follows:

tar cvf - --use-compress-prog=pigz /tmp/source/directory/* --recursion --exclude='*.vswp' --exclude='*.log' --exclude='*.hlog' | /path/to/s3cmd put - s3://bucket.location/BackupFolder/BackupFile.tar.gz --verbose -rr

I think its pretty self explaintory, but I'll run through the code anyway...

tar cvf = tar compress verbose next option is a file
= stands for stdout in tar parlance 
--use-compress-prog=pigz = self explaintory, but you can probably swap this for any compression app which supports stdout. 
/tmp/source/directory/* = the directory or mount point where your source files are coming from
--recursion = recurse through the directorys to pickup all the files
--exclude='*.vswp' --exclude='*.log' --exclude='*.hlog' = exclude various file types (in this instance, I was backing up a broken VMFS storage array 
= breaks the input to the next app
/path/to/s3cmd = the directory where s3cmd resides - in my instance, Id installed git repository version
put = send to s3; put works with a single file name.
= use stdout as the source  s3://bucket.location/BackupFolder/BackupFile.tar.gz = the s3 bucket and path where you want the output stored
--verbose = debugging verbose output and status tracking
-rr = reduced redundancy storage - less expensive than full redundancy, and you can include/exclude this based on your needs.

 The biggest problem with this is you don't really get an idea of how long a backup will take. s3cmd splits the file into chunks, but you don't know how many chunks it is until the process has completed.  I average around 6 MB/s but a multi-gigabyte file can still take several hours to upload.  Whilst I didn't time it exactly,  a 70GB file, compressed to 10GB, took around 90mins to send to s3.
You may want to leave your backup running in a screen session.