Introduction to encrypted, incremental backup

Posted on
backup

There are a few approaches to encrypted incremental backup assuming the constraints are (1) we don’t trust the remote storage provider to snoop through our data, so it must be encrypted client-side; and (2) bandwidth is expensive, we want to minimise data sent to the remote storage provider - that includes only ever sending the changes to our data.

Take duplicity[1] for instance. In order to work with untrusted remote storage sites, all data is encrypted and the encryption key is never exposed to the remote storage provider. So the storage is

encrypt(v1) + encrypt(diff(v1, v2)) + encrypt(diff(v2, v3)) + ...

which works fine, except there’s no way for the server to consolidate the incremental parts back into the full file[2] unless we decrypt the data directly on the storage provider, violating the first constraint (see rdiff-backup[3] for a reverse-incremental merge system).

Otherwise, the data storage will grow unboundedly - until another full file is uploaded, at which point you can prune the old chain at your leisure once retention policies are met. This works fine, but it’s suboptimal in the second constraint - performing a full re-upload of a large backup set is certainly inconvenient.

Can we meet both constraints and still allow server-side incremental consolidation? There’s a remaining alternative; store the data instead in the form

encrypt(v1) + diff(encrypt(v1), encrypt(v2)) + diff(encrypt(v2), encrypt(v3)) + ...

which works well - diffs can be merged without requiring to decrypt the data - until you realise that encrypt(v1) and encrypt(v2) might produce entirely different files with very little in common. If v2 changed a byte half-way through the file, then all data from that point onward would appear totally differently (except in the rare case when, say, using ECB-mode encryption and making an edit an exact multiple of the block cipher’s length).

So is there a workaround to this workaround?

Solution 1 - FS-based

Instead of encrypting and backing up files / tarballs directly, we can create a loopback filesystem and back it up as a whole; and [i]restart the encryption cipher in an integer ratio to the filesystem sector size[/i]. As data writes are confined to filesystem sectors, this ensures that small changes to the data result only in small changes to the encrypted filesystem.

Instead of creating a loopback filesystem, you might instead use Truecrypt, and back up the container without encryption; Truecrypt restarts its cipher every 512 bytes regardless of the selected cipher or filesystem sector size[4] in order to minimise changes to the encrypted container.

The downside then is of course you have a fixed size container to back up, even if it is mostly empty; and it becomes easy to hit the limits of the container if care is not taken; but for most filesystems it’s possible to shrink and resize a partition without causing significant writes (at most, the MFT). Truecrypt does not generally support resizing containers, but extcv[5] can be used if the container filesystem is NTFS.

The other downside is unless you can bring your whole way of computing in line with this approach, it usually necessitates spooling a copy of the data and performing a local backup from your everyday filesystem into the container. But presumably a local copy of your backup data is a good part of your backup strategy anyway (for fast restores, etc).

I guess it would be possible to combine this approach with a new backup software that rebuilt the container on-the-fly every time a backup was initiated, if only the MFT of the generated container was preserved. It would take some reasonably close integration with the filesystem driver.

Solution 2 - Rolling Checksum

See the followup article.


  1. http://duplicity.nongnu.org/ “Encrypted bandwidth-efficient backup using the rsync algorithm.”

  2. Another goal of duplicity is to support “dumb storage” providers like Amazon S3 and random FTP hosts, where even if there was a way to perform the consolidation, you can’t run programs on the storage-side anyway.

  3. http://rdiff-backup.nongnu.org/

  4. http://www.truecrypt.org/docs/modes-of-operation “The size of each data unit is always 512 bytes (regardless of the sector size).”

  5. http://sourceforge.net/projects/extcv/ “Expand a TrueCrypt volume on the fly without reformatting.”