Design of backup software

Posted on Mar 22, 2014

backup

This article covers the design of an ideal file-based backup solution.

DESIRED PROPERTIES

Untrusted server
Incremental forever

METHOD

Split file into blocks using rsync algorithm / deterministic reparse points This makes it very likely that a 1B addition in the middle of the file won’t cause an entire reupload of the following sections Use a block interval that is quite large (e.g. 100KB) to minimise size of metadata
Encrypt each block with its own SHA1 hash, referenced by the SHA1 hash of the encrypted block
Send blocks into offsite storage, which can incref and deduplicate based on encrypted hashes, and cannot decrypt because the original SHA1 is not known
Client software maintains a local database of which files refer to which encrypted blocks, and a list of known decryption keys Database is encrypted with a password, kept locally, and mirrored offsite

BIKESHEDDING

Replace SHA1 above with the latest fast, cryptographically secure hash algorithm (BLAKE1 was a SHA-3 finalist, and BLAKE2 outperforms MD5)

SPECIFIC TASKS

Pruning

Server can keep track of when a block was incref’d, but not when it is no longer part of a current working set, so
It must be the client’s responsibility to decref blocks
Server has to keep track of which users have a ptr to which blocks, to prevent abuse
Must have a separate deletion key (server must authorise) vs encryption key (server must never know this)
Deletion key must never be autosaved, to prevent abuse

Auditing

Client can use local database to determine where size is used, without trusting remote storage

Restore

Client can request (any) encrypted profile, decrypt it locally, submit block requests for latest/any version, reconstruct locally using keys in DB

Efficiency

Block check/store requests can be batched to reduce number of roundtrip network requests
In case of incomplete upload, client can submit full list of known blocks (by encrypted hash) in local database to compare with full list of blocks on server which are associated to them

Streams

Single-pass and does not require knowing full data length, can therefore back up stdout without spool files

FUNDAMENTAL COMPARISON OF FILE- AND BLOCK- BASED SYSTEMS

FILE-BASED ~ Intensive rolling delta allows you to absolutely minimise the data upload when a file is modified. ~ Consistent file snapshot, but not guaranteed to be crash consistent

BLOCK-BASED

Incremental backups do not require intensive search

First-boot and unsync states require very intensive full reupload or full diff
Not necessarily file-consistent
File modifications literally rewrite every block in use, requiring full file reupload ~ Files not necessarily consistent, but whole system is crash consistent ~ Can use NT APIs to determine file extents on disk, allowing excluding blocks for unwanted files (restore would set these files to full of zeros)