Design of backup software

Posted on
backup

This article covers the design of an ideal file-based backup solution.

DESIRED PROPERTIES

  • Untrusted server
  • Incremental forever

METHOD

  1. Split file into blocks using rsync algorithm / deterministic reparse points This makes it very likely that a 1B addition in the middle of the file won’t cause an entire reupload of the following sections Use a block interval that is quite large (e.g. 100KB) to minimise size of metadata

  2. Encrypt each block with its own SHA1 hash, referenced by the SHA1 hash of the encrypted block

  3. Send blocks into offsite storage, which can incref and deduplicate based on encrypted hashes, and cannot decrypt because the original SHA1 is not known

  4. Client software maintains a local database of which files refer to which encrypted blocks, and a list of known decryption keys Database is encrypted with a password, kept locally, and mirrored offsite

BIKESHEDDING

Replace SHA1 above with the latest fast, cryptographically secure hash algorithm (BLAKE1 was a SHA-3 finalist, and BLAKE2 outperforms MD5)

SPECIFIC TASKS

Pruning

  • Server can keep track of when a block was incref’d, but not when it is no longer part of a current working set, so
  • It must be the client’s responsibility to decref blocks
  • Server has to keep track of which users have a ptr to which blocks, to prevent abuse
  • Must have a separate deletion key (server must authorise) vs encryption key (server must never know this)
  • Deletion key must never be autosaved, to prevent abuse

Auditing

  • Client can use local database to determine where size is used, without trusting remote storage

Restore

  • Client can request (any) encrypted profile, decrypt it locally, submit block requests for latest/any version, reconstruct locally using keys in DB

Efficiency

  • Block check/store requests can be batched to reduce number of roundtrip network requests
  • In case of incomplete upload, client can submit full list of known blocks (by encrypted hash) in local database to compare with full list of blocks on server which are associated to them

Streams

  • Single-pass and does not require knowing full data length, can therefore back up stdout without spool files

FUNDAMENTAL COMPARISON OF FILE- AND BLOCK- BASED SYSTEMS

FILE-BASED ~ Intensive rolling delta allows you to absolutely minimise the data upload when a file is modified. ~ Consistent file snapshot, but not guaranteed to be crash consistent

BLOCK-BASED

  • Incremental backups do not require intensive search
  • First-boot and unsync states require very intensive full reupload or full diff
  • Not necessarily file-consistent
  • File modifications literally rewrite every block in use, requiring full file reupload ~ Files not necessarily consistent, but whole system is crash consistent ~ Can use NT APIs to determine file extents on disk, allowing excluding blocks for unwanted files (restore would set these files to full of zeros)