This article covers the design of an ideal file-based backup solution.
DESIRED PROPERTIES
- Untrusted server
- Incremental forever
METHOD
-
Split file into blocks using rsync algorithm / deterministic reparse points This makes it very likely that a 1B addition in the middle of the file won’t cause an entire reupload of the following sections Use a block interval that is quite large (e.g. 100KB) to minimise size of metadata
-
Encrypt each block with its own SHA1 hash, referenced by the SHA1 hash of the encrypted block
-
Send blocks into offsite storage, which can incref and deduplicate based on encrypted hashes, and cannot decrypt because the original SHA1 is not known
-
Client software maintains a local database of which files refer to which encrypted blocks, and a list of known decryption keys Database is encrypted with a password, kept locally, and mirrored offsite
BIKESHEDDING
Replace SHA1 above with the latest fast, cryptographically secure hash algorithm (BLAKE1 was a SHA-3 finalist, and BLAKE2 outperforms MD5)
SPECIFIC TASKS
Pruning
- Server can keep track of when a block was incref’d, but not when it is no longer part of a current working set, so
- It must be the client’s responsibility to decref blocks
- Server has to keep track of which users have a ptr to which blocks, to prevent abuse
- Must have a separate deletion key (server must authorise) vs encryption key (server must never know this)
- Deletion key must never be autosaved, to prevent abuse
Auditing
- Client can use local database to determine where size is used, without trusting remote storage
Restore
- Client can request (any) encrypted profile, decrypt it locally, submit block requests for latest/any version, reconstruct locally using keys in DB
Efficiency
- Block check/store requests can be batched to reduce number of roundtrip network requests
- In case of incomplete upload, client can submit full list of known blocks (by encrypted hash) in local database to compare with full list of blocks on server which are associated to them
Streams
- Single-pass and does not require knowing full data length, can therefore back up stdout without spool files
FUNDAMENTAL COMPARISON OF FILE- AND BLOCK- BASED SYSTEMS
FILE-BASED ~ Intensive rolling delta allows you to absolutely minimise the data upload when a file is modified. ~ Consistent file snapshot, but not guaranteed to be crash consistent
BLOCK-BASED
- Incremental backups do not require intensive search
- First-boot and unsync states require very intensive full reupload or full diff
- Not necessarily file-consistent
- File modifications literally rewrite every block in use, requiring full file reupload ~ Files not necessarily consistent, but whole system is crash consistent ~ Can use NT APIs to determine file extents on disk, allowing excluding blocks for unwanted files (restore would set these files to full of zeros)