When implementing a backup system we may need the client to ask the server if it already has a certain block. There are several ways we can implement this in order to reduce latency impacts;
- Client sends block hash to server
- Server replies Y/N
- High latency impact
- Server sends client a list of all block hashes on server
- Client is then free to stream in new blocks as fast as it can
- No latency impact
- Sensitive data release?
- Large initial download?
For blocks averaging 4KB against a 1TB storage pool, using a 256-bit = 64-byte hash, that’s a 16GB download (infeasibly large). Increasing average block size to 64KB makes it a 1GB download - still infeasible even if the bloom filter is rsynced.
- Server sends a prebuilt bloom filter of all block hashes on server
- Client can quickly stream new blocks that definitely don’t exist on server, and has to request for blocks that might exist on server
- Since duplicate blocks are reasonably unlikely (dedup gives ~3% space improvement in typical scenario), this gets most of the performance benefit without much sensitive data release
- Bloom filter size should be quite large to minimise false positives - but “quite large” depends on size of installation
For blocks averaging 4KB and a 5% false positive rate against a 1TB storage pool, we need a 200MB bloom filter. Increasing the average block size to 64KB makes it a 12MB download, probably feasible to perform on every backup or to cache for a while (client can update the cache themselves when submitting blocks!)
- Client builds local knowledge of remotely stored blocks
- Client sends blocks to server regardless of whether server already has them
- Server checks for block already existing but does not inform server
- No sensitive data release
- Negatively impacts bandwidth