Block check latency

Posted on Jun 7, 2014

backup

When implementing a backup system we may need the client to ask the server if it already has a certain block. There are several ways we can implement this in order to reduce latency impacts;

Option 1

Client sends block hash to server
Server replies Y/N
- High latency impact

Option 2

Server sends client a list of all block hashes on server
Client is then free to stream in new blocks as fast as it can
- No latency impact
- Sensitive data release?
- Large initial download?

For blocks averaging 4KB against a 1TB storage pool, using a 256-bit = 64-byte hash, that’s a 16GB download (infeasibly large). Increasing average block size to 64KB makes it a 1GB download - still infeasible even if the bloom filter is rsynced.

Option 3

Server sends a prebuilt bloom filter of all block hashes on server
Client can quickly stream new blocks that definitely don’t exist on server, and has to request for blocks that might exist on server
- Since duplicate blocks are reasonably unlikely (dedup gives ~3% space improvement in typical scenario), this gets most of the performance benefit without much sensitive data release
- Bloom filter size should be quite large to minimise false positives - but “quite large” depends on size of installation

For blocks averaging 4KB and a 5% false positive rate against a 1TB storage pool, we need a 200MB bloom filter. Increasing the average block size to 64KB makes it a 12MB download, probably feasible to perform on every backup or to cache for a while (client can update the cache themselves when submitting blocks!)

Option 4

Client builds local knowledge of remotely stored blocks
Client sends blocks to server regardless of whether server already has them
Server checks for block already existing but does not inform server
- No sensitive data release
- Negatively impacts bandwidth