Block check latency

Posted on
backup

When implementing a backup system we may need the client to ask the server if it already has a certain block. There are several ways we can implement this in order to reduce latency impacts;

Option 1

  • Client sends block hash to server
  • Server replies Y/N
    • High latency impact

Option 2

  • Server sends client a list of all block hashes on server
  • Client is then free to stream in new blocks as fast as it can
    • No latency impact
    • Sensitive data release?
    • Large initial download?

For blocks averaging 4KB against a 1TB storage pool, using a 256-bit = 64-byte hash, that’s a 16GB download (infeasibly large). Increasing average block size to 64KB makes it a 1GB download - still infeasible even if the bloom filter is rsynced.

Option 3

  • Server sends a prebuilt bloom filter of all block hashes on server
  • Client can quickly stream new blocks that definitely don’t exist on server, and has to request for blocks that might exist on server
    • Since duplicate blocks are reasonably unlikely (dedup gives ~3% space improvement in typical scenario), this gets most of the performance benefit without much sensitive data release
    • Bloom filter size should be quite large to minimise false positives - but “quite large” depends on size of installation

For blocks averaging 4KB and a 5% false positive rate against a 1TB storage pool, we need a 200MB bloom filter. Increasing the average block size to 64KB makes it a 12MB download, probably feasible to perform on every backup or to cache for a while (client can update the cache themselves when submitting blocks!)

Option 4

  • Client builds local knowledge of remotely stored blocks
  • Client sends blocks to server regardless of whether server already has them
  • Server checks for block already existing but does not inform server
    • No sensitive data release
    • Negatively impacts bandwidth