No filesystem - the rust codebase directly handles the disk layout and schedulin...

jaysoncena · on March 15, 2016

this is really cool! Can you give more details on how blocks, replications, metadata, etc. are being handled?

james_cowling · on March 15, 2016

Lazy answer but we'll blog about this in the next month.

On a high level it's variable-sized blocks packed into 1GB extents, which are then aggregated into volumes and erasure coded across a set of disks on different machines/racks/rows/etc. We also replicate cross-country in addition to this. Live writes are written into non-erasure-coded volumes and encoded in the background.

The volume metadata on the disks contains enough information to be self-describing (as a safety precaution), but we also have a two-level index that maps blocks to volumes and volumes to disks.

More info to come later.

kdkeyser · on March 15, 2016

Are live writes replicated in real time, or are they locally staged (which is what Facebook does, I believe)

Also, do you mimic the eventually consistent behaviour of AWS, or do you offer a stronger form of consistency?

jamwt · on March 15, 2016

Live writes are written out with 4x redundancy in the local zone, then asynchronously replicated out to the remote zone. Some time later, it is erasure coded into a more space-efficient format, independently in each zone.