> If it were me in my ideal world, I’d have copies of everything stored in S3 be...

bigiain · on Oct 11, 2021

I have written a few documents outlining some risks and possible mitigations against AWS problems.

It's not just "Russia nuked North Virginia". That's not by far the most likely failure scenario in my estimation. I consider "billing problem causes Amazon to close our account" or "trusted employee with top level access deletes everything in our account right down to Glacier storage" or "employee with access severely violates AWS T&Cs and gets us booted off" as being far more realistic threats, while effectively having the same consequences as a couple of ICBMs with MIRVs targeted to every datacenter in NV...

In any of those cases, expecting to rehydrate from S3 isn't an option.

Of course, nobody wants to incur the costs associated with "How could we keep business continuity without AWS?" because it gets very expensive very quickly for anything non trivial.

Dylan16807 · on Oct 11, 2021

> Of course, nobody wants to incur the costs associated with "How could we keep business continuity without AWS?" because it gets very expensive very quickly for anything non trivial.

I think that mostly depends on how much data you generate in a day. If sending backups out increases your bandwidth use by 5% then it's pretty easy to throw that into cold storage on google or azure or local or all three.

bigiain · on Oct 11, 2021

It's not just the data (that's an easy enough problem to solve).

It's the infrastructure.

Even if you've only got a reasonable simple platform, say some redundant EC2 app servers behind a load balancer and a multi AZ RDS database, with some S3 storage and a Cloudfront distribution serving static assets - you've probably also got Route53 DNS hosting, AKS ssl certs, deployment AMIs, CloudWatch monitoring/alerting, and a bunch of other "incidental" but effectively proprietary AWS stuff - because it's there.

How do you get all that stood up "right now" in Azure or GCP or DigitalOcean or wherever, unless you've already put the time/effort into making that happen?

How many "single points of failure" are locked inside your AWS account? (For my stuff, Route53 is the thing that keeps me up at night occasionally. If we lost access to our domains registered/hosted in AWS, we'd need to pick new domain names and update all out apps...)

Dylan16807 · on Oct 12, 2021

I guess it depends on how much you need "right now".

It doesn't take crazy amounts of effort to set up app servers, a load balancer, a database, and S3-compatible storage somewhere else.

If you had one person working on that two days a month you could keep a warm fallback system ready to go. That takes some effort to keep a map of what your cloud services are actually doing, which is a good idea anyway.

swyx · on Oct 11, 2021

interesting threat vector. would CEO having sole access to Cloudflare R2 clone satisfy at least data durability? since full continuity without AWS is probably impossible for most real life scenarios.

brazzy · on Oct 11, 2021

If the CEO has sole access to it, how do the backups get there?

AdamN · on Oct 11, 2021

Having different read/write permissions is an important durability consideration. In this case the CEO has read privs (not write) and the archive utilities have write privs, not read.

There are other problems of course, CEO being hit by a bus being the most obvious.

brazzy · on Oct 11, 2021

> In this case the CEO has read privs (not write) and the archive utilities have write privs, not read.

Not sure you could really make this work.

The archive utilities would also need to be barred from overwriting or deleting backups. If that's automated, who configures it? And how do you do differential backups without read privileges?

AdamN · on Oct 15, 2021

The archive utility, when run as the backup user, cannot read the file but it can read file metadata or have a database of backup state so it knows where to send differential backups.

The archive utility, run as the recover user, can only read the file.

chefandy · on Oct 11, 2021

Ehhhh. I had a good reminder that s3 is only as good as its supporting interfaces a few years ago. I needed to create some test buckets with production data for dev work while I was getting a project of the ground. Sure, I could have made a new, separate read only username just for the task and all that— but it was a simple task and I was crunched for time, so I logged into the web GUI with the account admin login, made the copy, ran my tests, and deleted it. No problem. I thought. I got a VM a couple of days later from their support team saying they had reason to believe they’d lost some of my data — apparently the GUI delete had a (quickly detected, fixed and proactively addressed) bug that was deleting entries with the same relative key in every vaguely similarly named bucket, including my production bucket. Luckily it was the beginning of the transition and I sill had local copies.

Anyone could validly make the argument that this us user error and sloppy ops work — but that’s almost always the case in data loss events… be it unverified backups, abusing root, etc.

I still think there’s value in diversifying both in vendor and physical location.