Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With open source SQL databases there is no "simply" when it comes to replication. Even today, MySQL replication is brittle, and master/slave inconsistencies are the rule rather than the exception. Slave crashes often cause replayed transactions due to lack of atomicity in writing master.info and relay-log.info. The replication landscape with PostgeSQL is varied and essentially a bag on the side. Last I counted there were more than 10 different ways of doing it, a number involving trigger based log shipping. It wasn't about open-source vs commercial, it was about scaling the reads.

The detail pages weren't pre-generated, they were based on read-only catalog data, which I think is entirely database appropriate. I imagine that complete re-generation of data is no longer done, but I'd be willing to bet that Berkeley DBs are still used in production somewhere.



A read-only database isn't a database in my book.

I agree that built-in replication can be difficult to administer even today, but you're being completely revisionist here. Replication wasn't introduced into MySQL until 2000. In 1997, you would by necessity have rolled your own replication system tailored to your needs (much simpler than solving the general-case problem). That's basically what you did anyway, but you solved it in the most trivial way possible: you 'replicated' by doing a complete database dump and re-distributing the entire DB. If you'd had a viable open-source relational database, you could have scaled the reads and got more developer productivity by distributing a SQL database (e.g. SQLLite) rather than a key-value database (BDB).

I appreciate your standing up and giving a concrete example of NoSQL usage - nobody else has been brave enough to do so. But it seems that the reasons for it were highly specific to the time: there were no viable open-source databases, Amazon was just introducing the idea of customer reviews (i.e. pre Web 2.0) so data was primarily read-only, memory was comparatively expensive and memcached didn't exist, and you had a comparatively small product catalog where complete re-generation was an option. I don't think you can carry forward the optimizations you made in that framework into today's world.


See my reply to the grandparent.

I actually was responsible for that system, and moving away from BDB's being pushed to servers sometime in '00 or so.

As you said, these weren't really databases by any stretch of the imagination, simply snapshots, and built for a very specific type of query. (by asin, by time, reverse ordered)

The building of the DB's was a pain in the ass, because the sheer scale of them was so big that you had to do clean builds (instead of incrementals) fairly often without them wasting space. There was also all sorts of voodoo magic going on to work around various BDB issues.

The system did eventually move to a service architecture (as all of AMZN did), for two main reasons:

1) pushing that much data to more and more servers was getting insane, even on their inner networks.

2) we wanted faster turnaround for new reviews

3) rebuilding the BDBs was becoming more and more cumbersome with scale

All that said, the original system did take us pretty darn far, both in scalability of traffic and scalability of data, farther than most websites will ever reach.

Fun times working there, you really get to work on some unique problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: