I'm doing professional services of optimizing on-prem and cloud Postgres databases. It always surprises me how much companies tend to take cloud databases as automated and rely on them to do auto-tuning for everything.
Well they don't, and as soon as the product goes live and the amount of data slightly grows, the performance degrades exponentially.
* Never count on cloud databases to replace a DBA work.
* Don't take ORM optimizations for granted. They all suck at one point or another.
* If you can't afford an FTE, take short professional gigs from time to time to fix and boost performance.
I'd even go as far to say if you don't know how to write SQL that does the same thing your ORM framework is doing, you're setting yourself up for a lot of pain later when you do something non-trivial.
Worse, is that ORMS tend to creep all through your code. Introducing not only all sorts of coupling- and responsibility issues down the road, it most of all leads to unexpected issues.
Sure, its nice being able to say `ThisPost.authors.sort(DESC, 'postCount').where('active = true').select('name', 'avatarUrl', 'postCount')` anywhere in your codebase, but you know have spread knowledge of how to sort, filter; what attributes there are, what their relations are and so on, throughout your codebase.
Or, more on-topic: you know have spread knowledge about what can be fetched optimized, and what will cause performance pain, throughout your code. Suddenly some view-template, or completely unrelated validation, can now cause your database to go down.
Rails' activerecord is especially cumbersome in this: it offers "sharp knives"[0] but hardly any tooling to protect yourself (or that new intern) from leaving all those sharp knives lying around. A "has_and_belongs_to :users" and a ".where(users: { status: 'active', comment_count: '>10' }" is just two lines. But they will haunt you forever and probably will bring your database down if you have any load.
this is why nowadays I prefer to use query builders (like knex - knexjs.org) instead of ORMs. You can optimise your queries without having huge blobs of raw SQL in the middle of model code, and you can build models (or service objects, as I prefer) manually that are completely transparent.
Yeah, for me this is a classic "easy vs simple" example.
Cloud databases are a very easy option. But Postgres is 1.3 million lines of code, and on top of that you add the complexity of the cloud vendor's environment, choices, and custom code. It may be easy, but it's definitely not simple.
This is pretty basic PostgreSQL tuning (1). An oft-cited problem with PostgreSQL is that it uses a process-per-connection (session in PG lingo, I think). `work_mem` is the amount of memory a connection has to, you know, do work. Sorting, distinct, some joins.
Number of Connections * work_mem is usually going to eat up the biggest chunk of your PostgreSQL's memory. AFAIK, the configuration is extremely coarse. There's no way for example to say: I want all the connections from username "myapp" to have 10MB and those from user "reporting" to have 100MB. And it can't be adjusted on the fly, per query.
Being able to set aside 200GB of memory to be used, as needed, by all connections (maybe with _some_ limits to prevent an accident), would solve a lot of problems (and introduce a bunch of them, I know).
Since they're on RDS, I can't help but point out that I've seen DB queries on baremetal operate orders of magnitude faster. 10 minute to 1 second type thing. I couldn't help but wonder if they'd even notice the disk-based sorting on a proper (yes, I said it) setup.
I worked on a MySQL-on-RDS setup once where the company was splurging (and being charged through the nose, ofc) on a backing EBS volume with 20k IOPS. Meanwhile nobody had bothered to update the default settings, which limited all background work to use no more than 100 IOPS. Needless to say, this DB had severe performance problems.
Managed databases in the cloud are a bit like ORMs: they significantly reduce the amount of knowledge needed 99% of the time. Sadly, when the 1% of the time hits this also means that nobody on the team has built up the DBA skills required for the big hairy problem. I see this pattern repeated all the time when consulting for startups that just had their first real traction and now the magic database box is no longer working as before.
I think DB tuning seems like a perfect business opportunity for specialist consulting. Doing a really good job requires deep expertise but then it only needs to be repeated occasionally (if the operational profile or the data size changes significantly).
But I don't see this model much. I mostly hear of companies struggling to develop internal expertise or going without.
I did this for years. I was a server optimisation consultant. I'd always advise my clients that increasing the specs of the hardware would probably be cheaper in the short term than hiring me to go through the various configurations with a fine toothed comb. I also promised a fixed price report on what needed to be done, as well as a fixed price quote on doing the actual work. I found that this level of transparency and honesty never failed, and the work was always approved. I'd also offer training for the internal team.
I've pretty much completely changed careers now, but I found that this model did work well. It involved a lot of constant research and occasionally a huge amount of stress, but I loved it!
It does exist (I'm one of them), and it's much more common on expensive databases like Microsoft SQL Server and Oracle. With free databases, it's relatively inexpensive to throw more hardware at the problem. When licensing costs $2,000 USD per CPU core or higher, there's much more incentive for companies to bring an expert in quickly to get hardware costs under control.
Quite so. Of course, in most countries all drivers will have had mandatory driving lessons before they got their license and the instructor will have told them about the handbrake. No such license requirement exists for databases and it shows.
> Sadly, when the 1% of the time hits this also means that nobody on the team has built up the DBA skills required for the big hairy problem. I see this pattern repeated all the time when consulting for startups that just had their first real traction and now the magic database box is no longer working as before.
I call this the Kubernetes effect. 99% of the time, people think it's easy. Then the 1% hits and they have to completely rebuild it from the ground up. For us it was cluster certificate expiration and etcd corruption, and some other problems we couldn't ever pin down.
The other effect (which I'm sure has a name) is when they depend on it for a long time, but then a problem starts to occur, and because they can't figure out a way to solve it, they move to a completely different tech ("We had problems with updating our shards so we moved to a different database") and encounter a whole new problem and repeat the cycle.
An Oracle story. Read queries can also write to disk due to the way Oracle handles consistency. Information on which transactions last changed particular rows in a table gets stored with the rows. Read query then looks up those transactions to see whether they were committed or not. If not, read query skips rows they touched (and looks up their previous versions). If they were, that transaction info gets cleared up, so next queries don't have to look it up again.
This is simplified explanation, but the point is the process executing a read query is often best positioned to do this maintenance work. This decision trades off a little latency for throughput.
This Oracle behavior is called "block cleanout," and makes it unnecessary for a transaction to touch all effected blocks when it commits or rolls back - some future transaction will do this.
Oracle also has two init.ora parameters, SORT_AREA_SIZE and SORT_AREA_RETAINED_SIZE, that are similar to the WORK_MEM mentioned in the subject article.
As far as I know, SORT_AREA_SIZE is global, and differing values cannot be assigned to roles.
Postgres does something similar. Every time a query returns a row the system must do a lookup into a visibility map to see if the row is visible to the current transaction. It can also check if it's a completely dead row. If there is an index involved, then on the next row fetch Postgres will report to the index whether the last row was dead or not. The index can elect to delete the row reference, which means a write during a select statement.
This is not a pg specific thing, for instance in SQL Server this is referred to as “spilling to disk”. Large sorts can spill to disk, and the query planner may assume this is going to be done in RAM, based on what stats it has about the relevant indexes, but later have to spill to disk, so pick a plan that is sub-optimal.
> Furthermore, the entire output is accumulated in temporary storage (which might be either in main memory or on disk, depending on various compile-time and run-time settings) which can mean that a lot of temporary storage is required to complete the query.
The database has analysed the query that is being run, and decided that the quickest way to fulfil that query is to take 27MB of data and sort it into a file on disc before reading it back in with the correct order. This is caused by the work_mem setting in the database being set too low, preventing the database from contemplating just sorting the whole thing in memory.
The default work_mem setting for Postgres has historically been very low. It's fine for reading single rows from a single table using an index, but as soon as you put a few joins in there it's not adequate. It should be one of the first steps of setting up Postgres to increase this limit.
Well they don't, and as soon as the product goes live and the amount of data slightly grows, the performance degrades exponentially.
* Never count on cloud databases to replace a DBA work.
* Don't take ORM optimizations for granted. They all suck at one point or another.
* If you can't afford an FTE, take short professional gigs from time to time to fix and boost performance.