A read query can write to disk: a Postgres story

krembo · on Nov 1, 2021

I'm doing professional services of optimizing on-prem and cloud Postgres databases. It always surprises me how much companies tend to take cloud databases as automated and rely on them to do auto-tuning for everything.

Well they don't, and as soon as the product goes live and the amount of data slightly grows, the performance degrades exponentially.

* Never count on cloud databases to replace a DBA work.

* Don't take ORM optimizations for granted. They all suck at one point or another.

* If you can't afford an FTE, take short professional gigs from time to time to fix and boost performance.

spicybright · on Nov 1, 2021

> Don't take ORM optimizations for granted.

I've been bitten many a time by this.

I'd even go as far to say if you don't know how to write SQL that does the same thing your ORM framework is doing, you're setting yourself up for a lot of pain later when you do something non-trivial.

berkes · on Nov 1, 2021

Worse, is that ORMS tend to creep all through your code. Introducing not only all sorts of coupling- and responsibility issues down the road, it most of all leads to unexpected issues.

Sure, its nice being able to say `ThisPost.authors.sort(DESC, 'postCount').where('active = true').select('name', 'avatarUrl', 'postCount')` anywhere in your codebase, but you know have spread knowledge of how to sort, filter; what attributes there are, what their relations are and so on, throughout your codebase.

Or, more on-topic: you know have spread knowledge about what can be fetched optimized, and what will cause performance pain, throughout your code. Suddenly some view-template, or completely unrelated validation, can now cause your database to go down.

Rails' activerecord is especially cumbersome in this: it offers "sharp knives"[0] but hardly any tooling to protect yourself (or that new intern) from leaving all those sharp knives lying around. A "has_and_belongs_to :users" and a ".where(users: { status: 'active', comment_count: '>10' }" is just two lines. But they will haunt you forever and probably will bring your database down if you have any load.

[0] https://rubyonrails.org/doctrine/#provide-sharp-knives

beaconstudios · on Nov 1, 2021

this is why nowadays I prefer to use query builders (like knex - knexjs.org) instead of ORMs. You can optimise your queries without having huge blobs of raw SQL in the middle of model code, and you can build models (or service objects, as I prefer) manually that are completely transparent.

wpietri · on Nov 1, 2021

Yeah, for me this is a classic "easy vs simple" example.

Cloud databases are a very easy option. But Postgres is 1.3 million lines of code, and on top of that you add the complexity of the cloud vendor's environment, choices, and custom code. It may be easy, but it's definitely not simple.

selfhoster11 · on Nov 1, 2021

> It always surprises me how much companies tend to take cloud databases as automated and rely on them to do auto-tuning for everything.

Automation and auto-tuning is kind of the whole value proposition of managed services, isn't it?

paulryanrogers · on Nov 1, 2021

IME, there is a spectrum from barely hosted (rented bare metal) to white-glove service (Heroku)

latch · on Nov 1, 2021

This is pretty basic PostgreSQL tuning (1). An oft-cited problem with PostgreSQL is that it uses a process-per-connection (session in PG lingo, I think). `work_mem` is the amount of memory a connection has to, you know, do work. Sorting, distinct, some joins.

Number of Connections * work_mem is usually going to eat up the biggest chunk of your PostgreSQL's memory. AFAIK, the configuration is extremely coarse. There's no way for example to say: I want all the connections from username "myapp" to have 10MB and those from user "reporting" to have 100MB. And it can't be adjusted on the fly, per query.

Being able to set aside 200GB of memory to be used, as needed, by all connections (maybe with _some_ limits to prevent an accident), would solve a lot of problems (and introduce a bunch of them, I know).

Since they're on RDS, I can't help but point out that I've seen DB queries on baremetal operate orders of magnitude faster. 10 minute to 1 second type thing. I couldn't help but wonder if they'd even notice the disk-based sorting on a proper (yes, I said it) setup.

(1) https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...

ianleeclark · on Nov 1, 2021

> `work_mem` is the amount of memory a connection has to, you know, do work.

It's the amount of memory a connection has to do work per sorting/hashing operation. https://www.postgresql.org/docs/12/runtime-config-resource.h...

> There's no way for example to say: I want all the connections from username "myapp" to have 10MB and those from user "reporting" to have 100MB.

Yes, you can with

`ALTER ROLE myapp SET work_mem TO '10mb';`

and

`ALTER ROLE reporting SET work_mem TO '100mb';

jhgb · on Nov 1, 2021

> Yes, you can with [...a helpful example...]

Some days it's nice to have a lucky 10k problem (https://xkcd.com/1053) in place of a c10k problem (http://www.kegel.com/c10k.html). Thanks.

WJW · on Nov 1, 2021

I worked on a MySQL-on-RDS setup once where the company was splurging (and being charged through the nose, ofc) on a backing EBS volume with 20k IOPS. Meanwhile nobody had bothered to update the default settings, which limited all background work to use no more than 100 IOPS. Needless to say, this DB had severe performance problems.

Managed databases in the cloud are a bit like ORMs: they significantly reduce the amount of knowledge needed 99% of the time. Sadly, when the 1% of the time hits this also means that nobody on the team has built up the DBA skills required for the big hairy problem. I see this pattern repeated all the time when consulting for startups that just had their first real traction and now the magic database box is no longer working as before.

mcherm · on Nov 1, 2021

I think DB tuning seems like a perfect business opportunity for specialist consulting. Doing a really good job requires deep expertise but then it only needs to be repeated occasionally (if the operational profile or the data size changes significantly).

But I don't see this model much. I mostly hear of companies struggling to develop internal expertise or going without.

Why?

Intermernet · on Nov 1, 2021

I did this for years. I was a server optimisation consultant. I'd always advise my clients that increasing the specs of the hardware would probably be cheaper in the short term than hiring me to go through the various configurations with a fine toothed comb. I also promised a fixed price report on what needed to be done, as well as a fixed price quote on doing the actual work. I found that this level of transparency and honesty never failed, and the work was always approved. I'd also offer training for the internal team.

I've pretty much completely changed careers now, but I found that this model did work well. It involved a lot of constant research and occasionally a huge amount of stress, but I loved it!

BrentOzar · on Nov 1, 2021

> But I don't see this model much. Why?

It does exist (I'm one of them), and it's much more common on expensive databases like Microsoft SQL Server and Oracle. With free databases, it's relatively inexpensive to throw more hardware at the problem. When licensing costs $2,000 USD per CPU core or higher, there's much more incentive for companies to bring an expert in quickly to get hardware costs under control.

redis_mlc · on Nov 1, 2021

Yup, SQL Server performance and licensing (!) consultants came out of retirement after MS doubled the licensing costs in 2012.

Managers don't like paying for optimization consultants, but it's the easiest way to both improve current performance and scale a master.

Note that Oracle positions MySQL as a replacement for SQL Server, as "moderate-performance databases." :)

Source: DBA.

https://metrixdata360.com/pricing-series/1500-price-increase...

apavlo · on Nov 1, 2021

> I think DB tuning seems like a perfect business opportunity

I got you fam: https://ottertune.com/

idiocrat · on Nov 1, 2021

Call it driving with hand brakes on.

You have bought your 1000 hps car and are driving it as-is since you left the car dealer's parking place. Nobody told you to release the hand brakes.

WJW · on Nov 1, 2021

Quite so. Of course, in most countries all drivers will have had mandatory driving lessons before they got their license and the instructor will have told them about the handbrake. No such license requirement exists for databases and it shows.

daemoncoder · on Nov 1, 2021

I like to say they should remove the block from under the accelerator pedal.

throwaway20371 · on Nov 1, 2021

> Sadly, when the 1% of the time hits this also means that nobody on the team has built up the DBA skills required for the big hairy problem. I see this pattern repeated all the time when consulting for startups that just had their first real traction and now the magic database box is no longer working as before.

I call this the Kubernetes effect. 99% of the time, people think it's easy. Then the 1% hits and they have to completely rebuild it from the ground up. For us it was cluster certificate expiration and etcd corruption, and some other problems we couldn't ever pin down.

The other effect (which I'm sure has a name) is when they depend on it for a long time, but then a problem starts to occur, and because they can't figure out a way to solve it, they move to a completely different tech ("We had problems with updating our shards so we moved to a different database") and encounter a whole new problem and repeat the cycle.

mikeklaas · on Nov 1, 2021

> And it can't be adjusted on the fly, per query.

Sure it can:

   SET LOCAL work_mem = '256MB';
   SELECT * FROM …

ComodoHacker · on Nov 1, 2021

An Oracle story. Read queries can also write to disk due to the way Oracle handles consistency. Information on which transactions last changed particular rows in a table gets stored with the rows. Read query then looks up those transactions to see whether they were committed or not. If not, read query skips rows they touched (and looks up their previous versions). If they were, that transaction info gets cleared up, so next queries don't have to look it up again.

This is simplified explanation, but the point is the process executing a read query is often best positioned to do this maintenance work. This decision trades off a little latency for throughput.

chasil · on Nov 1, 2021

This Oracle behavior is called "block cleanout," and makes it unnecessary for a transaction to touch all effected blocks when it commits or rolls back - some future transaction will do this.

Oracle also has two init.ora parameters, SORT_AREA_SIZE and SORT_AREA_RETAINED_SIZE, that are similar to the WORK_MEM mentioned in the subject article.

As far as I know, SORT_AREA_SIZE is global, and differing values cannot be assigned to roles.

ccleve · on Nov 1, 2021

Postgres does something similar. Every time a query returns a row the system must do a lookup into a visibility map to see if the row is visible to the current transaction. It can also check if it's a completely dead row. If there is an index involved, then on the next row fetch Postgres will report to the index whether the last row was dead or not. The index can elect to delete the row reference, which means a write during a select statement.

sokols · on Nov 1, 2021

Similarly, in Oracle, Temporary Tablespaces might be utilized when a query uses sort operations for example.

dspillett · on Nov 1, 2021

This is not a pg specific thing, for instance in SQL Server this is referred to as “spilling to disk”. Large sorts can spill to disk, and the query planner may assume this is going to be done in RAM, based on what stats it has about the relevant indexes, but later have to spill to disk, so pick a plan that is sub-optimal.

scottlamb · on Nov 1, 2021

Even SQLite3 does this. https://www.sqlite.org/queryplanner.html says under "Sorting":

> Furthermore, the entire output is accumulated in temporary storage (which might be either in main memory or on disk, depending on various compile-time and run-time settings) which can mean that a lot of temporary storage is required to complete the query.

It also will sometimes create an index during a query. https://www.sqlite.org/tempfiles.html mentions this under "Transient Indices".

b112 · on Nov 1, 2021

Some queries with mysql result in temp tables, and some use disk based tables, due to complexity.

https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-t...

whoomp12342 · on Nov 1, 2021

Am I stupid? is this just disk caching or page filing? why is this so clickbaity

mnw21cam · on Nov 1, 2021

The database has analysed the query that is being run, and decided that the quickest way to fulfil that query is to take 27MB of data and sort it into a file on disc before reading it back in with the correct order. This is caused by the work_mem setting in the database being set too low, preventing the database from contemplating just sorting the whole thing in memory.

The default work_mem setting for Postgres has historically been very low. It's fine for reading single rows from a single table using an index, but as soon as you put a few joins in there it's not adequate. It should be one of the first steps of setting up Postgres to increase this limit.