I became intimately familiar with negative dentries while debugging a slow service deploy a few years ago.
A deploy that was normally very fast would sometimes hang for a few minutes during a phase where all it had to do was delete the old application directory and move the new one into place.
Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them. Nothing ever touched that directory while the negative dentries accumulated for weeks or months. When someone finally deployed, the first rmdir that came along bore the cost of deleting all those negative dentries. It hung for seconds or minutes while the kernel essentially cleared out the entire dcache, deleting linked list elements one by one. It showed up in perf as being stuck inside shrink_dcache_parent.
This is actually easy to reproduce:
$ mkdir /tmp/foo
$ touch /tmp/nodelete
# create and delete 100k files
$ for i in $(seq 1 10); do bash -c 'for i in $(seq 1 10000); do rm $(mktemp /tmp/foo/XXXXXX); done' &; done; wait
...
$ time rmdir /tmp/foo
rmdir: failed to remove '/tmp/foo': Directory not empty
rmdir /tmp/foo 0.00s user 0.02s system 91% cpu 0.024 total
$ time rmdir /tmp/foo
rmdir: failed to remove '/tmp/foo': Directory not empty
rmdir /tmp/foo 0.00s user 0.00s system 81% cpu 0.003 total
Both rmdirs fail, but the first one takes 24ms. If you create and delete more files, it takes longer and longer.
At some point we probably would've noticed the memory leak as well (I found an 18 GB slab on one host while this was happening) but the machines in question have huge amounts of ram.
I worked around the issue by making the application reuse tempfile names.
My conclusion at the time was that it was not, strictly speaking, a bug. It seemed to be a sharp edge that was WAI.
Considering it again now, I do think it's essentially a bug, but it seems to be a known thing at this point. What I described is the same issue addressed by this unmerged patch: https://lkml.org/lkml/2017/9/18/739 (see discussion here: https://lwn.net/Articles/814535/). And it's mentioned in the article in this HN link:
> Those dentries still take up valuable memory, and they can create other problems (such as soft lockups) as well.
Even if it's just a performance anomaly, these are good reports for kernel developers to have. If nothing else, it helps expand developers' understanding of the sorts of workloads people have had problems with. In the case of really complex systems, it can take a number of reports to spot the pattern, or in the case of proposed fixes, enough pains points to justify the risk of making a change. A report like this takes 60 seconds to cut and paste into an email to a mailing list. Or use the kernel.org bugzilla that is triaged by helpful volunteers. Every voice counts.
NXDOMAIN caching has been the bane of my existence recently and I have been meaning to survey how other cache implementations deal with the problem. Looks like everyone suffers from it. I've heard edgy teenagers say that existence is suffering, but I suppose caches teach us that non-existence is suffering too.
NXDOMAIN at least has a TTL so it will not grow forever. What problem are you referring to, not being able to invalidate a cached NXDOMAIN? Or something else?
Yeah, well, if the TTL in the SOA were being respected it wouldn't be a problem. Tomorrow we're going to run some experiments to find out which godforsaken layer of "smart" infrastructure is responsible.
So much fun it is to have multiple servers with 1.5 TB of ram slowly fill up with negative cache dentries to the point that the server kernel finally decides that memory pressure is a thing and purges the negative entries all at once - which results in the server being locked up and unresponsive for about ~3 minutes. Oh yeah, and no tunables to control the negative dentry specific behavior.
It wasn't even actual ram pressure - there was ~500 GB of unused ram but something like 800 GB or so of negative cache dentries.
So, it was over some threshold that the kernel decided it was time to start reclaiming all those entries and locked up the entire server while it was doing so. Madness.
My favourite interaction with negative dentries was with PHP's curl library, built with NSS for encryption. NSS would do some kind of filesystem performance check, intentionally opening lots of files that didn't exist. At one point, every server in a large autoscaling group was spending 10GB EACH just on the dentry cache. :-)
Aren't there already well-tested LRU cache algorithms out there? It seems like adapting one of those to the general problem of kernel cache management would be a good way to address the general problem described in the article.
LRU works by by using a specific cache size. How big should that be? Will that number work for small microcontrollers and giant servers? Is that per disc? Per directory? Per system? Per kernel?
Should that size expand and contract overtime somehow?
The article gets into a little bit of that stuff, but they’re all real problems.
Sure, I get that there are still decisions to be made, but the discussion in the article makes it seem like nobody has even thought of using an LRU cache and then having a discussion about how to tune the cache size.
tbh this was my thought too. No need for background processing either, there are quite a few options for amortized-constant-time old-value deletion at lookup time. Or do that cleanup while checking the disk on a cache miss - if you have no misses, the cache is working fantastically and probably shouldn't be pruned.
Or is there some reason those aren't acceptable in-kernel, while apparently a lack of cleanup is fine? Maybe CPU use is too high...?
The article gives some examples, but here's another one: consider what happens when you run "git status" on a large repository. In order to determine the status of every file, git needs to check each subdirectory for a ".gitignore" file, even though the vast majority of the time there's only one at the root. All of those nonexistent files can become negative dentries.
In theory, git could do its own caching, or it could be refactored so as to move the .gitignore scanning inline with the main directory traversal. In practice, it doesn't do so (at least as of 2.30.2, which is the version I just tested). And I don't think it's reasonable to expect every program to contort itself to minimize the number of times it looks for nonexistent files, when that's something that can reasonably be delegated to the OS.
The only way git could cache this is with a file-system watcher.... and tbh my experience with git fswatchers has been utterly abysmal. They miss things / have phantom changes routinely, on the order of a couple times per week, so I just disable them outright now.
Without a truly reliable filesystem watching technique, git cannot cache this information. Any cached knowledge needs to be verified because it could have changed, which defeats the whole purpose of a cache here. It could change its strategy, to e.g. only check at repo root or require a list of "enabled" .gitignores, but it doesn't do that currently.
Oh, I just meant caching within the lifetime of a single git command's execution.
As it stands, when you run "git status", the git executable goes through each directory in your working copy, lists all of its directory entries, and then immediately afterward calls open() to see if a .gitignore file exists in that directory -- even if there wasn't one in the list it just read.
That's what I meant by saying that in principle, there's no reason git couldn't just cache the file's presence/absence on its own, just for those few microseconds (maybe "cache" was a poor choice of words). In practice, I can understand that the implementation complexity might not be worth it.
ah, lol. yeah, that's a good point, that scenario could definitely be smarter - I wasn't thinking about it listing the directory first. plus the time between list and "look again for .gitignore" is small enough that any inconsistencies could very reasonably be labeled an acceptable resolution to a race between git and other file system changes.
What happens when the .gitignore above you changes while you are in the midst of scanning a subdirectory? Is that not the same problem?
The problem is that git operations aren't transactional with respect to the filesystem, no? Sure, the window of uncertainty changes, but it's never zero.
But that's an inherent problem with a non-transactional file system.
If you remove .gitignore 1 microsecond before git open()s it, it's gone.
If you remove .gitignore 1 microsecond after git open()s it (even before it has a chance to read it), unix file system semantics mean you still get the contents.
There is always a race condition, you can play with the specifics, but cannot avoid it when multiple files referring to each other are edited independently.
Well, recursive configuration is a misfeature of git. No sane program should work that way. But regardless, if git scans every sub directory then the kernel should have already cached all "positive" dentries obviating the need for any negative ones. And that cache must be an order of magnitude larger than any negative cache.
Yeah, my immediate thought was of stuff like "ask nginx to give me this misspelled webpage" -> look up that file -> bam, dentry.
Expose practically anything on the internet that leads to a piece of user input causing a filesystem operation (which is going to be extremely common and often unavoidable), and "hundreds" isn't the concern. Millions to billions is.
What if your C compiler fills up your negative dentry cache during, say, a kernel compile, because it's looking for a huge number of include files that don't exist, since the C preprocessor uses a very simplistic algorithm to decide where it needs to look for include files? (This is one of the examples in the paragraph in the article that I referred to.)
A deploy that was normally very fast would sometimes hang for a few minutes during a phase where all it had to do was delete the old application directory and move the new one into place.
Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them. Nothing ever touched that directory while the negative dentries accumulated for weeks or months. When someone finally deployed, the first rmdir that came along bore the cost of deleting all those negative dentries. It hung for seconds or minutes while the kernel essentially cleared out the entire dcache, deleting linked list elements one by one. It showed up in perf as being stuck inside shrink_dcache_parent.
This is actually easy to reproduce:
Both rmdirs fail, but the first one takes 24ms. If you create and delete more files, it takes longer and longer.At some point we probably would've noticed the memory leak as well (I found an 18 GB slab on one host while this was happening) but the machines in question have huge amounts of ram.
I worked around the issue by making the application reuse tempfile names.