Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Negative dentries, 20 years later (lwn.net)
63 points by bitcharmer on April 11, 2022 | hide | past | favorite | 37 comments


I became intimately familiar with negative dentries while debugging a slow service deploy a few years ago.

A deploy that was normally very fast would sometimes hang for a few minutes during a phase where all it had to do was delete the old application directory and move the new one into place.

Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them. Nothing ever touched that directory while the negative dentries accumulated for weeks or months. When someone finally deployed, the first rmdir that came along bore the cost of deleting all those negative dentries. It hung for seconds or minutes while the kernel essentially cleared out the entire dcache, deleting linked list elements one by one. It showed up in perf as being stuck inside shrink_dcache_parent.

This is actually easy to reproduce:

  $ mkdir /tmp/foo
  $ touch /tmp/nodelete
  # create and delete 100k files
  $ for i in $(seq 1 10); do bash -c 'for i in $(seq 1 10000); do rm $(mktemp /tmp/foo/XXXXXX); done' &; done; wait
  ...
  $ time rmdir /tmp/foo
  rmdir: failed to remove '/tmp/foo': Directory not empty
  rmdir /tmp/foo  0.00s user 0.02s system 91% cpu 0.024 total
  $ time rmdir /tmp/foo
  rmdir: failed to remove '/tmp/foo': Directory not empty
  rmdir /tmp/foo  0.00s user 0.00s system 81% cpu 0.003 total
Both rmdirs fail, but the first one takes 24ms. If you create and delete more files, it takes longer and longer.

At some point we probably would've noticed the memory leak as well (I found an 18 GB slab on one host while this was happening) but the machines in question have huge amounts of ram.

I worked around the issue by making the application reuse tempfile names.


> Turned out that the application was writing a bunch of tempfiles into the cwd and then immediately deleting them.

> I worked around the issue by making the application reuse tempfile names.

Knowing nothing about the issue beyond what you've written here...

why not make the application create a directory for its tempfiles, and then remove that directory along with the tempfiles?


additionally, there may be a security issue in reusing temporary filenames.


Yeah, I also moved the tempfiles to a more appropriate location at the same time.


So you didn't report the bug? Kernel developers absolutely LOVE those kinds of bugs, especially with such a trivial reproducer!


My conclusion at the time was that it was not, strictly speaking, a bug. It seemed to be a sharp edge that was WAI.

Considering it again now, I do think it's essentially a bug, but it seems to be a known thing at this point. What I described is the same issue addressed by this unmerged patch: https://lkml.org/lkml/2017/9/18/739 (see discussion here: https://lwn.net/Articles/814535/). And it's mentioned in the article in this HN link:

> Those dentries still take up valuable memory, and they can create other problems (such as soft lockups) as well.


Even if it's just a performance anomaly, these are good reports for kernel developers to have. If nothing else, it helps expand developers' understanding of the sorts of workloads people have had problems with. In the case of really complex systems, it can take a number of reports to spot the pattern, or in the case of proposed fixes, enough pains points to justify the risk of making a change. A report like this takes 60 seconds to cut and paste into an email to a mailing list. Or use the kernel.org bugzilla that is triaged by helpful volunteers. Every voice counts.


NXDOMAIN caching has been the bane of my existence recently and I have been meaning to survey how other cache implementations deal with the problem. Looks like everyone suffers from it. I've heard edgy teenagers say that existence is suffering, but I suppose caches teach us that non-existence is suffering too.


NXDOMAIN at least has a TTL so it will not grow forever. What problem are you referring to, not being able to invalidate a cached NXDOMAIN? Or something else?


Yeah, well, if the TTL in the SOA were being respected it wouldn't be a problem. Tomorrow we're going to run some experiments to find out which godforsaken layer of "smart" infrastructure is responsible.


So much fun it is to have multiple servers with 1.5 TB of ram slowly fill up with negative cache dentries to the point that the server kernel finally decides that memory pressure is a thing and purges the negative entries all at once - which results in the server being locked up and unresponsive for about ~3 minutes. Oh yeah, and no tunables to control the negative dentry specific behavior.


In general the linux kernel seems extremely badly behaved under ram pressure, see how it reacts to swapping.


It wasn't even actual ram pressure - there was ~500 GB of unused ram but something like 800 GB or so of negative cache dentries.

So, it was over some threshold that the kernel decided it was time to start reclaiming all those entries and locked up the entire server while it was doing so. Madness.


My favourite interaction with negative dentries was with PHP's curl library, built with NSS for encryption. NSS would do some kind of filesystem performance check, intentionally opening lots of files that didn't exist. At one point, every server in a large autoscaling group was spending 10GB EACH just on the dentry cache. :-)

https://twitter.com/jdub/status/875570760361811973


Aren't there already well-tested LRU cache algorithms out there? It seems like adapting one of those to the general problem of kernel cache management would be a good way to address the general problem described in the article.


LRU works by by using a specific cache size. How big should that be? Will that number work for small microcontrollers and giant servers? Is that per disc? Per directory? Per system? Per kernel?

Should that size expand and contract overtime somehow?

The article gets into a little bit of that stuff, but they’re all real problems.


Sure, I get that there are still decisions to be made, but the discussion in the article makes it seem like nobody has even thought of using an LRU cache and then having a discussion about how to tune the cache size.


There is a million configurable variables already this can be another

Not me I’m watching Netflix :)


tbh this was my thought too. No need for background processing either, there are quite a few options for amortized-constant-time old-value deletion at lookup time. Or do that cleanup while checking the disk on a cache miss - if you have no misses, the cache is working fantastically and probably shouldn't be pruned.

Or is there some reason those aren't acceptable in-kernel, while apparently a lack of cleanup is fine? Maybe CPU use is too high...?


And how many entries will you keep in the LRU?


... hmmm, makes me want to write a script that looks up 5 files that aren't there in every directory in the system.


something like?

    find / -type d -exec sh -c 'for i in $(seq 5); do stat {}/"$(head -c 15 /dev/urandom | base32)" 2> /dev/null; done' \;


"It's hard to ... design a heuristic for an acceptable amount of negative dentries: it won't scale from small to large systems well"

Has an AIMD algorithms[1] been considered to dynamically adjust the limit?

[1] https://en.wikipedia.org/wiki/Additive_increase/multiplicati...


> "Repeated file-name lookups are common — consider...~/.nethackrc"

Lol, <3 LWN


How could a process accumulate hundreds of negative dentries during normal operation? If that happens something fishy is going on.


The article gives some examples, but here's another one: consider what happens when you run "git status" on a large repository. In order to determine the status of every file, git needs to check each subdirectory for a ".gitignore" file, even though the vast majority of the time there's only one at the root. All of those nonexistent files can become negative dentries.

In theory, git could do its own caching, or it could be refactored so as to move the .gitignore scanning inline with the main directory traversal. In practice, it doesn't do so (at least as of 2.30.2, which is the version I just tested). And I don't think it's reasonable to expect every program to contort itself to minimize the number of times it looks for nonexistent files, when that's something that can reasonably be delegated to the OS.


The only way git could cache this is with a file-system watcher.... and tbh my experience with git fswatchers has been utterly abysmal. They miss things / have phantom changes routinely, on the order of a couple times per week, so I just disable them outright now.

Without a truly reliable filesystem watching technique, git cannot cache this information. Any cached knowledge needs to be verified because it could have changed, which defeats the whole purpose of a cache here. It could change its strategy, to e.g. only check at repo root or require a list of "enabled" .gitignores, but it doesn't do that currently.


Oh, I just meant caching within the lifetime of a single git command's execution.

As it stands, when you run "git status", the git executable goes through each directory in your working copy, lists all of its directory entries, and then immediately afterward calls open() to see if a .gitignore file exists in that directory -- even if there wasn't one in the list it just read.

That's what I meant by saying that in principle, there's no reason git couldn't just cache the file's presence/absence on its own, just for those few microseconds (maybe "cache" was a poor choice of words). In practice, I can understand that the implementation complexity might not be worth it.


ah, lol. yeah, that's a good point, that scenario could definitely be smarter - I wasn't thinking about it listing the directory first. plus the time between list and "look again for .gitignore" is small enough that any inconsistencies could very reasonably be labeled an acceptable resolution to a race between git and other file system changes.


Does not caching this really help, though?

What happens when the .gitignore above you changes while you are in the midst of scanning a subdirectory? Is that not the same problem?

The problem is that git operations aren't transactional with respect to the filesystem, no? Sure, the window of uncertainty changes, but it's never zero.


But that's an inherent problem with a non-transactional file system.

If you remove .gitignore 1 microsecond before git open()s it, it's gone.

If you remove .gitignore 1 microsecond after git open()s it (even before it has a chance to read it), unix file system semantics mean you still get the contents.

There is always a race condition, you can play with the specifics, but cannot avoid it when multiple files referring to each other are edited independently.


Well, recursive configuration is a misfeature of git. No sane program should work that way. But regardless, if git scans every sub directory then the kernel should have already cached all "positive" dentries obviating the need for any negative ones. And that cache must be an order of magnitude larger than any negative cache.


The third paragraph in the article addresses this. Most of the negative dentries a process generates are going to be invisible to the user.


Yeah, my immediate thought was of stuff like "ask nginx to give me this misspelled webpage" -> look up that file -> bam, dentry.

Expose practically anything on the internet that leads to a piece of user input causing a filesystem operation (which is going to be extremely common and often unavoidable), and "hundreds" isn't the concern. Millions to billions is.


That's what I meant with "fishy". If nginx fills up your negative dentry cache with millions to billions of misspelled urls you are being dos:ed.


What if your C compiler fills up your negative dentry cache during, say, a kernel compile, because it's looking for a huge number of include files that don't exist, since the C preprocessor uses a very simplistic algorithm to decide where it needs to look for include files? (This is one of the examples in the paragraph in the article that I referred to.)


Even in the worst case, that shouldn't amount to more than a few thousand files. Millions of missing include files is unrealistic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: