One time we observed a dramatic drop-off in performance from one of our services after a certain day that week. I looked at recent releases and saw it perfectly coincided with one.
I asked the engineer in question to investigate, but after looking he said, "It's nothing I could be doing."
So I sat with him and used git-bisect to prove to him it was his commit: he had added trace logging within a couple of tight loops in the hottest parts of the code base. I smiled.
"But it's trace. That's disabled in production. It can't be that," he said. But we had already proven it was that commit, and the only thing that changed was additional logging.
Long story short, the logging library was filtering calls by level just before actually writing, rather than as close as possible to the call site—a design bug, for sure.
I had him swap out the library everywhere it was being used.
Logging is never free. Either you add additional compile time cost to remove it, or you add at the very least a single if and/or method call at runtime.
It's not easy problem to solve. The best I've come up with so far, is using a decorator in conjunction with a DI framework that supports it. It's still only a wrap around method calls, but it enables completely turning off logging in production, and enable it when it is needed without making a special build, or wasting resources during normal execution.
Preprocessor-disabled logging (and generally compile-time disabled logging) is free at run-time.
The compile-time cost of stripping out preprocessor macros disabled via #ifdef is close enough to free you'd have a really hard time measuring it - O(n) on number of lines of logging code, with n smaller than the number of lines of actual code for any real scenario, and a very small multiplier.
The "cost" there is in the visual appearance of the source code.
I don't understand why more people don't use Solaris Zones, they seem to me to be the superior solution by far, and with work done by Joyent you now have modern LX-branded zones also. Is the lack of adoption mainly due to the fact that it's Solaris, and not Linux?
Solaris is not free software and the free software forks never gained much traction. Plus it is now also associated with Oracle which brings along a lot of extra distrust, particularly given how litigious they are.
Oracle used excuse, that it's current OpenJDK license (GPLv2) is incompatible with license, used by Google's runtime (Apache 2). If Google re-licensed it's Java implementation under GPL, some of arguments, used by Oracle lawyers (code reuse and patent (?) violations), would have been void, and arguing about reuse of APIs would have been a lot harder.
Of course, this does not really matter, because the whole lawsuit is just excuse for power games between corporations. Oracle's goal wasn't about Java licensing, it was about gaining some degree of control over emerging Android ecosystem.
No, the copyright trial is about the "structure, sequence, and organization" of the APIs, not any literal copying anymore. Switching languages, but keeping the same class library still leaves them open.
I don't know anything about Solaris Zones, but I'm guessing that the FreeBSD jails wouldn't help with the problem in the article. It's still the same kernel running in the jail, so the fadvise calls are going to bottleneck in the same place.
Because the problem isn’t that there is one bug in a specific implementation, it’s that the whole model is fundamentally prone to this type of problem.
cgroups, Jails, and Zones all suffer from having to cover an immense surface area. Contrast with VMs, which only require managing a few, significantly simpler interfaces. There are definitely differences in quality, but they all use a similar approach.
OP says that fadvise is going to bottleneck in the same places on BSD that it does under Linux, with no supporting evidence. I agree that containers in general are a tough problem.
> Is the lack of adoption mainly due to the fact that it's Solaris, and not Linux?
I am guessing it is to avoid to learn a whole new ecosystem, tools, environments, rules, package system etc. It's just simpler to stick to Linux. But it all depends, let's say if my application on Illumos shows a 60% performance improvement, well I can see spending time learning it and using it as a base. But it would really have to be large benefit to justify switching OSes. Of course how would I even bother benchmarking to start with? I'd probably have to hear other stories or mentions on HN and such...
Last I used Illumos it had pretty poor hardware support (at least at entry to midlevel hardware which for obvious reasons must be a less important priority). I also ran into issues with their KVM port; mainly panics with various linux kernels (kvm clock, virt*, ..). Good experience in general though, especially the zones/zfs/kvm combo.
At Joyent, we are working on bringing bhyve to SmartOS and illumos. Much attention is being paid to performance and stability. The plan of record is for this to eventually replace KVM.
And deploy it...where? AWS, Azure, GCP and virtually all other providers have first class Docker and Kubernetes support. You can barely even host a Solaris VM anywhere today.
Zones are great and they are often the right solution. There are limits. If you have the choice of putting 100 zones on one 4 socket system or 25 zones on each of four 1 socket systems, I'd probably recommend going with the four systems. Hot locks and inefficient algorithms happen.
For instance, the directory name lookup cache can greatly impact the performance of file operations. This is what makes it so that when you open /a/b/c/d/e/blah, the OS has a good chance of knowing how to open "blah" directly without first searching /a, /a/b, /a/b/c, /a/b/c/d, and /a/b/c/d/e. I don't have performance numbers handy (left at $job - 1 and $job - 2), but the default size is not great for a system that handles hundreds of thousands of files. If that sounds like a lot of files, count how many files are read as part of a reasonably large build. Then imagine there are dozens of them running concurrently. Or imagine that you are hosting git repos or a bunch of static web content.
The obvious answer is to just increase the size of the DNLC. The problem is that there are other parts of the system that behave poorly with a large DNLC. For instance, whenever anyone tries to unmount a file system, dnlc_purge_vfsp() is called. This walks the cache, looking for entries that are associated with the file system. Who cares, right? We hardly ever unmount file systems. Well, if you use the automounter (by default Solaris uses it for at least /home/*), every few minutes it is trying to unmount all of the automounted file systems. The purge happens before EBUSY can be detected so hot cached entries may be purged while adding contention on dnlc-related locks. What's worse, the automounter doesn't reset its inactivity timer when it hits EBUSY, causing more frequent attempts to unmount than you would otherwise expect. There are other DNLC bottlenecks on a Solaris NFS server when a client removes a file.
And then as you get to larger systems with more NUMA effects, these type of operations become even more expensive.
Operating systems are hard. Making them scale to an infinite number of processes and processors is impossible. At a certain point, it becomes beneficial to use smaller hardware and/or add VMs into the mix.
It lives but it's not exactly thriving. OpenIndiana struggled, OmniOS gave up (and got picked up again) but still all that struggle makes me expect that Illumos will eventually fade out. At the latest when/if Joyent goes out of business. It's a shame because I like SmartOS a lot but it nowhere has the momentum Linux has. I don't see who will write all the device drivers for the next generation of hardware coming out.
The point was not to complain about how bad the Docker is but rather to highlight that a lot of unexpected things may come from the fact that the kernel is shared.
"Logging too much" was just a reason for the posix_fadvise being called too often but this becomes a problem ONLY when the kernel is shared. In case of virtualization, everyone gets its own version of fadvise (the kernel) and the conflict doesn't happen.
VM certainly does "call underlying host kernel operation", it just does so indirectly — the guest userspace calls fadvise(), kernel implementation of fadvise() asks the virtio disk driver to perform particular read/writes, the virtio driver asks underlying kernel disk driver to read/write individual disk sectors (without knowing, that they are related to specific file in guest filesystem).
This specific bug was caused by putting high load on "kernel dentry cache", e.g. a contention for memory structure, present in kernel memory. Guests normally don't share memory, so contending for it was avoided.
Incidentally, there are situations, when different guests can compete for same memory — when VM uses so-called "memory deduplication" techniques. Which is why enabling that stuff on production systems may be a bad idea.
In the case under discussion, the dontneed fadvise would tell the guest kernel page cache to discard written data after write. So it is useful without the host kernel knowing about it.
Yeah. The whole article seems to be a bug in the logging library that's in use here, and an area of contention in the kernel? (How many times is his application logging, to cause that much contention?) I'm not sure how Docker figures into it, aside from it easily lets one run multiple instances of an app on the same hardware, but stuff like supervisord will also do that?
I'm not entirely clear as to why a logging library needs to call fadvise; a log file, is, I presume opened in append-only mode. Isn't "append" sufficient advice to the kernel? Also, fadvise needs byte ranges, and I have no idea what you'd pass for a log file…
Using O_APPEND does not imply, that kernel needs to purge the pages from cache ASAP, does it? Removing pages from cache may be expensive operation by itself, so I presume, that it is avoided by default.
More importantly, if the disk can not catch up, the log data is going to end up waiting in page cache anyway (typical case of bufferbloat). Linux kernel does not have telepathic abilities to balance needs of crazy logger and other applications in system, so without resolving underlying issue (bufferbloat), those writes would take up too much cache, potentially bringing down disk performance of other applications.
fadvise() may schedule quicker eviction, effectively acting as syscall version of vm.dirty_ratio. Of cause, that does not resolve the problem, — just moves it to different layer. The real solution is either
1) blocking the apps until their logs are fully written (for example, by using O_DIRECT)
2) showing those apps middle finger and throwing away some of their logs (AFAIK, this is occasionally done by syslog).
A comment in the Medium article asked the same thing. The author's response:
> I agree that piping all logs to stdout would be the best solution in case of Dockerized microservices. It’s just that in our case we were porting an existing system, which a) already heavily relied on logging to files b) consisted of many microservices itself, which we couldn’t yet split into separate Docker containers but also couldn’t pipe all their logs to the same stdout.
well that’s just an argument to use a pluggable logging system like in python/log4j where you don’t configure logging in your service/library and leave that up to the implementor. also let’s you log to a structured format like JSON and pipe that to a logging aggregator
i never tried this but i think it's similar to docker-sync, where you still have to modify your docker compose files. it's not good for people who want a single consistent docker-compose file that would work across varying distributions (ie; opensource project).
These options are part of the Docker API and the underlying engine implements it as needed. On Linux hosts there is nothing to tweak as bind mounts offer full consistency without compromising performance.
In Docker for Mac it tweaks the consistency settings for mounts that cross the VM barrier with osxfs.
So itnisnsafe to define these in any comoose file no matter where it runs, just need to make sure that the app in the container can actually deal with the consistency level applied to the particular mount.
Newer versions of docker have been much better, but now it just eats up a ton of ram. The ram setting in prefs doesn’t seem to actually limit it at all.
Nice debugging story but the conclusion was totally wrong! The author even knows this. If they would be logging 3-4x the usual rate they would have seen the same problem on bare metal too. Nothing to do with docker or competing containers or whatever.
I think this is important because "your" implies that there is something that someone can do to fix the problem. Omitting it makes the think that there's a general problem with the docker ecosystem.
I asked the engineer in question to investigate, but after looking he said, "It's nothing I could be doing."
So I sat with him and used git-bisect to prove to him it was his commit: he had added trace logging within a couple of tight loops in the hottest parts of the code base. I smiled.
"But it's trace. That's disabled in production. It can't be that," he said. But we had already proven it was that commit, and the only thing that changed was additional logging.
Long story short, the logging library was filtering calls by level just before actually writing, rather than as close as possible to the call site—a design bug, for sure.
I had him swap out the library everywhere it was being used.
Moral: logging is not free.