Another reason why Docker containers may be slow

mratzloff · on April 6, 2018

One time we observed a dramatic drop-off in performance from one of our services after a certain day that week. I looked at recent releases and saw it perfectly coincided with one.

I asked the engineer in question to investigate, but after looking he said, "It's nothing I could be doing."

So I sat with him and used git-bisect to prove to him it was his commit: he had added trace logging within a couple of tight loops in the hottest parts of the code base. I smiled.

"But it's trace. That's disabled in production. It can't be that," he said. But we had already proven it was that commit, and the only thing that changed was additional logging.

Long story short, the logging library was filtering calls by level just before actually writing, rather than as close as possible to the call site—a design bug, for sure.

I had him swap out the library everywhere it was being used.

Moral: logging is not free.

hvidgaard · on April 6, 2018

Logging is never free. Either you add additional compile time cost to remove it, or you add at the very least a single if and/or method call at runtime.

It's not easy problem to solve. The best I've come up with so far, is using a decorator in conjunction with a DI framework that supports it. It's still only a wrap around method calls, but it enables completely turning off logging in production, and enable it when it is needed without making a special build, or wasting resources during normal execution.

kbutler · on April 6, 2018

Preprocessor-disabled logging (and generally compile-time disabled logging) is free at run-time.

The compile-time cost of stripping out preprocessor macros disabled via #ifdef is close enough to free you'd have a really hard time measuring it - O(n) on number of lines of logging code, with n smaller than the number of lines of actual code for any real scenario, and a very small multiplier.

The "cost" there is in the visual appearance of the source code.

Fnoord · on April 6, 2018

> Moral: logging is not free.

I can see how this is even true for a sniffing hardware bridge but isn't it free when your source is fiber?

lurker9 · on April 5, 2018

I don't understand why more people don't use Solaris Zones, they seem to me to be the superior solution by far, and with work done by Joyent you now have modern LX-branded zones also. Is the lack of adoption mainly due to the fact that it's Solaris, and not Linux?

(Solaris lives on in Illumos et al)

eikenberry · on April 5, 2018

Solaris is not free software and the free software forks never gained much traction. Plus it is now also associated with Oracle which brings along a lot of extra distrust, particularly given how litigious they are.

travbrack · on April 6, 2018

Illumos/SmartOS is FOSS

sigjuice · on April 6, 2018

Not entirely FOSS. Looks like you need some binaries to actually build it. https://wiki.illumos.org/display/illumos/How+To+Build+illumo...

aaronchall · on April 6, 2018

I remember being told Java is FOSS too... now tell that to Google...

asdbffg · on April 6, 2018

Java is FOSS.

Oracle used excuse, that it's current OpenJDK license (GPLv2) is incompatible with license, used by Google's runtime (Apache 2). If Google re-licensed it's Java implementation under GPL, some of arguments, used by Oracle lawyers (code reuse and patent (?) violations), would have been void, and arguing about reuse of APIs would have been a lot harder.

Of course, this does not really matter, because the whole lawsuit is just excuse for power games between corporations. Oracle's goal wasn't about Java licensing, it was about gaining some degree of control over emerging Android ecosystem.

oneweekwonder · on April 6, 2018

Can google pull a swift move?

jordanrobinson · on April 6, 2018

Many would say they already are with Kotlin.

monocasa · on April 6, 2018

No, the copyright trial is about the "structure, sequence, and organization" of the APIs, not any literal copying anymore. Switching languages, but keeping the same class library still leaves them open.

willejs · on April 5, 2018

FreeBSD jails it is then.

prewett · on April 6, 2018

I don't know anything about Solaris Zones, but I'm guessing that the FreeBSD jails wouldn't help with the problem in the article. It's still the same kernel running in the jail, so the fadvise calls are going to bottleneck in the same place.

Skunkleton · on April 6, 2018

Why would an entirely different piece of software have the same performance bottlenecks?

rbranson · on April 6, 2018

Because the problem isn’t that there is one bug in a specific implementation, it’s that the whole model is fundamentally prone to this type of problem.

cgroups, Jails, and Zones all suffer from having to cover an immense surface area. Contrast with VMs, which only require managing a few, significantly simpler interfaces. There are definitely differences in quality, but they all use a similar approach.

Skunkleton · on April 6, 2018

OP says that fadvise is going to bottleneck in the same places on BSD that it does under Linux, with no supporting evidence. I agree that containers in general are a tough problem.

jbn · on April 6, 2018

Maybe because https://blog.regehr.org/archives/303 (N-version programming often exhibits the same defects...)

rdtsc · on April 6, 2018

> Is the lack of adoption mainly due to the fact that it's Solaris, and not Linux?

I am guessing it is to avoid to learn a whole new ecosystem, tools, environments, rules, package system etc. It's just simpler to stick to Linux. But it all depends, let's say if my application on Illumos shows a 60% performance improvement, well I can see spending time learning it and using it as a base. But it would really have to be large benefit to justify switching OSes. Of course how would I even bother benchmarking to start with? I'd probably have to hear other stories or mentions on HN and such...

jbergstroem · on April 5, 2018

Last I used Illumos it had pretty poor hardware support (at least at entry to midlevel hardware which for obvious reasons must be a less important priority). I also ran into issues with their KVM port; mainly panics with various linux kernels (kvm clock, virt*, ..). Good experience in general though, especially the zones/zfs/kvm combo.

mgerdts · on April 6, 2018

At Joyent, we are working on bringing bhyve to SmartOS and illumos. Much attention is being paid to performance and stability. The plan of record is for this to eventually replace KVM.

If you are itching for some slides, http://bhyvecon.org/bhyvecon2018-Gwydir.pdf

We have some work to do before we are ready for widespread use. As those pieces fall in place, we'll announce on smartos-discuss@lists.smartos.org.

paxy · on April 6, 2018

And deploy it...where? AWS, Azure, GCP and virtually all other providers have first class Docker and Kubernetes support. You can barely even host a Solaris VM anywhere today.

benjaminl · on April 6, 2018

That would by Joyent [0] the makers of SmartOS (Basically, Open Source Solaris). Their Triton cloud product uses SmartOS under the hood.

[0] - https://www.joyent.com/smartos

amdavidson · on April 6, 2018

They're not exactly competitive:

https://www.joyent.com/pricing/cloud/compute

https://www.digitalocean.com/pricing/

benjaminl · on April 6, 2018

They claim to be half the price of AWS [0]. Digital Ocean isn't relevant here as they don't offer container hosting.

[0] - https://www.joyent.com/pricing/compare

amdavidson · on April 6, 2018

In your link, they're comparing to EC2 VMs not Amazon's Container hosting.

Their whole premise is that their containers are equivalent to/better than VMs[0].

0: https://www.joyent.com/blog/understanding-triton-containers

oblio · on April 6, 2018

So your solution is 1 vendor? Doesn’t seem too realistic.

mgerdts · on April 6, 2018

Zones are great and they are often the right solution. There are limits. If you have the choice of putting 100 zones on one 4 socket system or 25 zones on each of four 1 socket systems, I'd probably recommend going with the four systems. Hot locks and inefficient algorithms happen.

For instance, the directory name lookup cache can greatly impact the performance of file operations. This is what makes it so that when you open /a/b/c/d/e/blah, the OS has a good chance of knowing how to open "blah" directly without first searching /a, /a/b, /a/b/c, /a/b/c/d, and /a/b/c/d/e. I don't have performance numbers handy (left at $job - 1 and $job - 2), but the default size is not great for a system that handles hundreds of thousands of files. If that sounds like a lot of files, count how many files are read as part of a reasonably large build. Then imagine there are dozens of them running concurrently. Or imagine that you are hosting git repos or a bunch of static web content.

The obvious answer is to just increase the size of the DNLC. The problem is that there are other parts of the system that behave poorly with a large DNLC. For instance, whenever anyone tries to unmount a file system, dnlc_purge_vfsp() is called. This walks the cache, looking for entries that are associated with the file system. Who cares, right? We hardly ever unmount file systems. Well, if you use the automounter (by default Solaris uses it for at least /home/*), every few minutes it is trying to unmount all of the automounted file systems. The purge happens before EBUSY can be detected so hot cached entries may be purged while adding contention on dnlc-related locks. What's worse, the automounter doesn't reset its inactivity timer when it hits EBUSY, causing more frequent attempts to unmount than you would otherwise expect. There are other DNLC bottlenecks on a Solaris NFS server when a client removes a file.

And then as you get to larger systems with more NUMA effects, these type of operations become even more expensive.

Operating systems are hard. Making them scale to an infinite number of processes and processors is impossible. At a certain point, it becomes beneficial to use smaller hardware and/or add VMs into the mix.

ofrzeta · on April 6, 2018

It lives but it's not exactly thriving. OpenIndiana struggled, OmniOS gave up (and got picked up again) but still all that struggle makes me expect that Illumos will eventually fade out. At the latest when/if Joyent goes out of business. It's a shame because I like SmartOS a lot but it nowhere has the momentum Linux has. I don't see who will write all the device drivers for the next generation of hardware coming out.

sigjuice · on April 5, 2018

Because Linux userspace probably works best on the Linux kernel?

dstroot · on April 6, 2018

Better title: “nother reason why my code is slow and I’m logging too much”

pacavaca · on April 6, 2018

The point was not to complain about how bad the Docker is but rather to highlight that a lot of unexpected things may come from the fact that the kernel is shared. "Logging too much" was just a reason for the posix_fadvise being called too often but this becomes a problem ONLY when the kernel is shared. In case of virtualization, everyone gets its own version of fadvise (the kernel) and the conflict doesn't happen.

cjhanks · on April 6, 2018

If your VM call to `fadvise` is not calling the underlying host kernel operation, is it even working?

asdbffg · on April 6, 2018

VM certainly does "call underlying host kernel operation", it just does so indirectly — the guest userspace calls fadvise(), kernel implementation of fadvise() asks the virtio disk driver to perform particular read/writes, the virtio driver asks underlying kernel disk driver to read/write individual disk sectors (without knowing, that they are related to specific file in guest filesystem).

This specific bug was caused by putting high load on "kernel dentry cache", e.g. a contention for memory structure, present in kernel memory. Guests normally don't share memory, so contending for it was avoided.

Incidentally, there are situations, when different guests can compete for same memory — when VM uses so-called "memory deduplication" techniques. Which is why enabling that stuff on production systems may be a bad idea.

cjhanks · on April 6, 2018

Thanks for explanation. I've seen the damage of KSMD as well.

pacavaca · on April 6, 2018

Hmm. I'm definitely not an expert in hypervisor implementations but I would guess that it should not proxy any calls to the host kernel...

dullgiulio · on April 6, 2018

But as fadvise can reliably only be implemented in the host kernel, the one in hypervisor is basically a noop.

fulafel · on April 6, 2018

In the case under discussion, the dontneed fadvise would tell the guest kernel page cache to discard written data after write. So it is useful without the host kernel knowing about it.

deathanatos · on April 6, 2018

Yeah. The whole article seems to be a bug in the logging library that's in use here, and an area of contention in the kernel? (How many times is his application logging, to cause that much contention?) I'm not sure how Docker figures into it, aside from it easily lets one run multiple instances of an app on the same hardware, but stuff like supervisord will also do that?

I'm not entirely clear as to why a logging library needs to call fadvise; a log file, is, I presume opened in append-only mode. Isn't "append" sufficient advice to the kernel? Also, fadvise needs byte ranges, and I have no idea what you'd pass for a log file…

asdbffg · on April 6, 2018

Using O_APPEND does not imply, that kernel needs to purge the pages from cache ASAP, does it? Removing pages from cache may be expensive operation by itself, so I presume, that it is avoided by default.

More importantly, if the disk can not catch up, the log data is going to end up waiting in page cache anyway (typical case of bufferbloat). Linux kernel does not have telepathic abilities to balance needs of crazy logger and other applications in system, so without resolving underlying issue (bufferbloat), those writes would take up too much cache, potentially bringing down disk performance of other applications.

fadvise() may schedule quicker eviction, effectively acting as syscall version of vm.dirty_ratio. Of cause, that does not resolve the problem, — just moves it to different layer. The real solution is either

1) blocking the apps until their logs are fully written (for example, by using O_DIRECT)

2) showing those apps middle finger and throwing away some of their logs (AFAIK, this is occasionally done by syslog).

rmrfrmrf · on April 5, 2018

Isn't this kind of thing part of the reason why you log to stdout instead of handling disk writes in your app?

crcl · on April 6, 2018

A comment in the Medium article asked the same thing. The author's response:

> I agree that piping all logs to stdout would be the best solution in case of Dockerized microservices. It’s just that in our case we were porting an existing system, which a) already heavily relied on logging to files b) consisted of many microservices itself, which we couldn’t yet split into separate Docker containers but also couldn’t pipe all their logs to the same stdout.

rickycook · on April 6, 2018

well that’s just an argument to use a pluggable logging system like in python/log4j where you don’t configure logging in your service/library and leave that up to the implementor. also let’s you log to a structured format like JSON and pipe that to a logging aggregator

vivex · on April 6, 2018

https://sysdig.com/blog/container-isolation-gone-wrong/

jwildeboer · on April 6, 2018

TL;DR logging using glog uncovered a bottleneck which was fixed upstream 9 months ago. Maybe keep your dependencies up2date? :)

voltagex_ · on April 6, 2018

Heh, was working with a router today that has a 6 (?) year old dependency - not everyone is so lucky.

abpavel · on April 6, 2018

"Due to a kernel bug..." made me question the whole article right there

jaequery · on April 5, 2018

on osx, i know for a fact Docker for OSX is pretty darn slow due to its way they handle filesystem.

but using Dinghy greatly helped sped everything up due to it using nfs. just in case anyone wanted to know.

lilactown · on April 5, 2018

It can't be worse than the way they handle the filesystem on Windows. I can't even get events (e.g. watching for changes) to work on volumes.

NightlyDev · on April 6, 2018

Reading a bunch of small files from a volume is insanely slow from me in windows, so yeah...

cpuguy83 · on April 6, 2018

This will help you get similar fs performance in d4mac without falling back to NFS: https://docs.docker.com/docker-for-mac/osxfs-caching/#tuning...

jaequery · on April 6, 2018

i never tried this but i think it's similar to docker-sync, where you still have to modify your docker compose files. it's not good for people who want a single consistent docker-compose file that would work across varying distributions (ie; opensource project).

cpuguy83 · on April 7, 2018

These options are part of the Docker API and the underlying engine implements it as needed. On Linux hosts there is nothing to tweak as bind mounts offer full consistency without compromising performance. In Docker for Mac it tweaks the consistency settings for mounts that cross the VM barrier with osxfs.

So itnisnsafe to define these in any comoose file no matter where it runs, just need to make sure that the app in the container can actually deal with the consistency level applied to the particular mount.

jtreminio · on April 6, 2018

:delegated and :cached are ignored on non-MacOS hosts.

dawnerd · on April 6, 2018

Newer versions of docker have been much better, but now it just eats up a ton of ram. The ram setting in prefs doesn’t seem to actually limit it at all.

AlphaSite · on April 6, 2018

You can force docker to mount volumes in a non conistant way. It’s just never clearly documented anywhere.

clintonb · on April 5, 2018

docker-sync (http://docker-sync.io/) is another option that helps resolve filesystem issues on macOS.

jaequery · on April 6, 2018

i had high hopes for this when i first heard about it but it's no good for me as it requires you to modify your docker-compose file to suit it's needs

jdelman · on April 6, 2018

you can layer compose configs and put the sync stuff in its own file: `docker-compose up -f docker-compose.yml -f docker-compose-sync.yml`

jaequery · on April 6, 2018

or you can just go with dinghy and be done with :p

rmrfrmrf · on April 5, 2018

Hadn't heard of Dinghy before. Thanks!

dingo_bat · on April 6, 2018

Nice debugging story but the conclusion was totally wrong! The author even knows this. If they would be logging 3-4x the usual rate they would have seen the same problem on bare metal too. Nothing to do with docker or competing containers or whatever.

dbenhur · on April 6, 2018

Not only that, but glog authors observed the overhead of frequent fadvise and added rate limiting almost a year and half ago. https://github.com/google/glog/commit/dacd29679633c9b845708e...

+1 perf. -1 stale library use. -1 misdirected learning

jorisvh · on April 6, 2018

This is not the original title. The original title is "Another reason why your Docker containers may be slow".

johnhenry · on April 6, 2018

I think this is important because "your" implies that there is something that someone can do to fix the problem. Omitting it makes the think that there's a general problem with the docker ecosystem.

pacavaca · on April 7, 2018

Hmm. I didn't notice that 'your' is not in the HN title. So, is it a filtering or I just forgot to type it?

pacavaca · on April 7, 2018

And yes, 'your' is important in this case. I definitely didn't want it to sound like another Docker-blaming article.

dang · on April 6, 2018

We usually take 'you' out of titles because it's a clickbait device. Similar to 'this'.