That's a good question, and the answer has to do with, obviously, the cost of spawning OS threads/processes, context-switching among them, and how all that affects caches (cache hit/miss ratio dominates most workloads nowadays, more so than actual CPU instructions execution).
Context switching can be cost on average 30micros of CPU overhead, but can reach as high as 1500micros. (You can reduce that by pinning threads to core, but the savings will likely get may be as much as 20% - maybe).
Thankfully now, at least on Linux, AFAIK, a context switch doesn't result in a TLB flush - back in the day that was quite expensive.
Whenever the OS scheduler needs to context switch, it needs to save old state(including regs), and restore another state and perform some other kind of book-keeping. This is more expensive and convoluted than doing it yourself -- you also usually don't need to care for all possible registers, so you can reducing switching costs by e.g not caring about SSE registers.
Apps that create too many threads are constantly fighting for CPU time - you will end up wasting a lot of cpu cycles practically cycling between threads.
Another hidden cost that can severely impact server workloads is that, after your thread has been switched out, even if another process becomes runnable, it will need to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which means that processes are given time slices of 10ms. If your thread has been switched out, but becomes runnable almost immediately, and there are 2 thread before your threads before it in the run queue waiting for CPU time, your thread may have to wait upto 20ms in a worse case scenario to get CPU time. Depending on the average length of the run queue (reflected in sys.load average), and how length threads typically run before getting switched out again, this can considerable affect performance.
Also, kernels don't generally do a good job at respecting CPU affinity -- even on an idle system. You can control that yourself.
If you are going to represent ‘jobs’ or tasks as threads, and you need potentially 100s or 1000s of those, it just won’t scale - the overhead is too high.
If you control context-switching on your application (in practice that’s about saving and restoring a few registers) -- cooperative multitasking -- you can have a lot more control on how that works, when its appropriate to switch etc, and the overhead is far lower.
Oh, and on average its 2.5-3x more expensive to context switch when using virtualisation.
>> If you are going to represent ‘jobs’ or tasks as threads, and you need potentially 100s or 1000s of those, it just won’t scale - the overhead is too high.
This seems to be an indictment against containerization (ie docker, kubernetes, etc). How can a single machine, even with containers, scale out if this is true? Do people advocating docker and similar technologies not understand or recognize that their containers should only utilize y threads (where y = (total threads per machine)/(number of containers hosted on that machine))? I personally have never heard this mentioned as a consideration when deploying docker, for example.
The advantages of containers are multidimensional. The benefit of running multiple processes on a single node is not solely to increase utilization of a given machine but to also gain advantages in deployment, etc.
Schedulers can take what's being mentioned here into account. You can say "I need at least 2 cores but would like up to 8" or "I need guaranteed 8 cores, evict lower priority containers if necessary" to ensure your containers aren't fighting over resources . You can also classify machines for certain types of workloads and then specify certain containers should run on a particular class of node. Not all container platforms handle this, but Kubernetes does at least.
And as we add improved resource management and tuning to Kubernetes you gain those improvements without having to change your deployment pipeline - just like when a compiler gets better you get more performance for "free".
Another hidden cost that can severely impact server workloads is that, after your thread has been switched out, even if another process becomes runnable, it will need to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which means that processes are given time slices of 10ms. If your thread has been switched out, but becomes runnable almost immediately, and there are 2 thread before your threads before it in the run queue waiting for CPU time, your thread may have to wait upto 20ms in a worse case scenario to get CPU time. Depending on the average length of the run queue (reflected in sys.load average), and how length threads typically run before getting switched out again, this can considerable affect performance.
Also, kernels don't generally do a good job at respecting CPU affinity -- even on an idle system. You can control that yourself.
If you are going to represent ‘jobs’ or tasks as threads, and you need potentially 100s or 1000s of those, it just won’t scale - the overhead is too high.
If you control context-switching on your application (in practice that’s about saving and restoring a few registers) -- cooperative multitasking -- you can have a lot more control on how that works, when its appropriate to switch etc, and the overhead is far lower.
Oh, and on average its 2.5-3x more expensive to context switch when using virtualisation.