Coroutines in C, revisited

markpapadakis · on Dec 17, 2016

This is a C++ coroutines framework( https://github.com/markpapadakis/coros-fibers ), but its trivial to “port” it to C.

I ‘ve also written a few things about coroutines and fibers(e.g https://medium.com/software-development-2/coroutines-and-fib... and https://medium.com/software-development-2/high-performance-s... ).

Coroutines based on stack manipulation(as opposed to simple FSMs) are extremely powerful and are particularly useful for implementing fair scheduling across runnable ‘processes’ and concurrent execution contexts. Fair scheduling in particular can be really important in the domain of distributed systems that need to process potentially long-running and/or expensive request which would otherwise stall processing other cheaper requests indefinitely. Combined with async I/O, this gives near optimal concurrency and resources utilisation. Coroutines are often associated (literally and figuratively) with futures/promises. See SeaStar framework for a great implementation of those ideas.

marktangotango · on Dec 17, 2016

>> Coroutines based on stack manipulation(as opposed to simple FSMs) are extremely powerful and are particularly useful for implementing fair scheduling across runnable ‘processes’ and concurrent execution contexts.

So, isn't this what threads are for? This is not a snarky comment, genuine interest: but isn't this what the OS scheduler and thread implementations are meant to do, and implementing essentially ones own threads and scheduling is the epitomy of 'reinventing the wheel'? Specifically, when going down the route of using a coroutine to essentially duplicate functionality provided by native threads, is there any performance advantage? My understanding is that, nowadays, OS threads and schedulers are very highly optimize for performance.

Other than being able to write concurrent code with yield() and resume(), is there really any reason to NOT use OS level primitives?

jandrewrogers · on Dec 17, 2016

Operating system threads have two big disadvantages. You also design the code differently, since one schedule is cooperative and the other is preemptive.

First, OS context switching is much more expensive in terms of overhead compared to lightweight contexts like fibers. A highly optimized user-controlled context-switching scheme can be barely more expensive than a function call. Context switching OS threads, even when highly optimized, is much more expensive than you might assume, microseconds compared to a few dozen clock cycles, which adds up if you need to switch contexts frequently. In many software systems, the computational cost of the OS context switch can be higher than the computational cost of the real work done during that time slice (i.e. >50% CPU overhead due to context switching) and you end up doing a lot of optimization work trying to avoid this situation, often by implementing something that looks suspiciously like user-space coroutines.

Second, and I would argue more importantly, the operating system provided thread schedule, through no fault of its own, is often pretty terrible for concurrent/distributed/parallel systems. The operating system simply lacks the runtime context to optimally determine (1) when to switch contexts and (2) the right thread to switch to. In practice, this leads to a lot more context switching, and wasteful context switching, with its high overhead than would ever occur if you designed your own schedule. The impact is not small at all, a poor schedule can seriously damage CPU cache locality.

corysama · on Dec 17, 2016

Linux threads are impressively optimized for what they do. The way to beat them is to not do as much. Specifically, to not do a kernel transition. Many coroutine switch implementations are not much more expensive than function calls. For example https://news.ycombinator.com/item?id=13199581

markpapadakis · on Dec 17, 2016

That's a good question, and the answer has to do with, obviously, the cost of spawning OS threads/processes, context-switching among them, and how all that affects caches (cache hit/miss ratio dominates most workloads nowadays, more so than actual CPU instructions execution). Context switching can be cost on average 30micros of CPU overhead, but can reach as high as 1500micros. (You can reduce that by pinning threads to core, but the savings will likely get may be as much as 20% - maybe). Thankfully now, at least on Linux, AFAIK, a context switch doesn't result in a TLB flush - back in the day that was quite expensive. Whenever the OS scheduler needs to context switch, it needs to save old state(including regs), and restore another state and perform some other kind of book-keeping. This is more expensive and convoluted than doing it yourself -- you also usually don't need to care for all possible registers, so you can reducing switching costs by e.g not caring about SSE registers. Apps that create too many threads are constantly fighting for CPU time - you will end up wasting a lot of cpu cycles practically cycling between threads.

Another hidden cost that can severely impact server workloads is that, after your thread has been switched out, even if another process becomes runnable, it will need to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which means that processes are given time slices of 10ms. If your thread has been switched out, but becomes runnable almost immediately, and there are 2 thread before your threads before it in the run queue waiting for CPU time, your thread may have to wait upto 20ms in a worse case scenario to get CPU time. Depending on the average length of the run queue (reflected in sys.load average), and how length threads typically run before getting switched out again, this can considerable affect performance.

Also, kernels don't generally do a good job at respecting CPU affinity -- even on an idle system. You can control that yourself.

If you are going to represent ‘jobs’ or tasks as threads, and you need potentially 100s or 1000s of those, it just won’t scale - the overhead is too high.

If you control context-switching on your application (in practice that’s about saving and restoring a few registers) -- cooperative multitasking -- you can have a lot more control on how that works, when its appropriate to switch etc, and the overhead is far lower.

Oh, and on average its 2.5-3x more expensive to context switch when using virtualisation.

marktangotango · on Dec 17, 2016

>> If you are going to represent ‘jobs’ or tasks as threads, and you need potentially 100s or 1000s of those, it just won’t scale - the overhead is too high.

This seems to be an indictment against containerization (ie docker, kubernetes, etc). How can a single machine, even with containers, scale out if this is true? Do people advocating docker and similar technologies not understand or recognize that their containers should only utilize y threads (where y = (total threads per machine)/(number of containers hosted on that machine))? I personally have never heard this mentioned as a consideration when deploying docker, for example.

ecnahc515 · on Dec 17, 2016

The advantages of containers are multidimensional. The benefit of running multiple processes on a single node is not solely to increase utilization of a given machine but to also gain advantages in deployment, etc.

Schedulers can take what's being mentioned here into account. You can say "I need at least 2 cores but would like up to 8" or "I need guaranteed 8 cores, evict lower priority containers if necessary" to ensure your containers aren't fighting over resources . You can also classify machines for certain types of workloads and then specify certain containers should run on a particular class of node. Not all container platforms handle this, but Kubernetes does at least.

smarterclayton · on Dec 18, 2016

And as we add improved resource management and tuning to Kubernetes you gain those improvements without having to change your deployment pipeline - just like when a compiler gets better you get more performance for "free".

btrask · on Dec 17, 2016

I just created a submission for libco, which I think is the best (fastest and most portable) library for coroutines in C. It's great, I wish it got more love. https://news.ycombinator.com/item?id=13199581

codr4life · on Dec 17, 2016

I'm curious, how is libco faster or more portable than the approach described in this post?

near · on Dec 17, 2016

It's not, they're two separate things. Unfortunately, nobody can agree on terminology for this stuff. I try not to ever use the word 'coroutine' anymore because of this.

libco is what I call a cooperative threading library. It allocates a stack for each thread. This means that thread A can call six functions deep into helper functions, and suspend in the middle of a helper function, and then resume right back at that same point.

Thus, you program libco very similarly to how you do preemptive threading, only you don't ever need any locks, and you have to do your own scheduling (which can be as simple as "co_call and co_return", or "co_jump" in assembly terms.)

The CORO #define wrappers in the article is what I would call a very simple state machine. There is no thread, there is no state machine, there is no user-space context switching going on. You only get the call/ret style behavior, and you can't ret from a subroutine called inside of it. It's far, far less useful.

However, when you can get away with it, CORO will be about ten times faster on average. libco is optimized as much as possible with pure assembler. On ARM, it's literally three instructions. On x86, it's ten. The entirety of the burden is because modern CPUs do not enjoy having their stack pointer changed. Go to older CPUs and the overhead drops substantially, but few people are coding on a 68000 anymore. This also means that libco is not ISO C. It requires an assembly implementation or a platform that lets you poke at jmpbuf from setjmp/longjmp to work. In practice, it'll work anywhere you'd reasonably want. x86, amd64, ppc, ppc64, arm, mips, sparc, alpha; Windows, macOS, Linux, *BSD, Haiku; etc have all been tested. And even if you find something crazy obscure, you only have to write about twenty lines of code to port it.

The magic of libco is that it's about 100 times faster than preemptive threading would be for the context switching overhead. So when you actually need subroutines, and when you don't want to be swamped in locks/semaphores/critical sections, it can eliminate giant swaths of nested state machines (switch tables) by keeping all that state inside each local stack frame, transparently and out of your codebase. Deeply nested state machines are a serious pain in the ass to reason about and maintain, trust me.

Also ... both of these methods have a serious flaw (as does preemptive threading): serialization. If you want to save the state of your program in the middle of working, and restore to that position later (think save states in an emulator) ... it'll be a bit tricky with libco. But since the state is hidden by the #define, it'll be absolutely impossible with CORO, unless you redesigned it to use external state, which would make it even clunkier to use.

In conclusion: both complement each other. I feel C++20 or whatever should add CORO-style stackless state machine wrappers to the language for convenience. But it should definitely leave stack-based cooperative threading out, because that can easily be done as a library. (It can include library functions for that purpose, but it doesn't need language-level primitives.)

There's a hundred cooperative threading libraries, by the way. My aim with libco was to make a library that has four functions (co_create, co_delete, co_active, co_switch) and zero types/structs. As simple and fast as possible, supplying as many optimized backends as possible. Then let people build on top of that. The downside is most people still choose to write their own libraries from scratch, so there's a lot of wheel reinventing going on.

naasking · on Dec 17, 2016

> It's not, they're two separate things. Unfortunately, nobody can agree on terminology for this stuff.

There is standard terminology though. Coroutines allow arbitrarily nested calls back and forth. So-called "stackful" coroutines are the only possibility in C. Other languages perform a program translation into continuation-passing style to support stackless coroutines with the same semantics.

CORO falls under a class of abstractions known as "generators". You can think of it like a degenerate coroutine limited to a single call level, but I don't think conflating the terminology like this is useful.

codr4life · on Dec 17, 2016

You got it backwards though; all the name means is the subroutine has multiple entry points for suspending and resuming execution, which can be used to implement nonpreemptive threads, generators and other abstractions.

clarkmoody · on Dec 17, 2016

How does this work compare with libdill[1]?

[1] http://libdill.org/

drvdevd · on Dec 17, 2016

At a glance this just looks like a short, simple coroutine implementation/hack using C macros, whereas libdill does a lot more to extend C syntax in similar manner, but in a much more comprehensive way (I believe with the intention of matching some of Go's concurrency mechanisms). The author compares this to Duff's device so the comparison seems apt.

I would pick libdill for any serious program and keep this on hand as a C snippet in Vim or $EDITOR.

[edit] I should say this could be used in a "serious" program too, but it's nice and short in a way that looks like a great snippet to me

codr4life · on Dec 17, 2016

It's simpler, to begin with. libdill mixes coroutines with scheduling, there's no reason to do that from my perspective.

zserge · on Dec 17, 2016

I've recently made a similar protothread/coroutine library for embedded systems where local continuations can be either setjmp or goto or case labels. Nothing fancy, but perhaps someone find it useful: https://github.com/zserge/pt

enqk · on Dec 17, 2016

This is a rediscovery of contiki's protothreads:

http://contiki.sourceforge.net/docs/2.6/a01802.html

drakehutner · on Dec 17, 2016

Which afaik is based on the publication of Simon Tatham:

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Who directly references good old duffs device. The title seems to acknowledge the existence of said article, but I didn't find any reference.

codr4life · on Dec 17, 2016

I know, I know. Why is it so important that someone else did something conceptually similar once upon a time? Can't you see the value in standing on their shoulders and try to reach further in some direction? We're all in the same boat here, you can't own ideas. In this case I was aiming to simplify API and usage while keeping the performance, and I feel I managed to take at least a few steps in that direction.

enqk · on Dec 17, 2016

As scandox I did not mean to be dismissive, just to highlight the technique has been used in the past, as a reference for discussion and evaluation.

It is hard to imagine how one could preserve local variables w/o knowledge about the precise compiler + architecture. You would also have to communicate to the compiler that these locals should not be re-arranged nor merged in every context where the function is being used.

So the pragmatic solution thus far has been to keep track of variables in a separate block of memory, passed in in every invocation of a coroutine.

codr4life · on Dec 17, 2016

Preservation of stack data is done by comparing addresses of stack allocated values and using memcpy to copy the space in between. That's the most straight forward solution I could come up with. I'm still working on a way to track the variables automagically, which is why the optimization level is set to 0 for now; looks like I might end up having to add a third macro for preserved vars.

enqk · on Dec 17, 2016

What if these variables contain addresses to other stack variables?

codr4life · on Dec 17, 2016

Undefined behavior. Any ideas on how to solve that?

scandox · on Dec 17, 2016

Well it is useful for others to know the history and have other references. I am sure it is not intended to diminish your work.

drakehutner · on Dec 17, 2016

Exactly that!

Sorry if my comment came across like an insult. Wasn't intended as such. Maybe it was just me hoping to find a direct improvement of Tathams implementation. Which I would call "quirky" at best.

Your implementation seems to be much more "useable" and understandable in comparison.

After reading through more of your library I must say, I like it's style.

codr4life · on Dec 17, 2016

None of this is personal to me, I have no idea where the inspiration came from; I'm just trying to share the beauty I see in the concepts. So I get a bit frustrated with anything that steals focus from my mission, like history lessons. I don't really care who thought of what first. Sorry if that came across as hurt feelings, there is no conflict here.

I'm glad you like the humble beginnings of the library I've always been wishing for :)

drakehutner · on Dec 17, 2016

This makes your work even more interesting. I don't know if you read the original Mail [1] in which Tom Duff shared his Device, but at the end he's mentioning that he had an idea for something like coroutines:

> It amazes me that after 10 years of writing C there are still little corners that I haven't explored fully. (Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.)

He later replied to Simon Tatham, that his coroutines actually where those "horrid" ideas.

Obviously you must had some thoughts similar to those two. Kudos for that. You seem to to have a deep understanding of the concepts that make the C language (and how to twist them) ;).

[1]: http://www.lysator.liu.se/c/duffs-device.html

fanf2 · on Dec 17, 2016

It's worth noting that Tatham's coroutines are very widely deployed in PuTTY

drakehutner · on Dec 17, 2016

I believe Putty is part of the reason why they exists. Every author is able to employ his own mad creations (Speaking from experience here).

I have never seen his coroutine someplace else with the single exception of the above mentioned contiki / protothreads.

codr4life · on Dec 17, 2016

But where do we stop? In the end it's also based on K&R, and on BCPL, and...

scandox · on Dec 17, 2016

Dude it's a forum. The medium is discursive.

codr4life · on Dec 17, 2016

Dude, I'm on a mission; and it's not a history lesson.

scandox · on Dec 17, 2016

Those that ignore history are doomed to repeat it as some other dude said. But he's dead so...But I admire your zeal truly.

codr4life · on Dec 17, 2016

But this has nothing to do with threads, even though I know that you can implement them using the same technique. And the API offered is simpler, without funky BEGIN/END macros.

c-smile · on Dec 17, 2016

That appears as a variant of Simon Tatham idea: http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

Just in case, I've made C++ variant of it: https://www.codeproject.com/Tips/29524/Generators-in-C