A Fast General-Purpose Lock-Free Queue for C++

amelius · on Nov 8, 2014

This is an interesting showcase of lock-free programming. Thank you for this!

However, I see the following concerns:

1. The code is so complicated that (imho) an actual formal proof would be needed to show that this actually works.

2. How portable is this? Will this work on other architectures than i86, which have different ordering guarantees?

3. This reduces the time for enqueueing and dequeueing by a factor of x, which is immense (near 100x or 1000x?), but we have to keep in mind that the code uses only a small fraction of total time spent by an application. Therefore, it would be nice to see that this reduction is worthwhile in a practical application.

4. The code for this queue is actually so large (2800 lines) that a large amount of cache memory is needed to keep this code in active memory, and keep it running as fast as it could.

Again, thanks, this will be interesting to study.

cameron314 · on Nov 8, 2014

1. It is complicated, and, of course, with any complex code comes the possibility of bugs (especially when threads are involved!). I'd love to have a formal proof myself, but it's not a simple undertaking -- the sheer number of possible codepaths at execution time would make an exhaustive check infeasible. (And expressing the entire thing as a formal model is beyond my means.) What I've done instead is to try to test individual subsystems in isolation (you can check out the tests if you're interested): There's a free-list, a thread-local storage map, an add-only lock-free list, two block mapping methods, and the core algorithm itself. Then I wrote a series of unit tests (that are really more like integration tests, I suppose) that test individual functions, and everything together (and boy did those tests ever catch bugs during development!). There's never too many tests. There's never too much model-checking. But I only have limited time :-) So in the meantime, I made an initial release when I thought it was stable and correct to the best of my knowledge, with a reasonable amount of testing that I feel is reasonably thorough. I'd gladly accept more tests contributed by somebody else.

2. It is meant to be 100% portable to any system that has a C++11 compiler. I was very careful about the orderings I wrote, so yes, it should work on CPUs with weaker memory models.

3. The factor depends on what part of the API is used, and with how many threads. I absolutely agree that a fast queue itself probably won't buy you much in terms of application performance. But if you're going to write a data structure, why not write a high-performance one? Much more fun, IMO, and certainly can't hurt. 4. Here I must disagree. The number of lines of code of a heavily templated data structure are entirely unrelated to the compiled size :-) In particular, if you use either only producer tokens or only no producer tokens, then a large portion of the code is unused and will be optimized out. Same for the bulk methods.

near · on Nov 8, 2014

Please don't be upset with me for asking ... but I am sincerely curious: I know that an MPMC lock-free queue is a vastly challenging thing (the best I could do lock-free myself was one producer thread, one consumer thread), but is this really the minimum level of complexity needed to implement this idea? There's a lot of comments (which is great), but even with that, the code just seems incredibly massive to me for what it is (over 2800 lines and 100KB, of course it has a lot of great commenting in there as well.)

If there really is no way to reduce the complexity (or at least without ruining the performance), then I understand. I know code size isn't a particularly important metric, I just find code that's been reduced to its raw essence a lot easier to understand and learn from than over-architected, over-abstracted code.

I ask because this is something I'm interested in understanding (I want to write a nice thread-safe allocator), but the code even with the comments and article is fairly overwhelming to me at the moment.

Either way, great job on making this! Really impressive performance metrics, too.

cameron314 · on Nov 8, 2014

You're certainly right that it's not the minimum amount of complexity. But it's not that hard to implement a simple MPMC queue (e.g. if you can assume DCAS on your platform) -- what's hard is making it fast. So part of the complexity derives from a design that calls for blocks of elements instead of a simlpe linked list, for example; but a lot of the speed comes from that design too.

The rest of the code size (and complexity) arises from the feature set (which I intended to be fairly complete): Users can provide tokens for better performance -- but don't have to. There's bulk methods. There's a move constructor and assignment operator (for the tokens too!). There's traits so that you can override the malloc/free that's used, and all the constants. There's memory debugging machinery (hidden in an #ifndef NEBUG). It all adds up, but not without reason, I think :-)

I will add a "how to read the code" section to the README later on when I have a minute. I think it would be instructive for those curious.

TillE · on Nov 8, 2014

It's not a proof of concept, it's an actual thing that's designed to be used. If you read the README, it's clear that there are a variety of nonessential features included which you may or may not want, but I see no evidence it's "over-architected."

kyberias · on Nov 8, 2014

The code has a huge amount of opportunities for re-factoring and cleanup. Some functions are massive.

Br, Robert C. Martin

Animats · on Nov 8, 2014

OK, now demonstrate that this doesn't require fence instructions to prevent problems on superscalar CPUs (anything bigger than a low-end ARM today).

cameron314 · on Nov 8, 2014

Of course it requires fences. It wouldn't be correct otherwise, and would probably not pass the tests even on x86, thanks to compiler re-ordering.

krig · on Nov 8, 2014

The code does have fence instructions. I have no idea if it is correct, I just took a look at the source code.

davidtgoldblatt · on Nov 8, 2014

This doesn't seem to correctly implement a queue to me?

Suppose thread A enqueues an item, signals somehow to thread B that thread B may proceed, and then thread B enqueues an item. There's no guarantee that thread A's item will be dequeued before thread B's.

Of course, non-linearizable doesn't necessarily mean useless (this is in fact an active research area), but it makes me think the performance comparisons aren't really fair; you're comparing genuinely linearizable queues with one that isn't bound by that performance-hampering requirement.

cameron314 · on Nov 8, 2014

I consider it a queue, despite its non-linearizability. I'll quote from my README where I addressed this concern:

One particular consequence of this design (which seems to be non-intuitive) is that if two producers enqueue at the same time, there is no defined ordering between the elements when they're later dequeued. Normally this is fine, because even with a fully linearizable queue there'd be a race between the producer threads and so you couldn't rely on the ordering anyway. However, if for some reason you do extra explicit synchronization between the two producer threads yourself, thus defining a total order between enqueue operations, you might expect that the elements would come out in the same total order, which is a guarantee my queue does not offer. At that point, though, there semantically aren't really two separate producers, but rather one that happens to be spread across two threads. In this case, you can still establish a total ordering with my queue by creating a single producer token, and using that from both threads to enqueue (taking care to synchronize access to the token, of course, but there was already extra syncrhonization involved anyway). I expect this use case to be fairly rare, though!

davidtgoldblatt · on Nov 8, 2014

This isn't really a great solution though, since it pushes the problem of ensuring FIFO ordering outside of the queue and onto the user; but being FIFO is entirely the point of the queue data structure. If your queue doesn't provide ordering across multiple threads, it can't meaningfully be called a multi-threaded queue, so it's not very fair to compare it to genuine multi-threaded queues. Unless I've misread the documentation, this is just a collection of SPMC queues, together with a read operation that checks all the queues for an item to dequeue, right? But there is no synchronization that occurs to tie any of these SPMC queues together. Even if you wrapped every single access to this data structure in a lock, for instance, you would not be able to ensure FIFO ordering of elements inserted into it.

I think the estimation that having inter-thread ordering requirements is uncommon is incorrect. Probably the most common use of MPMC queues is as a concurrency primitive in systems using a CSP-like abstraction, where such requirements are extremely common. But you can't implement, say, Go's channels in terms of this data structure.

cameron314 · on Nov 9, 2014

Yes, it is essentially an abstraction over a collection of SPMC queues (when using tokens to dequeue, it is actually a bit smart in that it tries to pair producers with consumers where possible).

If you wrapped every enqueue operation in a lock and used the same producer token, then you'd essentially have a single SPMC queue, and it would be fully FIFO. But obviously this defeats the purpose somewhat, so yes, it's not a good solution if you need that guarantee.

I honestly didn't consider external inter-thread ordering at all while I was designing the queue. I guess I was picturing more of a producer-consumer model, than a message passing system (where this dependence between messages across threads/producers is far more common, it seems).

I'd change this constraint if I could, but I can't. It's the fundamental assumption the very heart of the entire design. I'd have to start from scratch (which I'm not prepared to do, at this point). But I've changed the README to make this limitation crystal clear (you're not the only one with reservations with regards to this aspect).

I think it's still useful, perhaps not for all applications, but for many. Sorry if I came across as trying to purport it as something that it wasn't; that was not my intention at all.

krislee · on Nov 8, 2014

This idea is very similar to the Java Disruptor pattern (http://lmax-exchange.github.io/disruptor/). So is this an implementation of this pattern in C++ of sole independent creation of the author?

cameron314 · on Nov 8, 2014

Yes. I hadn't looked at the disruptor pattern until yesterday, and I still don't fully understand it. I designed this queue from scratch.

stinos · on Nov 8, 2014

May I ask why you choose to use inline on the non-template class-methods? Did it make any difference in the code produced? For the rest: this seems very well done. Love all the proper comments and the huge amounts of static_asserts.

robert_tweed · on Nov 8, 2014

Not the author, but since the whole thing is defined in a header, everything has to be declared inline to avoid violating the one definition rule.

Inline hasn't had anything to do with performance for years (compilers just do what they want anyway), but it's somewhat confusing that the meaning was changed.

near · on Nov 8, 2014

I found a cool trick around that recently. Use anonymous namespaces when you want to have "standard-library/boost"-like template header libraries. Anything inside of an anonymous namespace has no external linkage, and it can transparently use anything inside its parent namespace.

For instance:

    namespace nall { namespace {
      void foo() {
      }
    }}

You can include this in multiple compilation units, invoke it with nall::foo(), and have no issues with multiple symbols.

(and of course as rav said, as long as you put the function definition inside the class definition, that can also work. But my trick will work if you want the functions declared outside the class, or want non-class functions.)

T-zex · on Nov 8, 2014

Wouldn't that become a separate function per translation unit?

near · on Nov 8, 2014

I want to say that GCC will do the same thing with the inline trick. It has the option to inline the function (where there won't be any actual functions in the resulting binary), or declare it as a unit-unique function (where the actual function will be repeated in the binary, one for each object that includes our header and uses the function), or recognize it's the same function in each unit and merge it back to just one function for the entire binary (which seems even more likely to happen with LTO.)

Theoretically, the same thing should happen here. The only difference is you don't have to hint that you want the function inlined (and maybe you really don't), but you can still mark the function inline (or always_inline) if you want to ask GCC to please do that, and it doesn't ignore you.

However, I haven't actually analyzed the resulting binaries comparing the two approaches, so I can't say for sure if GCC can optimize the multiple functions to one in the former case (inline) but not the latter case (anonymous namespace). It would seem like an oversight if that were true.

rav · on Nov 8, 2014

> everything has to be declared inline to avoid violating the one definition rule.

Actually, methods defined inside the class definition are implicitly marked inline.

cameron314 · on Nov 8, 2014

It's a matter of personal coding style preference. That's all :-) I like to document which functions I'd prefer to be inlined.

It should make no difference in the output, since I believe all the methods are implicitly 'inline' anyway. And it certainly doesn't affect correctness.