Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I write Clojure for food, and Common Lisp for fun. One reason for the latter is CL's speed -- awhile back I compared a bit of (non-optimized) Clojure code I wrote for a blog post with a rough equivalent in CL, and was stunned that the Common Lisp code ran about 10x faster. This made me curious as to how fast it could be made if I really tried, and was able to get nearly 30x more[1] by optimizing it.

Clojure is definitely fast enough for everything I've done professionally for six years. But Common Lisp, while having plenty of rough edges, intrigues on the basis of performance alone. (This is on SBCL -- I have yet to play with a commercial implementation.)

[1] http://johnj.com/from-elegance-to-speed.html



I have implemented real algotrading framework in Common Lisp (SBCL) that connected directly to Warsaw Stock Exchange.

The application was receiving binary messages from the exchange over multicast, rebuilding state of the market, running various (simple) algorithms and responding with orders within 5 microseconds of the original message, at up to 10k messages per second.

With SBCL you can write a DSL and have ability to fully control the resulting instructions (through vops). It is just the question how dedicated you are to writing macros upon macros upon macros.

I used this to the fullest extent and the result was as good as any hand optimized C/assembly.

For example, the binary message parser would receive a stack of complicated specifications for the messages in the form of XML files (see here if you are curious: https://www.gpw.pl/pub/GPW/files/WSE_CDE_XDP_Message_Specifi...), converted XML to DSL and then, through magic of macros, the DSL was converted to accessors that allowed optimal access to the fields. Optimal here mans the assembly could not be improved upon any further.

Large parts of the application (especially any communication and device control as it was done with full kernel bypass) was written in ANSI C but the Foreign Function Interface makes integrating it into the rest of application a breeze.

I write all of this, because I frequently meet complete disbelief from people that Lisp can be used in production.

I personally think it is exactly the opposite. Lisps offer fantastic development environment. The problem are developers who are mostly unable to use Lisp for work effectively, I guess due to too much power, freedom and complete lack of direction on how to structure your application.


It is similar to how many still don't belief in GC enabled systems languages.

If some big shot corporation with weight in the industry doesn't shove it down the throat of devs, then it isn't possible.

And regarding Lisp, most tutorials keep ignoring that arrays, structures, stack allocation, deterministic allocation,... are also part of Common Lisp.


> It is similar to how many still don't belief in GC enabled systems languages.

Well... don't give my application as an example.

I have worked around the GC problem by using CL as kind of compiler for the application -- I built high level DSL that was then compiled to binary supplemented with small blocks written in ANSI C.

If you think about it, SBCL already does this and so does JIT in JVM.

I will give you an example:

Before the order goes to the exchange, it needs to be validated against a set of rules. Like "do we have enough money to run this order?" or "How much of allotted limit this algorithm still has?" or "Is volatility on this market within bounds for this algorithm to be used?"

The rules were written in Common Lisp DSL, then Common Lisp code converted it to a decision tree, then optimized decision tree, then compiled optimized version of the decision tree to a binary function. That function itself had no longer anything to do with Common Lisp, it could have just as well been written in ANSI C.

Then Common Lisp code wired these functions to form the application.

While most of the application code was actually Common Lisp (and some 20% of ANSI C), if you were market order and observed what instructions are handling you, none of them would actually be Common Lisp.

After the initial setup, the Common Lisp application was running on only one core, and the constructed binary took over all over cores and was running without garbage collection on memory regions allocated outside of Common Lisp system.

I hope this description makes sense... I have never before or after met an application that was built this way. As far as I know it is only one of its kind.


You know, even beyond Common Lisp, lisp literature is often surprisingly low level. Lambda papers are dealing with assembly too. People may be confused about lisp being up high in its ivory towers but the people in this subculture were extremely diligent/smart on both end of the ladder.


Part of this is that when Lisps started ALL other languages were very low level and the hardware that the Lisps were running on did not afford to waste any memory or cycles.

So efficiency was an important consideration right from the start.

Other languages that most people use today grew at later times (borrowing from Lisp) when the resources wasn't such a big deal.

I consider languages like PHP, Ruby and Python peak examples of mainstream languages that were created with barely any consideration for efficiency. This coincides with 1990s and early 2000s where we saw dramatic increase of system resources and especially memory without much need to use it efficiently.

Nowadays it improved a little bit as we learn to run bigger loads so newer mainstream languages (like Rust) and some older (like Java, C# or JavaScript) are putting more effort on efficiency.


Indeed, when someone points out "you cannot do that because...", I feel the urge to point them out to what was happening on 1960's hardware and how much their watch outperforms a Connection Machine.


the memory capacity of society is astonishingly low


Christian Queinnec (Lisp in Small Pieces author) vaguely mentions that all dynamic languages are lisp under various clothings.


You sure of the 5us figure?

Working in the field, that seems overconfidently good, considering its faster than most wire to wire SLA of world class FPGAs doing market data translation.


I am sure, 5us was measured on the switch, not on the server.

5us is actually quite slow.

This is from 2014, around the time I worked on this project:

https://stackoverflow.com/questions/17256040/how-fast-is-sta...

They cite "750 to 800 nanoseconds" wire to wire latency and that their next platform is going to be even faster.

They were using FPGA and yes, FPGA is faster, but not as much faster as many people would think.

First, market events come in isolation. There are no concurrent messages. WSE guarantees 1/10000th of a secend between each message. So you have your entire machine dedicated to executing the order.

FPGAs are usually used to run multiple copies of same net to speed up a simple problem but that is not the case here.

Second, FPGAs are usually used as a shortcut to optimize the execution of the problem. With FPGA you say "rather than trying to solve this problem with generic instructions that add a lot of delays I will just design dedicated net that will not be bothered by the generic baggage".

So in generic assembly you may want to write a branch and the branch predictor may go the wrong way and that costs. On FPGA you design your net and so you just go straight to the point.

But it doesn't mean you can't design normal code to be fast. You just need to be aware of actual cost of every single instruction of the critical path.

And third, FPGAs are clocked slower. What this means you have to do a lot per clock cycle on FPGA just to be on par with x86 core.

That application I worked on it was not intended as HFT. 5 us was an arbitrary goal we wanted to reach knowing full well that it is way behind HFT-ers.


This is fascinating. Did you come across any good resources to help learn how to optimise aggressively on SBCL, as you're describing?


This was many years ago and I don't remember exact sources. It was more of a research than a normal development project. I spent most of my time researching techniques needed to write that kind of application.

The largest influence were definitely LMAX articles and Disruptor pattern.

SBCL was comparatively easy. Basically, if anything caused problems I just moved it to C and called using FFI. Think in terms of writing a C program but using pieces of assembly when C is not enough for some reason.

Even though a large part of small pieces was moved to C, the application still felt like Lisp. It just orchestrated a large library of utilities to do various things. I still had fully functional REPL, for example.

The hard part of the project was full kernel bypass. After the initial setup, the application stopped talking to Linux kernel except for one CPU core that was devoted to running OS threads, some necessary applications and some non-performance-critical threads of the algotrading framework (like REPL, persistence, etc).

All except for one core were completely owned each by a single thread of the application and never did any context switch after initial setup.


Thank you, that's very helpful. I've found the cepl videos instructive in a simlar area https://www.youtube.com/user/CBaggers/videos

I hadn't appreciated that level of thread isolation was possible in SBCL, but of course it makes sense. Presumably you had some kind of instrumentation to let you know if a stray syscall had slipped in?


> Presumably you had some kind of instrumentation to let you know if a stray syscall had slipped in?

Don't know if you can call it instrumentation... I wrote an extremely hacky patch for the kernel to detect when any piece of kernel is running on anything than core 0 after certain flag was set.

Remember, it is not just syscalls. Even something as simple as accessing memory can cause switch to kernel to resolve TLB entry if you don't set up your memory correctly to prevent this from happening.

Yeah... I know. Now a bunch of people will come and explain how this could be done the right way. I just didn't care at the time to invest more time than necessary to get this right.


Got it. That's rather lovely.


I assume that you used something like isolcpus kernel parameter, then sefaffinity to put the process on the isolated CPU.

I'm very interested in the kernel bypass technique to talk to the hardware, I'm assuming it was the NICs ring buffer.. how was this achieved in userspace?


Hmmm cl for composition of software components, most software, and anything needing low level access in C. I could drink that coolaide.


That is how Python got its spotlight with "Python" libraries, which are actually C.

And that is how plenty of polyglot devs can enjoy their favourite language.


Funnily enough, a lot of low-level coding is easier in CL than C (especially bit wrangling, but CL:DISASSEMBLE makes for easy introspection of performance critical code too), it's usually the interface part with C centric OS that get easier using FFI wrappers.


Others have made similar comments on comparing apples to oranges when comparing optimized CL to idiomatic Clojure, but what they didn't address was "idiomatic Clojure". Threading a sequence through a bunch of lazy operations is good for development time, but at this point I wouldn't consider it idiomatic for "production" code.

Two things in the implementation are performance killers: Laziness and boxed maths.

- baseline: Taking your original implementation and running it on a sequence or 1e6 elements I generated, I start off at 1.2 seconds.

- Naive transducer: Needs a transducer of a sliding window which doesn't exist in the core yet[0], 470ms

- Throw away function composition, use peek and nth to access the vector: 54ms

- Map & filter -> keep: 49ms

- Vectors -> arrays: 29ms

I'd argue only the last step makes the code slightly less idiomatic. Might even say that aggressively using juxt, partial and apply is less idiomatic than simple lambdas

You can see the implementation here

[0] https://gist.github.com/nornagon/03b85fbc22b3613087f6

[1] https://gist.github.com/bsless/0d9863424a00abbd325327cff1ea0...

Edit: formatting


Your post is a lot of fun! I have a fondness for these kinds of deep dives. That being said, I feel like comparing the initial unoptimized Clojure code to highly optimized Common Lisp is kind of unfair. I wonder how fast the Clojure code could run if given the same care. Maybe I'll give that a try tonight!


That would be great! I agree it's not a fair comparison -- the post was meant more along the lines of, "how far can I push Common Lisp and learn some things?" rather than a strict comparison of performance of optimized code in each language. As I said, Clojure is fast enough for my purposes nearly all the time.


I wrote it up here: http://noahtheduke.github.io/posts/2021-10-02-from-elegance-... I'm not a great writer, but hopefully you enjoy it!

I also found a small bug, that you'll want to use `(elt s (+ x 7))`, not `(+ x 8)`. `elt` is 0-indexed, so first and last of an 8 element list will be 0 and 7.


You're going to be severely constrained by the fact that idiomatic clojure code uses immutable data structures which cannot theoretically be as fast as mutable ones in sbcl. Even with transience, the Clojure hash map trie is an order of magnitude slower than sbcl's hash table.


Of course. I don't mean to imply that optimized Clojure could match optimized Common Lisp, just that the disparity wouldn't be quite as stark. For truly speed-focused code on the JVM, you have to write speed-focused Java.


I also write Clojure for food, but also for fun. I found that a lot of Clojure is fast enough but also speeding it up is relatively easy with some type hinting and perhaps dipping into mutable data structures or other tricks.

Relevant Stackoverflow which contains a lot of simple suggestions to hopefully shore up the deficit compared to SBCL: https://stackoverflow.com/questions/14115980/clojure-perform...


I'm sorry to say, but I'm not a fan of articles like this, you're not using the same data-structures and algorithms, so what's the point?

To compare language compiler and runtime performance you should at least use similar data-structures and algorithms.

> but, you can generate very fast (i.e., more energy efficient)

I'm actually not sure this is true, I was surprised the other day to find a study on this and to find that the JVM was one of the most energy efficient runtime.

This was the link: https://thenewstack.io/which-programming-languages-use-the-l...

And what you can see is that execution time doesn't always mean more energy efficient. For example you can look at Go and see that it beats a lot of things in execution time, but loses to those in energy efficiency. Like how Go was faster than Lisp yet less energy efficient than Lisp.


Quite a few lisp implementations allow you to drop into inline assembly. That makes them theoretically per close to maximum efficiency.


I'm not sure in what way you're using the word "efficiency" here?

The research showed that the Go version of the programs were in average faster in terms of execution time and consumed less memory than the alternate Lisp versions, yet the Lisp versions consumed less electricity and were thus more energy efficient.

So the interesting bit here is that better performance and lower memory consumption doesn't always mean more energy efficient.

Now, there is definitely a link between performance and energy use, the research did show that for the most part, faster execution often reflected in lower energy spent, but there were surprising variation within that range where it wasn't always clear cut.


It would be cool if you updated the link to the Cookbook in your article from the old one (on Sourceforge) to the newer one (on Github): https://lispcookbook.github.io/cl-cookbook/ Best,


I will, thank you!


Your website inspires me, I've been on a similar career trajectory to you in some ways although at the opposite end of the earth. Starting in Physics and moving away in time, although I'm only 33 now. I love that you're putting your art on display. It's very rare to find scientists pursuing art in such a direct way.


I would be interested in at least the performance of type hinted clojure code.


The closest translation of the code translation, having already dropped the laziness of the Clojure version, was 4x faster. The rest of the speedups came from rewriting the code!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: