Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For some reason, this raised my curiosity how fast different languages write individual characters to a pipe:

PHP comes in at about 900KiB/s:

    php -r 'while (1) echo 1;' | pv > /dev/null
Python is about 50% faster at about 1.5MiB/s:

    python3 -c 'while (1): print (1, end="")' | pv > /dev/null
Javascript is slowest at around 200KiB/s:

    node -e 'while (1) process.stdout.write("1");' | pv > /dev/null
What's also interesting is that node crashes after about a minute:

    FATAL ERROR: Ineffective mark-compacts
    near heap limit Allocation failed -
    JavaScript heap out of memory
All results from within a Debian 10 docker container with the default repo versions of PHP, Python and Node.

Update:

Checking with strace shows that Python caches the output:

    strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null
Outputs a series of:

    write(1, "11111111111111111111111111111111"..., 8193) = 8193
PHP and JS do not.

So the Python equivalent would be:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null
Which makes it compareable to the speed of JS.

Interesting, that PHP is over 4x faster than the Python and JS.



> Javascript is slowest at around 200KiB/s:

I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.

> What's also interesting is that node crashes after about a minute

I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

The following code shouldn't crash, give it a try:

    node -e 'function write() {process.stdout.write("1"); process.nextTick(write)} write()' | pv > /dev/null
It's slower for me though, giving me 1.18MiB/s.

More examples with Babashka and Clojure:

    bb -e "(while true (print \"1\"))" | pv > /dev/null
513KiB/s

    clj -e "(while true (print \"1\"))" | pv > /dev/null
3.02MiB/s

    clj -e "(require '[clojure.java.io :refer [copy]]) (while true (copy \"1\" *out*))" | pv > /dev/null
3.53MiB/s

    clj -e "(while true (.println System/out \"1\"))" | pv > /dev/null
5.06MiB/s

Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka v0.8.1, Clojure 1.11.1.1105


>> What's also interesting is that node crashes after about a minute

> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

Not exactly: the GC is still running; it’s live memory that’s growing unbounded.

What’s going on here is that WritableStream is non-blocking; it has advisory backpressure, but if you ignore that it will do its best to accept writes anyway and keep them in a buffer until it can actually write them out. Since you’re not giving it any breathing room, that buffer just keeps growing until there’s no more memory left. `process.nextTick()` is presumably slowing things down enough on your system to give it a chance to drain the buffer. (I see there’s some discussion below about this changing by version; I’d guess that’s an artifact of other optimizations and such.)

To do this properly, you need to listen to the return value from `.write()` and, if it returns false, back off until the stream drains and there’s room in the buffer again.

Here’s the (not particularly optimized) function I use to do that:

  async function writestream(chunks, stream) {
      for await (const chunk of chunks) {
          if (!stream.write(chunk)) {
              // When write returns null, stream is starting to buffer and we need to wait for it to drain
              // (otherwise we'll run out of memory!)
              await new Promise(resolve => stream.once('drain', () => resolve()))
          }
      }
  }
I do wish Node made it more obvious what was going on in this situation; this is a very common mistake with streams and it’s easy to not notice until things suddenly go very wrong.

ETA: I should probably note that transform streams, `readable.pipe()`, `stream.pipeline()`, and the like all handle this stuff automatically. Here’s a one-liner, though it’s not especially fast:

  node -e 'const {Readable} = require("stream"); Readable.from(function*(){while(1) yield "1"}()).pipe(process.stdout)' | pv > /dev/null


Are there still no async write functions which handle this easier than the old event based mechanism? Waiting for drain also sounds like it might reduce throughout since then there is 0 buffered data and the peer would be forced t Öl pause reading. A „writable“ event sounds more appropriate - but the node docs don’t mention one.


Your node version indeed did not crash. Tried for 2 minutes.

But using a longer string crashed after 23s here:

    node -e 'function write() {process.stdout.write("1111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000"); process.nextTick(write)} write()' | pv > /dev/null


Hm, strange. With the same out of memory error as before or a different one? Tried running that one for 2 minutes, no errors here, and memory stays constant.

Also, what NodeJS version are you on?


Yes, same error as before. Memory usage stays the same for a while, then starts to skyrocket shortly before it crashes.

node is v10.24.0. (Default from the Debian 10 repo)


Huh yeah, seems to be a old memory leak. Running it on v10.24.0 crashes for me too.

After some quick testing in a couple of versions, it seems like it got fixed in v11 at least (didn't test any minor/patch versions).

By the way, all versions up to NodeJS 12 (LTS) are "end of life", and should probably not be used if you're downloading 3rd party dependencies, as there are bunch of security fixes since then, that are not being backported.


I used this exact issue today while pointing out how Debian support dates can be misleading as packages themselves aren’t always getting fixes: https://github.com/endoflife-date/endoflife.date/issues/763#...


> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

Java has (had) weird idiosyncrasies like this as well, well it doesn't crash, but depending on the construct you can get performance degradations depending on how the language inserts safepoints (where the VM is at a knowable state and a thread can be safely paused for GC or whatever).

I don't know if this holds today, but I know there was a time where you basically wanted to avoid looping over long-type variables, as they had different semantics. The details are a bit fuzzy to me right now.


If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:

  yes | pv > /dev/null
There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".

Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.

[1] Previous HN discussion: https://news.ycombinator.com/item?id=14542938


Faster still is

  pv < /dev/zero > /dev/null


Yes but you don't have control of which character is written (only NULLs).

yes lets you specify which character to output. 'yes n' for example to output n.


Yes doesn't just let you choose a character. It lets you choose a string that will be repeated. So

    yes 123abc
will print

    123abc123abc123abc123abc123abc
and so on.


each time terminated by a newline, so:

  123abc
  123abc
  123abc
  ...


> It runs at about 6 GiB/s on my system...

Honest question: what are the practical use cases of this?

Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate. Also seems like the bottleneck would always be the consuming program...


Historically, you could have dirty filesystems after a reboot that "fsck" would ask an absurd number of questions about ("blah blah blah inode 1234567890 fix? (y/n)"). Unless you were in a very specific circumstance, you'd probably just answer "y" to them. It could easily ask thousands of questions though. So: "yes | fsck" was not uncommon.


> Historically

It's probably still common in installation scripts, like in Dockerfiles. `apt-get install` has the `-y` option, but it would be useful for all other programs that don't.


Just to clarify: I was applying "historically" to "fsck", not to the use of "yes" in general. I can't remember the last time I've had the need to use "yes | fsck"


> Honest question: what are the practical use cases of this?

It also allows you to script otherwise interactive command line operations with the correct answer. Many come like tools now days provide specific options to override queries. But there are still a couple hold outs which might not.


> Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate.

At that rate no but I definitely use it once in a while. For example if a copy quite a few files and then get repeatedly asked if I want to overwrite the destination (when it's already present). Sure, I could get my commmand back and use the proper flag to "cp" or whatever to overwrite, but it's usually much quicker to just get back the previous line, go at the beginning (C-a), then type "yes | " and be done with it.

Note that you can pass a parameter to "yes" and then it repeats what you passed instead of 'y'.


> especially at that bit rate. Also seems like the bottleneck would always be the consuming program...

It's not made to be fast; it's just fast by nature, because there's no other computation it needs to do than to just output the string.


It is optimized quite seriously. I remember there was a comparison of it with I believe a BSD version, where the latter was thousands time more readable (although slower).


I'm getting ~3.10GiB/s with both GNU's and FreeBSD's. I do see that GNU's version has some optimizations, but their effectiveness isn't apparent when doing `yes | pv > /dev/null`.

However, my point was just that its performance was never a main point of it. Even without optimizations, it's still very fast, and I don't think whoever created it first was concerned with it having to be super fast, as long as it was faster than the prompts of whatever was downstream in the pipe.


It really is! It's been a few years since I saw the article on HN so I just reposted it: https://news.ycombinator.com/item?id=31619076


Yes can repeat any string, not just "y". It can be useful for basic load generation.


I've used it to test some db behavior with `yes 'insert ...;' | mysql ...`. Fastest insertions I could think of.


A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):

- Node (no buffering): 1.2 MiB/s

- Go (no buffering): 2.4 MiB/s

- Python (8 KiB buffer): 2.7 MiB/s

- Go (8 KiB buffer): 218 MiB/s

Go program:

    f := bufio.NewWriterSize(os.Stdout, 8192)
    for {
       f.WriteRune('1')
    }


Not specifically addressed at you, but it's a bit amusing watching a younger generation of programmers rediscovering things like this, which seemed hugely important in like 1990 but largely don't matter that much to modern workflows with dedicated APIs or various shared memory or network protocols, as not much that is really performance-critical is typically piped back and forth anymore.

More than a few old backup or transfer scripts had extra dd or similar tools in the pipeline to create larger and semi-asynchronous buffers, or to re-size blocks on output to something handled better by the receiver, which was a big deal on high speed tape drives back in the day. I suspect most modern hardware devices have large enough static RAM and fast processors to make that mostly irrelevant.


In addition to buffering within the process, Linux (usually) buffers process stdout with ~16KB, and does not buffer stderr.


I did the same test, but added a rust and bash version. My results:

Rust: 21.9MiB/s

Bash: 282KiB/s

PHP: 2.35MiB/s

Python: 2.30MiB/s

Node: 943KiB/s

In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.

Rust:

  fn main() {
      loop {
          print!("1");
      }
  }
Bash (no discernible difference between echo and printf):

  while :; do printf "1"; done | pv > /dev/null


For languages like C, C++, and Rust, the bottleneck is going to mainly be system calls. With a big buffer, on an old machine, I get about 1.5 GiB/s with C++. Writing 1 char at a time, I get less than 1 MiB/s.

    $ ./a.out 1000000 2000 | cat >/dev/null
    buffer size: 1000000, num syscalls: 2000, perf:1578.779593 MiB/s
    $ ./a.out 1 2000000 | cat >/dev/null
    buffer size: 1, num syscalls: 2000000, perf:0.832587 MiB/s
Code is:

    #include <cstddef>
    #include <random>
    #include <chrono>
    #include <cassert>
    #include <array>
    #include <cstdio>
    #include <unistd.h>
    #include <cstring>
    #include <cstdlib>

    int main(int argc, char **argv) {

        int rv;

        assert(argc == 3);
        const unsigned int n = std::atoi(argv[1]);
        char *buf = new char[n];
        std::memset(buf, '1', n);

        const unsigned int k = std::atoi(argv[2]);

        auto start = std::chrono::high_resolution_clock::now();
        for (size_t i = 0; i < k; i++) {
            rv = write(1, buf, n);
            assert(rv == int(n));
        }
        auto stop = std::chrono::high_resolution_clock::now();

        auto duration = stop - start;
        std::chrono::duration<double> secs = duration;

        std::fprintf(stderr, "buffer size: %d, num syscalls: %d, perf:%f MiB/s\n", n, k, (double(n)*k)/(1024*1024)/secs.count());
    }
EDIT: Also note that a big write to a pipe (bigger than PIPE_BUF) may require multiple syscalls on the read side.

EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work:

    $ ./a.out 1000000 1000 >/dev/null
    buffer size: 1000000, num syscalls: 1000, perf: 1827368.373827 MiB/s


I suspect (but am not sure) that the shell may be doing something clever for a stream redirection (>) and giving your program a STDOUT file descriptor directly to /dev/null.

I may be wrong, though. Check with lsof or similar.


There's no special "no work" detection needed. a.out is calling the write function for the null device, which just returns without doing anything. No pipes are involved.



Seems like it's buffering output, which Python also does. Python is much slower if you flush every write (I get 2.6 MiB/s default, 600 KiB/s with flush=True).

Interestingly, Go is very fast with a 8 KiB buffer (same as Python's), I get 218 MiB/s.


for the bash case, the cost of forking to write two chars is overwhelming compared to anything related to I/O.


Echo and printf are shell built-ins in bash[0]. Does it have to fork to execute them?

You could probably answer this by replacing printf with /bin/echo and comparing the results. I'm not in front of a Linux box, or I'd try.

[0] https://www.gnu.org/software/bash/manual/html_node/Bash-Buil...


> Echo and printf are shell built-ins in bash

Ah, yeah, good point, I am wrong.


There's no forking and it's wrinting one character.


with Rust you could also avoid using a lock on STDOUT and get it even faster!


Tested it, seems to about double the speed (from 22.3mb/s to 47.6mb/s).


> python3 -c 'while (1): print (1, end="")' | pv > /dev/null

python actually buffers its writes with print only flushing to stdout occasionally, you may want to try:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null
which I find goes much slower (550Kib/s)


Luajit using print and io.write

  LuaJIT 2.1.0-beta3
Using print is about 17 MiB/s

  luajit -e "while true do print('x') end" | pv > /dev/null
Using io.write is about 111 MiB/s

  luajit -e "while true do io.write('x') end" | pv > /dev/null


"Javascript" is slowest probably because node pushes the writes to a thread instead of printing directly from the main process like PHP.

Python cheats, and it's still slow as heck even while cheating (buffers the output at 8192 chunks instead of issuing 1 byte writes).

write(1, "1", 1) loop in C pushes 6.38MiB/s on my PC. :)


Why is it cheating to use a buffer? This is the behavior you would get in C if you used the C standard library (putc/fputc) instead of a system call (write).


Because it doesn't answer the question "how fast individual languages write individual characters to a pipe" if in fact some languages do not.

It's not language "cheating" of course. It's just OP "measuring the wrong thing".


If you want to compare apples to apples, you can switch Python to use unbuffered stdout/stderr (via `-u` or fcntl inside the script [0])

[0]: https://stackoverflow.com/a/881751/6001364


Adding a few results:

Using OP's code for following

    php 1.8mb/sec
    python 3.8 Mb/sec
    node 1.0 Mb/sec
Java print 1.3 Mb/sec

    echo 'class Code {public static void main(String[] args) {while (true){System.out.print("1");}}}' >Code.java; javac Code.java ; java Code | pv>/dev/null
Java with buffering 57.4 Mb/sec

    echo 'import java.io.*;class Code2 {public static void main(String[] args) throws IOException {BufferedWriter log = new BufferedWriter(new OutputStreamWriter(System.out));while(true){log.write("1");}}}' > Code2.java ; javac Code2.java ; java Code2 | pv >/dev/null


Java can get even much much faster: https://gist.github.com/justjanne/12306b797f4faa977436070ec0...

That manages about 7 GiB/s reusing the same buffer, or about 300 MiB/s with clearing and refilling the buffer every time

(the magic is in using java’s APIs for writing to files/sockets, which are designed for high performance, instead of using the APIs which are designed for writing to stdout)


Nice, that's pretty cool!


`process.stdout.write` is different to PHP's `echo` and Python's `print` in that it pushes a write to an event queue without waiting for the result which could result in filling event queue with writes. Instead, you can consider `await`-ing `write` so that it would write before pushing another `write` to an event queue.

    node -e '
        const stdoutWrite = util.promisify(process.stdout.write).bind(process.stdout);
        (async () => {
            while (true) {
                await stdoutWrite("1");
            }
        })();
    ' | pv > /dev/null


I'm on a 2015 MB Air with two browsers running, probably a dozen tabs between them, three tabs in iTerm2, Outlook, Word, and Teams running.

Perl 5.18.0 gives me 3.5 MiB per second. Perl 5.28.3, 5.30.3, and 5.34.0 gives 4 MiB per second.

    perl5.34.0 -e 'while (){ print 1 }' | pv > /dev/null
For Python 3.10.4, I get about 2.8 MiB/s as you have it written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for 3.8) with this. I also get 4.8 MiB/s with 2.7:

    python3 -c 'while (1): print (1)' | pv > /dev/null
If I make Perl behave like yes and print a character and a newline, it has a jump of its own. The following gives me 37.3 MiB per second.

    perl5.34.0 -e 'while (){ print "1\n" }' | pv > /dev/null
Interestingly, using Perl's say function (which is like a Println) slows it down significantly. This version is only 7.3 MiB/s.

    perl5.34.0 -E 'while (1) {say 1}' | pv > /dev/null
Go 1.18 has 940 KiB/s with fmt.Print and 1.5 MiB/s with fmt.Println for some comparison.

    package main

    import "fmt"

    func main() {
            for ;; {
                    fmt.Println("1")
            }
    }

These are all macports builds.


For me:

Python3: 3 MiB/s

Node: 350 KiB/s

Lua: 12 MiB/s

  lua -e 'while true do io.write("1") end' | pv > /dev/null
Haskell: 5 MiB/s

  loop = do
    putStr "1"
    loop

  main = loop
Awk: 4.2 MiB/s

  yes | awk '{printf("1")}' | pv > /dev/null


Lua is an interesting one.

    while true do
      io.write "1"
    end
PUC-Rio 5.1: 25 MiB/s

PUC-Rio 5.4: 25 MiB/s

LuaJIT 2.1.0-beta3: 550 MiB/s <--- WOW

They all go slightly faster if you localize the reference to `io.write`

    local write = io.write
    while true do
      write "1"
    end


> They all go slightly faster if you localize the reference to `io.write`

No noticeable difference for LuaJIT, which makes sense, since JIT should figure it out without help.


And this, folks, is why you have immutable modules. If you know before runtime what something is, lookup is a lot faster.


Ah yes you're right. Basically no difference with LuaJIT.

5.1 and 5.4 show about ~8% improvement.


Haskell can be even simpler:

    main = putStr (repeat '1')
[Edit: as pointed out below, this is no longer the case!]

Strings are printed one character at a time in Haskell. This choice is justified by unpredictability of the interaction between laziness and buffering; I am uncertain it's the correct choice, but the proper response is to use Text where performance is relevant.


Wow, this does 160 MiB/s. That's a huge improvement! The output of strace looks completely different:

  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
With the recursive code, it buffered the output in the same way but bugged the kernel a whole lot more in-between writes. Not exactly sure what is going on:

  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
  rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=920390843}) = 0
  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=920666397}) = 0
  ...
  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192


I'm honestly surprised either of them wind up buffered! That must be a change since I stopped paying as much attention to GHC.

I'm also not sure what's going on in the second case. IIRC, at some point historically, a sufficiently tight loop could cause trouble with handling SIGINT, so it might be related to some overagressive workaround for that?


On my extremely old desktop PC (Phenom II 550) running an out-of-date OS (Slackware 14.2):

Bash:

    while :; do printf "1"; done  | ./pv > /dev/null
    [ 156KiB/s]
Python3 3.7.2:

    python3 -c 'while (1): print (1, end="")' | ./pv > /dev/null
    [1,02MiB/s]
Perl 5.22.2:

    perl -e 'while (true) {print 1}'  | ./pv > /dev/null
    [3,03MiB/s]
Node.js v12.22.1:

    node -e 'while (1) process.stdout.write("1");' | ./pv > /dev/null
    [ 482KiB/s]


Potential buffering issues aside, as others have pointed out the node.js example is performing asynchronous writes, unlike the other languages' examples (as far as I know).

To do a proper synchronous write, you'd do something like:

  node -e 'const { writeSync } = require("fs"); while (1) writeSync(1, "1");' | pv > /dev/null
That gets me ~1.1MB/s with node v18.1.0 and kernel 5.4.0.


You're testing a very specific operation, a loop, in each language to determine its speed, not sure if I'd generalize that. I wonder what it'd look like if you replaced the loop with static print statements that were 1000s of characters long with line breaks, the sort of things that compiler optimizations do.


I find that NodeJS runs eventually out of memory and crashes with applications that do a large amount of data processing over a long time with little breaks even if there are no memory leaks.

Edit: I've found this consistently building multiple data processing applications over multiple years and multiple companies


Perhaps different approaches to caching?

I'm reminded of this StackOverflow question, Why is reading lines from stdin much slower in C++ than Python?

https://stackoverflow.com/q/9371238/


I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed as Node if I use $stdout.sync = true though..)


Python pushes 15MiB on my M1 Pro if you go down a level and use sys directly.

   python3 -c 'import sys
   while (1): sys.stdout.write("1")'| pv>/dev/null


That caches though. You can see it when you strace it.


    python3 -u -c 'import sys
      while (1): sys.stdout.write("1")'| pv>/dev/null
427KiB/s

    python3 -c 'import sys
      while (1): sys.stdout.write("1")'| pv>/dev/null
6.08MiB/s

Using python 3.9.7 on macOS Monterey.


Good point, but so does a standard print call. Calling flush() after each write does bring the perf to 1.5MiB


I was getting different results depending on when I run it. Took me a second to realize it was my processor frequency scaling.


What version of node are you using? It seems to run indefinitely on 14.19.3 that comes with Ubuntu 20.04.


Using `sys.stdout.write()` instead of `print()` gets ~8MiB/s on my machine.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: