I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.
> What's also interesting is that node crashes after about a minute
I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
The following code shouldn't crash, give it a try:
>> What's also interesting is that node crashes after about a minute
> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
Not exactly: the GC is still running; it’s live memory that’s growing unbounded.
What’s going on here is that WritableStream is non-blocking; it has advisory backpressure, but if you ignore that it will do its best to accept writes anyway and keep them in a buffer until it can actually write them out. Since you’re not giving it any breathing room, that buffer just keeps growing until there’s no more memory left. `process.nextTick()` is presumably slowing things down enough on your system to give it a chance to drain the buffer. (I see there’s some discussion below about this changing by version; I’d guess that’s an artifact of other optimizations and such.)
To do this properly, you need to listen to the return value from `.write()` and, if it returns false, back off until the stream drains and there’s room in the buffer again.
Here’s the (not particularly optimized) function I use to do that:
async function writestream(chunks, stream) {
for await (const chunk of chunks) {
if (!stream.write(chunk)) {
// When write returns null, stream is starting to buffer and we need to wait for it to drain
// (otherwise we'll run out of memory!)
await new Promise(resolve => stream.once('drain', () => resolve()))
}
}
}
I do wish Node made it more obvious what was going on in this situation; this is a very common mistake with streams and it’s easy to not notice until things suddenly go very wrong.
ETA: I should probably note that transform streams, `readable.pipe()`, `stream.pipeline()`, and the like all handle this stuff automatically. Here’s a one-liner, though it’s not especially fast:
Are there still no async write functions which handle this easier than the old event based mechanism? Waiting for drain also sounds like it might reduce throughout since then there is 0 buffered data and the peer would be forced t Öl pause reading. A „writable“ event sounds more appropriate - but the node docs don’t mention one.
Hm, strange. With the same out of memory error as before or a different one? Tried running that one for 2 minutes, no errors here, and memory stays constant.
Huh yeah, seems to be a old memory leak. Running it on v10.24.0 crashes for me too.
After some quick testing in a couple of versions, it seems like it got fixed in v11 at least (didn't test any minor/patch versions).
By the way, all versions up to NodeJS 12 (LTS) are "end of life", and should probably not be used if you're downloading 3rd party dependencies, as there are bunch of security fixes since then, that are not being backported.
> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
Java has (had) weird idiosyncrasies like this as well, well it doesn't crash, but depending on the construct you can get performance degradations depending on how the language inserts safepoints (where the VM is at a knowable state and a thread can be safely paused for GC or whatever).
I don't know if this holds today, but I know there was a time where you basically wanted to avoid looping over long-type variables, as they had different semantics. The details are a bit fuzzy to me right now.
If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:
yes | pv > /dev/null
There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".
Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.
Honest question: what are the practical use cases of this?
Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate. Also seems like the bottleneck would always be the consuming program...
Historically, you could have dirty filesystems after a reboot that "fsck" would ask an absurd number of questions about ("blah blah blah inode 1234567890 fix? (y/n)"). Unless you were in a very specific circumstance, you'd probably just answer "y" to them. It could easily ask thousands of questions though. So: "yes | fsck" was not uncommon.
It's probably still common in installation scripts, like in Dockerfiles. `apt-get install` has the `-y` option, but it would be useful for all other programs that don't.
Just to clarify: I was applying "historically" to "fsck", not to the use of "yes" in general. I can't remember the last time I've had the need to use "yes | fsck"
> Honest question: what are the practical use cases of this?
It also allows you to script otherwise interactive command line operations with the correct answer. Many come like tools now days provide specific options to override queries. But there are still a couple hold outs which might not.
> Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate.
At that rate no but I definitely use it once in a while. For example if a copy quite a few files and then get repeatedly asked if I want to overwrite the destination (when it's already present). Sure, I could get my commmand back and use the proper flag to "cp" or whatever to overwrite, but it's usually much quicker to just get back the previous line, go at the beginning (C-a), then type "yes | " and be done with it.
Note that you can pass a parameter to "yes" and then it repeats what you passed instead of 'y'.
It is optimized quite seriously. I remember there was a comparison of it with I believe a BSD version, where the latter was thousands time more readable (although slower).
I'm getting ~3.10GiB/s with both GNU's and FreeBSD's. I do see that GNU's version has some optimizations, but their effectiveness isn't apparent when doing `yes | pv > /dev/null`.
However, my point was just that its performance was never a main point of it. Even without optimizations, it's still very fast, and I don't think whoever created it first was concerned with it having to be super fast, as long as it was faster than the prompts of whatever was downstream in the pipe.
A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):
- Node (no buffering): 1.2 MiB/s
- Go (no buffering): 2.4 MiB/s
- Python (8 KiB buffer): 2.7 MiB/s
- Go (8 KiB buffer): 218 MiB/s
Go program:
f := bufio.NewWriterSize(os.Stdout, 8192)
for {
f.WriteRune('1')
}
Not specifically addressed at you, but it's a bit amusing watching a younger generation of programmers rediscovering things like this, which seemed hugely important in like 1990 but largely don't matter that much to modern workflows with dedicated APIs or various shared memory or network protocols, as not much that is really performance-critical is typically piped back and forth anymore.
More than a few old backup or transfer scripts had extra dd or similar tools in the pipeline to create larger and semi-asynchronous buffers, or to re-size blocks on output to something handled better by the receiver, which was a big deal on high speed tape drives back in the day. I suspect most modern hardware devices have large enough static RAM and fast processors to make that mostly irrelevant.
I did the same test, but added a rust and bash version. My results:
Rust: 21.9MiB/s
Bash: 282KiB/s
PHP: 2.35MiB/s
Python: 2.30MiB/s
Node: 943KiB/s
In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.
Rust:
fn main() {
loop {
print!("1");
}
}
Bash (no discernible difference between echo and printf):
For languages like C, C++, and Rust, the bottleneck is going to mainly be system calls. With a big buffer, on an old machine, I get about 1.5 GiB/s with C++. Writing 1 char at a time, I get less than 1 MiB/s.
#include <cstddef>
#include <random>
#include <chrono>
#include <cassert>
#include <array>
#include <cstdio>
#include <unistd.h>
#include <cstring>
#include <cstdlib>
int main(int argc, char **argv) {
int rv;
assert(argc == 3);
const unsigned int n = std::atoi(argv[1]);
char *buf = new char[n];
std::memset(buf, '1', n);
const unsigned int k = std::atoi(argv[2]);
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < k; i++) {
rv = write(1, buf, n);
assert(rv == int(n));
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = stop - start;
std::chrono::duration<double> secs = duration;
std::fprintf(stderr, "buffer size: %d, num syscalls: %d, perf:%f MiB/s\n", n, k, (double(n)*k)/(1024*1024)/secs.count());
}
EDIT: Also note that a big write to a pipe (bigger than PIPE_BUF) may require multiple syscalls on the read side.
EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work:
I suspect (but am not sure) that the shell may be doing something clever for a stream redirection (>) and giving your program a STDOUT file descriptor directly to /dev/null.
I may be wrong, though. Check with lsof or similar.
There's no special "no work" detection needed. a.out is calling the write function for the null device, which just returns without doing anything. No pipes are involved.
Seems like it's buffering output, which Python also does. Python is much slower if you flush every write (I get 2.6 MiB/s default, 600 KiB/s with flush=True).
Interestingly, Go is very fast with a 8 KiB buffer (same as Python's), I get 218 MiB/s.
Why is it cheating to use a buffer? This is the behavior you would get in C if you used the C standard library (putc/fputc) instead of a system call (write).
That manages about 7 GiB/s reusing the same buffer, or about 300 MiB/s with clearing and refilling the buffer every time
(the magic is in using java’s APIs for writing to files/sockets, which are designed for high performance, instead of using the APIs which are designed for writing to stdout)
`process.stdout.write` is different to PHP's `echo` and Python's `print` in that it pushes a write to an event queue without waiting for the result which could result in filling event queue with writes. Instead, you can consider `await`-ing `write` so that it would write before pushing another `write` to an event queue.
For Python 3.10.4, I get about 2.8 MiB/s as you have it written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for 3.8) with this. I also get 4.8 MiB/s with 2.7:
[Edit: as pointed out below, this is no longer the case!]
Strings are printed one character at a time in Haskell. This choice is justified by unpredictability of the interaction between laziness and buffering; I am uncertain it's the correct choice, but the proper response is to use Text where performance is relevant.
With the recursive code, it buffered the output in the same way but bugged the kernel a whole lot more in-between writes. Not exactly sure what is going on:
I'm honestly surprised either of them wind up buffered! That must be a change since I stopped paying as much attention to GHC.
I'm also not sure what's going on in the second case. IIRC, at some point historically, a sufficiently tight loop could cause trouble with handling SIGINT, so it might be related to some overagressive workaround for that?
Potential buffering issues aside, as others have pointed out the node.js example is performing asynchronous writes, unlike the other languages' examples (as far as I know).
To do a proper synchronous write, you'd do something like:
You're testing a very specific operation, a loop, in each language to determine its speed, not sure if I'd generalize that. I wonder what it'd look like if you replaced the loop with static print statements that were 1000s of characters long with line breaks, the sort of things that compiler optimizations do.
I find that NodeJS runs eventually out of memory and crashes with applications that do a large amount of data processing over a long time with little breaks even if there are no memory leaks.
Edit: I've found this consistently building multiple data processing applications over multiple years and multiple companies
I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed as Node if I use $stdout.sync = true though..)
PHP comes in at about 900KiB/s:
Python is about 50% faster at about 1.5MiB/s: Javascript is slowest at around 200KiB/s: What's also interesting is that node crashes after about a minute: All results from within a Debian 10 docker container with the default repo versions of PHP, Python and Node.Update:
Checking with strace shows that Python caches the output:
Outputs a series of: PHP and JS do not.So the Python equivalent would be:
Which makes it compareable to the speed of JS.Interesting, that PHP is over 4x faster than the Python and JS.