> CPU performance has completely outstripped memory perf, so memory hierarchies ...

zenhack · on Feb 15, 2020

> Set a pointer to and address and write to it. Better yet, I can define a packed struct that maps to a peripheral, point it to its memory address from a data sheet, and have a nice human readable way of controlling it: MyPIECDevice.sample_rate = 2000.

Just make sure you don't forget `volatile` in the right places. A lot of codebases end up just using their own wrappers written in asm for this kind of thing, because the developers have learned (rightly or wrongly) not to trust the compiler.

To be clear, it's not that hard to get the memory layout semantics you want in C. But issues around concurrent access, when it is acceptable for the compiler to omit loads & stores, whether an assignment is guaranteed to be a single load/store or possibly be split up (affects both semantics in the case of mmio and also atomicity), are all subtle questions, the answers to which are not at all suggested by the form of the code; The language is very much designed with the assumptions that (1) memory is just storage, so it's not important to be super precise on how reads and writes actually get done (in fairness, the lack of optimization in the original compilers probably made this more straightforward), and (2) concurrent access isn't really that important (the standard was completely silent on the issue of concurrency until C11). If you care about these issues there's a lot of rules lawyering you have to do to be sure your code isn't going to break if the compiler is cleverer than you are. A modern take on C should be much more explicit about semantically meaningful memory access.

I think you can make a sensible argument that wrt hierarchies C is at least not a heck of a lot worse than the instruction set, so maybe I'm conceding that point -- though the instruction set hides a lot that's going on implicitly too. Some of this though I think is the ISA "coddling" C and C programs; in a legacy-free world it might make more sense to have an ISA let the programmer deal with issues around cache coherence. I could imagine some smartly designed system software using the cache in ways that can't be done right now (example: a copying garbage collector with thread-local nurseries that are (1) small enough to fit in cache (2) never evicted and (3) never synced to main memory, because they're thread-local anyway). Experimental ISA design is well outside my area of competency though, so it's possible I'm talking out of my ass. But the general sentiment that modern ISAs hide a lot from the systems programmer and that other directions might make sense is something that I've heard more knowledgeable people suggest as well.

madmax96 · on Feb 15, 2020

>If you care about these issues there's a lot of rules lawyering you have to do to be sure your code isn't going to break if the compiler is cleverer than you are.

>A modern take on C should be much more explicit about semantically meaningful memory access.

If you are working on concurrent code close to the hardware you’re going to either have to accept a less efficient language or engage in rule lawyering. Unfortunately, granting the compiler license to perform the most mundane optimizations interferes with concurrent structures. Fortunately, with C there are rules to lawyer with, and they actually are simple. No matter what, rules will always need learned.

dooglius · on Feb 15, 2020

I definitely agree with all your criticisms of the memory semantics in C, and I would love a language that fixed these flaws, but the "ideal" low-level language is still a lot closer to C than it is to anything else. I also think that C, being low-level, is much better poised to deal with experimental ISA designs than higher-level languages. For instance, one mechanism of manual cache control could be that you set bit 63 in a pointer to indicate that loads/stores from should place the corresponding cacheline in a high-priority. That's pretty trivial with a pointer in C, but a lot harder with say a C++ reference.

naasking · on Feb 15, 2020

> It's trivial to layout memory as you please, where you please, very directly, in C

It wasn't trivial before fixed width integral types, which is fairly recent in C terms (C99), and it's still far more complicated than it needs to be.

Furthermore, the fact that C is the defacto language of performance means that our hardware has been constrained by needing to run C programs well in order to compete.

Think of all the interesting innovation we could have had without such constraints. For instance, see how powerful and versatile GPUs have become because they didn't carry that legacy.

madmax96 · on Feb 15, 2020

>see how powerful and versatile GPUs have become because they didn't carry that legacy.

GPUs are the best example for why C is a good lower-level high-level language, seeing how CUDA is programmed in C/C++.

Do you have any examples of architectures that could exist if only they weren't constrained by legacy C?

naasking · on Feb 15, 2020

> GPUs are the best example for why C is a good lower-level high-level language, seeing how CUDA is programmed in C/C++.

CUDA is not C or C++. That you can program GPUs in a C/C++-like language does not entail that C/C++ is a natural form of expression for that architecture.

> Do you have any examples of architectures that could exist if only they weren't constrained by legacy C?

Turing tarpit means that every architecture could be realized, but that doesn't make it a particularly efficient or a natural fit for the hardware.

For instance, consider that every garbage collected language must distinguish pointers from integer types, but no such distinction exists in current hardware, and the bookkeeping required can incur significant performance and memory constraints (edit: C also makes this distinction but it doesn't enforce it).

Lisp machines and tagged hardware architectures do make such a distinction though, and so more naturally fit. With such distinctions, you could even have a hardware GC.

madmax96 · on Feb 15, 2020

>That you can program GPUs in a C/C++-like language does not entail that C/C++ is a natural form of expression for that architecture.

It's not a matter of what is/isn't a "natural form of expression." The point of C/C++ is to be high-level enough for humans to build their own abstractions over hardware. (sounds like an OS, right?) The success of the design of C/C++ is in that the creators had no knowledge of modern GPUs, yet GPUs can efficiently execute them with a little care from developers. We use other abstractions (e.g. SciPy on Tensorflow) because they are more appropriate to solve our problems, but they are built on C.

>Lisp machines and tagged hardware architectures do make such a distinction though, and so more naturally fit. With such distinctions, you could even have a hardware GC.

And why would that not be backwards-compatible with legacy C?

Particularly, I am rejecting the idea that C is somehow stunting hardware development - I see no evidence of this fact. I am also skeptical about the claim (although I will not reject it outright) that there is a language substantially better fit compared to C for low-level programming (e.g. embedded, kernel).

naasking · on Feb 16, 2020

> It's not a matter of what is/isn't a "natural form of expression." The point of C/C++ is to be high-level enough for humans to build their own abstractions over hardware.

Sure it matters. If primitives don't map naturally to the hardware, then you have to build a runtime to emulate those primitives, just like GC'd languages do.

> The success of the design of C/C++ is in that the creators had no knowledge of modern GPUs, yet GPUs can efficiently execute them with a little care from developers

You cannot run any arbitrary C program on a GPU. This fact is exactly why GPUs were able to innovate without legacy compatibility holding them back.

Only later were GPUs generalised to support more sophisticated programs, which then permitted a subset of C to execute efficiently.

The progress of GPUs proves exactly the opposite point that you are claiming. If C were so perfectly suited to any sort of hardware, then GPUs would have been able to run C programs right from the beginning, which is not true.

> And why would that not be backwards-compatible with legacy C?

That's not the point I'm making. Turing equivalence ensures that compatibility can be assured no matter what.

The actual point is that CPU innovations were tested against C benchmark suites to check whether innovations effectively improved performance, and some or many of those that failed to show meaningful improvements were discarded, despite the fact that they would have had other benefits (obviously not all of them, but enough). It's simply natural selection for CPU innovation.

It's incredibly naive to think that only hardware influences software and not vice versa. For instance, who would create a hardware architecture that didn't have pointers? It would simply never happen, because efficient C compatibility is too important.

The problem is that C was given a disproportionately heavy weighting in these decisions. For instance, a tagged memory architecture would show zero improvement on C benchmarks, but it would have been huge for the languages that now dominate the software industry.

> that there is a language substantially better fit compared to C for low-level programming (e.g. embedded, kernel).

The limitations of C are well known (poor bit fields and bit manipulation, poor support for alignment and padding, no modules, poor standard library, etc, etc.).

Zig addresses some of those issues. Ada has been better than C for a long time. A better language than all of these could definitely be designed given enough resources, eg. see the research effort "House" [1].

[1] http://programatica.cs.pdx.edu/House/

madmax96 · on Feb 16, 2020

>If primitives don't map naturally to the hardware, then you have to build a runtime to emulate those primitives, just like GC'd languages do.

That's only half the equation. Hardware cannot save you from semantics that are less efficient. To use your example: every GC'd language must have a runtime system track objects, whether that is implemented with or without hardware support. That system constitutes additional overhead -- either precious silicon is used delivering hardware support for GC or clock cycles are used emulating that support. Either way, you're losing performance. C/C++ have semantics that are easy to support, in contrast.

>You cannot run any arbitrary C program on a GPU.

Nor can you run any arbitrary C/C++ program written for Posix on Windows, or a program written for the x86 on a STM32, etc. You have always had to know your platform with C/C++. The point is that they are flexible enough to work very well on many platforms.

>This fact is exactly why GPUs were able to innovate without legacy compatibility holding them back.

GPUs have become a lucrative business precisely because they have begun exposing a C++ interface. Look at how the usage of graphics cards have changed in recenter years.

> If C were so perfectly suited to any sort of hardware, then GPUs would have been able to run C programs right from the beginning, which is not true.

No. GPUs _were not_ general purpose compute devices from the beginning, as you pointed out. You had GLSL, etc. but the interface exposed to programmers was not Turing-complete. From what I gather, GPUs have only had a Turing-complete interface since shader model 3.0, which first appeared in 2004. By 2007, you had nvcc. Today, C++ is very well supported by CUDA. You may as well be saying "You can't run C on a cardboard box, so it's obviously not well-suited to all hardware." Obviously, your hardware needs to expose a Turing-complete interface for a Turing-complete language to be able to run on it.

>The problem is that C was given a disproportionately heavy weighting in these decisions. For instance, a tagged memory architecture would show zero improvement on C benchmarks, but it would have been huge for the languages that now dominate the software industry.

At what cost? As I already pointed out, adding support for VHLLs at the hardware level means you are spending silicon space on that task => languages like C will be slower. Yes, a lot of software is written in JavaScript, Java, and Python, and these languages would benefit from that hardware support. But people using JavaScript, Java, and Python generally are relying on C services (memcached, redis, postgre, etc) to do their heavy-lifting anyway, which you just made slower.

>For instance, who would create a hardware architecture that didn't have pointers? It would simply never happen, because efficient C compatibility is too important.

No. It would never happen because the machine you just described would make a very bad general purpose computer.

>The limitations of C are well known

Yes, they are. But everything you listed isn't substantial. It's C, with a better standard library, standard support for controlling alignment/padding, and modules. That's not significantly different.

ori_b · on Feb 18, 2020

> CUDA is not C or C++.

Cuda is a C++ API. On modern hardware, it's programmed in purely standard C++.

nickitolas · on Feb 15, 2020

Give me an example of where C is allowed to optimize your data layout and/or locality. Afaik it is incredibly restrictive in this sense, because of how well defined it is. The less things it gives as guarantees with regards to layout the more wiggle room it would have, and languages like C cannot do some things that languages with a gc can do that can improve cache locality.

dooglius · on Feb 15, 2020

It's not, that's the point, the language/compiler cannot interfere with the programmer fine tuning data structures to suit the underlying architecture.

nickitolas · on Feb 15, 2020

I thought the discussion was around compiler optimizations? That's the point

dooglius · on Feb 15, 2020

I didn't read it as such. The point behind what I and the parent are saying is, the programmer is going to do much better at optimal memory layout than an optimizer can, and C allows manual control while languages which can mess with memory layout necessarily cannot.