>GPU programming without macros looks like then take a peek at this:
I'm not sure if you missed my examples, or if there is some other missunderstanding going on ;)
GPU code in Julia works 100% without macros, but you are free to use them, so people do ;) Also, you can pretty much write the gpu kernels like pseydo code, not sure how much simpler it can get.
With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
I wrote an article some time ago about generic programming with Julia's gpu infrastructure:
Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
> With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
Heh, I've noticed that the setup code tends to come out longer than the actual kernels. It is inside `kernel = cuda`. Spiral is indentation sensitive like Python and F#.
The stuff inside {} is just module creation, think of it like tuples with named fields.
> Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
No, I've yet to actually benchmark it. It really depends on how good of a job NVCC does with the generic sequential reduce kernel. I'll do an in depth analysis when I am done with all the neural network work that I am doing currently which might take a while.
You can in fact get loop unrolling with just standard functions in Spiral. I go into it in the context of this chapter. It is achieved by pattern matching over tuples and recursion.
> If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)
Yes, by replacing meta programming with intensional polymorphism and inlinining guarantees. Also reflection should be done using pattern matching. I think this last one could be taken entirely seriously as it would not involve a replacement of the entire type system.
All the claims in the article about inlining and specialization that make it sound like magic is what in general makes me dubious about languages pretending to be speed kings. Yes, I am aware that GPU kernels do not require optimizers capable of having monads for breakfast and that in the context of GPU programming where they were made they are probably true, but inlining is the sort of thing that matters more the more high level a language is. For very high level languages that desire speed, there isn't much choice but to make them a part of language semantics despite the added burden it puts on the user.
Since we are still at it, I have a question I need to ask.
Recently I've been informed that Julia is capable of GCing GPU memory. If this is fully integrated that would be a major feature which is not possible in say .NET or Racket. I really wanted this in Spiral and could not get it in .NET.
By fully integrated, I mean much like for regular memory for which the GC takes note of the state of the system for when to do collection and defragmentation. If it can only make a thin wrapper with a finalizer (much like in .NET) around an unmanaged resource then it is not a big deal.
Is it fully integrated or is it a wrapper style memory management?
If it is the later, then that is too bad, but I'd suggest to Julia devs that they work on making it fully integrated as it would be a really good feature. Obviously, I can't do it in Spiral as I would need to write my own VM and I have only so many years in my life.
As a matter of fact, it's what the compiler devs keep recomending. As a developper I must say, that it's pretty relaxing to take a meta programming shortcut from time to time ;)
In general, we plan to offer a tool box that employs these kind of patterns for loop unrolling, tiling etc, to make it easier to write GPU code without meta programming and perform certain optimizations/scheduling patterns based on a more trait like system.
> Also reflection should be done using pattern matching
I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch! If we needed more than that, we could put the effort into extending the language into that direction. But I think your use cases must get be pretty advanced untill you need something more complex. Can you recommend a nice article that shows off beautiful pattern matching for reflection? I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Concerning inlining, I do think we actually force codegen to always inline when compiling for the GPU. I don't remember a 100% anymore, but there might have been some problems with that. But it's definitely the goal, to have a flexible compiler that you can fine tune to specific domains like GPU programming. And if we don't get it as part of the compiler, we can definitely do things like that with https://github.com/jrevels/Cassette.jl/
Appart from that, I'm pretty happy that our GPU compilation infrastructure isn't making Julia less useful as a general purpose language ;)
>Is it fully integrated or is it a wrapper style memory management?
It's wrapper style. We're playing around with different kind of hacks to make it less worse, but those are hacks you could do in any GC language. The good news is, that the Julia compiler devs take GPU computing seriously and promised to work on a full integration that is aware of the GPU hardware.
> As a matter of fact, it's what the compiler devs keep recomending.
I definitely agree with them in this matter. Macros might have a role in language development, but they should not be a stand in for compiler optimizations. For safety and speed, the type system should be there.
Here is how the example you've shown could be done in Spiral. Maybe I should add express support for literal testing in pattern matching, but it is not a pattern that comes up too often by itself.
> I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch!
What you say is exactly right as pattern matching compiles down to those low level operations, so there no reason at all why those low level operations should be done by hand. Pattern matching does not depend on a particular type system or whether the language is dynamic or static. There are only so many good ideas in programming languages and this is one of them.
Though there is some overlap between pattern matching and multiple dispatch, the roles are different. The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Here is quite a complex example of how it is used in action. I won't go into detail of this here, but you can see how I repeatedly match on the contents of a module at different times in order to get more generic functionality for the kernel.
The particular kernel shown here is the most complex one that exists in the library right now - I am yet to get to things like generic matrix multiplication and convolution, but I'll get there eventually.
> I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life. I could have said the same thing about first class functions back in 2015. I am sure in the future there will be such features I can't even imagine right now.
>The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Agreed! ;) Just wanted to point out, that I was able to use multiple dispatch elegantly in the unroll example, where you are using pattern matching.
>pattern matching compiles down to those low level operations
So I guess that means Julia could indeed satisfy you, if we gave those low level operations some more syntactic sugar?
Have you seen: http://kmsquire.github.io/Match.jl/latest/ ?
>Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life
True story. I'll try to be more observant when writing code and see, if I could actually solve things more elegantly with better pattern matching.
Btw, I'm writing on an article about GPU programming in Julia - do you have feature you'd like to have explained, or a killer feature you believe we can't do and that would impress you if we did?
I'm not sure if you missed my examples, or if there is some other missunderstanding going on ;)
GPU code in Julia works 100% without macros, but you are free to use them, so people do ;) Also, you can pretty much write the gpu kernels like pseydo code, not sure how much simpler it can get.
With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
I wrote an article some time ago about generic programming with Julia's gpu infrastructure:
https://techburst.io/writing-extendable-and-hardware-agnosti...
If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)
Cuda only (for no reason actually) and with lots of macros, here is another article about gpu programing in Julia: https://mikeinnes.github.io/2017/08/24/cudanative.html
Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.