Does anybody know how this is different from Spark? These Distributed Immutable Arrays sound suspiciously similar to Spark's Resilient Distributed Datasets. Is it just the choice of C++ as opposed to Scala that would make this more efficient?
Also, I wonder if and how they implemented the concept of lineage (unless these DIAs are not really very resilient)... I thought Spark relied on Scala's delayed evaluation to do that, though I may be mistaken.
Do you have any plans regarding how you'd implement resilience and the equivalent of Spark's concept of 'lineage', where you keep a history of how a given RDD was computed, and then you can recompute it if it gets lost?
I haven't looked into Spark in depth, but I believe that 'lineage' relies heavily on Scala's delayed evaluation and the underlying Java RMI facilities. Doing something similar in C++ may require a lot more effort and a significantly different set of tradeoffs regarding the performance model.
I'm not that directly involved in Thrill, so I can't really speak with authority. There aren't any concrete plans on fault tolerance but it would certainly be an interesting topic to work on, partially because the existing solutions seem quite inefficient.
The JVM was obviously never designed for functional programming. I would like to see something like Thrill built in Haskell or OCaml, both of which generate efficient native code, support unboxed arrays and have metaprogramming/staging facilities that go beyond templates. GHC Haskell even has declarative rewrite rules for compile-time fusion.
I'm still not convinced C++ is necessary to outperform Spark, especially as high-level features like reference counting and lambdas are being used.
These high-level features don't have any performance impact:
Reference counting is used on very large objects or for things that aren't touched a lot, so there is no measurable impact on performance or memory use.
Lambdas don't have any performance impact. But you could just as well plug in any other functor. In fact, if you chain a map and a filter in Thrill, the two will be joined by the compiler (take element, apply both, proceed to next element). This would not be possible with old-school function pointers.
I wasn't suggesting they have a performance impact for this particular project.
I'm asking why choose C++ for Thrill if you want essentially non-deterministic automatic memory management and lambdas. Performing e.g. map fusion as you describe is common place for functional language compilers. For example, the vector array library in Haskell has been doing this since 2008.
My hypothesis is that your high-performance implementation could be realised using safer (and I would argue more appropriate) functional languages.
I guess it comes down to familiarity and experience with the language and tooling, as well as predictability of performance. Achieving C++-like performance in Haskell seems possible from all I've seen, but also requires a lot of experience.
I'd love to see a similar project realised in a functional language, that would be quite exciting!
Also, I wonder if and how they implemented the concept of lineage (unless these DIAs are not really very resilient)... I thought Spark relied on Scala's delayed evaluation to do that, though I may be mistaken.