Does anybody know how this is different from Spark? These Distributed Immutable ...

beached_whale · on Feb 12, 2017

In the presentation they directly contrast the performance of the two. http://panthema.net/2016/1206-Thrill-High-Performance-Algori...

sreenadh · on Feb 12, 2017

Is there a video on the talk?

lorenzhs · on Feb 12, 2017

No, sadly scientific CS conferences are only rarely recorded.

lorenzhs · on Feb 12, 2017

DIAs are quite heavily inspired by RDDs. A lot of the performance increases come from the C++ compiler's ability to fuse local operations etc.

Thrill doesn't implement any fault tolerance at the moment, it's closer to prototype status than production readiness.

pzh · on Feb 13, 2017

Do you have any plans regarding how you'd implement resilience and the equivalent of Spark's concept of 'lineage', where you keep a history of how a given RDD was computed, and then you can recompute it if it gets lost?

I haven't looked into Spark in depth, but I believe that 'lineage' relies heavily on Scala's delayed evaluation and the underlying Java RMI facilities. Doing something similar in C++ may require a lot more effort and a significantly different set of tradeoffs regarding the performance model.

lorenzhs · on Feb 13, 2017

I'm not that directly involved in Thrill, so I can't really speak with authority. There aren't any concrete plans on fault tolerance but it would certainly be an interesting topic to work on, partially because the existing solutions seem quite inefficient.

grumpyprole · on Feb 13, 2017

The JVM was obviously never designed for functional programming. I would like to see something like Thrill built in Haskell or OCaml, both of which generate efficient native code, support unboxed arrays and have metaprogramming/staging facilities that go beyond templates. GHC Haskell even has declarative rewrite rules for compile-time fusion.

I'm still not convinced C++ is necessary to outperform Spark, especially as high-level features like reference counting and lambdas are being used.

lorenzhs · on Feb 13, 2017

These high-level features don't have any performance impact:

Reference counting is used on very large objects or for things that aren't touched a lot, so there is no measurable impact on performance or memory use.

Lambdas don't have any performance impact. But you could just as well plug in any other functor. In fact, if you chain a map and a filter in Thrill, the two will be joined by the compiler (take element, apply both, proceed to next element). This would not be possible with old-school function pointers.

grumpyprole · on Feb 14, 2017

I wasn't suggesting they have a performance impact for this particular project.

I'm asking why choose C++ for Thrill if you want essentially non-deterministic automatic memory management and lambdas. Performing e.g. map fusion as you describe is common place for functional language compilers. For example, the vector array library in Haskell has been doing this since 2008.

My hypothesis is that your high-performance implementation could be realised using safer (and I would argue more appropriate) functional languages.

lorenzhs · on Feb 14, 2017

I guess it comes down to familiarity and experience with the language and tooling, as well as predictability of performance. Achieving C++-like performance in Haskell seems possible from all I've seen, but also requires a lot of experience.

I'd love to see a similar project realised in a functional language, that would be quite exciting!