The companion paper to Thrill, with more details on its architecture and some benchmarks and comparisons to Spark and Flink: https://arxiv.org/abs/1608.05634
Thx for the link to the paper. It is a useful read in its own right.
It provides an overview of other processing framework,
explains why C++ was chosen, explains various bottlenecks and their affect.
I had worked with KMeans before, and happy to see as part of bench-marking, as it seems one of the more widely used approaches for unsupervised learning.
In my view, Thrill is similar in composite-ability and integration into existing code objectives to Python's Dask
There's also Blogel [0] which is a distributed graph processing framework in C++ and it runs significantly faster than its counterpart in Java, Apache Giraph [1].
I have started wondering if the big data developers really care about the speed; the advantages of these Java softwares start to fade out when compared with their C++ counterparts.
I think with Thrill, there are two different skill levels to be distinguished:
- Using it to implement things should be fairly easy and doesn't require advanced knowledge of C++. Basically you have to plug lambdas that do the processing into the provided operations, similar to Spark, but using C++ syntax. It might require some compiler error parsing skills, but altogether it shouldn't be too different from using Spark with Java/Scala
- Extending Thrill requires familiarity with modern C++, possibly including advanced template tricks.
Since there isn't a whole lot of advanced stuff available for Thrill (yet), that means that currently people with the latter skills would most likely be required at the moment. But in a world where the same libraries available for Spark are available for Thrill or a similar C++ framework, that wouldn't be the case. Note that Thrill is currently quite experimental.
I guess it's a trade-off, but dismissing the potential for 10x runtime gains "because C++" seems too one-sided. That isn't to say that the C++ frameworks don't have a long way to go before they can rival Spark etc in ease of use and tooling, they do! But at least they point out the inefficiencies and potential for improvement in these existing systems.
Thrill and STXXL are both delevoped in the same group at KIT (I work there, too, but I'm not directly involved). Thrill also reuses some parts of STXXL, and does so completely transparently to the user - if memory doesn't suffice, it'll use the disk.
Does anybody know how this is different from Spark? These Distributed Immutable Arrays sound suspiciously similar to Spark's Resilient Distributed Datasets. Is it just the choice of C++ as opposed to Scala that would make this more efficient?
Also, I wonder if and how they implemented the concept of lineage (unless these DIAs are not really very resilient)... I thought Spark relied on Scala's delayed evaluation to do that, though I may be mistaken.
Do you have any plans regarding how you'd implement resilience and the equivalent of Spark's concept of 'lineage', where you keep a history of how a given RDD was computed, and then you can recompute it if it gets lost?
I haven't looked into Spark in depth, but I believe that 'lineage' relies heavily on Scala's delayed evaluation and the underlying Java RMI facilities. Doing something similar in C++ may require a lot more effort and a significantly different set of tradeoffs regarding the performance model.
I'm not that directly involved in Thrill, so I can't really speak with authority. There aren't any concrete plans on fault tolerance but it would certainly be an interesting topic to work on, partially because the existing solutions seem quite inefficient.
The JVM was obviously never designed for functional programming. I would like to see something like Thrill built in Haskell or OCaml, both of which generate efficient native code, support unboxed arrays and have metaprogramming/staging facilities that go beyond templates. GHC Haskell even has declarative rewrite rules for compile-time fusion.
I'm still not convinced C++ is necessary to outperform Spark, especially as high-level features like reference counting and lambdas are being used.
These high-level features don't have any performance impact:
Reference counting is used on very large objects or for things that aren't touched a lot, so there is no measurable impact on performance or memory use.
Lambdas don't have any performance impact. But you could just as well plug in any other functor. In fact, if you chain a map and a filter in Thrill, the two will be joined by the compiler (take element, apply both, proceed to next element). This would not be possible with old-school function pointers.
I wasn't suggesting they have a performance impact for this particular project.
I'm asking why choose C++ for Thrill if you want essentially non-deterministic automatic memory management and lambdas. Performing e.g. map fusion as you describe is common place for functional language compilers. For example, the vector array library in Haskell has been doing this since 2008.
My hypothesis is that your high-performance implementation could be realised using safer (and I would argue more appropriate) functional languages.
I guess it comes down to familiarity and experience with the language and tooling, as well as predictability of performance. Achieving C++-like performance in Haskell seems possible from all I've seen, but also requires a lot of experience.
I'd love to see a similar project realised in a functional language, that would be quite exciting!