Thrill – Big Data Processing with C++

lorenzhs · on Feb 12, 2017

The companion paper to Thrill, with more details on its architecture and some benchmarks and comparisons to Spark and Flink: https://arxiv.org/abs/1608.05634

75dvtwin · on Feb 12, 2017

Thx for the link to the paper. It is a useful read in its own right.

It provides an overview of other processing framework, explains why C++ was chosen, explains various bottlenecks and their affect.

I had worked with KMeans before, and happy to see as part of bench-marking, as it seems one of the more widely used approaches for unsupervised learning.

In my view, Thrill is similar in composite-ability and integration into existing code objectives to Python's Dask

codepie · on Feb 12, 2017

There's also Blogel [0] which is a distributed graph processing framework in C++ and it runs significantly faster than its counterpart in Java, Apache Giraph [1].

I have started wondering if the big data developers really care about the speed; the advantages of these Java softwares start to fade out when compared with their C++ counterparts.

[0] - http://www.cse.cuhk.edu.hk/blogel/

[1] - http://www.cse.cuhk.edu.hk/blogel/papers/blogel.pdf

pjmlp · on Feb 12, 2017

If you just measure milliseconds yes.

If you measure project costs, including the salaries of the developers and amount of development days, then no.

This is the main reason why there is such a big pressure from trading folks for Oracle to improve Java regarding value types and FFI to native code.

lorenzhs · on Feb 12, 2017

I think with Thrill, there are two different skill levels to be distinguished:

- Using it to implement things should be fairly easy and doesn't require advanced knowledge of C++. Basically you have to plug lambdas that do the processing into the provided operations, similar to Spark, but using C++ syntax. It might require some compiler error parsing skills, but altogether it shouldn't be too different from using Spark with Java/Scala

- Extending Thrill requires familiarity with modern C++, possibly including advanced template tricks.

Since there isn't a whole lot of advanced stuff available for Thrill (yet), that means that currently people with the latter skills would most likely be required at the moment. But in a world where the same libraries available for Spark are available for Thrill or a similar C++ framework, that wouldn't be the case. Note that Thrill is currently quite experimental.

I guess it's a trade-off, but dismissing the potential for 10x runtime gains "because C++" seems too one-sided. That isn't to say that the C++ frameworks don't have a long way to go before they can rival Spark etc in ease of use and tooling, they do! But at least they point out the inefficiencies and potential for improvement in these existing systems.

adrianN · on Feb 12, 2017

There is also the STXXL [1] for times when your data is big but not "big". It contains containers and algorithms optimized for external storage.

http://stxxl.sourceforge.net/

lorenzhs · on Feb 12, 2017

Thrill and STXXL are both delevoped in the same group at KIT (I work there, too, but I'm not directly involved). Thrill also reuses some parts of STXXL, and does so completely transparently to the user - if memory doesn't suffice, it'll use the disk.

pzh · on Feb 12, 2017

Does anybody know how this is different from Spark? These Distributed Immutable Arrays sound suspiciously similar to Spark's Resilient Distributed Datasets. Is it just the choice of C++ as opposed to Scala that would make this more efficient?

Also, I wonder if and how they implemented the concept of lineage (unless these DIAs are not really very resilient)... I thought Spark relied on Scala's delayed evaluation to do that, though I may be mistaken.

beached_whale · on Feb 12, 2017

In the presentation they directly contrast the performance of the two. http://panthema.net/2016/1206-Thrill-High-Performance-Algori...

sreenadh · on Feb 12, 2017

Is there a video on the talk?

lorenzhs · on Feb 12, 2017

No, sadly scientific CS conferences are only rarely recorded.

lorenzhs · on Feb 12, 2017

DIAs are quite heavily inspired by RDDs. A lot of the performance increases come from the C++ compiler's ability to fuse local operations etc.

Thrill doesn't implement any fault tolerance at the moment, it's closer to prototype status than production readiness.

pzh · on Feb 13, 2017

Do you have any plans regarding how you'd implement resilience and the equivalent of Spark's concept of 'lineage', where you keep a history of how a given RDD was computed, and then you can recompute it if it gets lost?

I haven't looked into Spark in depth, but I believe that 'lineage' relies heavily on Scala's delayed evaluation and the underlying Java RMI facilities. Doing something similar in C++ may require a lot more effort and a significantly different set of tradeoffs regarding the performance model.

lorenzhs · on Feb 13, 2017

I'm not that directly involved in Thrill, so I can't really speak with authority. There aren't any concrete plans on fault tolerance but it would certainly be an interesting topic to work on, partially because the existing solutions seem quite inefficient.

grumpyprole · on Feb 13, 2017

The JVM was obviously never designed for functional programming. I would like to see something like Thrill built in Haskell or OCaml, both of which generate efficient native code, support unboxed arrays and have metaprogramming/staging facilities that go beyond templates. GHC Haskell even has declarative rewrite rules for compile-time fusion.

I'm still not convinced C++ is necessary to outperform Spark, especially as high-level features like reference counting and lambdas are being used.

lorenzhs · on Feb 13, 2017

These high-level features don't have any performance impact:

Reference counting is used on very large objects or for things that aren't touched a lot, so there is no measurable impact on performance or memory use.

Lambdas don't have any performance impact. But you could just as well plug in any other functor. In fact, if you chain a map and a filter in Thrill, the two will be joined by the compiler (take element, apply both, proceed to next element). This would not be possible with old-school function pointers.

grumpyprole · on Feb 14, 2017

I wasn't suggesting they have a performance impact for this particular project.

I'm asking why choose C++ for Thrill if you want essentially non-deterministic automatic memory management and lambdas. Performing e.g. map fusion as you describe is common place for functional language compilers. For example, the vector array library in Haskell has been doing this since 2008.

My hypothesis is that your high-performance implementation could be realised using safer (and I would argue more appropriate) functional languages.

lorenzhs · on Feb 14, 2017

I guess it comes down to familiarity and experience with the language and tooling, as well as predictability of performance. Achieving C++-like performance in Haskell seems possible from all I've seen, but also requires a lot of experience.

I'd love to see a similar project realised in a functional language, that would be quite exciting!

Mikeb85 · on Feb 12, 2017

Very cool. Will have to remember this, maybe write an R package that makes use of it.

fnord123 · on Feb 12, 2017

Not using thrill, but there is Rpbd:

https://rbigdata.github.io/

It's basically R on scalapack.

tmsldd · on Feb 12, 2017

The Force is strong with this one