There's also Blogel [0] which is a distributed graph processing framework in C++ and it runs significantly faster than its counterpart in Java, Apache Giraph [1].
I have started wondering if the big data developers really care about the speed; the advantages of these Java softwares start to fade out when compared with their C++ counterparts.
I think with Thrill, there are two different skill levels to be distinguished:
- Using it to implement things should be fairly easy and doesn't require advanced knowledge of C++. Basically you have to plug lambdas that do the processing into the provided operations, similar to Spark, but using C++ syntax. It might require some compiler error parsing skills, but altogether it shouldn't be too different from using Spark with Java/Scala
- Extending Thrill requires familiarity with modern C++, possibly including advanced template tricks.
Since there isn't a whole lot of advanced stuff available for Thrill (yet), that means that currently people with the latter skills would most likely be required at the moment. But in a world where the same libraries available for Spark are available for Thrill or a similar C++ framework, that wouldn't be the case. Note that Thrill is currently quite experimental.
I guess it's a trade-off, but dismissing the potential for 10x runtime gains "because C++" seems too one-sided. That isn't to say that the C++ frameworks don't have a long way to go before they can rival Spark etc in ease of use and tooling, they do! But at least they point out the inefficiencies and potential for improvement in these existing systems.
I have started wondering if the big data developers really care about the speed; the advantages of these Java softwares start to fade out when compared with their C++ counterparts.
[0] - http://www.cse.cuhk.edu.hk/blogel/
[1] - http://www.cse.cuhk.edu.hk/blogel/papers/blogel.pdf