Blaze: Fast query execution engine for Apache Spark

whinvik · on Oct 20, 2023

Any comparisons with Databricks Spark. When we started experimenting with Spark, we initially used AWS EMR. But then the same code was way faster on Databricks than it was on EMR, which resulted in us ditching EMR.

AdamProut · on Oct 20, 2023

Databricks has kept their Photon[1][2] query engine for Spark closed sourced thus far. Unless EMR has made equivalent changes to the Spark runtime they use Databricks should be much faster. Photon brings the standard vectorized execution techniques used in SQL data warehouses for many years to Spark.

[1] https://docs.databricks.com/en/clusters/photon.html [2] https://dl.acm.org/doi/10.1145/3514221.3526054

whinvik · on Oct 21, 2023

I am a bit hazy about the exact details of how we did it since its been some time, but we definitely did not use Photon as it was too expensive.

One of the issues was that we started experimenting with Delta Tables and EMR was horrible in leveraging that.

juliangamble · on Oct 20, 2023

It would be great to have a comparison to Dataframes and RDDs as well.

bllchmbrs · on Oct 20, 2023

DataFrames are just SQL. There will be no performance difference.

RDDs will be worse, so it shouldn't matter. No vectorization, no column processing, lots of serialization and de-serialization. They're basically always slower than DataFrames barring some strange use case.

esafak · on Oct 20, 2023

Got numbers?

zX41ZdbW · on Oct 21, 2023

Interesting, looks like it is just DataFusion engine for Spark. There is a similar project: https://github.com/oap-project/gluten - it brings ClickHouse as an engine to Spark.

zhangyt26 · on Oct 23, 2023

Photon, velox, and now this. Why would people use spark in the first place other than for legacy application reasons?

elromulous · on Oct 20, 2023

For a split second, I thought bazel[0] finally got externally renamed to its true name.

[0] https://en.m.wikipedia.org/wiki/Bazel_(software)

kristjansson · on Oct 20, 2023

Unfortunate name overlap with an under-loved PyData project: https://blaze.pydata.org

tjpnz · on Oct 21, 2023

And Google's version of Bazel.

laurentlb · on Oct 21, 2023

The public version was renamed Bazel because of name conflicts.

tomrod · on Oct 21, 2023

Same. I've always liked the blaze project.