Any comparisons with Databricks Spark. When we started experimenting with Spark, we initially used AWS EMR. But then the same code was way faster on Databricks than it was on EMR, which resulted in us ditching EMR.
Databricks has kept their Photon[1][2] query engine for Spark closed sourced thus far. Unless EMR has made equivalent changes to the Spark runtime they use Databricks should be much faster. Photon brings the standard vectorized execution techniques used in SQL data warehouses for many years to Spark.
DataFrames are just SQL. There will be no performance difference.
RDDs will be worse, so it shouldn't matter. No vectorization, no column processing, lots of serialization and de-serialization. They're basically always slower than DataFrames barring some strange use case.
Interesting, looks like it is just DataFusion engine for Spark. There is a similar project: https://github.com/oap-project/gluten - it brings ClickHouse as an engine to Spark.