DuckDB is terrific. I'm bullish on its potential for simplifying many big data p...

jetter · on Feb 11, 2023

I use Clickhouse to store close to 1TB of API analytics data (which would be 10TB in MongoDB, Clickhouse has insane compression ) and it's a wonderful and stable SQL-first alternative to DuckDB - which is a very exciting piece of software, but is indeed too young to embed into boring production. The last time I checked DuckDB npm package, it used callbacks instead of awaits..

antonycourtney · on Feb 12, 2023

I can understand how the older callback API for node.js might form a negative impression, but it's really not indicative of the maturity of the core db engine at all. And remember: the vast majority of users use the Python API. Even better news is that, as of a couple of months ago, there is now this package (which I wrote at MotherDuck and we have open sourced) which provides typed promise wrappers for the Duckdb API: https://www.npmjs.com/package/duckdb-async. This is an independent npm package for now, but was developed in close coordination with the DuckDb core team.

RobinL · on Feb 10, 2023

I'd love to hear any real world experiences of anyone who's tried to run jobs that would usually require a spark cluster on a single machine with loads of cores and memory.

How big can you go, and how does speed compare to Spark? (I'm guessing significantly faster from my experience using Duckdb on smaller machines)

wenc · on Feb 10, 2023

I have a single machine EC2 instance with 32 cores and 240GB memory and about 200 GB of partitioned Parquet files. I use DuckDB and Python with complex SQL (window functions, inequality joins, quantile functions etc) to extract data from this data.

Because it’s a single machine (no distributed cluster) DuckDB can heavily parallelize and vectorize. I don’t know if I can give you perf numbers but complex analytic queries over the entire dataset regularly finish in 1-2 mins (not scientific since I’m not telling what kinds of queries I’m running).

I’ve used Spark SQL and DuckDB overall is just more ergonomic, less boilerplate and is much faster since it is so lightweight.

Granted DuckDB can only process data on one machine (whereas Spark can scale up indefinitely by adding machines) but most data sets I work with fit on a single beefy machine.

Distributed computing — most of the time, you ain’t gonna need it.

It’s like StackOverflow: it serves 2B requests a month but only runs on a few on-prem servers. Most people think this is impossible but you can actually do a lot with very few machines if you’re smart about it. Same with data. Big data is overrated.

RobinL · on Feb 10, 2023

Thanks - very interesting

jakewins · on Feb 10, 2023

I used it as glue for a monthly job, last week, processing ~1B event records, juggling them a bit and pushing them as 2M parquet segment files to S3.

Mixed experience, would definitely not put it in a system that isn’t ok crashing frequently.

It segfaults on Alpine (argh, C++!), and force exits the whole NodeJS process when it gets unexpected HTTP responses from S3.

In an archive run of 2M pushes it’ll crash the process 4-5 times.

Overall still really, really like it, but learned to not trust it

tracker1 · on Feb 10, 2023

Node + AWS for data ingress has been pretty painful in my experience (mostly dynamo feeds from large csv (whois database)). In the end, rewrote in C# (core 2) and it was able to complete more reliably. I'm guessing that go and rust would also be better. I like node, really like JS, but I just think that maybe the AWS libraries aren't that great in the space. It would run for 3-5 hours, then just blow up unexpectedly, even with plenty of memory overhead, and not really bandwidth limited, with appropriate retries and slowdown for dynamo rejections.

If I never have to write ETL pipelines again, I won't be upset about it.

Master_Odin · on Feb 10, 2023

There was a bug that was recently fixed in node 16.17+ that was causing hard node processes crashes when doing stuff with S3 and I think had to do with receiving multiple packets at once or something.

SnowflakeOnIce · on Feb 10, 2023

My experience with a modest machine and a ~100GB dataset was that DuckDB was significantly easier to use and much faster (20x) than Ray Data. Have not compared directly with Spark.

There was no cluster to set up or administer using DuckDB.

adgjlsfhk1 · on Feb 11, 2023

spark is just a tool to let you take a computation that would run in an hour on your laptop if coded properly and send it to a server with 1000 cores where it runs in 2 hours.

benjaminwootton · on Feb 10, 2023

DuckDB is a relational OLAP store. If you want to do transformations on relational data using SQL then I think nowadays you would look at the modern data stack and do it with DBT.

If you have genuinely big and unstructured data then of course you need a cluster and would reach for Spark.

If you have smallish data then maybe DuckDB has a role because working with SQL is nicer than Pandas. But a lot of time you actually need the complexity of Pandas to do the transformation you need.

DuckDB is neat but I still can’t quite convince myself of a killer use case.

neeleshs · on Feb 10, 2023

I am not sure I understand the first comment very well. Are you saying that instead of duckdb use modern data stack? Because DBT and DuckDB don't seem to contradict, but can work together. FWIW, I think the only important breakthrough in the "modern data stack" is really dbt. The rest, nothing modern about it

benjaminwootton · on Feb 11, 2023

It’s more a comment on where the market is at rather than a recommendation. I’m not saying it is bad tech, but I don’t see it’s niche.

If you have a few billion records and you want to filter, join, aggregate them using SQL then DBT against an OLAP server solves that issue so well that it doesn’t leave much white space for DuckDB.

I mentioned modern data stack because when you have SaaS, low code, consumption based billing, open source etc then it treads even more on the DuckDB value prop. DuckDB would have been great if Oracle was my only choice, but when I have Snowflake and Clickhouse in the toolbox it is a tougher market for them to carve out a niche.

hobs · on Feb 11, 2023

Also fwiw its always lowercase - dbt.

89vision · on Feb 10, 2023

> working with SQL is nicer than Pandas

Really? I prefer working with dataframe apis. You get a nice sql-like paradigm plus all the control structures of the runtime.

camgunz · on Feb 11, 2023

Databases are just much, much faster than Pandas, and that's before you start factoring the extraction and loading of data. I treat Pandas as a last resort when I can't do something in SQL, generally this is something like integrating with external services or running recordlinkage.

RobinL · on Feb 11, 2023

If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink

camgunz · on Feb 11, 2023

Oh hot tip! Thank you! Love the blog btw

jammycrisp · on Feb 11, 2023

You might be interested in checking out Ibis (https://ibis-project.org/). It provides a dataframe-like API, abstracting over many common execution engines (duckdb, postgres, bigquery, spark, ...). Ibis wrapping duckdb has pretty much replaced pandas as my tool of choice for local data analysis. All the performance of duckdb with all the ergonomics of a dataframe API. (disclaimer: I contribute to Ibis for work).

je42 · on Feb 11, 2023

Interesting ! One newbie question, how does ibis differ from sqlalchemy ?

89vision · on Feb 13, 2023

sql alchemy is an orm, where ibis looks to be a dataframe api that is sort of a dsl over sql. It doesn't try to map relational domains to an object oriented paradigm like sql alchemy does

wenc · on Feb 10, 2023

SQL is much nicer for anything non-trivial. Pandas methods get unwieldy for complex aggregations.

Also Pandas methods are imperative so cannot be optimized. SQL is declarative so it can be optimized to the hilt and DuckDB is faster than Pandas in almost all cases, even on Pandas data frames themselves! (partly due to vectorization).

nerdponx · on Feb 10, 2023

As far as I can tell, DuckDB is an alternative to "data frame" libraries like Data.table, Polars, Pandas, etc. Is that the case? What makes DuckDB a better choice than, say, Polars?

corford · on Feb 10, 2023

This blog post offers a nice summary: https://motherduck.com/blog/six-reasons-duckdb-slaps/

nerdponx · on Feb 10, 2023

The blog post doesn't really make a comparison between DuckDB and data frame libraries. It mentions that the DuckDB Python bindings can interoperate with Pandas, but it doesn't really explain why you would use DuckDB instead of Pandas, or Polars (which is both faster and more portable than Pandas).

simonw · on Feb 10, 2023

Don't Polars and Pandas both require your entire data to fit in memory?

nerdponx · on Feb 10, 2023

Pandas doesn't. Polars I think has some lazy-loading capability, but it's not the default mode of operation and I don't think it supports all features. If DuckDB doesn't, then that's a big advantage.

jcheng · on Feb 11, 2023

I think you mean that Pandas does require your entire data to fit in memory? https://pandas.pydata.org/docs/user_guide/scale.html

nerdponx · on Feb 11, 2023

Yes, thank you, that was a typo.

corford · on Feb 11, 2023

Apparently duckdb paired with arrow lets you work performatively on bigger than memory parquet data (I haven't tried this though)

riezebos · on Feb 11, 2023

Arrow seems to be the common denominator for larger than ram. Polars is completely built on arrow, so the same advantages should apply.

https://github.com/pola-rs/polars#handles-larger-than-ram-da...

syntaxfree · on Feb 10, 2023

It’s a drop in alternative to SQLite that’s column-oriented/OLAP. I’ve been profiling entire projects in production switching between SQLite and duckdb (no clear conclusions yet)

nerdponx · on Feb 10, 2023

I suppose that leads to a broader question: when should you use an in-memory database, and when should you use a data frame library? The distinction between the two seems to be getting blurry (which maybe is a good thing).

IanCal · on Feb 10, 2023

Very blurry. The answer now is just "whichever is easier for the small part of the task right now". Since duckdb happily talks arrow, you can use pandas for part of it, quickly do some SQL where that is easier (with no data copying) then switch back to pandas for something. You don't really have to choose which one to use any more.

wenc · on Feb 10, 2023

Exactly. In my DuckDB workflow I use Pandas data frames and DuckDB queries interchangeably.

  import duckdb as db
  import pandas as pd 

  df = pd.read_excel(“z.xlsx”)
  df2 = db.query(“select * from df join ‘s3://bucket/a.parquet’ b on df.col b.col”).df()
  df3 = df2.col.apply(lambda x: x)

DuckDB can refer any Pandas data frame in the namespace as a SQL object. You can query across Parquet, CSV and Pandas data frames seamlessly.

Need to join Excel with Parquet with CSV? No problem. You can do it all within DuckDB.

brahbrah · on Feb 13, 2023

Pandas is in a separate category from all of these, including polars. If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers.

For example, this is how some basic operations would look in pandas.

Bump prices in 2020 up $1:

    prices_df.loc['2020'] += 1

Add expected temperature offsets to base temperature forecast:

    temp_df + offset_df

Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.

benjaminwootton · on Feb 10, 2023

SQL is easier and more natural to work with than Pandas.

brahbrah · on Feb 13, 2023

Sometimes yes and sometimes no. For example if you had price data and you wanted to bump prices up a dollar in 2020 how would you do that in sql?

In pandas it’s:

    prices_df.loc['2020'] += 1

If you had a temperature forecast and you wanted to add the expected temperature miss to them, how would you do that on sql?

In pandas it’s:

    temps_df + expected_miss_df

nojito · on Feb 11, 2023

How would you do df.T in sql?

wenc · on Feb 11, 2023

df.T is a special Pandas dataframe transpose on the dataframe index and the columns.

DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

But to answer your original question, the SQL analogue to a transpose are PIVOT/UNPIVOT operations which are mathematically rotation operations on invariants (your dimensions). This makes them much more general than a transpose -- which are just rotation operations on the rows/cols. PIVOT/UNPIVOT work on non-square data and allow you to specify different types of aggregations. PIVOT/UNPIVOT keywords are not yet implemented in DuckDB but are on the roadmap if I'm not mistaken.

nojito · on Feb 13, 2023

>DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

Transforming into a pandas df isn't zero copy.

UNPIVOT and PIVOT are quite verbose compared to df.T.

greggyb · on Feb 10, 2023

32 cores and 128GB RAM are now desktop-class specs. Latest generation commodity servers can supply you with hundreds of cores and TBs of RAM. Note: "commodity" != "cheap", at least not necessarily.

Ninja edit before anyone misconstrues this. I am not saying that the typical desktop has these specs. I am saying that the class of hardware that is most commonly run on desktops includes SKUs that can meet this spec. Desktop-class means the same motherboard socket and processor architecture.

krimpenrik · on Feb 10, 2023

Recently tried the GUI tool for ducks, forgot what's it called, something like 'Tab' and was quite disappointed. I feel duckdb needs a good tool like sqliteviewer to really take off.

antonycourtney · on Feb 12, 2023

I think you're referring to Tad (https://www.tadviewer.com), which I developed. Tad isn't "the GUI tool for DuckDb"; it's a desktop app that provides a pivot table based viewer for tabular data files (CSV, Parquet, and DuckDb/SQLite database files). It uses DuckDb as its engine, but pre-dates DuckDb and was developed independently. It's listed in the DuckDb docs along with several others tools that work with or use DuckDb. All that said, I'm sorry you found it disappointing, and would welcome any constructive feedback on what specifically you found lacking, either here or to tad-feedback@tadviewer.com.

krimpenrik · on Feb 13, 2023

Then I think I used it wrong, happens often. Thank you for your work. I was/am new to duckdb and since it was listed in the docs I assumed it was something like SQL viewer

0cf8612b2e1e · on Feb 10, 2023

DBeaver supports duckdb.

capkutay · on Feb 11, 2023

reminds me of this blog on streaming data to Parquet files and running queries on data in the native format.

https://pedram.substack.com/p/streaming-data-pipelines-with-...