Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DuckDB is terrific. I'm bullish on its potential for simplifying many big data pipelines. Particularly, it's plausible that DuckDB + Parquet could be used on a large SMP machine (32+ cores and 128GB+ memory) to deal with data munging for 100s of gigabytes to several terabytes, all from SQL, without dealing with Hadoop, Spark, Ray, etc.

I have successfully used DuckDB like above for preparing an ML dataset from about 100GB of input.

DuckDB is undergoing rapid development these days. There have been format-breaking changes and bugs that could lose data. I would not yet trust DuckDB for long-term storage or archival purposes. Parquet is a better choice for that.



I use Clickhouse to store close to 1TB of API analytics data (which would be 10TB in MongoDB, Clickhouse has insane compression ) and it's a wonderful and stable SQL-first alternative to DuckDB - which is a very exciting piece of software, but is indeed too young to embed into boring production. The last time I checked DuckDB npm package, it used callbacks instead of awaits..


I can understand how the older callback API for node.js might form a negative impression, but it's really not indicative of the maturity of the core db engine at all. And remember: the vast majority of users use the Python API. Even better news is that, as of a couple of months ago, there is now this package (which I wrote at MotherDuck and we have open sourced) which provides typed promise wrappers for the Duckdb API: https://www.npmjs.com/package/duckdb-async. This is an independent npm package for now, but was developed in close coordination with the DuckDb core team.


I'd love to hear any real world experiences of anyone who's tried to run jobs that would usually require a spark cluster on a single machine with loads of cores and memory.

How big can you go, and how does speed compare to Spark? (I'm guessing significantly faster from my experience using Duckdb on smaller machines)


I have a single machine EC2 instance with 32 cores and 240GB memory and about 200 GB of partitioned Parquet files. I use DuckDB and Python with complex SQL (window functions, inequality joins, quantile functions etc) to extract data from this data.

Because it’s a single machine (no distributed cluster) DuckDB can heavily parallelize and vectorize. I don’t know if I can give you perf numbers but complex analytic queries over the entire dataset regularly finish in 1-2 mins (not scientific since I’m not telling what kinds of queries I’m running).

I’ve used Spark SQL and DuckDB overall is just more ergonomic, less boilerplate and is much faster since it is so lightweight.

Granted DuckDB can only process data on one machine (whereas Spark can scale up indefinitely by adding machines) but most data sets I work with fit on a single beefy machine.

Distributed computing — most of the time, you ain’t gonna need it.

It’s like StackOverflow: it serves 2B requests a month but only runs on a few on-prem servers. Most people think this is impossible but you can actually do a lot with very few machines if you’re smart about it. Same with data. Big data is overrated.


Thanks - very interesting


I used it as glue for a monthly job, last week, processing ~1B event records, juggling them a bit and pushing them as 2M parquet segment files to S3.

Mixed experience, would definitely not put it in a system that isn’t ok crashing frequently.

It segfaults on Alpine (argh, C++!), and force exits the whole NodeJS process when it gets unexpected HTTP responses from S3.

In an archive run of 2M pushes it’ll crash the process 4-5 times.

Overall still really, really like it, but learned to not trust it


Node + AWS for data ingress has been pretty painful in my experience (mostly dynamo feeds from large csv (whois database)). In the end, rewrote in C# (core 2) and it was able to complete more reliably. I'm guessing that go and rust would also be better. I like node, really like JS, but I just think that maybe the AWS libraries aren't that great in the space. It would run for 3-5 hours, then just blow up unexpectedly, even with plenty of memory overhead, and not really bandwidth limited, with appropriate retries and slowdown for dynamo rejections.

If I never have to write ETL pipelines again, I won't be upset about it.


There was a bug that was recently fixed in node 16.17+ that was causing hard node processes crashes when doing stuff with S3 and I think had to do with receiving multiple packets at once or something.


My experience with a modest machine and a ~100GB dataset was that DuckDB was significantly easier to use and much faster (20x) than Ray Data. Have not compared directly with Spark.

There was no cluster to set up or administer using DuckDB.


spark is just a tool to let you take a computation that would run in an hour on your laptop if coded properly and send it to a server with 1000 cores where it runs in 2 hours.


DuckDB is a relational OLAP store. If you want to do transformations on relational data using SQL then I think nowadays you would look at the modern data stack and do it with DBT.

If you have genuinely big and unstructured data then of course you need a cluster and would reach for Spark.

If you have smallish data then maybe DuckDB has a role because working with SQL is nicer than Pandas. But a lot of time you actually need the complexity of Pandas to do the transformation you need.

DuckDB is neat but I still can’t quite convince myself of a killer use case.


I am not sure I understand the first comment very well. Are you saying that instead of duckdb use modern data stack? Because DBT and DuckDB don't seem to contradict, but can work together. FWIW, I think the only important breakthrough in the "modern data stack" is really dbt. The rest, nothing modern about it


It’s more a comment on where the market is at rather than a recommendation. I’m not saying it is bad tech, but I don’t see it’s niche.

If you have a few billion records and you want to filter, join, aggregate them using SQL then DBT against an OLAP server solves that issue so well that it doesn’t leave much white space for DuckDB.

I mentioned modern data stack because when you have SaaS, low code, consumption based billing, open source etc then it treads even more on the DuckDB value prop. DuckDB would have been great if Oracle was my only choice, but when I have Snowflake and Clickhouse in the toolbox it is a tougher market for them to carve out a niche.


Also fwiw its always lowercase - dbt.


> working with SQL is nicer than Pandas

Really? I prefer working with dataframe apis. You get a nice sql-like paradigm plus all the control structures of the runtime.


Databases are just much, much faster than Pandas, and that's before you start factoring the extraction and loading of data. I treat Pandas as a last resort when I can't do something in SQL, generally this is something like integrating with external services or running recordlinkage.


If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink


Oh hot tip! Thank you! Love the blog btw


You might be interested in checking out Ibis (https://ibis-project.org/). It provides a dataframe-like API, abstracting over many common execution engines (duckdb, postgres, bigquery, spark, ...). Ibis wrapping duckdb has pretty much replaced pandas as my tool of choice for local data analysis. All the performance of duckdb with all the ergonomics of a dataframe API. (disclaimer: I contribute to Ibis for work).


Interesting ! One newbie question, how does ibis differ from sqlalchemy ?


sql alchemy is an orm, where ibis looks to be a dataframe api that is sort of a dsl over sql. It doesn't try to map relational domains to an object oriented paradigm like sql alchemy does


SQL is much nicer for anything non-trivial. Pandas methods get unwieldy for complex aggregations.

Also Pandas methods are imperative so cannot be optimized. SQL is declarative so it can be optimized to the hilt and DuckDB is faster than Pandas in almost all cases, even on Pandas data frames themselves! (partly due to vectorization).


As far as I can tell, DuckDB is an alternative to "data frame" libraries like Data.table, Polars, Pandas, etc. Is that the case? What makes DuckDB a better choice than, say, Polars?


This blog post offers a nice summary: https://motherduck.com/blog/six-reasons-duckdb-slaps/


The blog post doesn't really make a comparison between DuckDB and data frame libraries. It mentions that the DuckDB Python bindings can interoperate with Pandas, but it doesn't really explain why you would use DuckDB instead of Pandas, or Polars (which is both faster and more portable than Pandas).


Don't Polars and Pandas both require your entire data to fit in memory?


Pandas doesn't. Polars I think has some lazy-loading capability, but it's not the default mode of operation and I don't think it supports all features. If DuckDB doesn't, then that's a big advantage.


I think you mean that Pandas does require your entire data to fit in memory? https://pandas.pydata.org/docs/user_guide/scale.html


Yes, thank you, that was a typo.


Apparently duckdb paired with arrow lets you work performatively on bigger than memory parquet data (I haven't tried this though)


Arrow seems to be the common denominator for larger than ram. Polars is completely built on arrow, so the same advantages should apply.

https://github.com/pola-rs/polars#handles-larger-than-ram-da...


It’s a drop in alternative to SQLite that’s column-oriented/OLAP. I’ve been profiling entire projects in production switching between SQLite and duckdb (no clear conclusions yet)


I suppose that leads to a broader question: when should you use an in-memory database, and when should you use a data frame library? The distinction between the two seems to be getting blurry (which maybe is a good thing).


Very blurry. The answer now is just "whichever is easier for the small part of the task right now". Since duckdb happily talks arrow, you can use pandas for part of it, quickly do some SQL where that is easier (with no data copying) then switch back to pandas for something. You don't really have to choose which one to use any more.


Exactly. In my DuckDB workflow I use Pandas data frames and DuckDB queries interchangeably.

  import duckdb as db
  import pandas as pd 

  df = pd.read_excel(“z.xlsx”)
  df2 = db.query(“select * from df join ‘s3://bucket/a.parquet’ b on df.col b.col”).df()
  df3 = df2.col.apply(lambda x: x)

DuckDB can refer any Pandas data frame in the namespace as a SQL object. You can query across Parquet, CSV and Pandas data frames seamlessly.

Need to join Excel with Parquet with CSV? No problem. You can do it all within DuckDB.


Pandas is in a separate category from all of these, including polars. If you were to say “pandas in long format only” then yes that would be correct, but the power of pandas comes in its ability to work in a long relational or wide ndarray style. Pandas was originally written to replace excel in financial/econometric modeling, not as a replacement for sql. Models written solely in the long relational style are near unmaintainable for constantly evolving models with hundreds of data sources and thousands of interactions being developed and tuned by teams of analysts and engineers.

For example, this is how some basic operations would look in pandas.

Bump prices in 2020 up $1:

    prices_df.loc['2020'] += 1
Add expected temperature offsets to base temperature forecast:

    temp_df + offset_df
Now imagine thousands of such operations, and you can see the necessity of pandas in models like this.


SQL is easier and more natural to work with than Pandas.


Sometimes yes and sometimes no. For example if you had price data and you wanted to bump prices up a dollar in 2020 how would you do that in sql?

In pandas it’s:

    prices_df.loc['2020'] += 1
If you had a temperature forecast and you wanted to add the expected temperature miss to them, how would you do that on sql?

In pandas it’s:

    temps_df + expected_miss_df


How would you do df.T in sql?


df.T is a special Pandas dataframe transpose on the dataframe index and the columns.

DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

But to answer your original question, the SQL analogue to a transpose are PIVOT/UNPIVOT operations which are mathematically rotation operations on invariants (your dimensions). This makes them much more general than a transpose -- which are just rotation operations on the rows/cols. PIVOT/UNPIVOT work on non-square data and allow you to specify different types of aggregations. PIVOT/UNPIVOT keywords are not yet implemented in DuckDB but are on the roadmap if I'm not mistaken.


>DuckDB produces Pandas dataframes, so you would just do df.T. No need to choose between one the other.

Transforming into a pandas df isn't zero copy.

UNPIVOT and PIVOT are quite verbose compared to df.T.


32 cores and 128GB RAM are now desktop-class specs. Latest generation commodity servers can supply you with hundreds of cores and TBs of RAM. Note: "commodity" != "cheap", at least not necessarily.

Ninja edit before anyone misconstrues this. I am not saying that the typical desktop has these specs. I am saying that the class of hardware that is most commonly run on desktops includes SKUs that can meet this spec. Desktop-class means the same motherboard socket and processor architecture.


Recently tried the GUI tool for ducks, forgot what's it called, something like 'Tab' and was quite disappointed. I feel duckdb needs a good tool like sqliteviewer to really take off.


I think you're referring to Tad (https://www.tadviewer.com), which I developed. Tad isn't "the GUI tool for DuckDb"; it's a desktop app that provides a pivot table based viewer for tabular data files (CSV, Parquet, and DuckDb/SQLite database files). It uses DuckDb as its engine, but pre-dates DuckDb and was developed independently. It's listed in the DuckDb docs along with several others tools that work with or use DuckDb. All that said, I'm sorry you found it disappointing, and would welcome any constructive feedback on what specifically you found lacking, either here or to tad-feedback@tadviewer.com.


Then I think I used it wrong, happens often. Thank you for your work. I was/am new to duckdb and since it was listed in the docs I assumed it was something like SQL viewer


DBeaver supports duckdb.


reminds me of this blog on streaming data to Parquet files and running queries on data in the native format.

https://pedram.substack.com/p/streaming-data-pipelines-with-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: