Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm still not sure I "get" the use case for DuckDB. From what I understand, it's like a nifty, in-memory SQL, but why is that better than just running PostGRES or Microsoft SQL server locally, where your data structures and tables and stuff have a lot more permanence?

Like, my workflow is either I query an exiting remote corporate DB and do my initial data munging there, or get givne a data dump that I either work on directly in Pandas, or add to a local DB and do a little more cleaning there. Not at all clear how Duck DB would hel



DuckDB developer here. DuckDB is a regular RDBMS that has persistent ACID storage, but is tuned towards analytical workloads, i.e. read-heavy workloads with aggregates that require full scans of the data. Any data you write to tables is stored persistently on disk, and not all your data needs to fit in memory either.

Our tagline is “SQLite for analytics”, as DuckDB is an in-process database system similar to SQLite that is geared towards these types of workloads.

DuckDB has a flexible query engine, and also has support for directly running SQL queries (in parallel!) on top of Pandas [1] and Parquet [2] without requiring the data to be imported into the system.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

[2] https://duckdb.org/2021/06/25/querying-parquet.html


Maybe this is a silly question: Why is the A/B choice between a row-major database and a column-major database, instead of between row-major tables and column-major tables within a flexible database?

What's stopping the other leading brands from implementing columnar storage, queries, and such with a COLUMN MAJOR table attribute?


Some databases do offer both, but it is much more involved than just changing the storage model. The entire query execution model needs to adapt to columnar execution. You can simulate a column store model in a row-store database by splitting a table into a series of single-column tables, but the performance benefits you will capture are much smaller than a system that is designed and optimized for column store execution.


SQL calculations on columnar data are quite different from row-based databases, so its effectively a different database engine. You can take multiple advantages of columnar data store, because it usually employs a form of vocabulary compression. For instance, obtaining distinct values of a field in a columnar DB is much faster because it's typically just the vocabulary of the field, so it doesn't even require a full table scan. Many other columnar computations such as filtering or aggregation can be done on compressed data without decompression.


I work heavily with pandas and dask (when you want to use multiple cores), using parquet files for storage. We see a lot of benefits in selectively bringing in duckdb into the mix. For instance, the joins are extremely slow with both pandas and dask and require a lot of memory. That's a situation where using duckdb reduces the memory needs and speeds things up a lot.

And we may not want to upload the data into postgres or another database. We can just work with parquet files and run in-process queries.


Check out this post for some comparisons with Pandas.

https://duckdb.org/2021/05/14/sql-on-pandas.html

DuckDB is often faster than Pandas, and it can handle larger than memory data. Plus, if you already know SQL, you don't have to become a Pandas expert to be productive in Python data munging. Pandas is still good, but now you can mix and match with SQL!


DuckDB is columnar, so in theory a lot faster than Postgres or SQL server for Analytical workloads.

DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite is to Postgres, MySQL, SQL Server.


From where do you get that sql server does not support columnar? That is a wrong claim.


I don’t think the OP said that SQL Server doesn’t support columnar, only that by analogy SQL Server is primarily a row store (which for most of its history was true).

Columnar technology in SQL Server only became usable in SQL 2016 (it existed in 2012 but was too restrictive — I know because I tried to use it).

In 2016 and above, you can either create a columnar index (non clustered column store) or convert an entire table into a columnar table (clustered column store). The technology is actually pretty impressive and I’ve used it in production where I have row stores coexisting with column stores within the same database.


Quite interesting read about how timescaledb turns a row store (psql) into a column store.


The simple answer noone else seems to have mentioned: SQLLite has quite a low limit on the number of columns it supports, which is a problem for data analytics which often prefers wide over long.


Not just in-memory. It's pretty convenient if you have a set of Parquet files with common schema. Fairly snappy and doesn't have to fit in memory.


I'm using duckdb for querying parquet files as well. It's an awesome tool, so nice to just "look into" parquet files with SQL.


Many enterprises are coming up with patterns where they replicate the data from the database (say Redshift) into parquet files (data lake?) and directing more traffic including analytical workloads onto the parquet files.

duckdb will be very useful here, instead of having to use Redshift Spectrum or whatever.


To me, the use case is really obvious: when you reached for SQLite but now want something with Moar Powah.

Now I've reduced it to a previously answered question: what's the use case for SQLite? ;) https://www.sqlite.org/whentouse.html

That being said, I don't see the point, and shudder at the idea of a web page's javascript doing anything which needs noticeable amounts of the CPU, but I'm a non-standard user...


I think local CPU is underutilized in these cloudy days. For many SaaS, the size of all your tenant data minus media is quite small. Most cloud apps have terrible experience with large tables, searching, and general perf. We trade this away for collaboration features.


I don't fully get the use case either, but it's in a different category than Postgres or Microsoft SQL because it runs in the browser and can be made part of your web app.


DuckDB-wasm is targeting the browser so it's not directly competing with Pandas (that's the job of native DuckDB).

It's targeting use cases where you want to push analytical computation away from servers into the client (browser).

Lets me sketch 2 examples:

A) You have your data sitting in S3 and the user-specific data is in a browser-manageable area.

(E.g. this paper from Tableau research actually states dataset sizes that should fall into that category: https://research.tableau.com/sites/default/files/get_real.pd...)

In that scenario, you could eliminate your central server if your clients are smart enough.

B) You are talking about larger dataset sizes (GB) and want to explore them ad-hoc in your browser.

Uploading them is unrealistic and installing additional software is no longer ad-hoc enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: