Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DuckDB's storage format has similar advantages as the Parquet storage format (e.g. individual columns can be read, partitions can be skipped, etc) but it is different because DuckDB's format is designed to do more than Parquet files.

Parquet files are intended to store data from a single table and they are intended to be written-once, where you write the file and then never change it again. If you want to change anything in a Parquet file you re-write the file.

DuckDB's storage format is intended to store an entire database (multiple tables, views, sequences, etc), and is intended to support ACID operations on those structures, such as insertions, updates, deletes, and even altering tables in the form of adding/removing columns or altering types of columns without rewriting the entire table or the entire file.

Tables are partitioned into row groups much like Parquet, but unlike Parquet the individual columns of those row groups are divided into fixed-size blocks so that individual columns can be fetched from disk. The fixed-size blocks ensure that the file will not suffer from fragmentation as the database is modified.

The storage is still a work in progress, and we are currently actively working on adding more support for compression and other goodies, as well as stabilizing the storage format so that we can maintain backwards compatibility between versions.



I should have clarified table storage specifically, to gain intuition on what happens when data is copied between DuckDB and Arrow/Parquet. Was much faster to just look at the code bridging the two and back track from there. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: