Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
BetrFS: An in-kernel file system that uses Bε trees to organize on-disk storage (betrfs.org)
167 points by espeed on Oct 13, 2018 | hide | past | favorite | 46 comments


This is really interesting...

... so I'm really a bit bewildered and sad by some of the engineering choices they made along the way, like requiring a modified kernel: https://github.com/oscarlab/betrfs/blob/master/README.md#com...

The reasons are detailed just above that link target, but are somewhat absurd IMHO: they modified the kernel's `struct task_struct` to pass error values, rather than fixing one of their libraries to handle its own error values. In return, nobody ever gets to use this without patching a kernel (which is, by the way, so ancient of a fork at this point that Forget It).

I know it's unfair of a random internetfolk to complain about the engineering choices of a project like this, I'm sure I don't know the influences and tradeoffs first hand, etc, etc, but... Ow. I would've tried this.

But not on a 3.11 kernel fork.

I hope someone can take all the awesome research here and get it to veer towards something slightly closer to product.


All the reasons given boil down to “we didn’t want to mess with TokuDB”. The build process (requiring specific GCC versions, CMake, and a bunch of packages including Valgrind), the errno patch to task_struct, and the half dozen libc stub functions are all there because TokuDB expects userspace libc.

I don’t know whether the authors ever attempted to patch TokuDB itself (or streamline it down to the essentials for the kernel). Instead they appear to have just taken the entire userspace-designed library and hacked the kernel until it fit. It’s a halfway-decent strategy if your goal is to get the thing off the ground as quickly as possible, but obviously a real implementation would have to ship a modified TokuDB instead (it’s OSS, so it should be hackable!)

This is pretty common in academia, sadly. As an academic who has released some academic OSS code myself, I can say that often there’s just not enough time or motivation to fix a blob of code into a generally usable format. Often it’s released just so that we can say “hey we open-sourced it so other researchers can build on it/replicate our results”. This may be one of the reasons why academic ideas don’t make it out to the real world that quickly.


Note that some of the people involved are also the original authors of TokuDB [0], including Bradley Kuszmaul [1] , Michael Bender [2], and Martin Farach-Colton [3] who were the founders of Tokutek (acquired by Percona in 2015).

Previous discussion with the authors: A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing https://news.ycombinator.com/item?id=9248298

[0] TokuDB Fractal Tree Index https://github.com/Tokutek/ft-index TokuDB Engine https://github.com/percona/tokudb-engine

[1] Bradley Kuszmaul http://people.csail.mit.edu/bradley/startups.html

[2] Michael Bender https://www3.cs.stonybrook.edu/~bender/

[3] Martin Farach-Colton https://www.cs.rutgers.edu/~farach/


Perhaps TokuDB was evolving rapidly at the time they were trying to incorporate it into this filesystem.

It seems like the choices are:

1. Fork TokuDB.

2. Fork the kernel.

If the goal is research, the first one of those would then slow down the research. The second one only gets you stuck on an old kernel, which while not great for end users, isn't as big of a problem for a researcher as having their researcher slowed down.


TokuDB seems to be written in C++. It's neat that they managed to make that work in-kernel.

Also there are patent notices in the TokuDB files.

So probability of it being in Linux at some point approaches zero. And with the patents they even made the whole technique unviable for any Linux fs (haven't looked at them in detail obviously).


See Bradley's comment wrt to licensing the fractal tree code as GPLv2 w/ a patent provision https://news.ycombinator.com/item?id=18208209


This isn't completely fair, although most of it is.

Some of the reasons for modifying Linux are performance enhancements to the filesystem layer that only make sense with a Bε-tree filesystem (and are not possible without patching Linux). They cover this in the hour-long talk at MS research linked elsewhere in this thread.[1] (Yes, it's long, but it's a pretty good presentation.)

E.g., they describe modifying the page cache to write-through small modifications to file data, rather than dirtying the entire page and writing it back later (a form of write amplification).

[1]: https://www.youtube.com/watch?v=fBt5NuNsoII


> E.g., they describe modifying the page cache to write-through small modifications to file data, rather than dirtying the entire page and writing it back later (a form of write amplification).

Considering that all the storage on the market now has sectors at least as large as a 4k page, this isn't actually reducing write amplification. At most, in some cases it might save a tiny bit of bus traffic.


They batch small edits into a log. Many edits to one sector written.


It makes sense, V1 is the hack (speed) and if it gets traction then you push for a V2 with different constraints and more market information (leverage).


Feels more like an academic project to prove a point, than an actual usable file system.


I don't think any reasonable academic wants anyone to actually use the prototype of their file system with irreplaceable (i.e. personal) data. Perhaps requiring a patched kernel is this project's way of making sure that that doesn't happen.


But also, you have to build TokuDB, using gcc-4.7 specifically, and then

> We import TokuDB as a binary blob, and overwrite TokuDB symbols using symbols from these files.

Pretty weird ...


I wouldn't actually apply such an invasive patch on any system I ever cared about anyway, but I wonder if it would work on the RHEL/CentOS kernel, which has a supposed version of 3.10?


It has very little to do with a 3.10, for instance it includes XFS v5 which was introduced in kernel 3.16.


Hence "supposed version"; it's my impression that they backported a non-trivial amount from as far forward as the 4.x series.


Michael Bender, one of the people behind this, gave an excellent invited talk on B^epsilon trees and the possibilities that write-optimised data structures introduce (especially in data base systems) at IPDPS this year. Unfortunately it wasn’t recorded as far as I’m aware, but the slides are available at http://ipdps.org/ipdps2018/bender-ipdps2018-wods.pdf. A more formal introduction to B^epsilon trees is http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf


Here's the talk Bradley Kuszmaul [1] gave to MIT 6.172 in 2010...

How TokuDB Fractal Tree Indexes Work https://www.youtube.com/watch?v=9Rb85cOXTKU&t=202s

[1] https://people.csail.mit.edu/bradley/


Talk given by Rob Johnson [1] at MSR a few years back...

BetrFS: A Right-Optimized Write-Optimized File System https://www.youtube.com/watch?v=fBt5NuNsoII

[1] http://www3.cs.stonybrook.edu/~rob/


That's a great talk, thanks!


The website doesn't seem to mention that several of the papers on the filesystem won best-paper awards at major conferences. The paper, Optimizing Every Operation in a Write-Optimized File System, in particular, won best-paper award at FAST '16.


Btrfs vs betrfs... This is going to cause so much confusion

Ah, but the project isn't new, so I guess it's not a new problem


We're slowly converging toward correct spelling though.


I had guessed they replaced btrfs' B-trees with Bε-trees. Anyway, I don't think there is much concern -- this seems like academic/corporate abandonwaware more as a proof of concept than anything else.


BtrFS stands for B-tree File System. BetrFS stands for B^{\epsilon}-tree File System. They are both named after their primary data structure.


AAAAAAAH

I always wondered why on earth they named a filesystem after Soviet armored personal carriers. But no, it's "betterfs" without the vowels. Got it.


Pronounce btrfs as “butter fs” and betrfs as “bee-turr fs” or something, then there is no confusion ;)

But yeah I agree, betrfs and btrfs are way too similar names.


I can’t beleive it’s not butrfs.


Your sound a bit bitrfs.


I think it's intended to be pronounced as "better fs"


What's the state of this file system? Is it in the Linux kernel? In some BSDs? Both the main page in the FAQ talk about "the kernel" without saying which kernel it is.

How reliable is it? Are there file system checkers for it? Does it support snapshots?


There's a comment upthread that it works on a very much patched Linux kernel, so it certainly isn't upstreamed.


What's the story with patent/license on fractal tree by tokutek (now percona)? Can you use it for personal use only? Ie. you can't use it at work without a license? What about Bε-trees - are there patent free implementations?


Tokutek licensed the fractal tree under GPLv2 with an explicit patent license to make clear that anyone could use the fractal tree code. I don't know what Percona did after the aquisiton.


Any idea what year the patent clock times out?


2027


Sorry, I have to say this is a poor and unpersuasive case for the architecture.

Performance increases are everything, yet unless I’m missing it there is no way to know what the improvements are.

For example, data on the chart here http://www.betrfs.org/faq.html, is enough information provided to reproduce the results including hardware and configuration? If not, the results are utterly meaningless.

Forward looking, it would strain credibility if you hadn’t tried to get a sense of the real world gains on Intel XPoint storage. I would speculate it will not be many years before there are no new green field deployments of storage that spins around in a circle.


First off, why would you use a research file system on enterprise hardware? Secondly, if you read just a little bit further you see that they are testing using spinning rust (HDD). Since they mention threads, I would hope that they are testing using something that supports hardware parallelism. So probably some relatively decent modern hardware. Mind you this is all speculation, and conclusions you could draw by reading the FAQ carefully.


If you describe results of an experiment, you need to provide the detail or pointer to how to reproduce it.

Benchmarking on enterprise hardware is completely relevant because that is the technology that is going to become dominant over the next few years so if performance gains do not show up on that type of technology they may not significant.

In some cases enterprise hardware is much different from what would be used in other scenarios, or on a massive scale, similar to how Facebook doesn’t go down and buy enterprise servers to run their data center. However in this case, the new generation of memory/storage hybrid will not be that different whether it’s in your laptop or in a server, Size features and scalability withstanding of course.

If I read an FAQ question that doesn’t have an asterisk or a pointer to the full information, it’s not my responsibility as the reader to go hunting around for details. This is the job of the author of the paper or website. You don’t get to brag without putting a


I wonder how this compares to bcachefs: https://bcachefs.org


A filesystem using complex data structures, with no mention of reliability? It seems like a huge omission. Personally I think simplicity (and reliability, which usually accompanies it) is the most important for a filesystem --- it doesn't matter how fast it is, if it is prone to data loss from bugs or whatever else.


They do all this work, and then choose possibly the most confusing name they could?


Is it mandatory to implement this in the kernel? Is this because linux is a monolith?


No, FUSE literally stands for "Filesystems in User Space"


Sorry the title of the article is in-kernel file system. Did I miss something?


You asked if it was mandatory to implement this in the kernel because Linux is a monolith. I pointed out that FUSE exists, and allows filesystems to be in userspace. I answered your question. As to some of the silly decisions made by this project? I can't speak to that.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: