BetrFS: An in-kernel file system that uses Bε trees to organize on-disk storage

heavenlyhash · on Oct 13, 2018

This is really interesting...

... so I'm really a bit bewildered and sad by some of the engineering choices they made along the way, like requiring a modified kernel: https://github.com/oscarlab/betrfs/blob/master/README.md#com...

The reasons are detailed just above that link target, but are somewhat absurd IMHO: they modified the kernel's `struct task_struct` to pass error values, rather than fixing one of their libraries to handle its own error values. In return, nobody ever gets to use this without patching a kernel (which is, by the way, so ancient of a fork at this point that Forget It).

I know it's unfair of a random internetfolk to complain about the engineering choices of a project like this, I'm sure I don't know the influences and tradeoffs first hand, etc, etc, but... Ow. I would've tried this.

But not on a 3.11 kernel fork.

I hope someone can take all the awesome research here and get it to veer towards something slightly closer to product.

nneonneo · on Oct 13, 2018

All the reasons given boil down to “we didn’t want to mess with TokuDB”. The build process (requiring specific GCC versions, CMake, and a bunch of packages including Valgrind), the errno patch to task_struct, and the half dozen libc stub functions are all there because TokuDB expects userspace libc.

I don’t know whether the authors ever attempted to patch TokuDB itself (or streamline it down to the essentials for the kernel). Instead they appear to have just taken the entire userspace-designed library and hacked the kernel until it fit. It’s a halfway-decent strategy if your goal is to get the thing off the ground as quickly as possible, but obviously a real implementation would have to ship a modified TokuDB instead (it’s OSS, so it should be hackable!)

This is pretty common in academia, sadly. As an academic who has released some academic OSS code myself, I can say that often there’s just not enough time or motivation to fix a blob of code into a generally usable format. Often it’s released just so that we can say “hey we open-sourced it so other researchers can build on it/replicate our results”. This may be one of the reasons why academic ideas don’t make it out to the real world that quickly.

espeed · on Oct 13, 2018

Note that some of the people involved are also the original authors of TokuDB [0], including Bradley Kuszmaul [1] , Michael Bender [2], and Martin Farach-Colton [3] who were the founders of Tokutek (acquired by Percona in 2015).

Previous discussion with the authors: A Comparison of Log-Structured Merge (LSM) and Fractal Tree Indexing https://news.ycombinator.com/item?id=9248298

[0] TokuDB Fractal Tree Index https://github.com/Tokutek/ft-index TokuDB Engine https://github.com/percona/tokudb-engine

[1] Bradley Kuszmaul http://people.csail.mit.edu/bradley/startups.html

[2] Michael Bender https://www3.cs.stonybrook.edu/~bender/

[3] Martin Farach-Colton https://www.cs.rutgers.edu/~farach/

adrianmonk · on Oct 13, 2018

Perhaps TokuDB was evolving rapidly at the time they were trying to incorporate it into this filesystem.

It seems like the choices are:

1. Fork TokuDB.

2. Fork the kernel.

If the goal is research, the first one of those would then slow down the research. The second one only gets you stuck on an old kernel, which while not great for end users, isn't as big of a problem for a researcher as having their researcher slowed down.

tobias3 · on Oct 13, 2018

TokuDB seems to be written in C++. It's neat that they managed to make that work in-kernel.

Also there are patent notices in the TokuDB files.

So probability of it being in Linux at some point approaches zero. And with the patents they even made the whole technique unviable for any Linux fs (haven't looked at them in detail obviously).

espeed · on Oct 13, 2018

See Bradley's comment wrt to licensing the fractal tree code as GPLv2 w/ a patent provision https://news.ycombinator.com/item?id=18208209

loeg · on Oct 13, 2018

This isn't completely fair, although most of it is.

Some of the reasons for modifying Linux are performance enhancements to the filesystem layer that only make sense with a Bε-tree filesystem (and are not possible without patching Linux). They cover this in the hour-long talk at MS research linked elsewhere in this thread.[1] (Yes, it's long, but it's a pretty good presentation.)

E.g., they describe modifying the page cache to write-through small modifications to file data, rather than dirtying the entire page and writing it back later (a form of write amplification).

[1]: https://www.youtube.com/watch?v=fBt5NuNsoII

wtallis · on Oct 13, 2018

> E.g., they describe modifying the page cache to write-through small modifications to file data, rather than dirtying the entire page and writing it back later (a form of write amplification).

Considering that all the storage on the market now has sectors at least as large as a 4k page, this isn't actually reducing write amplification. At most, in some cases it might save a tiny bit of bus traffic.

loeg · on Oct 14, 2018

They batch small edits into a log. Many edits to one sector written.

ramses0 · on Oct 13, 2018

It makes sense, V1 is the hack (speed) and if it gets traction then you push for a V2 with different constraints and more market information (leverage).

nialv7 · on Oct 13, 2018

Feels more like an academic project to prove a point, than an actual usable file system.

escape_goat · on Oct 13, 2018

I don't think any reasonable academic wants anyone to actually use the prototype of their file system with irreplaceable (i.e. personal) data. Perhaps requiring a patched kernel is this project's way of making sure that that doesn't happen.

ploxiln · on Oct 13, 2018

But also, you have to build TokuDB, using gcc-4.7 specifically, and then

> We import TokuDB as a binary blob, and overwrite TokuDB symbols using symbols from these files.

Pretty weird ...

yjftsjthsd-h · on Oct 13, 2018

I wouldn't actually apply such an invasive patch on any system I ever cared about anyway, but I wonder if it would work on the RHEL/CentOS kernel, which has a supposed version of 3.10?

wazoox · on Oct 13, 2018

It has very little to do with a 3.10, for instance it includes XFS v5 which was introduced in kernel 3.16.

yjftsjthsd-h · on Oct 13, 2018

Hence "supposed version"; it's my impression that they backported a non-trivial amount from as far forward as the 4.x series.

lorenzhs · on Oct 13, 2018

Michael Bender, one of the people behind this, gave an excellent invited talk on B^epsilon trees and the possibilities that write-optimised data structures introduce (especially in data base systems) at IPDPS this year. Unfortunately it wasn’t recorded as far as I’m aware, but the slides are available at http://ipdps.org/ipdps2018/bender-ipdps2018-wods.pdf. A more formal introduction to B^epsilon trees is http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf

espeed · on Oct 13, 2018

Here's the talk Bradley Kuszmaul [1] gave to MIT 6.172 in 2010...

How TokuDB Fractal Tree Indexes Work https://www.youtube.com/watch?v=9Rb85cOXTKU&t=202s

[1] https://people.csail.mit.edu/bradley/

espeed · on Oct 13, 2018

Talk given by Rob Johnson [1] at MSR a few years back...

BetrFS: A Right-Optimized Write-Optimized File System https://www.youtube.com/watch?v=fBt5NuNsoII

[1] http://www3.cs.stonybrook.edu/~rob/

jules · on Oct 13, 2018

That's a great talk, thanks!

williamkuszmaul · on Oct 13, 2018

The website doesn't seem to mention that several of the papers on the filesystem won best-paper awards at major conferences. The paper, Optimizing Every Operation in a Write-Optimized File System, in particular, won best-paper award at FAST '16.

y4mi · on Oct 13, 2018

Btrfs vs betrfs... This is going to cause so much confusion

Ah, but the project isn't new, so I guess it's not a new problem

agumonkey · on Oct 13, 2018

We're slowly converging toward correct spelling though.

loeg · on Oct 13, 2018

I had guessed they replaced btrfs' B-trees with Bε-trees. Anyway, I don't think there is much concern -- this seems like academic/corporate abandonwaware more as a proof of concept than anything else.

farach · on Oct 13, 2018

BtrFS stands for B-tree File System. BetrFS stands for B^{\epsilon}-tree File System. They are both named after their primary data structure.

_ugfj · on Oct 13, 2018

AAAAAAAH

I always wondered why on earth they named a filesystem after Soviet armored personal carriers. But no, it's "betterfs" without the vowels. Got it.

codetrotter · on Oct 13, 2018

Pronounce btrfs as “butter fs” and betrfs as “bee-turr fs” or something, then there is no confusion ;)

But yeah I agree, betrfs and btrfs are way too similar names.

stephenr · on Oct 13, 2018

I can’t beleive it’s not butrfs.

muterad_murilax · on Oct 13, 2018

Your sound a bit bitrfs.

nialv7 · on Oct 13, 2018

I think it's intended to be pronounced as "better fs"

perlgeek · on Oct 13, 2018

What's the state of this file system? Is it in the Linux kernel? In some BSDs? Both the main page in the FAQ talk about "the kernel" without saying which kernel it is.

How reliable is it? Are there file system checkers for it? Does it support snapshots?

yjftsjthsd-h · on Oct 13, 2018

There's a comment upthread that it works on a very much patched Linux kernel, so it certainly isn't upstreamed.

mirekrusin · on Oct 13, 2018

What's the story with patent/license on fractal tree by tokutek (now percona)? Can you use it for personal use only? Ie. you can't use it at work without a license? What about Bε-trees - are there patent free implementations?

bradleykuszmaul · on Oct 13, 2018

Tokutek licensed the fractal tree under GPLv2 with an explicit patent license to make clear that anyone could use the fractal tree code. I don't know what Percona did after the aquisiton.

loeg · on Oct 13, 2018

Any idea what year the patent clock times out?

colanderman · on Oct 13, 2018

WhitneyLand · on Oct 14, 2018

Sorry, I have to say this is a poor and unpersuasive case for the architecture.

Performance increases are everything, yet unless I’m missing it there is no way to know what the improvements are.

For example, data on the chart here http://www.betrfs.org/faq.html, is enough information provided to reproduce the results including hardware and configuration? If not, the results are utterly meaningless.

Forward looking, it would strain credibility if you hadn’t tried to get a sense of the real world gains on Intel XPoint storage. I would speculate it will not be many years before there are no new green field deployments of storage that spins around in a circle.

Something1234 · on Oct 14, 2018

First off, why would you use a research file system on enterprise hardware? Secondly, if you read just a little bit further you see that they are testing using spinning rust (HDD). Since they mention threads, I would hope that they are testing using something that supports hardware parallelism. So probably some relatively decent modern hardware. Mind you this is all speculation, and conclusions you could draw by reading the FAQ carefully.

WhitneyLand · on Oct 14, 2018

If you describe results of an experiment, you need to provide the detail or pointer to how to reproduce it.

Benchmarking on enterprise hardware is completely relevant because that is the technology that is going to become dominant over the next few years so if performance gains do not show up on that type of technology they may not significant.

In some cases enterprise hardware is much different from what would be used in other scenarios, or on a massive scale, similar to how Facebook doesn’t go down and buy enterprise servers to run their data center. However in this case, the new generation of memory/storage hybrid will not be that different whether it’s in your laptop or in a server, Size features and scalability withstanding of course.

If I read an FAQ question that doesn’t have an asterisk or a pointer to the full information, it’s not my responsibility as the reader to go hunting around for details. This is the job of the author of the paper or website. You don’t get to brag without putting a

SEJeff · on Oct 13, 2018

I wonder how this compares to bcachefs: https://bcachefs.org

userbinator · on Oct 13, 2018

A filesystem using complex data structures, with no mention of reliability? It seems like a huge omission. Personally I think simplicity (and reliability, which usually accompanies it) is the most important for a filesystem --- it doesn't matter how fast it is, if it is prone to data loss from bugs or whatever else.

jwatte · on Oct 14, 2018

They do all this work, and then choose possibly the most confusing name they could?

StreamBright · on Oct 13, 2018

Is it mandatory to implement this in the kernel? Is this because linux is a monolith?

SEJeff · on Oct 14, 2018

No, FUSE literally stands for "Filesystems in User Space"

StreamBright · on Oct 14, 2018

Sorry the title of the article is in-kernel file system. Did I miss something?

SEJeff · on Oct 15, 2018

You asked if it was mandatory to implement this in the kernel because Linux is a monolith. I pointed out that FUSE exists, and allows filesystems to be in userspace. I answered your question. As to some of the silly decisions made by this project? I can't speak to that.