Ah I see, we are going with “well technically it stores something therefore it i...

nieve · on May 24, 2020

If memory serves the original EToys.com code treated the filesystem as tree-structured database using atomic operations (though no transactions). It worked just fine, then the rewrite with an RDBMS that should have been stabler and faster resulted in the famous meltdowns. Admittedly this is cheating a bit since you can name folders & files with semi-arbitrary or internally structured string keys. By 1997 standards pure disk access without having to walk the filesystem heirarchy was blazingly fast compared to many of the databases I was using.

[Source: I was friends with the guy who wrote it as well as other EToys employees. God that was a trainwreck.]

__Joker · on May 24, 2020

Interesting, is there a blog around discussing this in detail ? If not would be kind enough to go more into detail.

nieve · on May 24, 2020

I don't think anyone posted about their particular system, but it's not unknown now. If you google "filesystem as a database" there are some relevant hits. One super simple and probably not ideal, but at least balanced version uses a hash of some primary key like customer row id as the file index, then partitions the items into directories with all permutations at each level (or only populated ones) based on successive parts of the hash. For example an item key that hashes to a32c4214585e9cb7a55474133a5fc986 would be located somewhere like this:

  a32c/4214/585e/9cb7/a554/74133a5fc986
    a32c/
      4214/
        585e/
          9cb7/
            a554/
              74133/a5fc986

The advantage of this kind of structure is that you never need to manually scan a directory since you know exactly what path you're trying to open. You still incur the OS lookup time for the inode-equivalent in the directory entry, but a deeper heirarchy keeps that faster. You can trade off time to traverse the heirarchy versus number of entries in the final directories by adjusting the length of the hash chunk you use at each level. Two characters will put vastly fewer entries at a given level, but vastly increase your directory depth.

Basically if you're manually scanning the heirarchy for anything but a consistency check or garbage collection you've already lost.

andreareina · on May 24, 2020

That's how git stores its objects:

    18:35 $ tree .git/objects/
    .git/objects/
    ├── 02
    │   └── 9581d0c8ecb87cf1771afc0b4c2f1d9f7bfa82
    ├── 3b
    │   └── 97b950623230bd218cef6aebd983eb826b2078
    (...)
    ├── info
    └── pack
        ├── pack-b1fe2364423805afb6b1c03be0811c93b19dedc9.idx
        └── pack-b1fe2364423805afb6b1c03be0811c93b19dedc9.pack

    10 directories, 10 files

nieve · on May 24, 2020

One important note: make sure you carefully consider using atomic renames and such for manipulating the files! Overwrite in place is a great way to end up with a corrupted item if something goes desperately wrong and you're not protected by COW or data journaling.

zaphar · on May 24, 2020

Usually you write these sorts of things as append only with an optional garbage collect. You get a minimal sort of Atomicity with that.

akamaozu · on May 24, 2020

I was thinking of doing something similar as a lightweight embedded datastore: apply structure to the file system like you would a redis key.

Would love to talk to anyone on the EToys team or anyone who has done something similar.

I'm @akamaozu on twitter.

nieve · on May 24, 2020

Unfortunately eToys imploded a couple of years later (2001) and there were only a few people involved at that stage so it's possible none of them are in the industry anymore. You might start by looking at email servers, I believe there are a few that use a deeply nested directory heirarchy for much the same reasons. IIRC Apple also does something similar with the interior of the sparsebundles used in Time Machine backups, but I don't know if any of that code is opensource.

m0zg · on May 23, 2020

You laugh, but I bet Excel produces orders of magnitude more real "business intelligence" than all other "BI" tools combined.

GuB-42 · on May 24, 2020

Here is an anecdote.

I had to work on a tool that shows what's wrong with an assembly line: missing parts, delays, etc... So that management can take corrective action. Typical "BI" stuff but in a more industrial setting.

The company went all out on new technologies. Web front-end, responsive design, "big data", distributed computing, etc... My job was to use PySpark to extract indicators from a variety of data sources. Nothing complex, but the development environment was so terrible it turned the most simple task into a challenge.

One day, the project manager (sorry, "scrum master") came in, opened an excel sheet, imported the data sets, and in about 5 minutes, showed me what I had to do. It took me several days to implement...

So basically, my manager with Excel was hundreds of times more efficient than I was with all that shiny new technology.

That experience made me respect Excel and people who know how to use it a lot more, and modern stacks a lot less.

I am fully aware that Excel is not always the right tool for the job, and that modern stacks have a place. For example, Excel does not scale, but there are cases where you don't need scalability. An assembly line isn't going to start processing 100x more parts anytime soon, and one that does will be very different. There are physical limits.

kqr · on May 24, 2020

I think you drew the right conclusion from your experience, but I also want to point out that building the first prototype is always anywhere from one to three orders of magnitude easier than building the actual product.

The devil is in the details, and software is nothing but details. The product owner at the company I work for likens it (somewhat illogically, but it works) with constructing walls. You can either pick whatever stones you have lying around, and then you'll spend a lot of time trying to fit them together and you'll have a hell of a time trying to repair the wall when a section breaks. Or you can build it from perfectly rectangular bricks, and it will be easy to make it taller one layer at a time.

Using whatever rocks you have lying around is like building a prototype in Excel. Carefully crafting layers of abstraction using proper software engineering procedures means taking the time to make those rectangular bricks before building the wall. End result more predictable when life happens to the wall.

yomly · on May 24, 2020

Well in these situations, the implicit ask of your company (I've been there myself) is to basically rebuild excel but replace some of the power/flexibility of excel for safety and to remove the risk of error away from front end users (aka move the risk to the back end developers)

Unfortunately which specific features of Excel are acceptable to remove are unknown until you have already way over invested into the project.

The best I've seen this done is having Excel as a client for your data store. Where read access is straightforward and write can be done via csv upload (and heavy validation and maybe history rollback).

That way the business can self-service every permutation of dashboard/report they need and only when a very specific usecase arises do you need to start putting engineering effort behind it.

I suppose you can also supplement the Excel workflow with a pared down CRUD interface for the inevitable employee allergic to excel.

tjalfi · on May 24, 2020

I posted elsewhere[0] in this thread about my employer's successful practice of replacing shared spreadsheets with web applications.

Here is another option that we use instead of CSV import.

Our applications support custom reports and custom fields.

Users can define new reports and run them on demand.

They can also define custom field types with validation, data entry support, etc.

This combination provides some of the extensibility of Excel while retaining the advantages of an application.

Edited for wording changes.

[0] https://news.ycombinator.com/item?id=23292374

jtdev · on May 24, 2020

...And orders of magnitude more wasted time and capital due to inaccurate and isolated data.

tjalfi · on May 24, 2020

People use what they know to solve the problems they have.

You can complain about their solution or see it as an opportunity.

I posted elsewhere[0] in this thread about my employer's practice of replacing shared spreadsheets with web applications.

This approach works quite well for us and I would encourage you to consider it as an option.

[0] https://news.ycombinator.com/item?id=23292374

capsulecorp · on May 24, 2020

You bet, but I'd really love to see data that supports that.

goatinaboat · on May 23, 2020

well technically it stores something therefore it is database joke

Confluent, the company behind Kafka, are 100% serious about Kafka being a database. It is however a far better database than MongoDB.

tjalfi · on May 24, 2020

Excel can be an excellent source of new line-of-business applications.

Many of my employer's applications started out as a shared spreadsheet or Access database.

Our development team worked with the users and built a web application to solve the same problem.

This approach has a lot of advantages:

* The market exists and has an incumbent. There's a lower risk of a write-off.

* The users are open to process changes. You still have to migrate people off of the spreadsheet, though.

* It's easy to add value with reporting, error checking, concurrent access, and access control.

* You can import the existing data to make the transition easier. This will require a lot of data cleaning.

Edited to add the following text from another post.

You can cover most of the requirements with a set of fixed fields.

The last 10% to 20% of the use cases requires custom reports and custom fields.

Users should be able to define their own reports and run them without your involvement.

They should also be able to define custom field types with validation, data entry support, etc.

If your web application has these two features and other advantages then you should be able to replace Excel.