C++ STL uses trees because the generic requirement for them to work is a < opera...

Sesse__ · on July 22, 2022

> Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations

This doesn't match my experience at all. C++ trees are not cache-friendly; they're pointer-chasing (and there's no arena implementation in the STL). Second, any sort of ordering structure (be it through a tree or through sorting + binary search) is notoriously prone to branch mispredictions. They look great in a microbenchmark if you search for the same element all the time and the trees are small, but in a practical application, it can be incredibly painful. Pretty much by design, every choice is 50–50 and unpredictable.

std::unordered_map from most STLs isn't a fantastic hash table, but in my experience it absolutely destroys std::map (and is also comfortably ahead of something like std::lower_bound on a sorted array). All of this is assuming large-ish N, of course; for small N (e.g. below 30, usually), the best by far is a simple unsorted array and just going through it linearly.

sicp-enjoyer · on July 22, 2022

> C++ trees are not cache-friendly

Agreed, they should use a B-tree to get cache locality and easy generics, but there is legacy code there.

I was referring to the performance of algorithms. For example `std::unique`, `std::lower_bounds`, etc. Many of these use sorted lists, whereas most other languages' standard libraries utilize hashing for these.

> is also comfortably ahead of something like std::lower_bound on a sorted array

I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

Sesse__ · on July 22, 2022

B-trees don't work well for CPU caches; 64b lines are typically too small to gain much. (OK, the M1 has 128b lines, but still.) And you still get the branch mispredicts, and a significant increase in code complexity, so hashing is still significantly better. (I've seen C++ projects where the main structure was a B+-tree because the lower levels could be residing on disk, but where they maintained a separate hash table in memory to skip the top first few B-tree levels!)

> I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

If by “it” you're talking about std::unordered_map (as key); yes, you can, but you'd have to supply your own hash function, because std::pair<A, B> does not have a std::hash specialization. (It's a bit sad, but the problem is highly specific to std::pair/std::tuple.) Likewise, you can use any arbitrary type you create yourself, as long as it has operator== and you've written a hash function for it.

Shorel · on July 23, 2022

With a modern Intel processor, it was much faster to go through a 10k integers array than sort it first and then do binary search.

The small N is not that small anymore, and the difference speed between stack and heap is bigger now than ever.