Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many of these algorithms are bandwidth- or cache-limited on modern machines, so you can get significant speedup by storing your data in fewer bytes, even if you expand it in registers before actually doing computation on it.


We're reaching a point where it's often faster to store pages in RAM compressed with a fast algo like LZ4 and to decompress them, than to simply copy from RAM uncompressed to L1 cache.


Exactly. In this case, the limit is memory bandwidth.


Wow, that's illuminating. I naively thought the overhead of converting between datatypes would not make this worth that much (in favor of saving cache misses). Though does this also have anything to do with the AVX512 instructions?


Yes, this AVX-512F instruction makes fp16 to fp32 efficient:

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

The result is that Winograd convolutions can achieve an effective FMA rate of twice the peak rate of the CPU.

The Winograd transform reduces the required number of FMAs by a factor of 5x, but you can only do FMAs at half peak rate (because you are bandwidth limited), so you come out ahead by a factor of 2.5x in theory (2x in practice).

Without fp16, that 2x advantage would be lost.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: