This is the "little part" of what fits into an earpiece. Each of those cores is maybe 0.04 square millimeters of die on e.g. 28nm process. RAM takes some area, but that's dwarfed by the analog and power components and packaging. The marginal cost of the gates making up the processors is effectively zero.
so 1mm2 peppered by those cores at 300MHz will give you 4 Tflops. And whole 200mm wafer - 100 Petaflops, like 10 B200s, and just at less than $3K/wafer. Giving half area to memory we'll get 50 PFlops with 300Gb RAM. Power draw is like 10-20KW. So, giving these numbers i'd guess Cerebras has tremendous margin and is just printing money :)
Yes, assuming you don't need to connect anything together and that RAM is tinier than it really is, sure. At 28nm, 3megabits/square millimeter is what you get of SRAM, so an entire wafer only gets you ~12 gigabytes of memory.
And, of course, most of Cerebras' costs are NRE and the stuff like getting heat out of that wafer and power in.
Same reason why Cerebras doesn't use DRAM. The whole point of putting memory close is to increase performance and bandwidth, and DRAM is fundamentally latent.
Also, process that is good at making logic isn't necessarily good for making DRAM. Yes, eDRAM exists, but most designs don't put DRAM on the same die as logic and instead stack it or put it off-chip.
Almost all these microcontrollers that are single-die have flash+SRAM. Almost all microprocessor cache designs are SRAM for these reasons (with some designs using off-die L3 DRAM)-- for these reasons.
>The whole point of putting memory close is to increase performance and bandwidth, and DRAM is fundamentally latent.
When the access patterns are well established and understood, like in the case of transformers, you can mitigate latency by prefetch (we can even have very beefed up prefetch pipeline knowing that we target transformers), while putting memory on the same chip gives you huge number of data lines thus resulting in huge bandwidth.
With embedded SRAM close, you get startling amounts of bandwidth -- Cerebras claims to attain >2 bytes/FLOP in practice -- vs H200 attaining more like 0.001-0.002 to the external DRAM. So we're talking about a 3 order of magnitude difference.
Would it be a little better with on-wafer distributed DRAM and sophisticated prefetch? Sure, but it wouldn't match SRAM, and you'd end up with a lot more interconnect and associated logic. And, of course, there's no clear path to run on a leading logic process and embed DRAM cells.
In turn, you batch for inference on H200, where Cerebras can get full performance with very small batch sizes.