More

l1k · 2026-06-04T12:39:36 1780576776

This is using Thunderbolt networking as transport, which incurs a bit of overhead.

But starting with the upcoming Linux v7.2, there's a new feature called USB4STREAM to use raw Thunderbolt packets as transport with minimum overhead and a super simple user interface:

https://lore.kernel.org/r/20260511102744.1867485-1-mika.west...

Release of v7.2-rc1 is predicted for Jul 5, that's when this will first be available as a tarball. Until then you have to clone from thunderbolt.git/next:

https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thun...

Or alternatively linux-next:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-n...

Press coverage:

https://www.phoronix.com/news/Intel-Linux-USB4STREAM

grw_ · 2026-06-04T16:12:09 1780589529

author here! it's not on top of USB4NET, no (RXE can already do that, it's compared in the benchmarks). it's built with the same tb primitives as the networking stack obviously, just assembled differently to emulate a verbs device instead of a nic. happy to answer any other q!

eqvinox · 2026-06-04T19:03:52 1780599832

A bit unclear, maybe I missed it in the article: how much of InfiniBand is there? Is it just the verbs interface? Or are there some (higher) layers of InfiniBand actually carried as bits on the cable? It can't be "bridged" into actual InfiniBand, right?

grw_ · 2026-06-04T20:12:56 1780603976

I actually didn't know there was more to InfiniBand than verbs (at least at this abstraction level, above PHY), so probably the answer is 'not much more'. The device imitates a RoCE V2 device and the higher level abstractions I used on top were GPU-ish libraries like NCCL and JACCL.

Good q about 'bridging into actual InfiniBand', I don't know the answer there either. My naive understanding would be that: since this is host-initiated RDMA (it's still the host cpu invoking into dma buffers, though they may be device-memory mapped), actually it should work fine, at least between two machines? I'm curious enough to try- I have a couple of machines with thunderbolt AND RoCE-capable NICs- the experiment is to see if we can use this across diverse transports simultaneously? I think this is what it does already (since the MacOS FA57 vs linux native are already 'different transports'), but say if you have a better scenario to demonstrate what 'bridging into actual infiniband' would look like!

eqvinox · 2026-06-04T21:57:16 1780610236

InfiniBand is its entire own networking standard, if you have Mellanox NICs you can switch them into IB mode and... short version, it's not Ethernet anymore. It's not even the same speeds/baud rates (e.g. there is a FDR rate at 14.0625Gbaud.) (NB: InfiniBand is indeed not RoCE, that E is Ethernet. InfiniBand had RDMA way before RoCE became a thing; probably why its APIs are being used for it.)

It sounds like you're really just doing the IB verbs (which is kinda really RDMA verbs). I don't think any kind of "bridging" (other than IP routing) is really possible (you'd need a chip that understands both TB and IB and can somehow translate RDMA requests between the two.)

grw_ · 2026-06-05T00:06:39 1780617999

Ah right, yes- I think we're talking about the same thing- this driver just chooses to pretend to be a RoCE v2 device (instead of e.g MLX Nic in IB mode), but nothing would change if it did I think. Or at least thats what the libibverbs abstraction promises.

There's no IB OR Ethernet underneath- I could have implemented this properly as it's own distinct transport kind, but seemed easier just to pretend to be something that is already known.

The 'the chip that understands both TB and IB and translate RDMA requests between the two' in this instance is your CPU, so orders-of-magnitude worse latency than an ASIC, but still better than anything on top of IP/Ethernet. I think there's also potential to do device-initiiated RDMA, where e.g GPU itself can write to some mailbox and have message appear across the abstracted transport in another GPUs mailbox. Even if the CPU is involved in shuffling pointers across mailboxes it doesn't necessarily mean it'll be a bottleneck

eqvinox · 2026-06-04T14:19:34 1780582774

> This is using Thunderbolt networking as transport,

Are you sure? It doesn't sound like it in some places in the text, e.g.:

>> a kernel driver that sits alongside thunderbolt-net, allocating DMA rings from the controller's NHI port in the same way

but I don't have the domain knowledge to tell…

adrian_b · 2026-06-04T15:24:52 1780586692

Yes, the description from TFA does not match the traditional Thunderbolt networking protocol, whose performance may be as low as that of a 10 Gb/s Ethernet interface.

The description from TFA matches what the poster above you said about a new Linux device driver that allows access to the raw Thunderbolt protocol for transferring data between computers. This appears to be an independent implementation of the same principle as in the device driver that will be merged in the mainline Linux.

While the official Linux device driver makes the raw Thunderbolt appear like a file, which can be written and read to transfer data, this implementation emulates an Infiniband interface, which presumably was simpler to use for distributing work over multiple GPUs.

They actually mention that with traditional Thunderbolt networking on the same computers, they had obtained only 9 Gb/s, i.e. more than 5 times slower than what they obtained with raw Thunderbolt.

scottlamb · 2026-06-04T17:59:09 1780595949

> traditional Thunderbolt networking protocol ... performance may be as low as that of a 10 Gb/s Ethernet interface.

Ouch. Why so much lower than the physical bandwidth (or what they've achieved here)?

grw_ · 2026-06-04T18:21:39 1780597299

A USB4 40Gbps cable consists of two 20G tx/rx pairs. The in-kernel networking implementation is single-stream and just uses one pair, and won't e.g. stripe across both pairs or across multiple cables, which was the main bandwidth unlock in TFA. Doing so would be a much more complicated undertaking, since now you've re-introduced out-of-order delivery which complicates re-assembly of large packets, retries, handling loss etc. The verbs interface is a lot simpler than that of a full IP stack, so although was possible to get this working across rails, may not be so simple for something pretending to be ethernet.

scottlamb · 2026-06-04T18:54:32 1780599272

> now you've re-introduced out-of-order delivery which complicates re-assembly of large packets, retries, handling loss etc.

Still confused though. For a standard TCP/IP networking stack, that support is all there anyway, as it's not meant for point-to-point links, and out-of-order delivery is a thing that happens on the Internet. I haven't tried thunderbolt-net, but it says it implements Apple's ThunderboltIP, so I'd expect it's IP-based networking on top, and so it'd all work? Is it that out-of-order delivery is far more common than usual, and this path is so much slower (by impairing LRO/GRO) that it's not worth aggregating at all?

I'd understand if each pair is logically represented as a separate networking device, and then you have to set up link aggregation on top of that. (And iirc at least with some forms of aggregation a particular flow is bound to one link, so you'd have to have a bunch of streams to actually get bandwidth benefits.) So caveats for sure but I'd expect something to be possible. But does it just not support using both pairs at all?

Even with using one pair I still don't understand why you'd only get about 10G rather than 20G on a pair. I do see chapter 4 of the (your?) article talks about the single DMA ring maybe imposing the 10 Gbps limit but I don't have any good intuition for why. I don't know say how large the rings are or what latencies to expect on their operations or what packet sizes are supported which might help me understand.

grw_ · 2026-06-04T20:29:00 1780604940

Yeah, thunderbolt-net is IP on top and it does work as you say, with a few caveats:

- On a single cable with two rails available, the thunderbolt-net grabs one and uses that. Without patching the kernel, there's no way to make it present a second interface using the remaining pair.

- If you had a second cable between the machines (for 4 total rails), thunderbolt-net will still only grab one rail, because the abstraction across which it's making the links sees an identical peer at the end of both links and so falls into the same trap as above. There is no LRO/GRO anyway (or it's buggy- I forget) on the linux version.

- Why you only get 10G rather than 20G on single pair- actually, this might be something specific to the Strix Halo SoC that I was testing on- on a different (still AMD) chipset and an Apple TB5 Mac I did see closer to 22G in one direction, but still 8 in the other. The Strix Halo NHI seems to be 'stripped down' (as expected, for mobile) in ways I don't really understand.

- Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?

scottlamb · 2026-06-05T15:52:46 1780674766

Ugh, yeah, gross for `thunderbolt-net` only support one link in total, though presumably fixable.

> - Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?

I'm used to page size making a difference (due to TLB pressure) but not a factor of 2. I'm not familiar with DMA, so maybe there's some reason it'd be that dramatic there, but I'm unsure.

If the total size vs the latency of draining is just so small that it frequently fills and stalls, or if the sender and receiver can't be accessing it at once (but I don't think should be true?), it might make more sense. I think if I were wanting to make this thing go more smoothly, I'd probably start by measuring fractions of the time the tx/rx buffers are completely empty and completely full.

Actually, I'm not sure I'm understanding the text "we only have a single DMA ring for tx and rx" either. Does that mean one for tx and one for rx? or really one ring in total? if the latter, does it have to say drain fully before switching modes? that would seem pretty crippling.

l1k · 2026-05-18T13:21:01 1779110461

Fun fact (or not so fun if you're a subscriber):

Somebody is spamming kernel mailing lists under the name Marian Corcodel with a 26 MByte message multiple times per day containing a collection of nonsensical patches. Looks AI-generated, perhaps with the intention to poison LLMs. This has been going on for a few days now.

https://lore.kernel.org/all/CAGg4U=GNtCObd_Nbm_1Rr5FEvPb69Yz...

probably_wrong · 2026-05-18T13:32:15 1779111135

I'd warn HN users not to click on that link simply because it will load a 26Mb message that will likely cause quite a strain on kernel.org's servers if everyone here does it.

sillysaurusx · 2026-05-18T15:11:21 1779117081

I was curious how much of an impact HN could have. Napkin math:

HN gets 24M views a day. Assume those views are evenly distributed across the front page (they aren’t), and that’s about 1M views for each front page post, assuming each user clicks on one post.

By the rule of 10s (also not exact), there are 10x less views on comment threads. So assume around 100k views on a comment thread as a theoretical average.

If everyone in this thread clicked on the link, that would be 2.6 TB of transfer across the day. But by the rule of 10’s we have to assume 10x fewer people will interact (upvote, click, anything) than view. So we’re down to 260GB transfer over the course of a day.

I wonder how close that is. It seems plausible that a link in the top comment of a thread could garner 10,000 clicks.

That’s still about one click every 8 seconds, which at 10Mbit/s would indeed overwhelm the server by a factor of about 2.5x. But I clicked through and it loaded in just a few seconds, so presumably the pipe is faster than 10Mbit/s.

Another caveat is that many websites are already several megabytes, so it seems strange that 26Mb would be the breaking point for a reasonable web host.

devsda · 2026-05-18T17:43:35 1779126215

Don't forget scrapers. Scrapers can be biased towards top posts and comments.

cyanydeez · 2026-05-18T23:46:23 1779147983

Arn't AI agents worse than scrapers now since they're basically a DDoS that runs over and over where scrapers will actually cache data.

perching_aix · 2026-05-18T16:27:11 1779121631

> HN gets 24M views a day

This is available info?

shagie · 2026-05-18T17:40:35 1779126035

https://news.ycombinator.com/item?id=33450094

2022 from dang:

> There's no stats page but last I checked it was around 5M monthly unique users (depending on how you count them), perhaps 10M page views a day (including a guess at API traffic), and something like 1300 submissions (stories) and 13k comments a day.

> The most interesting number is the 1300 submissions because that hasn't grown since 2011 - it just fluctuates. Everything else has been growing more or less linearly for a long time, which is how we like it.

kraftman · 2026-05-18T15:50:58 1779119458

Plenty of people deliberately posting to HN have their servers overwhelmed.

jedberg · 2026-05-18T20:07:58 1779134878

It's mirrored by Akamai, which is designed to repeatedly serve the same thing over and over. It won't really hurt anyone.

jmalicki · 2026-05-18T14:31:28 1779114688

Does a 26MB message actually cause noticeable strain on the server much beyond loading the page? I would think serving a contiguous 26MB chunk would be relatively similar to say 20 normal sized messages.

mort96 · 2026-05-18T16:44:09 1779122649

Way off. I went to an arbitrary message on lore.kernel.org. Firefox's network inspector says 7.37kB was transferred, including stylesheets. 26MB is roughly 3500x 7.37kB.

jmalicki · 2026-05-18T17:40:58 1779126058

Data transferred is not what generates load. sendfile() is about the lowest-overhead thing a web server does.

leonidasrup · 2026-05-18T13:48:47 1779112127

https://web.archive.org/web/20260518134447/https://lore.kern...

OuterVale · 2026-05-18T14:03:47 1779113027

I don't think needlessly straining the Internet Archive's servers is any better.

embedding-shape · 2026-05-18T15:17:35 1779117455

IA's infra is slightly better for big loads though, they tend to just have higher latency rather than aborted/timed out requests, for better or worse. It can be bit slow, but as long as you're ready to wait, you'll eventually get the response. Usually hosts just cut you off with a hardcoded timeout instead, which for people on high latency/low bandwidth connections can be super fun.

HDBaseT · 2026-05-19T03:13:18 1779160398

IA's resources are very limited as is. There is so many people (emulation/roms) YouTubers linking to Archive.org downloads for full ROM Sets.

It's a big problem. Donate to Archive.org if you can!

grosswait · 2026-05-18T14:08:28 1779113308

Will clicking on this link download a 26MB message putting extra load on archive.org's servers?

shevy-java · 2026-05-18T14:42:48 1779115368

Thank you for the warning. I rarely click on links these days though; only exception I make for HN links for main articles.

embedding-shape · 2026-05-18T15:18:14 1779117494

How do you navigate the web, everything is CTRL+L then manually type the address, or you have some fancier solution?

kelsey98765431 · 2026-05-18T15:48:40 1779119320

the web is useless outside of hn

embedding-shape · 2026-05-18T17:17:36 1779124656

90% of it yeah, but the 10% is still worth it, like HN.

neksn · 2026-05-18T17:10:56 1779124256

The page is gzipped in transit - only 5 MB of traffic are generated.

Bender · 2026-05-24T14:02:24 1779631344

Why can't they block the people doing this? Bring on the ban hammer.

Phelinofist · 2026-05-18T15:56:56 1779119816

> perhaps with the intention to poison LLMs

How does that work?

stefan_ · 2026-05-18T16:23:23 1779121403

This is just nonsensical changes and slurs, but particularly degenerate input data can cause big issues in training:

https://x.com/gabriberton/status/2051873677998956851

l1k · 2026-04-30T03:27:00 1777519620

It does enable address space separation of secret keys from user space, which some people love:

https://blog.cloudflare.com/the-linux-kernel-key-retention-s...

https://www.youtube.com/watch?v=7djRRjxaCKk

https://www.youtube.com/watch?v=lvZaDE578yc

So it's not as simple as "should not exist". I agree though that there doesn't seem to be a valid need to expose authencesn to user space.

Disclosure: I'm co-maintaining crypto/asymmetric_keys/ in the kernel and the author/presenter in the first two links is another co-maintainer.

ebiggers · 2026-04-30T03:56:17 1777521377

That can be done in userspace too -- different userspace processes have different address spaces too.

The fact that the first link recommends using keyctl() for RSA private keys is also "interesting", given that the kernel's implementation of RSA isn't hardened against timing attacks (but userspace implementations of RSA typically are).

ngomez · 2026-04-30T04:09:17 1777522157

The CloudFlare blog discusses that idea when they talk about having an "agent process" to hold cryptographic material, but they list drawbacks like having to develop two processes, implement a well-defined interface, and enforce ACLs. I'm not convinced that "developing two processes" is a reason not to do it, since the kernel is effectively just the second process now, but everything else makes sense.

It's unfortunate though since this is one thing I think Windows does decently well. The Windows crypto and TLS APIs do use a key isolation process by default (LSASS) and have a stable interface for other processes to use it [0]. I imagine systemd could implement something similar, but I also know that there are very strong opinions about adding more surface area to systemd.

[0] https://blackhat.com/docs/us-16/materials/us-16-Kambic-Cunni...

lostmsu · 2026-04-30T11:24:14 1777548254

TBH LSASS is privileged enough to be a good target for exploits.

l1k · 2026-04-30T04:30:40 1777523440

> the kernel's implementation of RSA isn't hardened against timing attacks

Cloudflare is using custom BoringSSL-based crypto code in the kernel:

https://lore.kernel.org/all/CALrw=nEyTeP=6QcdEvaeMLZEq_pYB9W...

400thecat · 2026-04-30T08:14:05 1777536845

can you please give me a real-life example of an application, on a typical linux laptop or typical linux server, which userspace application would use this CRYPTO_USER_API ? None that I looked at seem to use it: openssl, pgp, sha256sum

l1k · 2026-04-30T08:34:54 1777538094

As Eric has correctly stated above, we believe iwd (Intel Wireless Daemon), or rather the ell library it relies on (Embedded Linux Library) is the only relatively widespread user space application relying on it.

XorNot · 2026-04-30T09:39:26 1777541966

Isn't the better argument to ask whether there'd be benefit if all those things did?

A lack of adoption isn't apriori a good argument against an interface, and serious bugs can happen anywhere.

My personal opinion for a while has been that crypto operations should be in the kernel so we can end the madness that is every application shipping it's own crypto and trust system which has only gotten worse since containers were invented.

acdha · 2026-04-30T12:54:27 1777553667

> My personal opinion for a while has been that crypto operations should be in the kernel so we can end the madness that is every application shipping it's own crypto and trust system which has only gotten worse since containers were invented.

There’s a valid argument here but I think that’d devolve into the DNSSec trap without both a very well-designed API and a stable way to ship updates for older kernels. If people can’t get good user experience or have to force kernel upgrades to improve security, most applications will avoid it. Things like Chrome shipping their own crypto mean that they can very quickly ship things like PQC without waiting years or having to deal with issues like kernel n+1 having unrelated driver or performance issues which force things into a security vs. functionality fight.

XorNot · 2026-04-30T13:53:34 1777557214

Which does sort of loop around to the issue of Linux not having a stable ABI as a feature I suppose which would be one way to implement it with long term compatibility on kernel modules.

But the Chrome example also highlights the problem: Chrome might ship it, but vanishingly little software is ever going to upgrade and we've got an explosion of statically linked languages now.

MarsIronPI · 2026-05-01T03:05:30 1777604730

If Linux does that, I really hope it can be done in a standardized way that doesn't make porting to *BSD more difficult than it already might be. Standards are a good thing.

bawolff · 2026-04-30T12:51:31 1777553491

> A lack of adoption isn't apriori a good argument against an interface

I mean it kind of is (perhaps not a priori, but why is that relavent?). If something is not being used, its not meeting needs, so its just increasing attack surfaces without benefit.

l1k · on Oct 24, 2024

NTB = Non-Transparent Bridge

DW = DesignWare

l1k · on July 7, 2024

Some of the sites that went online around 1994 are still there:

https://north.pole.org/

https://town.hall.org/

I believe Carl Malamud (Internet Multicasting Service) was behind these.

The audio files are in Sun Audio format, which all browsers supported natively back then. Chromium apparently no longer does, requires saving and opening in VLC.

DamonHD · on July 7, 2024

One of my pages from around then still just about lives:

http://www.exnet.com/springboard.html

l1k · on May 11, 2024

"They Were There"

https://www.youtube.com/watch?v=MmVCePfMXAU

l1k · on Oct 19, 2023

If you need a lot of RAM, usually you need to buy servers with multiple CPUs to which you can attach the memory. Because the amount of DRAM you can attach to one CPU is limited.

If you don't have the need for all the extra CPUs, just being able to attach more memory to a single CPU through CXL may be cheaper.

l1k · on Oct 11, 2023

The original Raspberry Pi SoC (BCM2835) is ARMv6 with VFP2 Hard Float support.

Debian's "arm" architecture is ARMv7 with VFP3. It doesn't support BCM2835.

Debian's "armel" architecture is ARMv4. It doesn't use BCM2835 to its full potential.

So the BCM2835 is awkwardly positioned in-between Debian's two stock ARM 32-bit architectures, which motivated the decision to recompile all packages for a BCM2835-specific "armhf" distribution.

In a sense, it's a historic artifact.

rbanffy · on Oct 11, 2023

Interesting.

For my "canned mainframe" (https://github.com/rbanffy/vm370) ARM images, I'm using Debian as a base, since it has the armv6 architecture listed and I didn't notice any adverse effects.

I wonder if I should have used an RPi-specific base image.

OTOH, that'd render the container image incompatible with other 32-bit ARM boards.

l1k · on Sept 28, 2023

The point here is likely to pull the rug out from under scalpers' feet.

With the Raspberry Pi 5 out in two weeks, all the held-back inventory of older models will be dumped, prices will plummet, availability will become a non-issue.

In that sense it's a wise move.

alangibson · on Sept 28, 2023

Mouser sold about 3000 Pi 4s in the last couple of weeks. I'm hoping a few scalpers are about to get seriously burned.

Interestingly Digikey has over 2000 left. I wonder if they are limiting quantities.

michaeltimo · on Sept 28, 2023

Finding older models were also almost impossible in the past two years. It's unlikely that Raspberry Pi 5 will solve the issue. But even so, it's not a wise move because what is the point of bringing a new model when they can't make it available to normal people?

fanf2 · on Sept 28, 2023

The Raspberry Pi 5 will only be for sale to individuals until the end of this year (no industrial customers competing for inventory like the older models)

extraduder_ire · on Sept 28, 2023

Plus, they're only launching the 4/8gb models to start. So there'll be another wave of cheaper ones a little later. Really hoping they still hit the $35 price point on the 1gb model.

meragrin_ · on Sept 28, 2023

> they can't make it available to normal people

I guess you haven't been looking recently? I can go to a local store and pick one up. It looks easy to pick up one online too.

https://rpilocator.com/

giancarlostoro · on Sept 28, 2023

I think I've seen Pi Kits at my local target. The issue with those is, they're for niche things I might not care about and now I got tech waste on my hands, but also might not be the exact model I want.

Note I'm not disagreeing, just saying in some cases, the ones in-store are kits.

mlyle · on Sept 28, 2023

I wouldn't expect Target to carry bare Raspberry Pi-- how many people walk into a Target wanting a Pi with no accessories?

eldaisfish · on Sept 28, 2023

I have and it is still a pain. Many websites still have limits on how many you can buy. The situation has improved but it is far from what your comment implies.

mindcrime · on Sept 28, 2023

Many websites still have limits on how many you can buy.

For a hobbyist / individual, is that really a big deal? I mean, how many do you need at one time?

Anyway, the claim all along has been that supplies would be "back to normal" by the end of this year, and so far things seem to be tracking that way. If you look at rpilocator.com now, the entire first two pages are full of green lines, which is a DRASTIC improvement compared to just 6 months ago. And some of the major distributors are getting in shipments of 5,000, 6,000 at a time of some models and having them in stock for weeks on end. So one can clearly see that the situation is improving rapidly.

That said, I will make no claim one way or the other with regards to the question of whether or not shipping a Pi 5 is a "good idea" or not.

dixie_land · on Sept 28, 2023

On a similar note, I'm genuinely curious as to why Pi chose the "authorized reseller" model instead of selling them directly.

djbusby · on Sept 28, 2023

B2C, small quantity sales are not fun. B2B selling pallets full.

rewmie · on Sept 28, 2023

I'd be surprised if reselling through authorized sellers isn't much simpler and problem-free than selling them directly.

I also expect that using resellers ensures better odds of protecting the brand/project goodwill. Resellers deal with problems like "I paid a ton of cash for a board and it arrived late and/or broken". Support alone is a nightmare, and I recall that raspberry Pi struggled with PR when they started out. I vaguely recall Liz Upton being behind some ill-advised episodes that didn't improved Raspberry Pi's image and would get anyone other PR person sacked.

extraduder_ire · on Sept 28, 2023

They do also sell them directly, though not in large quantities. They even have a retail store somewhere in the UK.

KaiserPro · on Sept 28, 2023

Because world wide sales is really hard.

Having trusted local resellers is a much more scalable way to sell to local markets.

AmIDev · on Sept 28, 2023

If scalpers are able to sell a product at higher price, doesn't that mean the company priced the product too low?

extraduder_ire · on Sept 28, 2023

I think scalping is more of a supply issue, raising the official price of the product would only require more cash when scalpers are doing their buying.

sanity · on Sept 28, 2023

Raspberry should raise the price until scalping isn't profitable. Keeping the price low is just handing money to scalpers that should be going towards future product development, until they can meet the demand.

l1k · on April 2, 2023

A lot of RISC CPU arches which were popular in the 1990's declined because their promulgators stopped investments and bet on switching to IA64 instead. Around the year 2000, VLIW was seen as the future and all the CISC and RISC architectures were considered obsolete.

That strategic failure by competitors allowed x86 to grow market share at the high end, which benefited Intel more than the money lost on Itanium.

ghaff · on April 2, 2023

It's more complicated than that.

Sun didn't slow down on UltraSPARC or make an Itanium side bet. IBM did (and continues to) place their big hardware bet on Power--Itanium was mostly a cover your bases thing. I don't know what HP would have done--presumably either gone their own way with VLIW or kept PA-RISC going.

Pretty much all the other RISC/Unix players had to go to a standard processor; some were already on x86. Intel mostly recovered from Itanium specifically but it didn't do them any favors.

sliken · on April 2, 2023

Actually, they did. Intel promised aggressive delivery schedule, performance ramp, and performance. The industry took it hook, line, and sinker. While AMD decided not to limit 64 bit to the high end and brought out x86-64.

Sun did a port IA64 port of solaris, which is definitely an itanium side bet.

HP was involved in the IA64 effort and definitely was planning on the replacement of pa-risc from day 1.

davidgay · on April 2, 2023

> HP was involved in the IA64 effort and definitely was planning on the replacement of pa-risc from day 1.

As my memory remembers and https://en.wikipedia.org/wiki/Itanium agrees, Itanium originated at HP. So yes, a replacement for pa-risc from day 1, but even more so...

rodgerd · on April 2, 2023

Another way to look at the Itanic is that HP somehow conned Intel into betting the farm on building HP-PA3 for HP. Which is pretty impressive.

panick21_ · on April 4, 2023

Sun didn't slow down on UltraSPARC but they were just not very good at designing processors.

foobiekr · on April 2, 2023

This isn't really true. IBM/Motorola need to own the failure of POWER and PowerPC and MIPS straight up died on the performance side. Sun continued with Ultrasparc.

It wasn't that IA64 killed them, it's that they were getting shaky and IA64 appealed _because_ of that. Plus the lack of a 64bit x86.

panick21_ · on April 2, 2023

Its simply economics Intel had the volume. Sun and SGI simply didn't have the economics to invest the same amount, and they were also not chip company, the both didn't invest enough in chip design or invested it wrongly.

Sun spend an unbelievable amount of money on dumb ass processor projects.

Towards the end of the 90s all of them realized their business model would not do well against Intel, so pretty much all of them were looking for an exit and IA64 hype basically killed most of them. Sun stuck it out with Sparc with mixed results. IBM POWER continues but in a thin slice of the market.

Ironically there was a section of Digital and Intel who thought that Alpha should be the bases of 64 bit x86. That would have made Intel pretty dominate. Alpha (maybe a TSO version) with 32 bit x86 comparability mode.

PAPPPmAc · on April 2, 2023

Look closely at AMD designs (and staff) of the very late 90s and early 2000s and/or all modern x86 parts and see that ...more or less, that's what happened, just not with an Alpha mode.

Dirk Meyer (Co-Architect of the DEC Alpha 21064 and 21264) lead the K7 (Athlon) project, and they run on a licensed EV6 bus borrowed from the Alpha.

Jim Keller (Co-Architect of the DEC Alpha 21164 21264) lead the K8 (first gen x86-64) project, and there are a number of design decisions in the K8 evocative of the later Alpha designs.

The vast majority of x86 parts since the (NexGen Nx686 which became) AMD K6 and Pentium Pro (P6) have been internal RISC-ish cores with decoders that ingest x86 instructions and chunk them up to be scheduled on an internal RISC architecture.

It has turned out to sort of be a better-than-both-worlds thing almost by accident. A major part of what did in the VLIW-ish designs was that "You can't statically schedule dynamic behavior" and a major problem for the RISC designs was that exposing architectural innovations on a RISC requires you change the ISA and/or memory behavior in visible ways from generation to generation, interfering with compatability so... the RISC-behind-x86-decoder designs get to follow the state of the art changing whatever they need to behind the decoder without breaking compatibility AND get to have the decoder do the micro-scheduling dynamically.

panick21_ · on April 2, 2023

Yes that very much part of the history.

However I disagree that its the best of both worlds.

RISC doesn't necessary require changing the ISA, not anymore then on x86.

PAPPPmAc · on April 3, 2023

I'm certainly not going to claim that x86 and its irregularities and extensions of extensions is in _any way_ a good choice for the lingua franca instruction set (or IR in this way of thinking). Its aggressively strictly ordered memory model likely even makes it particularly unsuitable, it just had good inertia and early entrance.

The "RISC of the 80s and 90s" RISC principles were that you exposed your actual hardware features and didn't microcode to keep circuit paths short and simple and let the compiler be clever, so at the time it sort of did imply you couldn't make dramatic changes to your execution model without exposing it to the instruction set. It was about '96 before the RISC designs (PA-RISC2.0 parts, MIPS R10000) started extensively hiding behaviors from the interface so they could go out-of-order.

That changed later, and yeah, modern "RISC" designs are rich instruction sets being picked apart into whatever micro ops are locally convenient by deep out of order dynamic decoders in front of very wide arrays of microop execution units (eg. ARM A77 https://en.wikichip.org/wiki/arm_holdings/microarchitectures... ), but it took a later change of mindset to get there.

Really, the A64 instruction set is one of the few in wide use that is clearly _designed_ for the paradigm, and that has probably helped with its success (and should continue to, as long as ARM, Inc. doesn't squeeze too hard on the licensing front).

panick21_ · on April 3, 2023

Seems to me that you just have to be careful when bringing out a new version. You can't change the memory model from chip to chip but that goes for x86 to. Not sure what other behaviors are not really changeable.

Can you give me an example of this? SPARC of the late 90s ran 32bit SPARC.

userbinator · on April 2, 2023

Plus the lack of a 64bit x86.

If you look at the definitions of various structures and opcodes in x86 you'll notice gaps that would've been ideal for a 64-bit expansion, so I think they had a plan besides IA64, but AMD beat them to it (and IMHO with a far more inelegant extension.)

andrekandre · on April 3, 2023

  > and IMHO with a far more inelegant extension

what could they have done that would have been better?

Dalewyn · on April 2, 2023

>That strategic failure by competitors allowed x86 to grow market share at the high end, which benefited Intel more than the money lost on Itanium.

In that sense, Itanium was a resounding success for Intel (and AMD).

panick21_ · on April 2, 2023

Itanium was a success right until they actually made a chip.

What they should have done is hype Itanium and then they day it came out they should have said yeah this was a joke, what we did is buy Alpha from Compaq and its literally just Alpha with x86 comparability mode.

Then they would have dominated.