More

vineyardmike · 2026-06-15T17:42:22 1781545342

For Claude specifically, (1) enterprises pay API rates on top of subscriptions, so subscriptions profitability questions are only relevant for smaller companies and indie devs. Many of whom probably have sporadic or low usage which helps balance some heavy users.

Again, for Claude, (2) it’s rumored that their API rates have around a 90% profit margin. It’s also claimed that the subscription limits get you around 10x tokens per monthly dollar vs buying them with API rates.

Edit: to drive it home. If a tokens true cost to anthropic is 1/10 of what they sell it for at API rates, and a subscription gets you tokens at 1/10 the price, that’s cost-neutral for the business if every subscription uses every token. They’re selling tokens at cost, not at a loss. Many subscription users won’t use their full allotment. That means serving some users doesn’t cost the business as much - which might push the subscription business from cost neutral to profitable.

dominotw · 2026-06-15T17:43:20 1781545400

not sure how that concludes that subscriptions are not losing money.

vineyardmike · 2026-06-14T09:50:21 1781430621

Refusing to enact new laws around a thing most people don’t like, don’t want, and don’t care about (oh and is used for scams often) is quite different than a secret back door war.

mlrtime · 2026-06-15T10:01:36 1781517696

"around a thing most people don’t like"

Liking or not liking has nothing to do with it, this is literally the job of the government. Why do you think states started enacting their own crypto regulation laws (NYDFS) because the administration did 0.

vineyardmike · 2026-06-13T07:36:49 1781336209

If your country doesn't have any leading models, why not legalize distillation, either explicitly or implicitly?

(Chinese labs famously distilled American models, and that seems to be going well for them. They now have a competitive industry, home-grown talent choosing not to leave, and they now can truly compete without distillation).

vineyardmike · 2026-06-13T06:46:25 1781333185

The article addresses a pretty compelling reason...

Why would the makers of open models (mostly Chinese firms) continue to open them up, now that the value chain and economy shifts? Previously, it was a (Chinese) national goal to force the market to compress OpenAI/Anthropic margins (and compressing their revenue along the way), to ensure the Chinese had access to high quality models, and could afford to compete. Now there is an opportunity to usurp and be the international default, and claim the margin for themselves by closing their models.

Beyond that, there is likely an upper bound of capability-per-parameter, which means that there is an upper bound on "local" models, and once you need the cloud, why would the government not target clouds next?

vineyardmike · 2026-06-10T17:05:15 1781111115

Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

onlyrealcuzzo · 2026-06-10T18:05:27 1781114727

If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...

I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.

bee_rider · 2026-06-10T20:10:23 1781122223

How do you measure “true” complexity? Cyclomatic seems a bit… I dunno, artificial? Blunt? But it has the benefit of being defined.

onlyrealcuzzo · 2026-06-10T20:30:47 1781123447

There's a ton of research on this in the 80s... and interestingly, I haven't seen a lot of recent research.

Surprisingly, it seems most languages don't have a standard package to do a lot of these detections.

Ruby has Flay to detect similarity (something LLMs are prone to do). Basically re-write a huge function with only a couple of minor differences that should probably be params...

One of the things I rely on most is "pressure" -> which conditions are causing the most checks throughout the code-base. Those are things you should Type away.

Dynamically typed languages like Ruby create a huge surface area for type slop for LLMs, and why I would not recommend using a dynamically typed language for vibe coding.

You can have type "pressure" and nil "pressure" -> where you set a value to nil somewhere (that you probably shouldn't have) -> and that has ripple effects all throughout your codebase. Similarly, you can do this for values -> one place it's a string (where it shouldn't be), everywhere else a symbol (what it should be) -> but now you've got hundreds of casts to_sym or to_s in your codebase.

There's also state drift & reification misses -> you constantly update two states (that should probably just be one new value or a function) and sometimes you forget to update one (more of a bug possibility than complexity). Same for reification misses -> you constantly check for multiple conditions -> that should probably be one value or a function, and similarly (buggy, you may sometimes miss one).

Complexity comes down to state and control flow -> so you want to check what's causing you to make the most decisions (especially state/time based), and where it's coming from. Where do you have the most state and why...

I'm hoping to release everything in the next few weeks, but it takes a while to polish things, especially when it's a side-quest of a side project...

aleksiy123 · 2026-06-10T22:23:22 1781130202

Interesting, I do think blending the fuzziness of models with the determinism of hard checks/conformance is the way too go.

But using some kind of metrics as guardrails/steering seems interesting.

epiccoleman · 2026-06-10T20:50:27 1781124627

> Dynamically typed languages like Ruby create a huge surface area for type slop for LLMs, and why I would not recommend using a dynamically typed language for vibe coding.

I totally understand this, and have seen the problems firsthand. But Elixir / Phoenix / LiveView, along with Tidewave, have become my favorite "vibe slop stack." Just so quick and easy, and the LLM seems to get things right quite often.

fridder · 2026-06-11T17:47:50 1781200070

I wonder if a dedicated client or mode in a client would provide some benefits. Might also be interesting to do adversarial stuff too where it argues with itself or another model

Daishiman · 2026-06-10T19:22:59 1781119379

What metrics have you found useful?

yeodev · 2026-06-10T18:38:41 1781116721

I wonder how much this will impact locally used models for coding. I can imagine using diffusion models that are x-times faster than Qwen or Gemma 4 - where I have to do more "pre-ai" work which is a good thing and can have a very fast, very cheap model to work with locally. I assume since it doesn't do heavy computing for a long time that it's cheaper to run on local hardware as well?

irthomasthomas · 2026-06-10T21:54:04 1781128444

Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:

  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis

Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.

SwellJoe · 2026-06-10T22:07:51 1781129271

I've found the average output of many suboptimal models is still suboptimal, especially when it comes to judging the accuracy/correctness of the work of other models.

I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.

It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.

irthomasthomas · 2026-06-10T22:25:27 1781130327

I have had a better experience with my own use. I use it every day and it rarely fails to improve tasks. Perhaps the prompts and rubrics make a difference. And finding bugs is one of the better use cases because it is essentially a search problem. As long as models are non-deterministic and there is some diversity in training data, then an ensemble that iterates on the problem is more likely to cover the ground needed to find solve a problem.

Some tasks benefit from this approach more than others. There was a paper from google on a version they made which was very similar and achieved SOTA then on planning and pathfinding benchmarks.

edit:

Mind Evolution paper https://deepmind.google/research/publications/122391/

(That was a month after I published llm-consortium :) https://xcancel.com/karpathy/status/1870692546969735361

evilturnip · 2026-06-10T20:00:18 1781121618

I get exactly what you mean. After getting frustrated with how slow Claude was on my personal projects, I switched to Google Antigravity with Flash models and the speed difference is huge. I feel more in the flow and just more focused on the task. I did not realize how much a difference speed can make.

Claude is better for extremely complicated, large codebases where its slower response time might be a good trade-off for the complexity of the task. Antigravity and other fast models works so much better for smaller projects where you want a "flowy" code, run, debug cycle.

bpavuk · 2026-06-10T19:52:16 1781121136

YESSSS!!! speed is THE way! I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s, Flash-Lite is more useful than GPT-5.5 for me this way. if it's too slow, you just stay in that goddamn async death loop

embedding-shape · 2026-06-10T20:46:50 1781124410

> I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s

Regardless of speed, use the LLM to eliminate the need for boilerplate rather than just creating more code faster.

> if it's too slow, you just stay in that goddamn async death loop

Things get slow when you're ballooning the size of your code, files, design and architecture, and things get more involved and complicated, piling fast hacks on top of fast hacks and everything get brittle.

Slow is fast, longer-term anyways.

bpavuk · 2026-06-11T10:11:34 1781172694

previously, mugging through docs to turn them into serializables for some API took weeks of grueling work if you wanted to cover an entire API surface that's as big as, say, GitHub's. nowadays, just "copy Markdown" from the very same GitHub, put 10-12 data classes, and let LLM extrapolate from there. with Gemini's 65.5k max token output, that is just several prompts and about two hours. that's the boilerplate. there is practically no way to automate this unless GitHub adopts OpenAPI spec in a way that's not buggy, so that we can just hit an endpoint and point procedural source generators at them

embedding-shape · 2026-06-11T11:21:47 1781176907

> that's the boilerplate

Sounds like you're trying to just re-implement a HTTP API, not really boilerplate.

Boilerplate is code you could have avoided written, but you take the "temporary" shortcut of copy-pasting the code instead of building a proper abstraction. This is what I'm talking about is the wrong direction.

elxr · 2026-06-10T21:53:47 1781128427

For boilerplate, yeah. But when asking research or exploratory questions, or weighing whether a feature is well designed, or asking "can I implement _x_ feature using these libraries without introducing unnecessary complexity", then GPT-5.5 medium is still fast enough.

10-20 seconds times a couple turns on a new feature isn't bad. Kimi is also similarly fast if not faster.

I do agree with smaller models for more constrained/routine tasks though.

bpavuk · 2026-06-11T10:01:41 1781172101

well, I can usually think for myself or hit someone up in Discord (or Teams, if it's for a living) and in a worst case (that person just deflects to AI anyway) just save some token budget for myself

elxr · 2026-06-11T11:03:10 1781175790

I always think for myself too, but when learning to do something I've never implemented before, it's nice to have little sanity checks using something with the reasoning ability (plus the fast natural language search on hundreds of pages of documentation) of a model like GPT-5.5.

Every line I put in my app, I still reason about myself. But when deciding between 5+ ways of building some random, non-straightforward feature, it's nice to have what's essentially a "mentor" AI.

fittingopposite · 2026-06-11T06:39:26 1781159966

Mercury is a US LLM from https://www.inceptionlabs.ai/

desireco42 · 2026-06-10T21:29:39 1781126979

Wow... I forgot about that. Mercury is brutal. I had him review lint errors and the speed is just insane

skybrian · 2026-06-10T18:06:00 1781114760

Could you say more about how you use it? What does your workflow look like?

vineyardmike · 2026-06-10T18:53:24 1781117604

Imagine you’re entirely pre-AI… to do some work, you read code, think, then write some code across a number of files. Usually then a small dance with compilation/unit tests to address anything broken. Along the way, you use your human judgement on style and quality, and midway through your change you might refactor something based on learned best practices (eg, when to use a static method, or helper class).

Today, even the dumbest AI agents can trivially loop through the final dance to get compilation, and often unit tests (depending on scope of failure). Big SOTA agents have OK code quality, but if left unattended or unchecked will still generate pretty sloppy repos after a while. That’s true even when using models like Opus which is ridiculously expensive in comparison.

When using the models in this fast “pair programming” style, I find that I (the human) mostly do all the “plan and think” work, and usually point the smaller agent towards specific files/directories, with specific targeted changes. It’s slower than 1-shot prompting an entire feature, but slightly faster than doing it manually, and I find the code is less “slop” because the changes are smaller and more human. It retains the agentic benefits of handing imports, compilation iteration, etc and can do basic cross-file plumbing. It’s also cheap and fast to do iterations like “wait make that method static” or “let’s update this to use <other util class>” and things like that. When the agent is slow to make localized edits, I find I’m less likely to push for minor nit-picks and style updates.

andai · 2026-06-10T18:35:46 1781116546

So you're making smaller edits?

vineyardmike · 2026-06-10T06:50:47 1781074247

I think one of life's big questions is defining the size of your sphere of responsibility.

Some people decide that sphere is really big, and they go on to be those historical figures. Others define it really big, and wallow in angst, aware and powerless to the suffering. Others still define it too small, and by the end of their life, find regret that they didn't try to help those within reach.

vineyardmike · 2026-06-09T20:39:39 1781037579

While I agree that AIs would do a good job…

Would you rather take instructions from a ruthless robot or ruthless flesh sack?

yoyohello13 · 2026-06-09T20:49:42 1781038182

Weirdly enough, I'd take the robot. At least we can pretend the robot doesn't know any better. The human is actively choosing to be a dick and profiting off it.

vitally3643 · 2026-06-09T20:49:29 1781038169

The flesh sack is choosing to be an insufferable twat, and the robot either doesn't have any choice or has a decent statistical justification for what it does.

ungreased0675 · 2026-06-09T23:15:50 1781046950

A robot with access to all of the company data and mediocre decision making wouldn’t be terrible.

vineyardmike · 2026-06-08T21:09:00 1780952940

The obvious answer to where the AI Labs get customers is Cloud GPUs. Most users (globally) have cheap phones with poor CPUs and small amounts of RAM. They can't run usable models locally, and it's not clear from the Google-Apple deal if G is selling access to their cloud compute as part of that $1B, or just sharing the weights/IP.

Apple themselves have said there is usage limits, with a subscription upgrade for more usage. So clearly AI Labs are directly competing on that front, it's just a normal default/chosen decision. Considering there are defaults and still successful competitors (eg. safari v chrome), there's no reason to think that competition can't handle this too.

Edit: I want to add that Google is also probably willing to give the model away at a discount to its true value in exchange for guaranteeing that their primary competition (who has tons of cash) won’t have an economic incentive to enter the foundation model training arms race.

Most users who actually want these features for anything more serious than summarization and style updates will probably find value in a modest subscription or ad-supported tier of higher quality models, even if just for occasional usage. Apple can provide this, but once you're comparing features, for many Gemini/Claude/ChatGPT may be a better fit.

Oh, and I think there is an unfortunate but real risk that once again, apple totally over-promises here, and their AI models that they ship end up being pretty poor, and that drives users further into subscriptions.

dwaite · 2026-06-09T00:23:34 1780964614

> Apple themselves have said there is usage limits, with a subscription upgrade for more usage.

Specifically for image generation. They haven't indicated you have limits for Siri interactions.

vineyardmike · 2026-06-09T01:41:39 1780969299

> "Some features, including image generation, have daily usage limits because they rely on powerful server models. Increased access is available with most iCloud plus subscription plans".

Start at 1:07:00 in their announcement video. Craig is absolutely talking about "Apple Intelligence" as a whole in this segment.

Pragmatically, of course they'd need to add metering to any cloud available APIs that rely on large models. There's no way they will eat the cost of serving unlimited access to a cloud LLM to end users if they won't eat the cost of an image generation model.

dofm · 2026-06-08T21:20:40 1780953640

> Oh, and I think there is an unfortunate but real risk that once again, apple totally over-promises here, and their AI models that they ship end up being pretty poor, and that drives users further into subscriptions.

OK, that I would concede is a possibility. Though Gemini is clearly capable, and the (alleged) story is that they have licensed a one-trillion parameter form of Gemini. I don't think they are making the same mistake.

ETA: I also concede they could make a different mistake ;-)

avidphantasm · 2026-06-08T23:19:59 1780960799

The AI labs are racing to create a moat out of trillion-parameter models and the GPUs that can run them. The problem is this is the wrong architecture for most AI inference use cases. On-device inference is where this is going, clearly Apple believes this too. So Zitron is entirely correct about this AI datacenter build out being a boondoggle with no ROI.

vineyardmike · 2026-06-08T08:17:44 1780906664

I'll take the contrary position and say that I think the "tokenmaxxing" we've previously seen was useful (but shouldn't continue indefinitely). My TLDR position is that TokenMaxxing was a way to force discovery of Product Market Fit.

The push by companies to incorporate AI into everything is (depending on the company) either hype and cargo-culting or it was an attempt by management to (1) try and discover if/what new workflows or tools could use it and (2) force the haters to use as it got better.

Where I work, there is an obvious split between people who have been willing to use AI, and those that hated it from day 1 and mocked the "stochastic parrots". Senior folks were disproportionately haters, and generally didn't see much productivity lift from early AI stuff. They strongly resisted the mandates to use AI, and completely missed the "agentic" inflection point that other colleagues experienced. The more willing users saw Claude Code/agents and were able to experience this as the genuine benefit it can be. Now that the more senior folks are using agentic programming, they're genuinely able to maintain code quality and see meaningful speed improvements in coding tasks.

Today, tokenmaxxing doesn't make sense because we found the product-market-fit of agentic coding. Now that most (?) employees are onboard with using it, the industry can shift focus to cost-effective usage and positive-ROI usage. For example, Uber shifting to a fixed per-employee token budget.

red-iron-pine · 2026-06-08T12:58:28 1780923508

> The push by companies to incorporate AI into everything is (depending on the company) either hype and cargo-culting or it was an attempt by management to (1) try and discover if/what new workflows or tools could use it and (2) force the haters to use as it got better.

"we need to figure out if we can replace you with AI, or if it just extends your abilities"

apothegm · 2026-06-08T10:20:43 1780914043

Usually it’s the seller whose responsibility it is to find PMF, not the buyer.

vineyardmike · 2026-06-08T16:28:46 1780936126

Most tech companies are buyers of AI, many are also sellers of AI in their own products, and a few are also building it.

apothegm · 2026-06-09T11:54:19 1781006059

Sure. But the ones who are “tokenmaxxing” (I hate that term) are generally maximizing their usage as consumers.

“Try and discover if/what new workflows or tools could use it” is something that’s supposed to be done by the companies selling a product so they can then convince people to buy and use it — not something that the buyers are supposed to do.

vineyardmike · 2026-06-05T00:18:03 1780618683

Well the article links to a source of the unreliable claim. You read the article right?

The coal is unreliable narrative has existed for many years because coal plants break down and have mechanical issues at higher rates than other sources, renewable and otherwise. Coal is less reliable than natural gas and oil based power, nuclear, to say nothing of the comparison to solar, which obviously doesn’t need mechanical repairs. The method of converting coal to power is actually less reliable because it’s more mechanically intense, and has more moving parts which makes it more fragile and repair prone.

If you google this, you’ll find that statistically across coal plants, they’ve historically had a reliability of 70-90% of target capacity due to mechanical issues.