I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 years
It is a far cry from Opus 4.6.
Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing analyses of the collection as a whole. It was a clear leap forward.
Sonnet 4.6 feels like an evolution of whatever the previous models were doing. It is marginally better in the sense that it seemed to make fewer mistakes or with a lower level of severity, but ultimately it made all the usual mistakes (making things up, saying it'll quote a poem and then quoting another, getting time periods mixed up, etc).
My initial experiments with coding leave the same feeling. It is better than previous similar models, but a long distance away from Opus 4.6. And I've really been spoiled by Opus.
Opus 4.6 is outstanding for code, and for the little I have used it outside of that context, in everything else I have used it with. The productivity with code is at least 3x what I was getting with 5.2, and it can handle entire projects fairly responsibly. It doesn’t patronize the user, and it makes a very strong effort to capture and follow intentions. Unlike 5.2, I’ve never had to throw out a days work that it covertly screwed up taking shortcuts and just guessing.
I’ve had it make some pretty obvious mistakes. I have to hold back the impulse to “unstick” it manually. In my case, it’s been surprisingly good at eventually figuring out what it was doing wrong - though sometimes it burns a few minutes of tokens in the process.
Claude's willingness to poke outside of its present directory can definitely be a little worrying. Just the other day, it started trying to access my jails after I specifically told it not to.
On a Mac, I use built-in sandboxing to jail Claude (and every other agent) to $CWD so it doesn’t read/write anything it shouldn’t, doesn’t leak env, etc. This is done by dynamically generating access policies and I open sourced this at https://agent-safehouse.dev
By any chance, do you know what Claude Code's sandbox feature uses under the hood and how that relates to your solution ? From what I remember it also uses the native MacOS sandbox framework, but I haven't looked too deep into it and don't trust it fully
Claude Code sandboxing uses the same basic OS primitive but grants read access to the entire filesystem and includes escape hatches (some commands bypass sandboxing). Also, I wanted something solid I can use to limit every agent (OpenCode, Pi, Auggie, etc).
I haven’t found an easy way, but I have a working theory -
sandbox-exec cannot filter based on domain names, but it can restrict outbound network connections to a specific IP/port (and drop the rest). If I can run a proxy on localhost:19999, I can allow agents to connect through it and filter connections by hostname. From my research, most agents support $HTTP_PROXY, so I'll try redirecting their HTTP requests through my security proxy. IIRC, if I do this at the CONNECT level, I don't need to MITM their traffic nor require a trusted root cert.
Recently, Codex CLI implemented something like DNS filtering for their sandbox, so I'd investigate their repo.
Some commercial firewalls will snoop on the SNI header in TLS requests and send a RST towards the client if the hostname isn’t on a whitelist. Reasonably effective. If there’s a way with the macos sandboxing to intercept socket connections you might find some proxy software that already supports this.
I like seeing this analysis on new model releases, any chance you can aggregate your opinions in one place (instead of the hackernews comment sections for these model releases)?
Opus 4.6 has been awful for me and my team. It goes immediately off the rails and jumps to conclusions on wants and asks and just keeps chugging along forever and won't let anything stop it down whatever path it decides. 4.5 was awesome and is our still go-to model.
That's interesting, 4.6 is finally when AI started to become good in my eyes. I have a very strict plan phase, argue, plan then partial execute. I like it to do boilerplate then I do the hard stuff myself and have it do a once over at the end.
Although I have had it try to debug something and just get stuck chugging tokens.
I have found this to be true too and I thought I was the only one. Everyone is praising 4.6 and while it’s great at agentic and tool use, it does not follow instructions as cleanly as 4.5 - I also feel like 4.5 was just way more efficient too
I think that's because not everyone does the same job within the same stack and constraints. I'm yet to find an LLM that writes the kind of C++ I dabble with without having to manually tweak it myself (or that truly understands our codebase). Conversely, I find that LLMs are now excellent at python and orchestration tasks for instance. It's very situational
This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.
My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.
I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.
Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.
I'm curious how this would compare with codex 5.3. I've heard Codex actually is pretty good but Opus 4.6 has become synonymous with AI coding because all the big names praise it. I haven't compared them against each other though so can't really draw a conclusion.
There are no universals. You have to try it on your particular codebase and see what works for you.
For me, OpenAI is ahead in intelligence, and Anthropic is ahead in alignment. I use both but for different tasks.
Given the pace of change, intuition is somewhat of a liability: what's true today may not be true tomorrow. You have to constantly keep an open mind and try new things.
It is a far cry from Opus 4.6.
Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing analyses of the collection as a whole. It was a clear leap forward.
Sonnet 4.6 feels like an evolution of whatever the previous models were doing. It is marginally better in the sense that it seemed to make fewer mistakes or with a lower level of severity, but ultimately it made all the usual mistakes (making things up, saying it'll quote a poem and then quoting another, getting time periods mixed up, etc).
My initial experiments with coding leave the same feeling. It is better than previous similar models, but a long distance away from Opus 4.6. And I've really been spoiled by Opus.