I ran the same test I ran on Opus 4.6: feeding it my whole personal collection o...

K0balt · 2026-02-17T22:57:55 1771369075

Opus 4.6 is outstanding for code, and for the little I have used it outside of that context, in everything else I have used it with. The productivity with code is at least 3x what I was getting with 5.2, and it can handle entire projects fairly responsibly. It doesn’t patronize the user, and it makes a very strong effort to capture and follow intentions. Unlike 5.2, I’ve never had to throw out a days work that it covertly screwed up taking shortcuts and just guessing.

renmillar · 2026-02-17T23:08:11 1771369691

That last part is a real one though, mine tried to debug a Dockerfile by poking around my local environment outside of Docker today.

josephg · 2026-02-18T01:58:15 1771379895

I’ve had it make some pretty obvious mistakes. I have to hold back the impulse to “unstick” it manually. In my case, it’s been surprisingly good at eventually figuring out what it was doing wrong - though sometimes it burns a few minutes of tokens in the process.

tiltowait · 2026-02-18T05:11:06 1771391466

Claude's willingness to poke outside of its present directory can definitely be a little worrying. Just the other day, it started trying to access my jails after I specifically told it not to.

e1g · 2026-02-18T07:47:08 1771400828

On a Mac, I use built-in sandboxing to jail Claude (and every other agent) to $CWD so it doesn’t read/write anything it shouldn’t, doesn’t leak env, etc. This is done by dynamically generating access policies and I open sourced this at https://agent-safehouse.dev

nowahe · 2026-02-18T09:10:58 1771405858

By any chance, do you know what Claude Code's sandbox feature uses under the hood and how that relates to your solution ? From what I remember it also uses the native MacOS sandbox framework, but I haven't looked too deep into it and don't trust it fully

e1g · 2026-02-18T09:26:42 1771406802

Claude Code sandboxing uses the same basic OS primitive but grants read access to the entire filesystem and includes escape hatches (some commands bypass sandboxing). Also, I wanted something solid I can use to limit every agent (OpenCode, Pi, Auggie, etc).

qalmakka · 2026-02-18T10:07:58 1771409278

On Linux in a pinch you can use bubblewrap to hide and replace directories for a given process

caspar · 2026-02-24T02:15:40 1771899340

for anyone reading this later, claude code's sandbox code is at https://github.com/anthropic-experimental/sandbox-runtime/

danw1979 · 2026-02-18T18:10:14 1771438214

This is great !

Did you have any thoughts about how to restrict network access on macos too ?

e1g · 2026-02-18T18:57:31 1771441051

I haven’t found an easy way, but I have a working theory -

sandbox-exec cannot filter based on domain names, but it can restrict outbound network connections to a specific IP/port (and drop the rest). If I can run a proxy on localhost:19999, I can allow agents to connect through it and filter connections by hostname. From my research, most agents support $HTTP_PROXY, so I'll try redirecting their HTTP requests through my security proxy. IIRC, if I do this at the CONNECT level, I don't need to MITM their traffic nor require a trusted root cert.

Recently, Codex CLI implemented something like DNS filtering for their sandbox, so I'd investigate their repo.

danw1979 · 2026-02-18T21:05:09 1771448709

Some commercial firewalls will snoop on the SNI header in TLS requests and send a RST towards the client if the hostname isn’t on a whitelist. Reasonably effective. If there’s a way with the macos sandboxing to intercept socket connections you might find some proxy software that already supports this.

the HTTP_PROXY approach might be simpler though.

teaearlgraycold · 2026-02-18T06:01:08 1771394468

For the moment it’s best practice to run it and all of your dev stuff in a VM.

linolevan · 2026-02-18T00:28:02 1771374482

Oh! Poem guy is back, hey!

I like seeing this analysis on new model releases, any chance you can aggregate your opinions in one place (instead of the hackernews comment sections for these model releases)?

hypercube33 · 2026-02-18T04:15:38 1771388138

Opus 4.6 has been awful for me and my team. It goes immediately off the rails and jumps to conclusions on wants and asks and just keeps chugging along forever and won't let anything stop it down whatever path it decides. 4.5 was awesome and is our still go-to model.

majora2007 · 2026-02-18T15:40:24 1771429224

That's interesting, 4.6 is finally when AI started to become good in my eyes. I have a very strict plan phase, argue, plan then partial execute. I like it to do boilerplate then I do the hard stuff myself and have it do a once over at the end.

Although I have had it try to debug something and just get stuck chugging tokens.

1broseidon · 2026-02-18T05:46:33 1771393593

I have found this to be true too and I thought I was the only one. Everyone is praising 4.6 and while it’s great at agentic and tool use, it does not follow instructions as cleanly as 4.5 - I also feel like 4.5 was just way more efficient too

qalmakka · 2026-02-18T10:10:38 1771409438

I think that's because not everyone does the same job within the same stack and constraints. I'm yet to find an LLM that writes the kind of C++ I dabble with without having to manually tweak it myself (or that truly understands our codebase). Conversely, I find that LLMs are now excellent at python and orchestration tasks for instance. It's very situational

1broseidon · 2026-02-18T13:21:03 1771420863

100% - you are very right. 4.6 is amazing for orchestration. I even built some tools around agent to agent contracting.

I use 4.6 as the brain and then handoff to a more rigid llm like GPT 5.2 or Opus 4.5

cube2222 · 2026-02-18T11:35:23 1771414523

This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.

My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.

versteegen · 2026-02-18T15:30:52 1771428652

I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.

Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.

jxmesth · 2026-02-18T05:52:22 1771393942

I'm curious how this would compare with codex 5.3. I've heard Codex actually is pretty good but Opus 4.6 has become synonymous with AI coding because all the big names praise it. I haven't compared them against each other though so can't really draw a conclusion.

zarzavat · 2026-02-18T09:36:40 1771407400

There are no universals. You have to try it on your particular codebase and see what works for you.

For me, OpenAI is ahead in intelligence, and Anthropic is ahead in alignment. I use both but for different tasks.

Given the pace of change, intuition is somewhat of a liability: what's true today may not be true tomorrow. You have to constantly keep an open mind and try new things.

Listening to influencers is a waste of time.

stingraycharles · 2026-02-18T01:44:04 1771379044

Given than Sonnet is the cheaper “workhorse” alternative for Opus, isn’t this expected?

hesgyrxgh · 2026-02-18T16:16:24 1771431384

I'm curious if you tried the same prompt for chatgpt 5.2 Did it not give you a mind blowing analysis?

Valakas_ · 2026-02-18T09:41:17 1771407677

Thanks for testing and sharing your results.

slopinthebag · 2026-02-18T02:12:49 1771380769

How do you evaluate the analyses?