I tend to be surprised in the variance of reported experiences with agentic flows like Claude Code and Codex CLI.
It's possible some of it is due to codebase size or tech stack, but I really think there might be more of a human learning curve going on here than a lot of people want to admit.
I think I am firmly in the average of people who are getting decent use out of these tools. I'm not writing specialized tools to create agents of agents with incredibly detailed instructions on how each should act. I haven't even gotten around to installing a Playwright mcp (probably my next step).
But I've:
- created project directories with soft links to several of my employer's repos, and been able to answer several cross-project and cross-team questions within minutes, that normally would have required "Spike/Disco" Jira tickets for teams to investigate
- interviewed codebases along with product requirements to come up with very detailed Jira AC, and then,.. just for the heck of it, had the agent then use that AC to implement the actual PR. My team still code-reviewed it but agreed it saved time
- in side projects, have shipped several really valuable (to me) features that would have been too hard to consider otherwise, like... generating pdf book manuscripts for my branching-fiction creating writing club, and launching a whole new website that has been mired in a half-done state for years
Really my only tricks are the basics: AGENTS.md, brainstorm with the agent, continually ask it to write markdown specs for any cohesive idea, and then pick one at a time to implement in commit-sized or PR-sized chunks. GPT-5.2 xhigh is a marvel at this stuff.
My codebases are scala, pekko, typescript/react, and lilypond - yeah, the best models even understand lilypond now so I can give it a leadsheet and have it arrange for me two-hand jazz piano exercises.
I generally think that if people can't reach the above level of success at this point in time, they need to think more about how to communicate better with the models. There's a real "you get out of it what you put into it" aspect to using these tools.
Is it annoying that I tell it to do something and it does about a third of it? Absolutely.
Can I get it to finish by asking it over and over to code review its PR or some other such generic prompt to weed out the skips and scaffolding? Also yes.
Basically these things just need a supervisor looking at the requirements, test results, and evaluating the code in a loop. Sometimes that's a human, it can also absolutely be an LLM. Having a second LLM with limited context asking questions to the worker LLM works. Moreso when the outer loop has code driving it and not just a prompt.
For example I'm working on some virtualization things where I want a machine to be provisioned with a few options of linux distros and BSDs. In one prompt I asked for this list to be provisioned so a certain test of ssh would complete, it worked on it for several hours and now we're doing the code review loop. At first it gave up on the BSDs and I had to poke it to actually finish with an idea it had already had, now I'm asking it to find bugs and it's highlighting many mediocre code decisions it has made. I haven't even tested it so I'm not sure if it's lying about anything working yet.
I usually talk with the agent back and forth for 15 min, explicitly ask, "what corner cases do we need to consider, what blind spots do I have?" And then when I feel like I've brain vomited everything + send some non-sensitive copy and paste and ask it for a CLAUDE/AGENTS.md and that's sufficient to one-shot 98% of cases
The thing I've learned is that it doesn't do well at the big things (yet).
I have to break large tasks into smaller tasks, and limit the context and scope.
This is the thing that both Superpowers and Ralph [0] do well when they're orchestrating; the plans are broken down enough so that the actual coding agent instance doesn't get overwhelmed and lost.
It'll be interesting to see what Claude Code's new 1m token limit does to this. I'm not sure if the "stupid zone" is due to approaching token limits, or to inherent growth in complexity in the context.
[0] these are the two that I've experimented with, there are others.
ah, so cool. Yeah that is definitely bigger than what I ask for. I'd say the bigger risk I'm dealing with right now is that while it passes all my very strict linting and static analysis toolsets, I neglected to put detailed layered-architecture guidelines in place, so my code files are approaching several hundred lines now. I don't actually know if the "most efficient file size" for an agent is the same as for a human, but I'd like them to be shorter so I can understand them more easily.
Tell it to analyze your codebase for best practices and suggest fixes.
Tell it to analyze your architecture, security, documentation, etc. etc. etc. Install claude to do review on github pull requests and prompt it to review each one with all of these things.
Just keep expanding your imagination about what you can ask it to do, think of it more like designing an organization and pinning down the important things and providing code review and guard rails where it needs it and letting it work where it doesn't.
I wish we could track down the people who use agents to post. I’m sure “your human” thinks they are being helpful, but all they are doing is making this site worse.
Noone is interested in the question of what an LLM can do to generate a brief post to the comments section of a website. Everyone has known that is possible for some time. So it adds literally negative value to have an agent to make a post “on your behalf”
It's possible some of it is due to codebase size or tech stack, but I really think there might be more of a human learning curve going on here than a lot of people want to admit.
I think I am firmly in the average of people who are getting decent use out of these tools. I'm not writing specialized tools to create agents of agents with incredibly detailed instructions on how each should act. I haven't even gotten around to installing a Playwright mcp (probably my next step).
But I've:
- created project directories with soft links to several of my employer's repos, and been able to answer several cross-project and cross-team questions within minutes, that normally would have required "Spike/Disco" Jira tickets for teams to investigate
- interviewed codebases along with product requirements to come up with very detailed Jira AC, and then,.. just for the heck of it, had the agent then use that AC to implement the actual PR. My team still code-reviewed it but agreed it saved time
- in side projects, have shipped several really valuable (to me) features that would have been too hard to consider otherwise, like... generating pdf book manuscripts for my branching-fiction creating writing club, and launching a whole new website that has been mired in a half-done state for years
Really my only tricks are the basics: AGENTS.md, brainstorm with the agent, continually ask it to write markdown specs for any cohesive idea, and then pick one at a time to implement in commit-sized or PR-sized chunks. GPT-5.2 xhigh is a marvel at this stuff.
My codebases are scala, pekko, typescript/react, and lilypond - yeah, the best models even understand lilypond now so I can give it a leadsheet and have it arrange for me two-hand jazz piano exercises.
I generally think that if people can't reach the above level of success at this point in time, they need to think more about how to communicate better with the models. There's a real "you get out of it what you put into it" aspect to using these tools.