Planager Mode
The latest batch of orangey LLM kool-aid inspired many to firestoke that now belongs to the ideas guy. A similar sentiment fomented in my gut from using Claude Code, shortly before this hot-take convention. I Claude Max-maxxed my way to what initially felt like a Sam’s Club-sized token supply; I’d soon learn how one pushes a top-tier subscription to its very limit by, for example, Gas Town-like multi-agent orchestrations. But the seductive high of waking up in the morning to tens of agents grinding away proved ephemeral in practice for this hack-ademic. Instead of being an ideas guy, I was spending more time managing plans.
Welcome to planager mode.
The mirage about the returns on multi-agent orchestrations was made evident to me by at least two projectsThere were others but these are the two standout ones.. The first was from seeing the results of allowing 150 Sonnet agents to grade an equal number of exam scripts in parallel Perhaps this should be called an eval?, a task I estimated should have taken under 30 minutes. In reality, this task ended up taking about 60 hours, including the time it took for me to do all the work myself1. The sloppiness of the multi-agent grading system was clarified by randomly sampling graded scripts; within 5 scripts, it was clear that the agents had hallucinated their way to assigning grades to answers they hadn’t read. This was not a problem of context management, but merely the inability to parse each student’s unique way of writing in the assigned file by creating more cells. Somehow, these agents couldn’t pick this up; repeat attempts at grading with Opus and prompts about such issues didn’t lead to consistent outcomes.
The second project involved orchestrating a bunch of Opus agents to write PhD-level research surveys; while the outputs here contain hallucinations, I still consider them impressive2. I have resolved some of the issues, like hallucinated references, with deterministic scripts that are basically lintersI have only eyeballed that these work; a more robust test is due for when I have more focused time on my hands, once the semester wraps up.; the lesson here is that prompting an LLM to not hallucinate doesn’t mean much to them.
So, besotted as we hack-ademics may be with well-crafted wordcelery, we do not undervalue accurate shape-rotation. Now, I am certain people out there will say I could have set up better specs or plans in both use cases — and I don’t fully disagree here — but it is also not how I imagine intelligence uptake will sell, especially to those who are trying to find personal gains from intelligence in diverse daily tasks, not just SWE use cases. That said, I’ll also clarify that my comments and expectation are reasonable at a technical/SWE-level. For example, the aforementioned exams involve Python (which agents supposedly love) and Markdown-formatted explanations in Jupyter Notebooks. These exams have a fairly robust set of auto-graded coding tests via nbgrader; these do not comprehensively grade an exam as I am still in the loop to evaluate the Markdown explanations of student work, but the tests definitely reduce my workload more than LLMs have so far. So, I expect LLMs to deliver, at minimum, an equal reduction in my cognitive burden while grading; and if the kind of alien intelligence that automates white-collar work and threatens universities is here, then I think it is rational to assume better performance with minimal iterative plan-tweaking from me.
With agents, every idea you’ve ever had becomes a plan that needs to be continuously iterated on. For the right kind of tasks, such plans work right away; others offer marginal improvements on an earlier implementation, but will eventually converge on what you want to see. Most such projects have some easily determinable behaviour. But tasks like grading 150 exams end up revealing new edge cases that the LLMs and you somehow missed in discussion.
And if you are like me, then you have likely kicked off several other disparate project ideas in parallel; coordinating plans across these multiple projects becomes unwieldy at what feels like an exponential rate.
I want to clarify: not a week goes by where I’m not impressed by all that these coding agents can do. I’m also left confused by what they cannot do — why are they so good at interpreting typos in the right context3 but so poor at surfacing historical evidence4? But, the illusory promise of draining the swamp of years of ideas within hours remains undelivered.
Coding agents have moments where they are jukebox-esque — the right tokens make them sing complex but testable software — but they are not there yet for more research-flavoured tasks where outputs, even if they feel somewhat amenable to plans that follow a SDD/TDD-like breakdown. This is why using coding agents feels more like using slot machines — especially for non-SWEs like me. If not too careful, one can get stuck in the locally optimal trance of lever-pulling with no rewards. Simultaneously, because the technological marvel is just so new and practical experiences poorly documented, this lesson can also only be learned by, in fact, entering this trance.
-
I ran the experiment several times under different conditions to see if they would get any better just because this just felt interesting to do; eventually, I gave up and just did the work that would have taken me between 22-25 hours to do manually. ↩
-
I discussed this in RalPhD. My feeling now is that LLMs will progress research much faster than grading exams as there are too many flavours of answers possible in an exam taken by 150 students; I wouldn’t have intuited this five years ago. Then again, people also wouldn’t have anticipated white collar-work being the first to be under threat in the manner and scale we are currently witnessing. ↩
-
I just asked Claude to a somewhat vague question about a comment by Demos Hasboe and it somehow quickly determined I meant Demis Hassabis before finding the right answer. ↩
-
I still try to see how many prompts it takes for them to get to the Goodyear experiments on rotating space stations, but somehow it’s just not a topic that surfaces right away. It’s why I consider the WiP article I wrote to be an oddity; I feel like this is something LLMs should be getting right away. ↩