RalPhD and The Autonomous Paper

Towards PhD-level Research Papers in a Fully Autonomous Ralph-loop

After reading Maggie Appleton’s excellently illustrated blogpost, I began building a Gas Town-inspired multi-agent system that can perform aspects of PhD-level research1. One of this system’s early outputs spans 94-pages and surveys soft robotics for space applications, wholly authored by a team of 20+ agentsI use the word agent quite loosely here but will clarify this later in the essay. in roughly 8 hoursThis is worthy of a blog post on its own but on a later date, perhaps. from a 2-minute voice prompt without any further steering. The agents run as Github Actions jobs so all work is completed inside the repository; this allows overnight and jobs and breaks from a screenIn Christmas, a Cloudflare tunnel to my laptop (when on) allowed triggering work while not near it. while being within Github Actions’ free limits with the downside of higher latency than a setup involving a virtual machine or sandbox.

A quick skim of this output showed hallucinated references; this paper is unworthy of submission to a scientific journal for a variety of reasons, but I have submitted it to the new Journal of AI Generated Papers. I also tested this system to generate reports on weather prediction using CorDiff for Rich Pauloo, a dear friend. While I confirmed the hallucinated referencesI wasn’t surprised at these hallucinations; a vibe-coded system that relies only on prompts can only be so effective., I suspect there are also hallucinated claims — I have not closely checked for these.

Skimming the outputs, Rich and I felt that such setups can definitely be steered towards accuracy by engineering better workflows. My personal belief from the longer paper is somewhat stronger: that such a vibe-coded system comprising large language models (LLMs) could be engineered to fully autonomously write hallucination-free papers of serious breadth, depth, length, insight, and rigour. The article’s quality could even be made good enough for acceptance into a peer-reviewed journal. This feels like a fun experiment to work on to learn about how to use modern AI tooling; it’s what partially underlies my attempt at making RalPhD. I am less sure what this means for academia and publishing; if papers are easy to write and is done from the perspective of metrics, then that is not a good enough end. But if AI-augmented researchers can pull futuristic ideas into the modern day more rapidly, especially the cooler sci-fi ones, then I think history be less harsh on the community. What I am quite confident about is that this is going to happen.

In this post, I present some of the key things I have learned about in building the RalPhD system, which is yet to be battle-tested. I wouldn’t recommend anyone use it yet; I think it’s riddled with bugs. But I needed to write things down to just clear my thoughts as I have been spending far too much time on Claude Code.

The Ralph Loop

To improve on these early attempts at autonomous papers, Rich nudged me to study and potentially incorporate the Ralph loop into my systemWhich should explain why the new system is called RalPhD.. Introduced by Geoff Huntley in mid-2025, Ralph remains a hot concept in software engineering circles2: the idea is to have the LLM implement a sequential plan from inside a while loop. The LLM works, uninterrupted, until all tasks in the plan are checked off; the bash script to achieve this is effective and also deceptively simple:

while true; do
	cat build-prompt.md | claude --dangerously-skip-permissions
done

When I reviewed my first autonomous paper system, I found a similar-looking while loop; but a closer examination revealed that a while loop does not a Ralph loop make. So, what does?

Well, here is what my system was doing:

while true; do
	cat implementation-plan.md | claude --dangerously-skip-permissions
done

If you look closely at the markdown file that is being piped into Claude, you will notice that my approach was looping directly through an entire plan (i.e., implementation-plan.md) whereas Huntley’s passes a meta-prompt (i.e., build-prompt.md) that tasks an LLM to specifically complete only one task from the plan per iteration of the loop. Specifically, his prompt tells the agent to study the implementation plan and pick the most important thing to do. This is followed by: IMPORTANT: update the implementation plan when the task is done.All-caps ensures Claude pays special attention to these bits.. You can see this below from his Lynchian masterpiece; I suspect its presentation alone makes it a worthy rival to Engelbart’s Mother of All Demos.

The Lynchian demonstration of a Ralph Loop; maybe this will be seen as the new mother of all demos?

My orchestration system’s implementation-plan.md lacks this higher-level constraint so Claude would plow through the plan in the same context window, which were prone to automatic compaction3. This is certainly one of the causes for hallucinations. I have since improved my first implementation to match Huntley’s task-level prompt.

As research tasks can be less clear-cut and possibly longer running, I also added more to my loop’s smartness by asking Claude to code in some context thresholding, which makes the agent aware of its own context window limits. At this threshold value, a separate bash script deterministically writes a yield file, which then triggers the agent to finish its current subtask, make a commit, update a checkpoint.md of what’s the next subtask to work on in a new session. Note that I do not ask the agent to update the implementation-plan.md at yield time; as the task remains unchecked, the new agent re-situates itself in the unchecked task and the checkpoint.md has the needed granularity of the sub-task.

While an LLM’s prompt-following is not always reliable, Huntley’s suggestions on how to Ralph provided testable guardrails to my first implementation; so far, every task has started in a fresh context window and I noticed improvements, such as less hallucinated references. I have also incorporated linters in my system that offer more deterministic outputs to reduce reference hallucination so I couldn’t say for sure what led to the improvements.

But I was now suspicious about a multi-agent system achieving great things based on prompting vibes alone so I decided to rebuild a new multi-agent system using Ralph principles — RalPhD — from the ground-up.

Yes, after nearly two months of Claude Code, I was reading code again.

Of Coding agents in Ralph Loops and Harnesses

Now, the ideas I’m about to discuss seem straight-forward and common sense in retrospect, but the density of useful information in the AI domain is concentrated in podcasts and Zoom recordingsAll hail oral culture, I guess?; this has made the knowledge feel more ephemeral and less diffuse than it actually might be. My bad habit of dumping links into Claude for summarisation and general Claude Code-fatigue has ensured poor retention of information; integrating it into a gestalt of useful knowledge has somehow never come into focus until now4.

It was in one of these AI builder community call recordings that I learned of the smart and dumb zones in an LLM’s context window. It is purported that LLMs are in the smart zone between 80k to 128k tokens — between 40% and 64% of the earlier Opus’s 200k token context windowDex is still holding onto the 100k tokens with Opus 4.6’s new 1M window. — so less is more. Loading too many MCPs, for example, fills the context window enough to put the LLM in the dumb zone — this can happen even before a single user message is sent.

The dumb zone should be avoided.

Being more judicious with each of the items in blue can better ensure that your agent operates in the “smart zone” for longer, as shown in the below schematic.

Operate in the smart zone by being smarter.

Ralphing had helped resolve issues (potentially) linked to compaction and context rot. I wondered what other ways there might be of more effectively using context and also making LLM behaviours more deterministic. It was around this time that I sat more deeply with Huntley’s piece on coding agents; I realised that his ideas integrated nicely with the notion of the smart and dumb zones5.

Huntley’s explanationEven if others called it that before Huntley, this was my first introduction to a specific definition. of a coding agent is that it’s a program to restrain the agentic urge of an LLM to only use6 a specific set of tools to build software to specifications (or specs). One can program an agent’s tool use capabilities in a Python script by only accessing the reasoning engine and giving this engine only those tools essential for its role in the plan.

For example, consider two agents in the RalPhD setup: the deep reader and scout. The former should only do two things: read a paper or article or blogpost that is stored in the project’s library and write notes on it. If it feels the need to read something else not in this collection, it shouldn’t be allowed to go out and find it on its own — this will fill up its context window. Instead, it should delegate that task to the scout, which is much like a research librarian; it shouldn’t be taking notes about a paper or reading its contents deeply, but should find the relevant literature and collect interesting papers by reading abstracts7. To ensure they operate in the smart zone, I can program them per Huntley’s idea to use specific tools in completing actions. To this end, I have a Python script for such agents, say agent.py; pseudocode for the aforementioned scout and a deep reader are shown below:

# scout's inner loop if --agent "scout":

while True:
	response = claude.complete(task)
	if response.wants_tool:
		result = web_search(response.query) # the only discovery tool it has
		task = result
	if yield_file_exists():
		break
	# output: scored_papers.md — NOT analysis notes

# deep-reader's inner loop --agent "deep-reader":
while True:
	response = claude.complete(task)
	if response.wants_tool:
		result = read_pdf(response.pages) # the only input tool it has
		task = result
	if yield_file_exists():
		break
	# output: notes.md — NOT the next paper to read

In each case, you can see claude; this is the reasoning engine alone and not the hyper-tooled ferret you get in Claude Code that has access to all tools. For example, you can see that the in the first while loop , the scout only searches the web (i.e., web_search) whereas the deep reader only reads the PDF (i.e., read_pdf). This while loop — this is also called the inner loop — where it can keep using its allocated set of tools.

Each agent is then invoked inside a Ralph loop — this is the outer loop — as shown below:

while true; do
	cat build-prompt.md | python3 agent.py --agent "$AGENT"
done

you will notice that the revised Ralph loop runs the agent via a Python script (python3 agent.py) whereas the erstwhile loop passed a build-prompt.md to claude. So, we now have a tool-delimited coding agent; this ought to lead to more deterministic behaviours in each agent.

In the grander RalPhD setup, the scout gets web_search, citation_lookup, and citation_download; the deep-reader gets read_file, write_file, and some tools for manipulating PDFs. There are a slew of other agents like coder, paper-writer, and so on; I’m hoping to battle-test this multi-agent implementation later this week.

Hopefully, this clarifies a bit of what I am working towards with RalPhD.

Terminology Clarifications

Going over foundational ideas in agentic engineering has been beneficial at an implementation level, but it has also provided some clarifications on terminology. For example, not all LLMs should be seen as agentic8. My spinterpretationspin+interpretation of this is that agentic LLMs are the subset of all LLMs with the distinguishing property that they are capable of using tools to script programs and, therefore, effect change in your codebaseEither on your machine, virtual machine, or a repository on Github/elsewhere.. So, by extension, coding agents are the subset of agentic LLMs with programmatically restricted tool-calling capabilities of agentic LLMs.

Another piece also cleared up the confusing terminology of harnesses:

I forget where I got the below snippet from on Huntley’s site.

But that said, if all else fails, the version I have found most useful is that a coding agent is like a powerful but well-trained dog in a harness. An agentic LLM is one step removed from that; it’s like a powerful dog that takes you for a walk as opposed to you walking it. As for an LLM…. well, I don’t have a good analogy so I’ll just say it’s like a fox; you see it once in a while from a distance, but harbour limited desire to get closer to one, even if it looks kinda cute — especially after you’ve had a dog.

Closing Remarks

While Ralph loops feel quite fundamental right now, I suspect they will become less relevant to software engineering. Eliminating the poor performance of models in the supposed dumb zone is probably already a big focus at frontier labs as fully autonomous software engineering—either with agentic LLMs or coding agents— that is totally reliable should arrive in the next couple of years.

What I am less sure about is that widening the smart zone in an LLM’s context window will lead to them independently and autonomously making new discoveries; I suspect this will be particularly hard in domains that are data-deficient (i.e., domains that are unlike the life sciences). That said, I am more confident that AI-augmented human researchers will help here; they will make discoveries faster; better explore the breadth of the design space of any problems; and also find connections in literature by better steering synthesis agents in a multi-agent team.

But, given current trends, these researchers might need to engineer robust AI harnesses on their own. Here, I think Ralph’s principles will remain relevant to research; while the loop currently emphasises better context window management of LLMs, I can see it limiting the volume of LLM outputs humans have to navigate. In other words, the loop will help better manage humans’ context windows. This thought is what underpins my pursuit of developing RalPhD.

My longer-term objective is to use these principles to focus on making more personally meaningful and tangible progress on [[ angadh.com/_notes/spaceship-engineering-1|building spaceships ]] without, for example, the hallucinations of the above paper. With few funders9 for such work, perhaps a team of AI agents working alongside humans — an agentic research team, if you will — can help drag this idea from the future into the present.

  1. As this lacks a singular definition, I define this as the kind of research question that lacks an answer — or, at least, a definitive one. A PhD is awarded for making the question seem worthwhile to have spent time thinking about alongside developing an acceptable set of answers that either resolve the question or lead to even more questions. 

  2. This is saying something for an idea in AI, where progress is so rapid that anything that lasts a couple months feels Lindy. The Ralph concept is about 9 months at this point. 

  3. I am no expert on how compaction happens, but my suspicion is that an LLM is likely being used to summarize the work when maximum context is used. So, if a long implementation plan is being executed on, lossy summarization is the result. 

  4. My intent in writing this is to finally integrate the key things I learned in the last month. I suspect some of this is incorrectly interpreted by me; this also means I may have poorly vibe-coded RalPhD. 

  5. This realisation might have been due to something Jeff said in a podcast or presentation; but searchability is a problem with inaccurately transcribed oral culture. 

  6. Software engineers (SWEs) employ the word call instead of use; I am no SWE so I will use the words interchangeably as, to me, they don’t lose much meaning. 

  7. Note that both agents can also write checkpoint files, like any subagent in this RalPhD setup if their context windows are reaching the pre-defined threshold value 

  8. A practical example to understand this idea is that when you use Grok on Twitter; it does not use tools to build you stuff (yet). In contrast, Claude and ChatGPT can be invoked in Cursor, Claude Code, or Codex to build you cool things. This is my understanding so you can shoot the messenger if he’s wrong (i.e., shoot me, not Huntley). 

  9. The more time I spend with Claude Code, the more I sense that the current model of academic research funding will be shaken up as will what counts as meaningful outputs. I am unsure exactly how this will happen but it feels inevitable that an AI-augmented researcher can operate with 100x-1000x lesser funding/resources in many scenarios to produce the same quality/quantity of outputs in a more compressed time-span. Also, much of what I have been proposing is possibly just not ambitious enough. 



Mentions & Discussions

Loading mentions...