Loop design in Genkit Python · Agent Internals

Lance Martin published a guide this week on designing loops with Fable 5. Strip away the product names and one idea is doing the real work: don’t ask a model to check its own work. Put a separate verifier agent in the loop. The worker produces, an independent grader checks the result against a fixed standard, and the worker revises until the grader is satisfied. That is most of what you need, and it maps cleanly onto Genkit Python. (Memory across sessions is the other technique in the guide; more on that at the end.)

Why a separate agent instead of self-review? A model grading its own output reasons the same way it did when it wrote it, so it reproduces its own blind spots and tends to bless the work it already convinced itself was good. An independent verifier breaks that. It opens in a fresh context, sees only the artifact and a standard it did not author, and has no narrative to defend. Martin found a verifier sub-agent reliably beats self-critique for exactly this reason, and Claude Managed Agents ships it as Outcomes: you say what “done” looks like, and the platform spawns a stateless grader that re-checks the work every turn until each criterion passes. The Genkit version is a second prompt you own.

Concretely, the loop is: call the worker, hand its output to the verifier, read the verdict. If it passes, you are done. If it fails, append the feedback to the conversation and let the worker revise, not restart, then go again. You do not pick a magic number of tries; you loop until the verifier passes, with a cap as a safety valve so an impossible standard cannot spin forever.

The verifier needs that external standard fixed before the loop runs, and two things stay separate: the objective (what to build, which the worker reads) and the rubric (the checkable criteria, which only the grader reads). The separation is the whole point; a worker that can see the rubric is back to grading itself. (This is the distinction Martin draws between a /goal and a rubric, and it is easy to conflate.) In Genkit both are just data you pass in.

The two prompts live in .prompt files, with model and output schema in front matter and the body in Handlebars (templating) plus Picoschema (compact input/output schemas), both widely used formats if you want custom expressions without baking prompts into code. The worker prompt renders the objective on the first pass and a revision instruction after that:

---
model: googleai/gemini-pro-latest
input:
  schema:
    task: string
    feedback?: string
---
{{#if feedback}}
That attempt did not pass. Reviewer feedback: {{feedback}}
Revise your previous answer to satisfy every criterion.
{{else}}
{{task}}
{{/if}}

The verifier prompt is the independent grader. Its front matter pins the output to the Verdict schema, so the model is constrained to return a structured verdict, not prose:

---
model: googleai/gemini-pro-latest
input:
  schema:
    criteria(array): string
    output: string
output:
  schema: Verdict
  format: json
---
Grade this output against each criterion. Set passed=false if any criterion is unmet.

Criteria:
{{#each criteria}}
- {{this}}
{{/each}}

Output:
{{output}}

The flow loads both by name and wires them together:

from pathlib import Path

from genkit import Genkit
from genkit_google_genai import GoogleAI
from pydantic import BaseModel

ai = Genkit(
    plugins=[GoogleAI()],
    model='googleai/gemini-pro-latest',
    prompt_dir=Path(__file__).parent / 'prompts',
)

class Verdict(BaseModel):
    passed: bool
    feedback: str

ai.define_schema('Verdict', Verdict)  # lets the .prompt reference it by name

class Rubric(BaseModel):
    criteria: list[str]

class TaskInput(BaseModel):
    task: str               # the objective the worker sees
    rubric: Rubric          # the criteria only the verifier sees
    max_iterations: int = 10  # safety valve, not the target

@ai.flow()
async def correction_loop(input: TaskInput) -> str:
    # One conversation that grows across attempts. The worker sees every prior
    # draft and every critique, so it revises instead of restarting.
    messages = []
    feedback = ""
    output = ""

    for _ in range(input.max_iterations):
        # First pass renders {{task}}; later passes render the revision branch.
        # `messages` carries every prior draft, so the worker revises in place.
        response = await ai.prompt('worker')(
            input={'task': input.task, 'feedback': feedback},
            messages=messages,
        )
        messages = response.messages
        output = response.text

        # Fresh call: the verifier sees only the artifact and the rubric,
        # never the worker's conversation. That is its independence.
        result = await ai.prompt('verifier')(
            input={'criteria': input.rubric.criteria, 'output': output},
        )
        verdict = Verdict.model_validate(result.output)

        if verdict.passed:
            return output
        feedback = verdict.feedback

    raise RuntimeError(
        f"did not converge in {input.max_iterations} iterations; last output:\n{output}"
    )

Notice what carries the loop. The verifier is a separate prompt that gets only the output and the rubric, with no access to the worker’s conversation, which is what gives it an independent read instead of the worker’s self-serving one. The rubric is passed in by the caller, not inferred from the task text, so “done” is fixed before the loop starts. And the worker prompt is called with messages, not a freshly concatenated string: each pass appends the worker’s attempt and the verifier’s critique to one conversation, so the worker is editing its own prior draft with full context rather than re-deriving it from a single feedback line. That is the difference between “try again” and “revise.”

Two design choices worth being explicit about. First, the loop runs until the verifier passes; max_iterations is a circuit breaker, not a quota. Set it generously. Second, exhausting it is a real failure, so this raises instead of quietly returning a draft that never passed. A silent best-effort return is how a broken pipeline looks healthy. If your verdict carried a numeric score instead of a bool, you could keep the best attempt and surface it with its score; with a pass/fail verdict there is no “best,” only “passed” or “did not.”

If you are deciding what to put in criteria, the useful test is: would a human reviewer flag a violation of this rule without needing to interpret it? Concrete conditions work (“the function handles an empty list”, “the response is under 150 words”). Vague ones do not (“the output is high quality”, “the code is clean”). Vague criteria produce lenient verdicts, which defeats the loop.

The second loop in Martin’s guide operates at a different timescale. The self-correction loop runs within a single task. The memory loop runs across tasks, accumulating facts from one session that are available in the next. It is an outer loop: each run of the inner loop can generate learnings, those learnings get written to persistent storage, and future runs read them before starting.

Genkit Python has no built-in memory primitive. You can build one with structured output and a JSON file.

import json
from pathlib import Path

MEMORY_FILE = Path("agent_memory.json")

def load_memory() -> list[str]:
    if MEMORY_FILE.exists():
        return json.loads(MEMORY_FILE.read_text())
    return []

def save_memory(facts: list[str]):
    MEMORY_FILE.write_text(json.dumps(facts, indent=2))

class MemoryUpdate(BaseModel):
    learned_facts: list[str]

@ai.flow()
async def memory_loop(input: TaskInput) -> str:
    memory = load_memory()
    ctx = "\n".join(f"- {f}" for f in memory) if memory else "No prior knowledge."

    output = (await ai.generate(
        prompt=f"Prior learnings:\n{ctx}\n\nTask: {input.task}"
    )).text

    new_facts = (await ai.generate(
        prompt=(
            f"Task: {input.task}\n"
            f"Output: {output}\n\n"
            f"What rules should be remembered for future similar tasks?"
        ),
        output_format='json',
        output_schema=MemoryUpdate,
    )).output.learned_facts

    save_memory(memory + new_facts)
    return output

The distillation call at the end matters. You are not appending the full task and output to memory. You are asking the model to extract transferable rules from the experience. Martin describes this as a five-rung ladder: fail, investigate, verify, distill, consult. The distillation call here handles verify and distill together. What you get back is rules, not logs. Future runs consult the rules rather than rederiving them from scratch.

If you are deciding whether to combine these two flows, the answer depends on whether your tasks are independent or sequential. If each run is a fresh task with no relationship to prior runs, memory adds noise. If your agent is working through a domain repeatedly (debugging the same codebase, drafting in the same domain, operating the same pipeline), memory compounds. The correction loop is always worth having. The memory loop earns its cost only when the domain recurs.

Both of these are just flows. A flow is an async function with @ai.flow(). A loop is a for loop inside it. The framework gives you generate() and structured output via Pydantic. The rest is Python.

What Genkit Python does not give you is a first-class Rubric type or a verifier primitive (the managed-grader convenience that Outcomes provides). You define it yourself. That is probably the right call for now: the pattern is clear enough that a shared primitive would be more opinionated than useful. If you find yourself writing the same verifier scaffolding across many flows, that is the signal that an abstraction is worth building. One flow is not that signal.

A note on model selection: every prompt above pins googleai/gemini-pro-latest in its front matter. If you want a faster, cheaper worker, change the model: line in worker.prompt and leave the verifier on pro, since the verifier is the call that has to reason carefully against the criteria. The worker is also the call that re-sends the whole growing conversation each retry, so the cheaper model does double duty there.

That growing conversation is also why a loop like this eventually needs compaction. Every iteration re-sends every prior draft and critique, so a loop that runs ten passes on a hard task is carrying nine stale attempts the model no longer needs verbatim. On a long-running loop you would clip the old drafts the same way a coding agent clips old file bodies, which is the subject of a separate post.

Martin’s full guide is here.

To run the samples against PyPI genkit==0.8.1:

uv add genkit genkit-google-genai