Documentation Index
Fetch the complete documentation index at: https://docs.swarms.world/llms.txt
Use this file to discover all available pages before exploring further.
Overview
ThePlannerGeneratorEvaluator is a domain-agnostic three-agent orchestration harness inspired by the GAN-style architecture described in Anthropic’s harness design research. It coordinates long-running autonomous tasks from a short natural-language prompt, using an iterative generate-evaluate feedback loop to converge on high-quality output across any domain.
All three agents communicate through a single shared file on disk.
The harness follows this workflow:
- Planning: Planner expands a short prompt into an ambitious plan with steps and evaluation criteria
- Contract Negotiation: Generator proposes what “done” looks like for each step; Evaluator reviews
- Execution: Generator produces concrete output and self-assesses before handoff
- Evaluation: Evaluator scores output per-criterion with hard thresholds — any criterion below its threshold fails the step
- Feedback Loop: On failure, Generator receives scores + trajectory signal (refine or pivot) and retries
- All state on disk: The shared state file is the single append-only record of the entire run
Installation
Key Features
| Feature | Description |
|---|---|
| GAN-Style Separation | Distinct Generator and Evaluator agents prevent self-evaluation bias |
| Step Contracts | Generator and Evaluator agree on success criteria before execution |
| Hard Threshold Enforcement | Any single criterion below its threshold fails the step |
| Score Trajectory | Tracks score trends across retries — signals Generator to refine or pivot |
| Self-Assessment | Generator self-evaluates before Evaluator handoff |
| Shared State File | Single append-only .md file for all inter-agent communication |
| Domain-Agnostic | Planner defines evaluation criteria tailored to the task domain |
| Custom Agents | Pass pre-configured agents with tools, MCP, or any Agent settings |
| Configurable Thresholds | Default thresholds plus Planner-defined per-criterion thresholds |
Attributes
Model identifier for all three agents
Override model for the Planner
Override model for the Generator
Override model for the Evaluator
Upper bound on plan steps to execute
Max evaluation failures before advancing
Directory where output is produced
Path for the shared state file (auto-generated if None)
Fallback score thresholds by criterion name
Format for output (dict, str, list, final, json, yaml)
Enable verbose logging
Pre-configured Agent for planning
Pre-configured Agent for generation (e.g., with file/code tools)
Pre-configured Agent for evaluation (e.g., with Playwright MCP)
| Exception | Condition |
|---|---|
ValueError | If max_steps < 1, max_retries_per_step < 0, or model_name is empty |
Methods
run()
Execute the full PGE harness pipeline from a short prompt to completed output.task(str): A short natural-language description of the desired task
output_type
After run() completes, access harness.last_result for structured metadata:
| Field | Type | Description |
|---|---|---|
output_path | str | Path to the shared state file |
plan | str | The generated plan text |
step_logs | List[Dict] | Per-step metadata (contract, scores, retries) |
total_duration | float | Wall-clock time in seconds |
total_steps_completed | int | Number of steps that passed evaluation |
total_retries | int | Total retry attempts across all steps |
batched_run()
Run the harness on multiple tasks sequentially.tasks(List[str]): List of task prompts to process
Usage Examples
Basic Usage
Custom Agents with Tools
Pass pre-configured agents with tools so the Generator can write files and the Evaluator can verify them on disk:Evaluator with Playwright MCP (Web App Testing)
For web application development, give the Evaluator browser automation via Playwright MCP so it can test the running app like a real user:Custom Thresholds
Provide default score thresholds that apply when the Planner doesn’t define them:Architecture Details
Shared State File
All inter-agent communication flows through a single append-only markdown file. Each section is timestamped and labeled:Refine vs. Pivot
When a step fails evaluation, the harness computes a score trajectory across retries:- Scores improving — REFINE: keep the current direction, fix specific issues
- Scores declining or stagnant — PIVOT: take a fundamentally different approach
Evaluation Criteria
The Planner defines domain-appropriate criteria as part of the plan. Each criterion has:| Field | Description |
|---|---|
| Name | Short label (e.g., “accuracy”, “clarity”) |
| Weight | Relative importance (high, standard, low) |
| Description | What it measures and what good/bad looks like |
| Threshold | Minimum passing score (1-10). Any criterion below threshold = step fails |