Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.swarms.world/llms.txt

Use this file to discover all available pages before exploring further.

The AgentJudge evaluates and critiques outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports single-shot evaluations and iterative refinement through multiple evaluation loops with context building. Based on the research paper: Agent-as-a-Judge: Evaluate Agents with Agents
CapabilityDescription
Quality AssessmentEvaluates correctness, clarity, and completeness of agent outputs
Structured FeedbackProvides detailed critiques with strengths, weaknesses, and suggestions
Multimodal SupportCan evaluate text outputs alongside images
Context BuildingMaintains evaluation context across multiple iterations
Custom Evaluation CriteriaSupports weighted evaluation criteria for domain-specific assessments
Batch ProcessingEfficiently processes multiple evaluations

Architecture

Parameters

ParameterTypeDefaultDescription
idstruuid4()Unique identifier for the judge instance
agent_namestr"Agent Judge"Name of the agent judge
descriptionstr"You're an expert AI agent judge..."Description of the agent’s role
system_promptstrNoneCustom system instructions (uses default if None)
model_namestr"openai/o1"LLM model for evaluation
max_loopsint1Maximum evaluation iterations
verboseboolFalseEnable verbose logging
evaluation_criteriaOptional[Dict[str, float]]NoneDictionary of evaluation criteria and weights
return_scoreboolFalseWhether to return a numerical score instead of full conversation

Methods

step()

Processes a single task and returns the agent’s evaluation.
result = judge.step(task: str, img: Optional[str] = None) -> str

run()

Executes evaluation in multiple iterations with context building.
result = judge.run(task: str, img: Optional[str] = None) -> Union[str, int]
Returns str (full conversation) if return_score=False, or int (numerical score) if return_score=True.

run_batched()

Executes batch evaluation of multiple tasks.
results = judge.run_batched(tasks: List[str]) -> List[Union[str, int]]

Examples

Basic Evaluation

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge(
    agent_name="quality-judge",
    model_name="claude-sonnet-4-6",
    max_loops=2
)

agent_output = "The capital of France is Paris. The city is known for its famous Eiffel Tower."

evaluations = judge.run(task=agent_output)

Custom Evaluation Criteria

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge(
    agent_name="technical-judge",
    model_name="claude-sonnet-4-6",
    max_loops=1,
    evaluation_criteria={
        "accuracy": 0.4,
        "completeness": 0.3,
        "clarity": 0.2,
        "logic": 0.1,
    },
)

technical_output = "To solve x^2 + 5x + 6 = 0, we use the quadratic formula..."
evaluations = judge.run(task=technical_output)

Scoring Mode

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge(
    agent_name="scoring-judge",
    model_name="claude-sonnet-4-6",
    max_loops=2,
    return_score=True
)

score = judge.run(task="This is a correct and well-explained answer.")

Batch Processing

from swarms.agents.agent_judge import AgentJudge

judge = AgentJudge()

tasks = [
    "The capital of France is Paris.",
    "2 + 2 = 4",
    "The Earth is flat."
]

evaluations = judge.run_batched(tasks=tasks)

Reference

@misc{zhuge2024agentasajudgeevaluateagentsagents,
    title={Agent-as-a-Judge: Evaluate Agents with Agents},
    author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
    year={2024},
    eprint={2410.10934},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
}