TheDocumentation Index
Fetch the complete documentation index at: https://docs.swarms.world/llms.txt
Use this file to discover all available pages before exploring further.
AgentJudge evaluates and critiques outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports single-shot evaluations and iterative refinement through multiple evaluation loops with context building.
Based on the research paper: Agent-as-a-Judge: Evaluate Agents with Agents
| Capability | Description |
|---|---|
| Quality Assessment | Evaluates correctness, clarity, and completeness of agent outputs |
| Structured Feedback | Provides detailed critiques with strengths, weaknesses, and suggestions |
| Multimodal Support | Can evaluate text outputs alongside images |
| Context Building | Maintains evaluation context across multiple iterations |
| Custom Evaluation Criteria | Supports weighted evaluation criteria for domain-specific assessments |
| Batch Processing | Efficiently processes multiple evaluations |
Architecture
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
id | str | uuid4() | Unique identifier for the judge instance |
agent_name | str | "Agent Judge" | Name of the agent judge |
description | str | "You're an expert AI agent judge..." | Description of the agent’s role |
system_prompt | str | None | Custom system instructions (uses default if None) |
model_name | str | "openai/o1" | LLM model for evaluation |
max_loops | int | 1 | Maximum evaluation iterations |
verbose | bool | False | Enable verbose logging |
evaluation_criteria | Optional[Dict[str, float]] | None | Dictionary of evaluation criteria and weights |
return_score | bool | False | Whether to return a numerical score instead of full conversation |
Methods
step()
Processes a single task and returns the agent’s evaluation.run()
Executes evaluation in multiple iterations with context building.str (full conversation) if return_score=False, or int (numerical score) if return_score=True.