AgentJudge evaluates and critiques outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports single-shot evaluations and iterative refinement through multiple evaluation loops with context building.
Based on the research paper: Agent-as-a-Judge: Evaluate Agents with Agents
| Capability | Description |
|---|---|
| Quality Assessment | Evaluates correctness, clarity, and completeness of agent outputs |
| Structured Feedback | Provides detailed critiques with strengths, weaknesses, and suggestions |
| Multimodal Support | Can evaluate text outputs alongside images |
| Context Building | Maintains evaluation context across multiple iterations |
| Custom Evaluation Criteria | Supports weighted evaluation criteria for domain-specific assessments |
| Batch Processing | Efficiently processes multiple evaluations |
Architecture
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
id | str | uuid4() | Unique identifier for the judge instance |
agent_name | str | "Agent Judge" | Name of the agent judge |
description | str | "You're an expert AI agent judge..." | Description of the agent’s role |
system_prompt | str | None | Custom system instructions (uses default if None) |
model_name | str | "openai/o1" | LLM model for evaluation |
max_loops | int | 1 | Maximum evaluation iterations |
verbose | bool | False | Enable verbose logging |
evaluation_criteria | Optional[Dict[str, float]] | None | Dictionary of evaluation criteria and weights |
return_score | bool | False | Whether to return a numerical score instead of full conversation |
Methods
step()
Processes a single task and returns the agent’s evaluation.run()
Executes evaluation in multiple iterations with context building.str (full conversation) if return_score=False, or int (numerical score) if return_score=True.