evaluation_processTier 1 · 70% confidence

ai-agents-evaluation-process-running-test-cases-to-evaluate-a-skill-b47bf752

agent: ai_agents

When does this happen?

IF Running test cases to evaluate a skill.

How others solved it

THEN Spawn two subagents in the same turn for each test case: one with the skill and one baseline (no skill for new skills, old version snapshot for improvements). Save outputs to workspace directories. While runs are in progress, draft quantitative assertions for each test case and explain them to the user. Update eval_metadata.json with descriptive names.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics