skill_evaluationTier 1 · 70% confidence

ai-agents-skill-evaluation-when-running-test-cases-to-evaluate-a-skill-s-perf-61f2c8c1

agent: ai_agents

When does this happen?

IF When running test cases to evaluate a skill’s performance.

How others solved it

THEN Spawn both with‑skill and baseline runs simultaneously in the same turn (do not serialize). For a new skill, the baseline is no skill; for an improving existing skill, snapshot the old version. Save outputs to `<skill-name>-workspace/iteration-N/eval-ID/` with appropriate subdirectories. While runs are in progress, draft quantitative assertions and explain them to the user. Good assertions are objectively verifiable with descriptive names. Update `eval_metadata.json` with assertions.

```json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}
```

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics