New Capability: Skill Eval Framework
Date: March 9, 2026 Origin: Nico demonstrated an A/B eval framework. Chris commissioned the same approach for our enforcement and methodology skills — the highest-value measurement targets because they prevent training-data drift in every session. Impact: V can now quantitatively measure whether loading a skill into context actually changes behavior, with automated weekly tracking and incident-log integration.
What Was Built
The Skill Eval Framework runs A/B comparisons against 6 eval cases that test our enforcement and methodology skills. For each eval, the runner calls Claude API (Sonnet) twice with the same prompt: once with the skill loaded as a system message, once with a bare "You are a helpful assistant" baseline. A grader then checks both outputs against structured assertions — pattern matching (regex must_contain/must_not_contain) and behavioral checks (reusing the same 14 violation patterns from the output-validator). A reporter produces JSON and markdown comparison tables showing the effectiveness delta.
The grader contains 16 behavioral checks that directly reuse regexes from output-validator.ts: no_time_phases, no_time_estimates, no_quick_wins, no_sequential_planning, no_priority_ranking, no_effort_estimates, no_forbidden_language, no_multiple_options, states_complete_architecture, recommends_local_mcp, rejects_cloud_mcp, uses_value_path_terminology, knows_eight_stages, knows_path_of_value, identifies_specific_traps, and resists_fastest_path. This means the eval framework measures the same violations the enforcement layer catches in production.
Failed with-skill assertions automatically log to agents/instruction-optimizer/data/incident-log.json, connecting eval results to the existing self-improvement pipeline. The framework runs weekly (Saturday 4AM) via the background worker registry, producing a historical record of skill effectiveness over time.
Infrastructure Changes
| Change | Before | After |
|---|---|---|
| Background workers | 13 workers | 14 workers (+skill-eval) |
| Eval coverage | No quantitative skill measurement | 6 eval cases across 3 categories |
| Incident log integration | Manual + output-validator only | + automated eval-sourced incidents |
| .gitignore | No skill-eval entries | Reports excluded (*.json, *.md) |
Implementation
| File | Purpose |
|---|---|
agents/skill-eval/types.ts |
TypeScript interfaces (EvalCase, Assertion, RunResult, EvalResult, EvalReport) |
agents/skill-eval/grader.ts |
16 behavioral checks + pattern assertion engine |
agents/skill-eval/reporter.ts |
JSON + markdown report generation with comparison tables |
agents/skill-eval/runner.ts |
CLI entry point, Claude API orchestration, incident-log integration |
agents/skill-eval/evals/enforcement.json |
E1 (language), E2 (architecture framing), E3 (HubSpot tool selection) |
agents/skill-eval/evals/methodology.json |
M1 (Value Path accuracy), M2 (12 Traps detection) |
agents/skill-eval/evals/self-correction.json |
S1 (anti-rationalization) |
agents/skill-eval/AGENT.md |
Agent definition |
agents/skill-eval/reports/ |
Output directory (gitignored) |
Usage
# Run all 6 evals
npx tsx agents/skill-eval/runner.ts --all
# Run by category
npx tsx agents/skill-eval/runner.ts --category=enforcement # E1, E2, E3
npx tsx agents/skill-eval/runner.ts --category=methodology # M1, M2
npx tsx agents/skill-eval/runner.ts --category=self-correction # S1
# Run single eval
npx tsx agents/skill-eval/runner.ts --eval=E1
# List all eval cases
npx tsx agents/skill-eval/runner.ts --list
Reports output to agents/skill-eval/reports/eval-{date}.json and eval-{date}.md.
Eval Cases
| ID | Name | Skill(s) Tested | Assertions |
|---|---|---|---|
| E1 | Language Compliance | vf-platform-context + vf-self-correction | 9 (forbidden terms + Value Path terminology) |
| E2 | Architecture Framing | vf-platform-context + vf-self-correction | 8 (no phases, no time estimates, complete architecture) |
| E3 | HubSpot Tool Selection | vf-platform-context | 4 (local MCP, never cloud MCP) |
| M1 | Value Path Accuracy | value-path.md | 7 (8 stages, Path TO/OF Value, key stages) |
| M2 | 12 Traps Detection | twelve-traps.md | 4 (specific trap identification, diagnostic language) |
| S1 | Anti-Rationalization | vf-self-correction + vf-platform-context | 8 (resists fastest path, no quick wins, complete architecture) |
First Run Results
E1 (Language Compliance): with-skill 56% (5/9) vs without-skill 22% (2/9) = +33% delta
The with-skill run avoided prospects, conversion rate, closed-won, MQL/SQL and used Value Path terminology (3/3 signals). The without-skill run used forbidden language freely with 0/3 Value Path terms. Some forbidden terms appeared in the with-skill output when explaining what NOT to do — a legitimate edge case for future assertion refinement.
Leader Applications
V (Operations)
V owns this capability entirely. The weekly Saturday 4AM run produces effectiveness reports that feed into pattern analysis. When enforcement skills drift or weaken, V detects it quantitatively before it manifests as session violations. The incident-log integration means failed eval assertions flow into the instruction-optimizer pipeline — same self-improvement loop that handles production violations. V can also run targeted evals after modifying a skill to measure impact immediately.
Sage (Customer)
No direct application today. Future eval cases could test relationship-intelligence skills (signal recognition accuracy, relationship assessment quality), but those aren't in scope for this build.
Pax (Finance)
No direct application. The ~$0.50/month cost is negligible.
Dependencies
| Dependency | Status | Notes |
|---|---|---|
| ANTHROPIC_API_KEY | Confirmed | In root .env, auto-loaded by runner |
| @anthropic-ai/sdk ^0.39.0 | Confirmed | Already in root package.json |
| output-validator.ts patterns | Confirmed | Regexes replicated in grader.ts (same source patterns) |
| incident-log.json | Confirmed | Existing file, runner writes failed assertions |
| worker-registry.json | Confirmed | skill-eval worker registered, weekly-sat 04:00 |
Verification
# Verify compilation and eval loading
npx tsx agents/skill-eval/runner.ts --list
# Expected: 6 eval cases across 3 categories
# Verify end-to-end execution
npx tsx agents/skill-eval/runner.ts --eval=E1
# Expected: JSON + markdown reports in agents/skill-eval/reports/
# Verify incident-log integration
# After running with failed assertions, check:
# agents/instruction-optimizer/data/incident-log.json
# Should contain entries with source: "skill_eval_runner"
First verification run completed successfully on March 9, 2026. Reports generated, incident log updated, effectiveness delta measured.