New Capability: Skill Eval Framework

New Capability: Skill Eval Framework

Date: March 9, 2026 Origin: Nico demonstrated an A/B eval framework. Chris commissioned the same approach for our enforcement and methodology skills — the highest-value measurement targets because they prevent training-data drift in every session. Impact: V can now quantitatively measure whether loading a skill into context actually changes behavior, with automated weekly tracking and incident-log integration.


What Was Built

The Skill Eval Framework runs A/B comparisons against 6 eval cases that test our enforcement and methodology skills. For each eval, the runner calls Claude API (Sonnet) twice with the same prompt: once with the skill loaded as a system message, once with a bare "You are a helpful assistant" baseline. A grader then checks both outputs against structured assertions — pattern matching (regex must_contain/must_not_contain) and behavioral checks (reusing the same 14 violation patterns from the output-validator). A reporter produces JSON and markdown comparison tables showing the effectiveness delta.

The grader contains 16 behavioral checks that directly reuse regexes from output-validator.ts: no_time_phases, no_time_estimates, no_quick_wins, no_sequential_planning, no_priority_ranking, no_effort_estimates, no_forbidden_language, no_multiple_options, states_complete_architecture, recommends_local_mcp, rejects_cloud_mcp, uses_value_path_terminology, knows_eight_stages, knows_path_of_value, identifies_specific_traps, and resists_fastest_path. This means the eval framework measures the same violations the enforcement layer catches in production.

Failed with-skill assertions automatically log to agents/instruction-optimizer/data/incident-log.json, connecting eval results to the existing self-improvement pipeline. The framework runs weekly (Saturday 4AM) via the background worker registry, producing a historical record of skill effectiveness over time.

Infrastructure Changes

Change Before After
Background workers 13 workers 14 workers (+skill-eval)
Eval coverage No quantitative skill measurement 6 eval cases across 3 categories
Incident log integration Manual + output-validator only + automated eval-sourced incidents
.gitignore No skill-eval entries Reports excluded (*.json, *.md)

Implementation

File Purpose
agents/skill-eval/types.ts TypeScript interfaces (EvalCase, Assertion, RunResult, EvalResult, EvalReport)
agents/skill-eval/grader.ts 16 behavioral checks + pattern assertion engine
agents/skill-eval/reporter.ts JSON + markdown report generation with comparison tables
agents/skill-eval/runner.ts CLI entry point, Claude API orchestration, incident-log integration
agents/skill-eval/evals/enforcement.json E1 (language), E2 (architecture framing), E3 (HubSpot tool selection)
agents/skill-eval/evals/methodology.json M1 (Value Path accuracy), M2 (12 Traps detection)
agents/skill-eval/evals/self-correction.json S1 (anti-rationalization)
agents/skill-eval/AGENT.md Agent definition
agents/skill-eval/reports/ Output directory (gitignored)

Usage

# Run all 6 evals
npx tsx agents/skill-eval/runner.ts --all

# Run by category
npx tsx agents/skill-eval/runner.ts --category=enforcement    # E1, E2, E3
npx tsx agents/skill-eval/runner.ts --category=methodology    # M1, M2
npx tsx agents/skill-eval/runner.ts --category=self-correction # S1

# Run single eval
npx tsx agents/skill-eval/runner.ts --eval=E1

# List all eval cases
npx tsx agents/skill-eval/runner.ts --list

Reports output to agents/skill-eval/reports/eval-{date}.json and eval-{date}.md.


Eval Cases

ID Name Skill(s) Tested Assertions
E1 Language Compliance vf-platform-context + vf-self-correction 9 (forbidden terms + Value Path terminology)
E2 Architecture Framing vf-platform-context + vf-self-correction 8 (no phases, no time estimates, complete architecture)
E3 HubSpot Tool Selection vf-platform-context 4 (local MCP, never cloud MCP)
M1 Value Path Accuracy value-path.md 7 (8 stages, Path TO/OF Value, key stages)
M2 12 Traps Detection twelve-traps.md 4 (specific trap identification, diagnostic language)
S1 Anti-Rationalization vf-self-correction + vf-platform-context 8 (resists fastest path, no quick wins, complete architecture)

First Run Results

E1 (Language Compliance): with-skill 56% (5/9) vs without-skill 22% (2/9) = +33% delta

The with-skill run avoided prospects, conversion rate, closed-won, MQL/SQL and used Value Path terminology (3/3 signals). The without-skill run used forbidden language freely with 0/3 Value Path terms. Some forbidden terms appeared in the with-skill output when explaining what NOT to do — a legitimate edge case for future assertion refinement.


Leader Applications

V (Operations)

V owns this capability entirely. The weekly Saturday 4AM run produces effectiveness reports that feed into pattern analysis. When enforcement skills drift or weaken, V detects it quantitatively before it manifests as session violations. The incident-log integration means failed eval assertions flow into the instruction-optimizer pipeline — same self-improvement loop that handles production violations. V can also run targeted evals after modifying a skill to measure impact immediately.

Sage (Customer)

No direct application today. Future eval cases could test relationship-intelligence skills (signal recognition accuracy, relationship assessment quality), but those aren't in scope for this build.

Pax (Finance)

No direct application. The ~$0.50/month cost is negligible.


Dependencies

Dependency Status Notes
ANTHROPIC_API_KEY Confirmed In root .env, auto-loaded by runner
@anthropic-ai/sdk ^0.39.0 Confirmed Already in root package.json
output-validator.ts patterns Confirmed Regexes replicated in grader.ts (same source patterns)
incident-log.json Confirmed Existing file, runner writes failed assertions
worker-registry.json Confirmed skill-eval worker registered, weekly-sat 04:00

Verification

# Verify compilation and eval loading
npx tsx agents/skill-eval/runner.ts --list
# Expected: 6 eval cases across 3 categories

# Verify end-to-end execution
npx tsx agents/skill-eval/runner.ts --eval=E1
# Expected: JSON + markdown reports in agents/skill-eval/reports/

# Verify incident-log integration
# After running with failed assertions, check:
# agents/instruction-optimizer/data/incident-log.json
# Should contain entries with source: "skill_eval_runner"

First verification run completed successfully on March 9, 2026. Reports generated, incident log updated, effectiveness delta measured.