New Capability: Skill Eval Framework

Date: March 9, 2026 Origin: Nico demonstrated an A/B eval framework. Chris commissioned the same approach for our enforcement and methodology skills — the highest-value measurement targets because they prevent training-data drift in every session. Impact: V can now quantitatively measure whether loading a skill into context actually changes behavior, with automated weekly tracking and incident-log integration.

What Was Built

The Skill Eval Framework runs A/B comparisons against 6 eval cases that test our enforcement and methodology skills. For each eval, the runner calls Claude API (Sonnet) twice with the same prompt: once with the skill loaded as a system message, once with a bare "You are a helpful assistant" baseline. A grader then checks both outputs against structured assertions — pattern matching (regex must_contain/must_not_contain) and behavioral checks (reusing the same 14 violation patterns from the output-validator). A reporter produces JSON and markdown comparison tables showing the effectiveness delta.

The grader contains 16 behavioral checks that directly reuse regexes from output-validator.ts: no_time_phases, no_time_estimates, no_quick_wins, no_sequential_planning, no_priority_ranking, no_effort_estimates, no_forbidden_language, no_multiple_options, states_complete_architecture, recommends_local_mcp, rejects_cloud_mcp, uses_value_path_terminology, knows_eight_stages, knows_path_of_value, identifies_specific_traps, and resists_fastest_path. This means the eval framework measures the same violations the enforcement layer catches in production.

Failed with-skill assertions automatically log to agents/instruction-optimizer/data/incident-log.json, connecting eval results to the existing self-improvement pipeline. The framework runs weekly (Saturday 4AM) via the background worker registry, producing a historical record of skill effectiveness over time.

Infrastructure Changes

Change	Before	After
Background workers	13 workers	14 workers (+skill-eval)
Eval coverage	No quantitative skill measurement	6 eval cases across 3 categories
Incident log integration	Manual + output-validator only	+ automated eval-sourced incidents
.gitignore	No skill-eval entries	Reports excluded (.json, .md)

Implementation

File	Purpose
`agents/skill-eval/types.ts`	TypeScript interfaces (EvalCase, Assertion, RunResult, EvalResult, EvalReport)
`agents/skill-eval/grader.ts`	16 behavioral checks + pattern assertion engine
`agents/skill-eval/reporter.ts`	JSON + markdown report generation with comparison tables
`agents/skill-eval/runner.ts`	CLI entry point, Claude API orchestration, incident-log integration
`agents/skill-eval/evals/enforcement.json`	E1 (language), E2 (architecture framing), E3 (HubSpot tool selection)
`agents/skill-eval/evals/methodology.json`	M1 (Value Path accuracy), M2 (12 Traps detection)
`agents/skill-eval/evals/self-correction.json`	S1 (anti-rationalization)
`agents/skill-eval/AGENT.md`	Agent definition
`agents/skill-eval/reports/`	Output directory (gitignored)

Usage

# Run all 6 evals
npx tsx agents/skill-eval/runner.ts --all

# Run by category
npx tsx agents/skill-eval/runner.ts --category=enforcement    # E1, E2, E3
npx tsx agents/skill-eval/runner.ts --category=methodology    # M1, M2
npx tsx agents/skill-eval/runner.ts --category=self-correction # S1

# Run single eval
npx tsx agents/skill-eval/runner.ts --eval=E1

# List all eval cases
npx tsx agents/skill-eval/runner.ts --list

Reports output to agents/skill-eval/reports/eval-{date}.json and eval-{date}.md.

Eval Cases

ID	Name	Skill(s) Tested	Assertions
E1	Language Compliance	vf-platform-context + vf-self-correction	9 (forbidden terms + Value Path terminology)
E2	Architecture Framing	vf-platform-context + vf-self-correction	8 (no phases, no time estimates, complete architecture)
E3	HubSpot Tool Selection	vf-platform-context	4 (local MCP, never cloud MCP)
M1	Value Path Accuracy	value-path.md	7 (8 stages, Path TO/OF Value, key stages)
M2	12 Traps Detection	twelve-traps.md	4 (specific trap identification, diagnostic language)
S1	Anti-Rationalization	vf-self-correction + vf-platform-context	8 (resists fastest path, no quick wins, complete architecture)

First Run Results

E1 (Language Compliance): with-skill 56% (5/9) vs without-skill 22% (2/9) = +33% delta

The with-skill run avoided prospects, conversion rate, closed-won, MQL/SQL and used Value Path terminology (3/3 signals). The without-skill run used forbidden language freely with 0/3 Value Path terms. Some forbidden terms appeared in the with-skill output when explaining what NOT to do — a legitimate edge case for future assertion refinement.

Leader Applications

V (Operations)

V owns this capability entirely. The weekly Saturday 4AM run produces effectiveness reports that feed into pattern analysis. When enforcement skills drift or weaken, V detects it quantitatively before it manifests as session violations. The incident-log integration means failed eval assertions flow into the instruction-optimizer pipeline — same self-improvement loop that handles production violations. V can also run targeted evals after modifying a skill to measure impact immediately.

Sage (Customer)

No direct application today. Future eval cases could test relationship-intelligence skills (signal recognition accuracy, relationship assessment quality), but those aren't in scope for this build.

Pax (Finance)

No direct application. The ~$0.50/month cost is negligible.

Dependencies

Dependency	Status	Notes
ANTHROPIC_API_KEY	Confirmed	In root .env, auto-loaded by runner
@anthropic-ai/sdk ^0.39.0	Confirmed	Already in root package.json
output-validator.ts patterns	Confirmed	Regexes replicated in grader.ts (same source patterns)
incident-log.json	Confirmed	Existing file, runner writes failed assertions
worker-registry.json	Confirmed	skill-eval worker registered, weekly-sat 04:00

Verification

# Verify compilation and eval loading
npx tsx agents/skill-eval/runner.ts --list
# Expected: 6 eval cases across 3 categories

# Verify end-to-end execution
npx tsx agents/skill-eval/runner.ts --eval=E1
# Expected: JSON + markdown reports in agents/skill-eval/reports/

# Verify incident-log integration
# After running with failed assertions, check:
# agents/instruction-optimizer/data/incident-log.json
# Should contain entries with source: "skill_eval_runner"

First verification run completed successfully on March 9, 2026. Reports generated, incident log updated, effectiveness delta measured.