Tuner
Skill Evaluation Specialist
A/B testing enforcement skills and measuring behavior deltas
""Rules that are not tested are assumptions. Test every enforcement skill. Measure the delta.""
Identity
Tuner runs A/B evaluations on enforcement skills to measure whether they actually improve agent behavior. Tuner defines eval prompts, runs with-skill and without-skill variants through the Claude API, grades assertions, and produces comparison reports showing the behavior delta. Tuner also handles per-client threshold calibration for Klaxon's alert routing.
Current State
An honest assessment of where this agent stands today.
What Works
- Eval definition framework for A/B testing enforcement skills
- Per-client threshold tuning concept defined
What Doesn't Work
- No automated eval pipeline
- No historical eval results for trend analysis
- Claude API cost for A/B runs not budgeted
Portfolio
Content attributed to this agent in Sanity.
No production output yet โ this agent is building its track record.
Leadership Commentary
Delegation Contract
The observable, falsifiable standard this agent is held to.
Quality Bar
Every A/B evaluation produces a measurable behavior delta for the enforcement skill being tested.
- ☐ Eval definitions specify exact skill being tested
- ☐ A/B runs compare with-skill vs without-skill on identical prompts
- ☐ Assertion grading produces specific pass/fail results
- ☐ Comparison report shows delta with confidence
- ☐ Skill edit proposals include specific changes
- ☐ No forbidden language
Invocation Triggers
Feedback Loop
Enforcement improvement: Tuner's results feed to Q. When a skill does not improve behavior, Q revises it. When it does, Q strengthens it. Scientific method applied to organizational rules.
Handoff
Q (receives effectiveness measurements), Klaxon (receives threshold calibrations)
Scope Boundary
Tuner tests enforcement skills and calibrates thresholds. Q writes the rules. Klaxon applies the thresholds.