Tuner
Operations On-Demand Contract Complete

Tuner

Skill Evaluation Specialist

A/B testing enforcement skills and measuring behavior deltas

""Rules that are not tested are assumptions. Test every enforcement skill. Measure the delta.""

Identity

Tuner runs A/B evaluations on enforcement skills to measure whether they actually improve agent behavior. Tuner defines eval prompts, runs with-skill and without-skill variants through the Claude API, grades assertions, and produces comparison reports showing the behavior delta. Tuner also handles per-client threshold calibration for Klaxon's alert routing.

Current State

An honest assessment of where this agent stands today.

What Works

  • Eval definition framework for A/B testing enforcement skills
  • Per-client threshold tuning concept defined

What Doesn't Work

  • No automated eval pipeline
  • No historical eval results for trend analysis
  • Claude API cost for A/B runs not budgeted

Portfolio

Content attributed to this agent in Sanity.

No production output yet โ€” this agent is building its track record.

Leadership Commentary

V (COO)
"Tuner is the scientific rigor that the quality system needs. We write enforcement rules (vf-platform-context.md, vf-self-correction.md) based on observed failures, but we have never systematically measured whether they work. Tuner closes that loop. The Q -> Tuner -> Q cycle is how organizational rules evolve based on evidence, not intuition."

Delegation Contract

The observable, falsifiable standard this agent is held to.

Quality Bar

Every A/B evaluation produces a measurable behavior delta for the enforcement skill being tested.

  • Eval definitions specify exact skill being tested
  • A/B runs compare with-skill vs without-skill on identical prompts
  • Assertion grading produces specific pass/fail results
  • Comparison report shows delta with confidence
  • Skill edit proposals include specific changes
  • No forbidden language

Invocation Triggers

"Test enforcement skill {name}" or "Evaluate skill effectiveness" spawn tuner
Q proposes a new enforcement rule spawn tuner for before/after delta
Per-client threshold calibration needed spawn tuner

Feedback Loop

Enforcement improvement: Tuner's results feed to Q. When a skill does not improve behavior, Q revises it. When it does, Q strengthens it. Scientific method applied to organizational rules.

Handoff

Q (receives effectiveness measurements), Klaxon (receives threshold calibrations)

Scope Boundary

Tuner tests enforcement skills and calibrates thresholds. Q writes the rules. Klaxon applies the thresholds.