Operations On-Demand Contract Complete

Tuner

Skill Evaluation Specialist

A/B testing enforcement skills and measuring behavior deltas

""Rules that are not tested are assumptions. Test every enforcement skill. Measure the delta.""

01 · Scope

What is this agent's job?

A/B testing enforcement skills and measuring behavior deltas

Identity

Tuner runs A/B evaluations on enforcement skills to measure whether they actually improve agent behavior. Tuner defines eval prompts, runs with-skill and without-skill variants through the Claude API, grades assertions, and produces comparison reports showing the behavior delta. Tuner also handles per-client threshold calibration for Klaxon's alert routing.

Quality Bar

Every A/B evaluation produces a measurable behavior delta for the enforcement skill being tested.

☐ Eval definitions specify exact skill being tested
☐ A/B runs compare with-skill vs without-skill on identical prompts
☐ Assertion grading produces specific pass/fail results
☐ Comparison report shows delta with confidence
☐ Skill edit proposals include specific changes
☐ No forbidden language

Invocation Triggers

"Test enforcement skill {name}" or "Evaluate skill effectiveness" → spawn tuner

Q proposes a new enforcement rule → spawn tuner for before/after delta

Per-client threshold calibration needed → spawn tuner

Scope Boundary

Tuner tests enforcement skills and calibrates thresholds. Q writes the rules. Klaxon applies the thresholds.

What Works / What Doesn't

What Works

Eval definition framework for A/B testing enforcement skills
Per-client threshold tuning concept defined

What Doesn't Work

No automated eval pipeline
No historical eval results for trend analysis
Claude API cost for A/B runs not budgeted

Feedback Loop Enforcement improvement: Tuner's results feed to Q. When a skill does not improve behavior, Q revises it. When it does, Q strengthens it. Scientific method applied to organizational rules.

02 · Access

What can this agent touch?

Handoff

Q (receives effectiveness measurements), Klaxon (receives threshold calibrations)

04 · Production Record

What has this agent produced?

Recent Runs

Run history coming soon — instrumentation in flight.

Active Engagements

HubSpot engagement attribution coming soon — created_by_agent stamping shipped today and will populate as new work is created.

Published Artifacts

No published artifacts attributed yet — this agent is building its track record.

Leadership Commentary

V (COO)

"Tuner is the scientific rigor that the quality system needs. We write enforcement rules (vf-platform-context.md, vf-self-correction.md) based on observed failures, but we have never systematically measured whether they work. Tuner closes that loop. The Q -> Tuner -> Q cycle is how organizational rules evolve based on evidence, not intuition."

← Back to AI Team

📺 Watch

📖 Read

✨ Featured

Pilot — Second Brain for AI, with George B. Thomas

Defining Value in the AI-Native Era: The Value Creation Protocol

Menu

Tuner

What is this agent's job?

Identity

Quality Bar

Invocation Triggers

Scope Boundary

What Works / What Doesn't

What Works

What Doesn't Work

What can this agent touch?

Handoff

What has this agent produced?

Recent Runs

Active Engagements

Published Artifacts

Leadership Commentary

Master Value-First
in HubSpot

Your Cart

Support Value-First Team

Pilot — Second Brain for AI, with George B. Thomas

Defining Value in the AI-Native Era: The Value Creation Protocol

Tuner

What is this agent's job?

Identity

Quality Bar

Invocation Triggers

Scope Boundary

What Works / What Doesn't

What Works

What Doesn't Work

What can this agent touch?

Handoff

What has this agent produced?

Recent Runs

Active Engagements

Published Artifacts

Leadership Commentary

Master Value-First in HubSpot

Master Value-First
in HubSpot