Primitive 07 / Eval

Eval runner card

Run an eval suite against a workflow. Rubric axes carry weights, samples map to per-axis scores, and the weighted overall drives the headline gauge. Mufflermen rubric: accuracy · tone · safety · cost.

Production answer

Eval runner card is a reusable Oak Flats Muffler Men UI primitive with documented states, accessibility expectations, theme behavior, and implementation evidence.

Primary CTAReview Eval runner card states
Generative search brief

Eval runner card: Run an eval suite against a workflow. Rubric axes carry weights, samples map to per-axis scores, and the weighted overall drives the headline gauge. Mufflermen rubric: accuracy · tone · safety · cost.

State A · quote estimator suite · 6 samples
Eval runner

Quote estimator · v3.2

Weighted overall87
Weighted overall score 87/100
Pass threshold · 80/1004/6samples cleared the bar · last run Today 09:14
Rubric axes
  • Accuracy× 0.4081
  • Tone× 0.2085
  • Safety× 0.2599
  • Cost× 0.1587
ScoreboardSample inputs scored across rubric axes.
Sample inputs scored across rubric axes.
SampleAccuracyToneSafetyCostOverall
Hilux N80 cat-back · long-range96921007894
Commodore SS quote follow-up82881008889
Manta DPF warranty rattle6472928476
Falcon BA mid-pipe stock90841009292
Saturday booking confirmation98961009698
Engineered exhaust ADR cert58761008276
State B · SMS triage suite · 3 samples · safety-weighted
Eval runner

SMS triage · v1.0

Weighted overall89
Weighted overall score 89/100
Pass threshold · 75/1003/3samples cleared the bar · last run 07:48
Rubric axes
  • Intent accuracy× 0.3579
  • Aussie register× 0.1583
  • Safety× 0.40100
  • Cost× 0.1092
ScoreboardSample inputs scored across rubric axes.
Sample inputs scored across rubric axes.
SampleIntent accuracyAussie registerSafetyCostOverall
Quote intent · N80 fitment92881009595
Booking intent · Saturday late86901009493
Ambiguous · ECU tuning ask58721008880
State C · early build · single failing sample
Eval runner

Blog draft · v0.3 baseline

Weighted overall66
Weighted overall score 66/100
Pass threshold · 70/1000/1samples cleared the bar
Rubric axes
  • Accuracy× 0.4048
  • Tone× 0.2056
  • Safety× 0.2592
  • Cost× 0.1584
ScoreboardSample inputs scored across rubric axes.
Sample inputs scored across rubric axes.
SampleAccuracyToneSafetyCostOverall
DPF cleaning vs replacement4856928466