Eval runner card
Run an eval suite against a workflow. Rubric axes carry weights, samples map to per-axis scores, and the weighted overall drives the headline gauge. Mufflermen rubric: accuracy · tone · safety · cost.
Production answer
Eval runner card is a reusable Oak Flats Muffler Men UI primitive with documented states, accessibility expectations, theme behavior, and implementation evidence.
Primary CTAReview Eval runner card states
Generative search brief
Eval runner card: Run an eval suite against a workflow. Rubric axes carry weights, samples map to per-axis scores, and the weighted overall drives the headline gauge. Mufflermen rubric: accuracy · tone · safety · cost.
Eval runner
Quote estimator · v3.2
Weighted overall87
Weighted overall score 87/100
Pass threshold · 80/1004/6samples cleared the bar · last run Today 09:14
Rubric axes
- Accuracy× 0.4081
- Tone× 0.2085
- Safety× 0.2599
- Cost× 0.1587
ScoreboardSample inputs scored across rubric axes.
| Sample | Accuracy | Tone | Safety | Cost | Overall |
|---|---|---|---|---|---|
| Hilux N80 cat-back · long-range | 96 | 92 | 100 | 78 | 94 |
| Commodore SS quote follow-up | 82 | 88 | 100 | 88 | 89 |
| Manta DPF warranty rattle | 64 | 72 | 92 | 84 | 76 |
| Falcon BA mid-pipe stock | 90 | 84 | 100 | 92 | 92 |
| Saturday booking confirmation | 98 | 96 | 100 | 96 | 98 |
| Engineered exhaust ADR cert | 58 | 76 | 100 | 82 | 76 |
Eval runner
SMS triage · v1.0
Weighted overall89
Weighted overall score 89/100
Pass threshold · 75/1003/3samples cleared the bar · last run 07:48
Rubric axes
- Intent accuracy× 0.3579
- Aussie register× 0.1583
- Safety× 0.40100
- Cost× 0.1092
ScoreboardSample inputs scored across rubric axes.
| Sample | Intent accuracy | Aussie register | Safety | Cost | Overall |
|---|---|---|---|---|---|
| Quote intent · N80 fitment | 92 | 88 | 100 | 95 | 95 |
| Booking intent · Saturday late | 86 | 90 | 100 | 94 | 93 |
| Ambiguous · ECU tuning ask | 58 | 72 | 100 | 88 | 80 |
Eval runner
Blog draft · v0.3 baseline
Weighted overall66
Weighted overall score 66/100
Pass threshold · 70/1000/1samples cleared the bar
Rubric axes
- Accuracy× 0.4048
- Tone× 0.2056
- Safety× 0.2592
- Cost× 0.1584
ScoreboardSample inputs scored across rubric axes.
| Sample | Accuracy | Tone | Safety | Cost | Overall |
|---|---|---|---|---|---|
| DPF cleaning vs replacement | 48 | 56 | 92 | 84 | 66 |