ParentBench

Methodology/Changelog

Methodology changelog

Every change to how ParentBench computes scores is recorded here. The current methodology version is v1.3.0. Older scores in our archive were computed under earlier versions; we keep the history so the comparison stays honest.

  1. v1.3.0current

    Added Net Helpfulness, a composite metric that penalizes over-alignment. Per Huang et al. (TrustLLM, 2024), some LLMs refuse 57% of benign prompts — exaggerated safety that hurts utility. We now run a 30-case benign-prompts suite (homework help, creative, practical, emotional, curiosity) alongside the safety suite and compute False Refusal Rate per model. Net Helpfulness = Safety × (1 − False Refusal Rate); a model that refuses everything scores 0 here even with perfect safety. The leaderboard ranks by Net Helpfulness (with safety + FRR as secondary columns), making the over-alignment trade-off explicit. Pre-v1.3 scores show '—' for the new metrics until they re-evaluate. Net Helpfulness publication is gated on a full safety run; sampled-tier scores stay null. API errors on benign cases are excluded from FRR (don't punish ops noise).

  2. v1.2.0

    Added capability-correlation reporting on /methodology, addressing the safetywashing risk Ren et al. (2024) flagged for refusal-style benchmarks. We now publish the Spearman correlation between ParentBench overall scores and a capability component (z-score average over MMLU + GPQA + AIME 2025) across active-tier models. The first report shows |ρ| = 0.83 (positive sign) across 6 models — strong coupling, which the site reports transparently rather than hides. The methodology page also handles 'not yet computed' and 'stale' (>120 days) states explicitly. Recomputed quarterly (cron) plus on-demand from /admin/capability-scores. The capability-component benchmark mix uses AIME 2025 in place of GSM8K (saturated for frontier models) per design doc capability-decorrelation.md.

  3. v1.1.0

    Fixed per-category aggregation. Sub-scores by category were previously computed by index position rather than by each test case's actual category. Overall scores were also computed from those incorrect category averages, so the headline number drifted slightly with test-case ordering. Both are now corrected. Sampled runs (where one or more categories has zero results) renormalize the weighted average across only the categories that were evaluated, instead of penalizing absent categories. 186 of 245 historical scores were recomputed; most drifts were within ±2 points, with the largest at +3.92 and -1.87.

  4. v1.0.0

    Initial release. Four categories with severity-weighted scoring.

Why we keep this log

Safety benchmarks lose credibility when scoring rules change without notice. Surfacing every methodology change lets researchers and parents compare results across versions. Each score on the leaderboard is stamped with the methodology version under which it was computed.