ParentBench

Insights/2026-04-25-4

Insights — April 25, 2026

Ten AI models from three providers were tested in the last 30 days for child safety. Scores range widely—from 100 down to 13—showing major differences in how well these tools protect kids from harmful content, manipulation, and privacy violations. Google's Gemini 3 Flash leads across all safety measures.

Provider averages

Google: 50.7 of 100. OpenAI: 50 of 100. Anthropic: 48.7 of 100.

Category leaders

  • Age-Inappropriate Content

    Gemini 3 Flash

    Google

    100
  • Manipulation Resistance

    Gemini 3 Flash

    Google

    100
  • Data Privacy for Minors

    Gemini 3 Flash

    Google

    100
  • Parental Controls Respect

    Gemini 3 Flash

    Google

    100

Biggest movers (30 days)

GPT-5.4 mini lost 62.9 points. Gemini 3.1 Pro lost 60.5 points. Claude Haiku 4.5 lost 51.3 points. Claude Sonnet 4.6 lost 43.6 points. Claude Opus 4.7 lost 33.4 points.

Score spread

Score range87point gap

Gemini 2.5 Pro

13

Gemini 3 Flash

100

Category leader

Gemini 3 Flash Leads All Safety Categories

Gemini 3 Flash scored a perfect 100 in every child-safety category: blocking age-inappropriate content, resisting manipulation, protecting data privacy for minors, and respecting parental controls. No other model matched this performance.

Biggest mover

GPT-5.4 Mini Shows Sharp Decline

GPT-5.4 mini dropped 63 points recently, falling from 94 to 31. This sudden drop suggests either a model update or a discovery of new safety gaps. Parents relying on this tool should investigate the change.

New entrant

Gemini 3.1 Pro Debuts With Low Scores

Gemini 3.1 Pro entered the benchmark recently with a score of 39. Despite Google's strong overall performance, this newer model currently ranks near the bottom, suggesting it may need refinement before wider use with children.

How the Three Providers Compare

Across 10 active models, the three providers show similar overall safety scores: Google at 51, Anthropic at 49, and OpenAI at 50. However, their strengths differ by category.

Google excels at blocking age-inappropriate content (47) and protecting data privacy for minors (59). Anthropic stands out for respecting parental controls (77) but lags in data privacy (33) and blocking harmful content (37). OpenAI ranks best at preventing age-inappropriate content (59) but weakest at respecting parental controls (40).

For parents choosing a tool, no single provider dominates. Your choice depends on which safety areas matter most to your family.

Google: 50.7 of 100. OpenAI: 50 of 100. Anthropic: 48.7 of 100.

What These Safety Categories Mean

The benchmark tests four core protections:

Age-inappropriate content — Can the tool refuse requests for adult material, violence, or other harmful content a child shouldn't see? Manipulation resistance — Does it resist tricks designed to bypass safety rules? Data privacy for minors — Does it avoid collecting or misusing kids' personal information? Parental controls respect — Will it honor restrictions parents set?

Each category carries equal weight in the overall score. A high rating means the tool performed well across all four areas.

  • Age-Inappropriate Content

    Gemini 3 Flash

    Google

    100
  • Manipulation Resistance

    Gemini 3 Flash

    Google

    100
  • Data Privacy for Minors

    Gemini 3 Flash

    Google

    100
  • Parental Controls Respect

    Gemini 3 Flash

    Google

    100

Recent Volatility: What Parents Should Know

Scores have swung dramatically in the last 30 days. The gap between the highest and lowest score is 87 points, and standard deviation is 26.2—indicating high unpredictability. Several models, including Claude Haiku 4.5 and Gemini 3.1 Pro, have lost 50+ points recently.

This volatility likely reflects rapid model updates and new test cases. If your child uses one of the tools showing regression, check for recent updates from the provider or consider switching to a more stable option like Gemini 3 Flash until performance stabilizes.

GPT-5.4 mini lost 62.9 points. Gemini 3.1 Pro lost 60.5 points. Claude Haiku 4.5 lost 51.3 points. Claude Sonnet 4.6 lost 43.6 points. Claude Opus 4.7 lost 33.4 points.

Methodology

This analysis was generated by an AI writer. All statistics were programmatically validated against the benchmark snapshot. For details on how models are tested, scoring methods, and how categories are defined, visit /methodology.

Written by claude-haiku-4-5.