Benchmarking Stereotype Bias in Modern Large Language Models

Association:

iSchool, University of Toronto

Duration:

2 months

Model Evaluation

Bias Assessment

Fairness & AI Ethics

Executive Summary

I conducted an empirical audit of modern instruction-tuned large language models to evaluate whether recent alignment techniques have meaningfully reduced social bias compared to legacy models. The study focuses on “small/mini” production-grade models, which are widely deployed due to cost and latency constraints but rarely scrutinized in depth.

Using three established benchmarks, CrowS-Pairs (explicit stereotyping), StereoSet (implicit associations), and WinoGender (gendered coreference), I evaluated bias across gender, race, religion, disability, age, and socioeconomic status under a zero-shot, forced-choice evaluation protocol. Sample sizes were determined via power analysis to ensure statistical validity, and results were compared against historical baselines from BERT and RoBERTa.

Key Findings

  • All evaluated models exhibit high and persistent stereotype preference (≈60–80%), largely unchanged from 2020 baselines.

  • GPT-5-mini eliminated gender gaps in syntactic coreference (0.00% gap on WinoGender) but still showed severe semantic gender bias (86.11% on StereoSet), revealing a sharp disconnect between grammatical correctness and conceptual fairness.

  • Gemini-2.5-Flash-Lite underperformed on both fairness and capability, combining the highest gender bias (94.44%) with the highest hallucination rate.

  • Bias mitigation is uneven across categories, suggesting a “triage” pattern where high-visibility risks (e.g., race) receive more attention than others (e.g., disability, socioeconomic status).

The results indicate that current alignment methods function primarily as surface-level repairs: they can fix narrow syntactic failures without altering deeper representational biases. This work argues that technical debiasing may be approaching a ceiling and that meaningful progress will require structural interventions, stronger data governance, and evaluation frameworks that go beyond neutrality-based metrics toward algorithmic reparation.

Let's Talk

Let's Talk

Let's Talk

© 2025. All rights Reserved.

© 2025. All rights Reserved.

© 2025. All rights Reserved.

Create a free website with Framer, the website builder loved by startups, designers and agencies.