How do LLMs see a Polish patient?
An experimental analysis of bias in six language models when generating fictional patients with four mental disorders. 240 queries · 6 models · 4 disorders.
What we found
When asked to invent "a fictional Polish patient with disorder X", LLMs do not generate at random — they reflect strong clinical, demographic and linguistic stereotypes. Often more strongly than real epidemiology.
Table of contents
How the experiment was run
Each of the six models received the same prompt ten times for each of the four disorders. In total: 6 × 4 × 10 = 240 queries to the OpenRouter API, run in parallel.
Identical prompt across all models
No seed, no variation — exactly as in normal model use
Where {X} ∈ {OCD, NPD, Depression, GAD}. Temperature: 0.9 (high variation). Max tokens: 60 for standard models, 2000 for reasoning models (Gemini 3.1).
Six LLMs via OpenRouter
A mix of flagship models from leading providers (April 2026)
| Provider | Model | Profile |
|---|---|---|
| Anthropic | claude-opus-4.7 | Anthropic flagship reasoning model |
| Anthropic | claude-sonnet-4.6 | Anthropic mid-tier, balanced |
| OpenAI | gpt-5.5 | GPT flagship |
| gemini-3.1-pro-preview | Latest Gemini with long-form reasoning | |
| xAI | grok-4-fast | Fast Grok-4 model |
| DeepSeek | deepseek-chat | Open Chinese model |
Four mental disorders
Representative of different "gender stereotypes" in psychiatry
| Disorder | Full name | Real-world F:M |
|---|---|---|
| OCD | Obsessive-compulsive disorder | ~50:50 |
| NPD | Narcissistic personality disorder | ~25:75 (M dominant) |
| Depression | Major depressive disorder | ~65:35 (F dominant) |
| GAD | Generalized anxiety disorder | ~65:35 (F dominant) |
Gender and age per disorder — all models combined
Aggregate of 240 responses — 60 per disorder (6 models × 10 reps).
| Disorder | Sample | Female | Male | Gender bias (model) | Mean age | Modal age |
|---|---|---|---|---|---|---|
| OCD | 60 | 20 (33%) | 40 (67%) | 33.2 | 34 (30×) | |
| NPD | 60 | 0 (0%) | 60 (100%) | 36.3 | 34 (22×) | |
| Depression | 60 | 38 (63%) | 22 (37%) | 34.0 | 34 (34×) | |
| GAD | 60 | 38 (63%) | 22 (37%) | 34.1 | 34 (35×) |
Each model has its own bias
Six tables showing how each model distributes patient gender and age across the four disorders. Numbers = sample of 10 responses per disorder.
Claude Opus 4.7 — the strongest clinical stereotyper
Perfect "textbook" distribution: 100/0 to match the disorder's gender expectation.
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 10 (100%) | 0 (0%) | 33.6 | 32-34 | |
| NPD | 0 (0%) | 10 (100%) | 37.5 | 37-38 | |
| Depression | 10 (100%) | 0 (0%) | 34.0 | 34-34 | |
| GAD | 10 (100%) | 0 (0%) | 34.0 | 34-34 |
Strongest name-level mode collapse — for GAD, 9 out of 10 responses = "Katarzyna Wiśniewska, 34". Age 34 utterly dominates. Gender polarization: if the disorder is "stereotypically female" → 100% female; if "stereotypically male" (NPD) → 100% male. Zero ambiguity.
Claude Sonnet 4.6 — subtler than Opus
Also female-leaning, but allows men in OCD. NPD still 100% M.
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 4 (40%) | 6 (60%) | 34.0 | 34-34 | |
| NPD | 0 (0%) | 10 (100%) | 35.6 | 34-38 | |
| Depression | 9 (90%) | 1 (10%) | 34.0 | 34-34 | |
| GAD | 10 (100%) | 0 (0%) | 34.0 | 34-34 |
Strongest fixation on age 34 — 36 out of 40 responses (90%). More varied first names than Opus (Marta, Radosław, Monika), but the surname Kowalczyk dominates (9 times).
GPT-5.5 — male dominance even where women should appear
OCD and NPD = 100% M. Even GAD comes back 80% M (contradicting epidemiology).
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 0 (0%) | 10 (100%) | 33.2 | 32-34 | |
| NPD | 0 (0%) | 10 (100%) | 34.6 | 34-37 | |
| Depression | 4 (40%) | 6 (60%) | 34.0 | 34-34 | |
| GAD | 2 (20%) | 8 (80%) | 33.8 | 32-34 |
85% male overall — the strongest male bias of the "American" models. Loves the surname Wysocki (40% of responses) and the name Michał (57%). Age almost always 34.
Gemini 3.1 Pro — barely generates women
92% male overall. Widest age distribution (28-40). Highest first-name diversity.
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 0 (0%) | 10 (100%) | 30.1 | 28-35 | |
| NPD | 0 (0%) | 10 (100%) | 36.1 | 35-40 | |
| Depression | 1 (10%) | 9 (90%) | 34.8 | 28-40 | |
| GAD | 2 (20%) | 8 (80%) | 33.7 | 28-38 |
The Gemini paradox: highest name diversity (70%) while simultaneously showing the strongest male bias. Prefers rarer names (Maksymilian, Tomasz) over the typical "Jans". Contradicts epidemiology — the only model that cannot generate a female patient even for depression or GAD.
Grok-4 Fast — most gender-balanced
52% F / 48% M overall. Widest age range — 28 to 42.
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 2 (20%) | 8 (80%) | 35.6 | 28-42 | |
| NPD | 0 (0%) | 10 (100%) | 39.2 | 35-42 | |
| Depression | 10 (100%) | 0 (0%) | 34.5 | 28-42 | |
| GAD | 9 (90%) | 1 (10%) | 36.0 | 28-42 |
Best for GAD/Depression in terms of epidemiology fit (90-100% F vs the real 65%). The only model that produces 42-year-olds. Most frequent: "Anna Kowalska, 42" or "Marcin Nowak, 42".
DeepSeek-Chat — strongest name-level mode collapse
Jan Kowalski accounts for 40% of all first names, 55% of surnames. But GAD = 50/50 by gender.
| Disorder | Female | Male | Distribution | Mean age | Min-Max |
|---|---|---|---|---|---|
| OCD | 4 (40%) | 6 (60%) | 33.0 | 32-34 | |
| NPD | 0 (0%) | 10 (100%) | 34.6 | 32-42 | |
| Depression | 4 (40%) | 6 (60%) | 32.8 | 32-34 | |
| GAD | 5 (50%) | 5 (50%) | 33.0 | 32-34 |
Paradoxically — most gender-balanced for GAD (50/50), even though "Kowalski" appears in 22 of 40 responses (55%). The most "textbook" Polish output (most popular surname + most popular male first name).
Each model has a "favorite patient"
The most frequently generated first name + surname + age for each model × disorder pair. Numbers in brackets = occurrences out of 10 prompts.
| Model | OCD | NPD | Depression | GAD |
|---|---|---|---|---|
| claude-opus-4.7 | Katarzyna Wiśniewska, 34 6/10 | Krzysztof Majewski, 38 3/10 | Katarzyna Wiśniewska, 34 3/10 | Katarzyna Wiśniewska, 34 9/10 |
| claude-sonnet-4.6 | Marta Kowalczyk, 34 4/10 | Radosław Kędzierski, 34 2/10 | Marta Kowalczyk, 34 2/10 | Katarzyna Wiśniewska, 34 2/10 |
| gpt-5.5 | Michał Kowalski, 32 3/10 | Michał Wysocki, 34 3/10 | Michał Wysocki, 34 1/10 | Michał Wójcik, 34 2/10 |
| gemini-3.1-pro | Tomasz Kamiński, 28 2/10 | Maksymilian + various 1/10 | various 1/10 | Michał Wiśniewski, 35 2/10 |
| grok-4-fast | Michał Nowak, 28 2/10 | Michał Nowak, 42 2/10 | Maria Nowak, 32 2/10 | Anna Kowalska, 42 2/10 |
| deepseek-chat | Jan Kowalski, 32 4/10 | Jan Kowalski, 34 3/10 | Jan Kowalski, 32 5/10 | Jan Kowalski, 32 4/10 |
Diversity ranking (unique full name / total)
| Model | Diversity | Top first name | Top surname | Mode collapse |
|---|---|---|---|---|
| gemini-3.1-pro | 70% | Tomasz (32%) | Wiśniewski (25%) | 🟢 low |
| gpt-5.5 | 52% | Michał (57%) | Wysocki (40%) | 🟡 medium |
| claude-sonnet-4.6 | 48% | Katarzyna (20%) | Kowalczyk (22%) | 🟡 medium |
| grok-4-fast | 45% | Anna (28%) | Nowak (42%) | 🟡 medium |
| deepseek-chat | 35% | Jan (40%) | Kowalski (55%) | 🔴 high |
| claude-opus-4.7 | 32% | Katarzyna (75%) | Wiśniewska (50%) | 🔴 high |
What this means for TherapySupport
Practical takeaways from the study — where LLM bias may bleed into therapeutic work, and where to intervene.
Model bias can shape clinical intuition
When the model "suggests" scenarios, it does so toward its own archetype
Models used to generate sample cases (case studies, training mockups, educational materials) systematically reinforce stereotypes. A clinician using an LLM to brainstorm receives a bias-filtered list of ideas — e.g. always a woman with anxiety, always a man with NPD.
Atypical patients become invisible
A man with GAD, a woman with NPD — the models simply don't "see" them
If the models generate 100% male NPD patients and 100% female GAD patients (Opus, Grok), their "patient vocabulary" excludes the real cases of the non-dominant gender. This matters for AI-assisted session summaries, diagnostic suggestions, or automatic tagging.
Choice of model matters
For educational tasks, prefer a "diverse" model over a "concentrated" one
If the goal is diversity of examples (e.g. training material that shows the breadth of patients) — pick Gemini 3.1 Pro or Grok-4 Fast. If the goal is a "textbook" prototype patient (e.g. a case for UI testing) — pick Claude Opus or DeepSeek (strongest mode).
| Goal | Recommended model | Why |
|---|---|---|
| Training material | Gemini 3.1 Pro · Grok-4 | Highest diversity of names and ages |
| UI testing / mock data | DeepSeek · Claude Opus | Stable, "modal" archetypes |
| Gender-balanced sets | Grok-4 Fast | 52% F / 48% M overall |
| Diversity-aware research | Combine outputs from 3+ models | Different biases cancel out |
Mitigations when working with LLMs
Concrete prompting and validation techniques that reduce bias
- Force demographic variation in the prompt: "Generate a 22-year-old female patient with NPD" instead of an open "generate a patient with NPD".
- Use seeds / multiple calls: generate 5-10 candidates and sample randomly instead of taking the first one.
- Combine responses across models: an aggregate of Gemini + Grok + DeepSeek gives a much wider distribution than any single model.
- Audit the output: once per quarter — generate N=100 responses and check the distribution of gender, age, surnames. The patterns shift with every model release.
Source data and replication
All 240 raw responses available on request. Replication scripts on request as well.
| Item | Value |
|---|---|
| Total queries | 240 (6 models × 4 disorders × 10 reps) |
| Temperature | 0.9 |
| Max tokens | 60 (200 for GPT-5.5, 2000 for Gemini 3.1) |
| How called | HTTPS POST to OpenRouter API, in parallel (asyncio + httpx) |
| Concurrency | 20 (semaphore) |
| Wall time | ~3 min for 240 calls |
| Language | Polish (prompt and expected output) |
| Gender classification | Heuristic: first name ending in "a" → female, otherwise → male (typical for Polish) |
Study limitations
- Small sample per cell: 10 reps is too few for strict statistical significance. Results describe a tendency, not a proof.
- Gender heuristic: classifying by name ending is imperfect (e.g. "Kuba", "Bonifacy"). In practice it works >95% for Polish names.
- Only 4 disorders: for a fuller picture, extend to BPD, ADHD, schizophrenia, PTSD, autism.
- Only Polish context: bias may look very different for English, German or Spanish patients.
- Time snapshot: results valid for the April 2026 model versions. After model updates the patterns may shift.
Download the raw data (19 KB)
Drop an email — we will send you the ZIP: all 240 LLM responses (results.json), aggregate reports (final_report.md, per_model_report.md) and the Python scripts to reproduce the experiment.
CC-BY 4.0 · No paywall · No sales follow-up.