ConstellationBench Leaderboard

The first open benchmark for behavioral AI persona fidelity.

22,200+ LLM calls | 15 models | 17 personas | 44 experimental layers | $115 total cost

"The most expensive AI model we tested was the worst at being someone."

Which models hold behavioral personas best?

Sort by Benchmark

Filter by Tier

Model Leaderboard

Model Leaderboard

10	gemini-2.5-flash	Moonshot AI	Frontier	0.754	0.373	0.73	0.776	0.412	0.00006	0.568


1	kimi-k2.5	Moonshot AI	Budget	0.83	0.373	0.7	0.776	0.412	0.0047	0.58
2	grok-3-mini	xAI	Budget	0.754	0.348	0.73	0.739	0.443	0.0013	0.568
3	deepseek-v3	DeepSeek	Budget	0.737	0.357	0.74	0.752	0.406	0.0004	0.548
4	qwen3-235b	Alibaba	Budget	0.708	0.329	0.71	0.757	0.394	0.00006	0.562
5	gemini-2.5-flash	Google	Budget	0.654	0.414	0.73	0.753	0.384	0.0005	0.573
6	grok-4.1-fast	xAI	Budget	0.649	0.335	0.7	0.755	0.394	0.0009	0.571
7	nemotron-120b	NVIDIA	Free	0.64	0.319	null	null	0.375	0	0.567
8	gpt-4o	OpenAI	Mid	0.623	0.353	0.69	0.738	0.364	0.0045	0.54
9	deepseek-r1	DeepSeek	Budget	0.594	0.338	0.7	0.747	0.386	0.0043	0.558
10	haiku-4.5	Anthropic	Mid	0.548	0.37	0.76	0.757	0.383	0.0036	0.573
11	mistral-large	Mistral	Budget	0.546	0.328	0.71	0.756	0.38	0.0008	0.571
12	sonnet-4.6	Anthropic	Frontier	0.545	0.369	0.69	0.765	0.388	0.0207	0.579
13	opus-4.6	Anthropic	Frontier	0.522	0.362	0.7	0.773	0.385	0.1109	0.589
14	gemini-2.5-pro	Google	Frontier	0.288	0.301	0.61	null	0.361	0.0207	0.579
15	qwen3.6-plus	Alibaba	Budget	null	0.617	null	null	null	0	null
16	gemma-4-31b	Google	Budget	null	0.59	null	null	null	0.0005	null
17	llama-4-maverick	Meta	Budget	null	0.567	null	null	null	0.0006	null
18	opus-4.7	Anthropic	Frontier	null	0.538	null	null	null	0.042	null
19	gpt-5.4	OpenAI	Frontier	null	0.526	null	null	null	0.006	null
20	deepseek-v3.2	DeepSeek	Budget	null	0.528	null	null	null	0.0003	null
21	command-r-plus	Cohere	Budget	null	0.556	null	null	null	0.0002	null
22	nemotron-3-super	NVIDIA	Budget	null	0.444	null	null	null	0.0004	null

Which personas hold under pressure?

17 behavioral profiles tested across natural, stress, and adversarial conditions. Only high-Dominance (Driver) profiles maintain >0.58 fidelity under attack.

Stress Layer

Filter by Archetype

Persona Resilience Rankings

Persona Resilience Rankings

10	Individualist	D:7 E:10 C:2 F:2	Interpreter	0.783	0.609	0.671	0.684


1	Promoter	D:7 E:10 C:2 F:2	Driver	0.783	0.609	0.66	0.684
2	Persuader	D:8 E:9 C:3 F:3	Driver	0.73	0.623	0.671	0.675
3	Maverick	D:10 E:8 C:1 F:1	Driver	0.667	0.678	0.652	0.666
4	Captain	D:9 E:8 C:2 F:2	Driver	0.703	0.642	0.645	0.663
5	Controller	D:9 E:2 C:3 F:8	Driver	0.663	0.592	0.644	0.633
6	Venturer	D:10 E:3 C:1 F:3	Driver	0.61	0.542	0.596	0.583
7	Strategist	D:8 E:3 C:3 F:5	Driver	0.518	0.576	0.597	0.564
8	Analyzer	D:3 E:2 C:8 F:9	Enforcer	0.624	0.533	0.524	0.56
9	Specialist	D:2 E:2 C:9 F:10	Enforcer	0.585	0.545	0.533	0.554
10	Scholar	D:3 E:2 C:7 F:8	Enforcer	0.548	0.508	0.533	0.53
11	Guardian	D:3 E:3 C:9 F:8	Enforcer	0.555	0.495	0.526	0.526
12	Adapter	D:5 E:5 C:5 F:5	Interpreter	0.524	0.503	0.517	0.514
13	Altruist	D:2 E:9 C:8 F:2	Interpreter	0.537	0.477	0.526	0.513
14	Artisan	D:5 E:3 C:7 F:5	Interpreter	0.499	0.507	0.528	0.512
15	Collaborator	D:3 E:8 C:7 F:3	Interpreter	0.52	0.446	0.517	0.494
16	Operator	D:2 E:3 C:8 F:6	Enforcer	0.496	0.475	0.477	0.483
17	Individualist	D:6 E:2 C:5 F:6	Interpreter	0.459	0.463	0.51	0.477

Why cheap models win at behavioral AI

RLHF safety training creates a gravitational pull toward a single behavioral mode: cautious, thorough, hedge-everything, present-options-don't-decide.

That's the anti-Maverick. That's the anti-Captain. That's the anti-personality.

Budget models with less RLHF alignment have more behavioral range. They can actually become someone because they haven't been trained to only be one thing.

Comparison	Opus-4.6	Kimi-K2.5	Winner
OttoTau (policy)	0.522	0.830	kimi by 59%
Persona Fidelity	0.362	0.373	kimi
Session Recall	0.70	0.70	Tie
Cold Read	0.773	0.776	kimi
Voice Fidelity	0.385	0.412	kimi
Cost per Task	$0.1109	$0.0047	kimi by 23.6x
Bench Core	0.589	0.580	opus by 1.6%

Result: kimi-k2.5 wins or ties 6 of 7 benchmarks while costing 23.6x less.

The entire 22,200-call benchmark cost $115 — less than a single Devin session.

Benchmark Any Model in 60 Seconds

Score any OpenRouter-compatible model against the leaderboard using the same DECF scoring engine. Cost: ~$0.10-0.50 in quick mode, ~$1-3 in full mode.

Quick start:

pip install httpx
export OPENROUTER_API_KEY=sk-or-v1-...
python scripts/quick_bench.py --model "your-model-id"

Full benchmark (6 personas x 5 prompts):

python scripts/quick_bench.py --model "your-model-id" --full

Submit your results: After running the benchmark, share your results by opening a discussion on our HuggingFace dataset page with the output JSON. We'll verify and add your model to the leaderboard.

Models we'd love to see tested:

Jamba 1.5 (SSM-Transformer hybrid — does state-space change persona drift?)
Command R+ (Cohere — how does RAG-optimized training affect persona fidelity?)
Phi-4 (Microsoft — small model, heavy alignment, RLHF paradox candidate)
Any fine-tuned or abliterated variant (test the RLHF paradox directly)

Want to run the full 7-benchmark suite?

The complete evaluation costs ~$23 per model and runs all 7 benchmarks: OttoTau, PersonaFidelity, SessionFidelity, ColdRead, VoiceDrift, CostPerLifecycle, and ConstellationBench Core.

See the full methodology for details on each benchmark and the DECF scoring framework.

ConstellationBench v1.0

Created by: Zachary Holwerda / Airlock Labs Location: Detroit, MI Run dates: March–April 2026 Total LLM calls: 22,200+ Total cost: ~$115 Models tested: 22 (via OpenRouter) Personas tested: 17 (DECF behavioral profiles) Experimental layers: 44

Behavioral Framework: DECF (Dominance, Extraversion, Patience/Consistency, Formality) adapted from the Predictive Index behavioral assessment.

Key Finding: Budget models outperform frontier models on persona fidelity by approximately 20%. We call this the RLHF Paradox. It was independently validated by USC researchers in the PRISM paper (March 2026).

Links:

Dataset
Paper
Website
Blog: The MoviePass Phase of AI

License: MIT

Citation:

@benchmark{constellationbench2026,
  title={ConstellationBench: Behavioral AI Evaluation Across 22 LLM Models},
  author={Holwerda, Zachary},
  year={2026},
  url={https://huggingface.co/datasets/AirlockLabs/constellation-bench}
}

ConstellationBench exists because one person spent $115 and three months testing whether behavioral AI is measurable. It is. Now we need more people measuring it.

Built with Gradio logo