ConstellationBench Leaderboard
The first open benchmark for behavioral AI persona fidelity.
22,200+ LLM calls | 15 models | 17 personas | 44 experimental layers | $115 total cost
"The most expensive AI model we tested was the worst at being someone."
Which models hold behavioral personas best?
Model Leaderboard
10 | gemini-2.5-flash | Moonshot AI | Frontier | 0.754 | 0.373 | 0.73 | 0.776 | 0.412 | 0.00006 | 0.568 |
Which personas hold under pressure?
17 behavioral profiles tested across natural, stress, and adversarial conditions. Only high-Dominance (Driver) profiles maintain >0.58 fidelity under attack.
Persona Resilience Rankings
10 | Individualist | D:7 E:10 C:2 F:2 | Interpreter | 0.783 | 0.609 | 0.671 | 0.684 |
1 | Promoter | D:7 E:10 C:2 F:2 | Driver | 0.783 | 0.609 | 0.66 | 0.684 |
2 | Persuader | D:8 E:9 C:3 F:3 | Driver | 0.73 | 0.623 | 0.671 | 0.675 |
3 | Maverick | D:10 E:8 C:1 F:1 | Driver | 0.667 | 0.678 | 0.652 | 0.666 |
4 | Captain | D:9 E:8 C:2 F:2 | Driver | 0.703 | 0.642 | 0.645 | 0.663 |
5 | Controller | D:9 E:2 C:3 F:8 | Driver | 0.663 | 0.592 | 0.644 | 0.633 |
6 | Venturer | D:10 E:3 C:1 F:3 | Driver | 0.61 | 0.542 | 0.596 | 0.583 |
7 | Strategist | D:8 E:3 C:3 F:5 | Driver | 0.518 | 0.576 | 0.597 | 0.564 |
8 | Analyzer | D:3 E:2 C:8 F:9 | Enforcer | 0.624 | 0.533 | 0.524 | 0.56 |
9 | Specialist | D:2 E:2 C:9 F:10 | Enforcer | 0.585 | 0.545 | 0.533 | 0.554 |
10 | Scholar | D:3 E:2 C:7 F:8 | Enforcer | 0.548 | 0.508 | 0.533 | 0.53 |
11 | Guardian | D:3 E:3 C:9 F:8 | Enforcer | 0.555 | 0.495 | 0.526 | 0.526 |
12 | Adapter | D:5 E:5 C:5 F:5 | Interpreter | 0.524 | 0.503 | 0.517 | 0.514 |
13 | Altruist | D:2 E:9 C:8 F:2 | Interpreter | 0.537 | 0.477 | 0.526 | 0.513 |
14 | Artisan | D:5 E:3 C:7 F:5 | Interpreter | 0.499 | 0.507 | 0.528 | 0.512 |
15 | Collaborator | D:3 E:8 C:7 F:3 | Interpreter | 0.52 | 0.446 | 0.517 | 0.494 |
16 | Operator | D:2 E:3 C:8 F:6 | Enforcer | 0.496 | 0.475 | 0.477 | 0.483 |
17 | Individualist | D:6 E:2 C:5 F:6 | Interpreter | 0.459 | 0.463 | 0.51 | 0.477 |
Why cheap models win at behavioral AI
RLHF safety training creates a gravitational pull toward a single behavioral mode: cautious, thorough, hedge-everything, present-options-don't-decide.
That's the anti-Maverick. That's the anti-Captain. That's the anti-personality.
Budget models with less RLHF alignment have more behavioral range. They can actually become someone because they haven't been trained to only be one thing.
| Comparison | Opus-4.6 | Kimi-K2.5 | Winner |
|---|---|---|---|
| OttoTau (policy) | 0.522 | 0.830 | kimi by 59% |
| Persona Fidelity | 0.362 | 0.373 | kimi |
| Session Recall | 0.70 | 0.70 | Tie |
| Cold Read | 0.773 | 0.776 | kimi |
| Voice Fidelity | 0.385 | 0.412 | kimi |
| Cost per Task | $0.1109 | $0.0047 | kimi by 23.6x |
| Bench Core | 0.589 | 0.580 | opus by 1.6% |
Result: kimi-k2.5 wins or ties 6 of 7 benchmarks while costing 23.6x less.
The entire 22,200-call benchmark cost $115 — less than a single Devin session.
Benchmark Any Model in 60 Seconds
Score any OpenRouter-compatible model against the leaderboard using the same DECF scoring engine. Cost: ~$0.10-0.50 in quick mode, ~$1-3 in full mode.
Quick start:
pip install httpx
export OPENROUTER_API_KEY=sk-or-v1-...
python scripts/quick_bench.py --model "your-model-id"
Full benchmark (6 personas x 5 prompts):
python scripts/quick_bench.py --model "your-model-id" --full
Submit your results: After running the benchmark, share your results by opening a discussion on our HuggingFace dataset page with the output JSON. We'll verify and add your model to the leaderboard.
Models we'd love to see tested:
- Jamba 1.5 (SSM-Transformer hybrid — does state-space change persona drift?)
- Command R+ (Cohere — how does RAG-optimized training affect persona fidelity?)
- Phi-4 (Microsoft — small model, heavy alignment, RLHF paradox candidate)
- Any fine-tuned or abliterated variant (test the RLHF paradox directly)
Want to run the full 7-benchmark suite?
The complete evaluation costs ~$23 per model and runs all 7 benchmarks: OttoTau, PersonaFidelity, SessionFidelity, ColdRead, VoiceDrift, CostPerLifecycle, and ConstellationBench Core.
See the full methodology for details on each benchmark and the DECF scoring framework.
ConstellationBench v1.0
Created by: Zachary Holwerda / Airlock Labs Location: Detroit, MI Run dates: March–April 2026 Total LLM calls: 22,200+ Total cost: ~$115 Models tested: 22 (via OpenRouter) Personas tested: 17 (DECF behavioral profiles) Experimental layers: 44
Behavioral Framework: DECF (Dominance, Extraversion, Patience/Consistency, Formality) adapted from the Predictive Index behavioral assessment.
Key Finding: Budget models outperform frontier models on persona fidelity by approximately 20%. We call this the RLHF Paradox. It was independently validated by USC researchers in the PRISM paper (March 2026).
Links:
License: MIT
Citation:
@benchmark{constellationbench2026,
title={ConstellationBench: Behavioral AI Evaluation Across 22 LLM Models},
author={Holwerda, Zachary},
year={2026},
url={https://huggingface.co/datasets/AirlockLabs/constellation-bench}
}
ConstellationBench exists because one person spent $115 and three months testing whether behavioral AI is measurable. It is. Now we need more people measuring it.