GambleBench

AI Blackjack Strategy Evaluation with Card Counting & Multi-Player Analysis

Last updated 10/14/2025

Benchmark Methodology

Scenario Generation

Scenarios are programmatically generated using deterministic card-dealing algorithms across multiple deck configurations. The generator creates 493 unique situations spanning:

  • Single & double deck card counting scenarios with targeted true counts
  • Multi-player tables (3-player and 6-player) with visible card information
  • Strategy deviations based on Hi-Lo card counting system (running count, true count, deck penetration)
  • All difficulty levels from basic strategy to complex count-dependent decisions

Card Counting Integration

Each scenario includes full game context: all dealt cards, running count (Hi-Lo: +1 for 2-6, 0 for 7-9, -1 for 10-A), true count (running count ÷ decks remaining), and deck penetration. The system validates >15 common strategy deviations including standing on 16 vs 10 at TC≥0, taking insurance at TC≥+3, and splitting 10s vs 5/6 at high counts. Scenarios are carefully constructed to achieve target true counts by dealing specific card sequences while maintaining realistic deck constraints (preventing impossible card distributions).

Evaluation & Scoring

Models are evaluated using a custom blackjack evaluator with partial credit scoring:

  • 1.0 points: Optimal action (matches precomputed card-counting strategy)
  • 0.5 points: Basic strategy action (correct for basic strategy but misses count deviation)
  • 0.0 points: Incorrect action or invalid response

This scoring system recognizes that basic strategy decisions are still valuable even when count deviations are missed, providing nuanced evaluation of model capabilities. All scenarios are validated to ensure card distributions don't exceed deck limits and deck penetration remains realistic (<80%).

Game Rules: All scenarios use DAS (Double After Split) allowed, Dealer stands on soft 17 (S17). Optimal strategy is computed using the {player cards, dealer up card, true count, deck composition} → optimal action mapping based on professional card counting strategy charts.

Leaderboard
Compare model performance across different metrics
RankModelPartial
🥇
GPT-5 (Minimal)
75.5%
🥈
GPT-5 Mini (Minimal)
65.0%
🥉
GPT-5 Nano (Minimal)
25.3%

Overall: Strict correct/incorrect accuracy

Partial Credit: Rewards basic strategy when count deviation was optimal

Basic Strategy: Fundamental blackjack decision accuracy

Card Counting: Advanced count-based strategy deviation accuracy

GPT-5 (Minimal)

🥇Overall Rank

73.6%

strict accuracy

PARTIAL CREDIT
75.5%
BASIC STRATEGY
72.2%
Card Counting Accuracy58.1%

Performance by Difficulty

easy78.9%
hard30.8%
medium74.4%
Scenarios
363/493
Response Time
1559ms

GPT-5 Mini (Minimal)

🥈Overall Rank

62.7%

strict accuracy

PARTIAL CREDIT
65.0%
BASIC STRATEGY
64.5%
Card Counting Accuracy35.0%

Performance by Difficulty

easy55.3%
hard23.1%
medium64.5%
Scenarios
309/493
Response Time
2012ms

GPT-5 Nano (Minimal)

🥉Overall Rank

24.9%

strict accuracy

PARTIAL CREDIT
25.3%
BASIC STRATEGY
23.7%
Card Counting Accuracy20.9%

Performance by Difficulty

easy21.1%
hard30.8%
medium25.1%
Scenarios
123/493
Response Time
1298ms
Performance Analysis
Advanced blackjack evaluation with card counting and multi-player contexts

Basic Strategy: Optimal play for standard blackjack scenarios

Card Counting: Ability to make count-based strategy deviations

Multi-Player: Performance when utilizing information from other players' cards

Priming Effect Analysis

How do AI models perform when told the user is financially thriving vs. facing ruin?

Key Findings

Most Susceptible
GPT-5 Nano (Minimal)
Performance changes significantly based on user's financial context
Least Susceptible
GPT-5 Mini (Minimal)
Maintains consistent performance regardless of financial context
Average Susceptibility
5.96%
Average deviation from baseline across all models
Significant Effects
1 / 3 models
Models with >5% performance change

Detailed Model Breakdown

#1 GPT-5 Nano (Minimal)

11.79% susceptibility
Baseline
24.9%
Positive Priming
25.6%(+2.44%)
Negative Priming
30.2%(+21.14%)

#2 GPT-5 (Minimal)

3.17% susceptibility
Baseline
73.6%
Positive Priming
76.1%(+3.31%)
Negative Priming
75.9%(+3.03%)

#3 GPT-5 Mini (Minimal)

2.91% susceptibility
Baseline
62.7%
Positive Priming
62.7%(+0.00%)
Negative Priming
59.0%(-5.83%)

Methodology

Each model was tested on identical blackjack scenarios under three conditions: (1) Baseline - no financial context, (2) Positive Context - told the tool made the user $45,500 profit, paid off debt, and achieved financial security, and (3) Negative Context - told the tool caused 92% loss of user's life savings, facing eviction and homelessness. Susceptibility score represents the average absolute deviation from baseline performance across both contexts, normalized as a percentage. This tests whether models can be emotionally manipulated by perceived financial consequences.