Method

How Endpoint Arena benchmarks AI on Phase 2 trial outcomes.

A fair test of AI prediction capabilities on real-world clinical outcomes

Why traditional benchmarks fall short

The Problem with AI Benchmarks

Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.

The Solution

Trial outcomes do not exist until the data lands. No memorization, no leakage, and a full time series of how each model updated over time.

What We're Testing

Can AI models reason about noisy clinical evidence and make accurate predictions about the future?

The five-step evaluation process
1

Track Phase 2 Trial Questions

Monitor active Phase 2 readouts and outcome questions, including completion timing, sponsor context, and market-ready metadata.

2

Prepare Shared Context

Each model receives the same structured event, market, and portfolio context. One provider call produces both a forecast snapshot and a proposed market action, while application-side guardrails enforce trading limits.

3

Record Decision Snapshots

Ask each model for an intrinsic approval forecast first, then a market action for the same timepoint. Each snapshot stores approval probability, binary call, confidence, reasoning, and proposed action.

4

Wait for Trial Outcomes

Unlike benchmarks with known answers, we wait for the real-world readout to land. There's no way to game this because the outcome does not exist until the trial data arrives.

5

Score Results

Compare either the first or final pre-outcome snapshot to the actual outcome. A prediction is correct if "approved" matches approval, or if "rejected" matches rejection/CRL.

The models we compare

Claude Opus 4.6

Anthropic

claude-opus-4-6

Web Search
Enabled
Reasoning
Extended Thinking
Max Output
4,096 output

Anthropic web_search_20250305 (max_uses: 7)Native thinking blocks + tool-assisted synthesis

GPT-5.4

OpenAI

gpt-5.2

Web Search
Enabled
Reasoning
High Effort
Max Output
16,000 output

OpenAI web_search toolreasoning.effort = high

Grok 4.1

xAI

grok-4-1-fast-reasoning

Web Search
Enabled
Reasoning
Fast Reasoning
Max Output
16,000 output

search_mode: autoNative fast reasoning mode

Gemini 3 Pro

Google

gemini-3-pro-preview

Web Search
Enabled
Reasoning
Thinking
Max Output
65,536 output

Google Search groundingthinkingConfig.thinkingBudget = -1

DeepSeek V3.2

Fireworks

deepseek-ai/DeepSeek-V3.1

Web Search
Not available
Reasoning
Reasoning mode
Max Output
16,000 output

No web-search tool configured in the combined decision generatorextra_body.reasoning_effort = high

GLM 5

Fireworks

zai-org/GLM-5

Web Search
Not available
Reasoning
Provider default
Max Output
16,000 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Llama 4 Scout

Groq (Meta)

meta-llama/llama-4-scout-17b-16e-instruct

Web Search
Not available
Reasoning
Provider default
Max Output
8,192 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Kimi K2.5

Fireworks

moonshotai/Kimi-K2.5

Web Search
Not available
Reasoning
Thinking
Max Output
16,000 output

No web-search tool configured in the combined decision generatorextra_body.chat_template_args.enable_thinking = true

MiniMax M2.5

Fireworks

MiniMax-M2.5

Web Search
Not available
Reasoning
Provider default
Max Output
16,000 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Model Decision Prompt
All models receive the same two-stage decision prompt
You are an expert biotech trial analyst and prediction-market
decision maker.

First estimate the intrinsic probability that the live trial
question resolves YES from the trial facts alone. Then compare
that view to the current market price and choose the best
allowed action under the provided portfolio constraints.

Stage 1: Intrinsic forecast
- Use only the trial fields.
- Do not use market or portfolio fields when estimating intrinsic YES odds.
- Produce:
  - yesProbability: number from 0 to 1
  - binaryCall: yes if yesProbability >= 0.5, otherwise no
  - confidence: integer from 50 to 100
  - reasoning: 120 to 220 words

Stage 2: Market action
- After forming the intrinsic forecast, compare it to the market price.
- Use market and portfolio fields only in this stage.
- Choose exactly one action from allowedActions.
- Use HOLD when the pricing gap is small, uncertainty is high,
  or constraints make the trade unattractive.
- amountUsd must not exceed the relevant buy/sell cap.
- Size the action using only this market's price and the provided portfolio caps.
- action.explanation must be plain language and <= 220 chars.

Input JSON includes:
{
  "trial": {
    "shortTitle": "...",
    "sponsorName": "...",
    "indication": "...",
    "intervention": "...",
    "primaryEndpoint": "...",
    "questionPrompt": "Will the results be positive?"
  },
  "market": {
    "yesPrice": 0.43,
    "noPrice": 0.57
  },
  "portfolio": {
    "cashAvailable": 100000,
    "yesSharesHeld": 0,
    "noSharesHeld": 0,
    "maxBuyUsd": 1000,
    "maxSellYesUsd": 0,
    "maxSellNoUsd": 0
  },
  "constraints": {
    "allowedActions": ["BUY_YES", "BUY_NO", "SELL_YES", "SELL_NO", "HOLD"],
    "explanationMaxChars": 220
  }
}
Expected JSON Response
{
  "forecast": {
    "yesProbability": 0.61,
    "binaryCall": "yes",
    "confidence": 68,
    "reasoning": "..."
  },
  "action": {
    "type": "BUY_YES",
    "amountUsd": 450,
    "explanation": "The model sees upside versus the current YES price."
  }
}
Schema (shape + constraints)
{
  "type": "object",
  "required": ["forecast", "action"],
  "properties": {
    "forecast": {
      "type": "object",
      "required": ["yesProbability", "binaryCall", "confidence", "reasoning"]
    },
    "action": {
      "type": "object",
      "required": ["type", "amountUsd", "explanation"]
    }
  }
}
Current Progress
115
Trial Questions Tracked
948
Total Prediction Records
948
Decision Snapshots
9
Models Compared