How Endpoint Arena benchmarks AI on Phase 2 trial outcomes.
A fair test of AI prediction capabilities on real-world clinical outcomes
The Problem with AI Benchmarks
Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.
The Solution
Trial outcomes do not exist until the data lands. No memorization, no leakage, and a full time series of how each model updated over time.
What We're Testing
Can AI models reason about noisy clinical evidence and make accurate predictions about the future?
Track Phase 2 Trial Questions
Monitor active Phase 2 readouts and outcome questions, including completion timing, sponsor context, and market-ready metadata.
Prepare Shared Context
Each model receives the same structured event, market, and portfolio context. One provider call produces both a forecast snapshot and a proposed market action, while application-side guardrails enforce trading limits.
Record Decision Snapshots
Ask each model for an intrinsic approval forecast first, then a market action for the same timepoint. Each snapshot stores approval probability, binary call, confidence, reasoning, and proposed action.
Wait for Trial Outcomes
Unlike benchmarks with known answers, we wait for the real-world readout to land. There's no way to game this because the outcome does not exist until the trial data arrives.
Score Results
Compare either the first or final pre-outcome snapshot to the actual outcome. A prediction is correct if "approved" matches approval, or if "rejected" matches rejection/CRL.
Claude Opus 4.6
Anthropic
claude-opus-4-6
- Web Search
- Enabled
- Reasoning
- Extended Thinking
- Max Output
- 4,096 output
Anthropic web_search_20250305 (max_uses: 7)Native thinking blocks + tool-assisted synthesis
GPT-5.4
OpenAI
gpt-5.2
- Web Search
- Enabled
- Reasoning
- High Effort
- Max Output
- 16,000 output
OpenAI web_search toolreasoning.effort = high
Grok 4.1
xAI
grok-4-1-fast-reasoning
- Web Search
- Enabled
- Reasoning
- Fast Reasoning
- Max Output
- 16,000 output
search_mode: autoNative fast reasoning mode
Gemini 3 Pro
gemini-3-pro-preview
- Web Search
- Enabled
- Reasoning
- Thinking
- Max Output
- 65,536 output
Google Search groundingthinkingConfig.thinkingBudget = -1
DeepSeek V3.2
Fireworks
deepseek-ai/DeepSeek-V3.1
- Web Search
- Not available
- Reasoning
- Reasoning mode
- Max Output
- 16,000 output
No web-search tool configured in the combined decision generatorextra_body.reasoning_effort = high
GLM 5
Fireworks
zai-org/GLM-5
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 16,000 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Llama 4 Scout
Groq (Meta)
meta-llama/llama-4-scout-17b-16e-instruct
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 8,192 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Kimi K2.5
Fireworks
moonshotai/Kimi-K2.5
- Web Search
- Not available
- Reasoning
- Thinking
- Max Output
- 16,000 output
No web-search tool configured in the combined decision generatorextra_body.chat_template_args.enable_thinking = true
MiniMax M2.5
Fireworks
MiniMax-M2.5
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 16,000 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
You are an expert biotech trial analyst and prediction-market
decision maker.
First estimate the intrinsic probability that the live trial
question resolves YES from the trial facts alone. Then compare
that view to the current market price and choose the best
allowed action under the provided portfolio constraints.
Stage 1: Intrinsic forecast
- Use only the trial fields.
- Do not use market or portfolio fields when estimating intrinsic YES odds.
- Produce:
- yesProbability: number from 0 to 1
- binaryCall: yes if yesProbability >= 0.5, otherwise no
- confidence: integer from 50 to 100
- reasoning: 120 to 220 words
Stage 2: Market action
- After forming the intrinsic forecast, compare it to the market price.
- Use market and portfolio fields only in this stage.
- Choose exactly one action from allowedActions.
- Use HOLD when the pricing gap is small, uncertainty is high,
or constraints make the trade unattractive.
- amountUsd must not exceed the relevant buy/sell cap.
- Size the action using only this market's price and the provided portfolio caps.
- action.explanation must be plain language and <= 220 chars.
Input JSON includes:
{
"trial": {
"shortTitle": "...",
"sponsorName": "...",
"indication": "...",
"intervention": "...",
"primaryEndpoint": "...",
"questionPrompt": "Will the results be positive?"
},
"market": {
"yesPrice": 0.43,
"noPrice": 0.57
},
"portfolio": {
"cashAvailable": 100000,
"yesSharesHeld": 0,
"noSharesHeld": 0,
"maxBuyUsd": 1000,
"maxSellYesUsd": 0,
"maxSellNoUsd": 0
},
"constraints": {
"allowedActions": ["BUY_YES", "BUY_NO", "SELL_YES", "SELL_NO", "HOLD"],
"explanationMaxChars": 220
}
}{ "forecast": { "yesProbability": 0.61, "binaryCall": "yes", "confidence": 68, "reasoning": "..." }, "action": { "type": "BUY_YES", "amountUsd": 450, "explanation": "The model sees upside versus the current YES price." } }
{
"type": "object",
"required": ["forecast", "action"],
"properties": {
"forecast": {
"type": "object",
"required": ["yesProbability", "binaryCall", "confidence", "reasoning"]
},
"action": {
"type": "object",
"required": ["type", "amountUsd", "explanation"]
}
}
}


