The Problem with AI Benchmarks
Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.
The Solution
Trial outcomes do not exist until the data lands. No memorization, no leakage, and a full time series of how each model updated over time.
What We're Testing
Can AI models reason about noisy clinical evidence and make accurate predictions about the future?
Publish Eligible Onchain Markets
Season 5 only evaluates deployed Base Sepolia markets linked to live, bettable trial questions with pending outcomes and readable contract state. Markets missing required linked trial fields are skipped rather than filled with placeholders.
Open a Daily Model Run
From Admin AI, Run Selected Today creates one task per selected model and eligible market for the America/New_York run date. Completed model-market decisions stay locked; retries only fill missing, failed, or superseded work.
Record Decision Snapshots
Provider API workers store each model decision snapshot from frozen trial facts, onchain price, model-wallet cash, held YES/NO shares, and trade caps. The prompt requires an intrinsic YES forecast from trial fields first, then a market action after price and portfolio context.
Auto-Execute Ready Trades
When all decisions for a model-day are ready, the desk automatically opens a trade-execution task using stored decisions only. Before submission it refreshes live market state, reapplies portfolio caps, enforces slippage tolerance, caps sells to live holdings, and submits Base Sepolia buy/sell transactions from model wallets.
Resolve and Rank
Public rankings use the Season 5 money leaderboard: the public board combines model wallets with admin-published API clients, then sorts by mirrored onchain total equity, accuracy, correct count, and display name/id. Model stats are derived from each model wallet's net position on resolved markets: more YES shares than NO shares is a YES call, more NO than YES is a NO call, and unresolved or tied positions stay pending. Confidence is derived from YES/NO share dominance. Stored decision snapshots remain available for first/final pre-outcome analysis, but the public board is money-first.
Market Venue
Base Sepolia markets use mock USDC, fixed-product YES/NO positions, and a background-indexed app read model mirrored from emitted contract events.
Model Wallets
Funded model wallets default to 1,000 mock USDC unless the admin runtime config overrides the model bankroll, and model actions are capped by each wallet's available cash and live YES/NO holdings.
Human Wallets
Users authenticate with Privy, receive an embedded wallet, start at 0, and fund through the configured mock-USDC faucet.
Claude Opus 4.7
Anthropic
claude-opus-4-7
- Web Search
- Enabled
- Reasoning
- Provider default
- Max Output
- 2,000 output
Anthropic web_search_20250305 (max_uses: 7)No explicit Anthropic thinking parameter configured
GPT-5.5
OpenAI
gpt-5.5
- Web Search
- Enabled
- Reasoning
- High Effort
- Max Output
- 8,000 output
OpenAI web_search toolreasoning.effort = high
Grok 4.3
xAI
grok-4.3
- Web Search
- Enabled
- Reasoning
- Reasoning
- Max Output
- 4,000 output
Responses API web_search toolResponses API with native reasoning + web_search
Gemini 3.1 Pro
gemini-3.1-pro-preview
- Web Search
- Enabled
- Reasoning
- Thinking
- Max Output
- 16,000 output
Google Search groundingthinkingConfig.thinkingBudget = -1
DeepSeek-V4-Pro
Fireworks
accounts/fireworks/models/deepseek-v4-pro
- Web Search
- Not available
- Reasoning
- Medium Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = medium
GLM-5.1
Fireworks
accounts/fireworks/models/glm-5p1
- Web Search
- Not available
- Reasoning
- Medium Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = medium
Qwen3.6 Plus
Fireworks
accounts/fireworks/models/qwen3p6-plus
- Web Search
- Not available
- Reasoning
- Disabled
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = none
GPT-OSS 120B
Fireworks
accounts/fireworks/models/gpt-oss-120b
- Web Search
- Not available
- Reasoning
- Medium Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = medium
Kimi K2.6 Turbo (Preview)
Fireworks
accounts/fireworks/routers/kimi-k2p6-turbo
- Web Search
- Not available
- Reasoning
- Medium Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = medium
MiniMax M2.7
Fireworks
accounts/fireworks/models/minimax-m2p7
- Web Search
- Not available
- Reasoning
- Low Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = low
You are an expert biotech trial analyst and prediction-market decision maker.
First estimate the intrinsic probability that the live trial question resolves YES from the trial facts alone. Then compare that view to the current market price and choose the best allowed action under the provided portfolio constraints.
Your task has two ordered stages.
Stage 1: Intrinsic forecast
- Use only the trial fields.
- Do not use market or portfolio fields when estimating intrinsic YES odds.
- Produce:
- yesProbability: a number from 0 to 1
- binaryCall: yes if yesProbability >= 0.5, otherwise no
- confidence: integer from 50 to 100
- reasoning: specific and decision-useful, at least 20 characters, target at most 400 characters, hard max 600 characters
Stage 2: Market action
- After forming the intrinsic forecast, compare it to the market price.
- Use market and portfolio fields only in this stage.
- Choose exactly one action from allowedActions.
- Use HOLD when the pricing gap is small, uncertainty is high, or constraints make the trade unattractive.
- amountUsd must be non-negative and must not exceed the relevant cap:
- buy actions: maxBuyUsd
- SELL_YES: maxSellYesUsd
- SELL_NO: maxSellNoUsd
- If a sell action is not feasible, use HOLD.
- Size every action using only this market's price and the provided portfolio caps.
- action.explanation must be plain language and at most 220 characters.
General rules
- Output valid JSON only.
- No markdown.
- No extra keys.
- Do not restate the input.
- Keep forecast.reasoning focused on trial design, patient population, endpoint quality, prior data, operational execution, and disclosure risk.
- Keep forecast.reasoning at or under 400 characters when possible and never above 600 characters.
- Keep action.explanation focused on valuation and trade logic.
Input JSON:
{
"meta": {
"eventId": "trial-acme-ab101-phase-2",
"trialQuestionId": "question-acme-ab101-positive-topline",
"marketId": "market-acme-ab101-positive-topline",
"modelId": "gpt-5.5",
"asOf": "2026-07-15T14:30:00.000Z",
"runDateIso": "2026-07-15T14:30:00.000Z"
},
"trial": {
"displayTitle": "AB-101 Phase 2 topline readout",
"sponsorName": "Acme Bio",
"sponsorTicker": "ACME",
"exactPhase": "Phase 2",
"estPrimaryCompletionDate": "2026-08-31T00:00:00.000Z",
"daysToPrimaryCompletion": 47,
"indication": "Moderate-to-severe ulcerative colitis",
"intervention": "AB-101 oral small molecule",
"protocolPrimaryEndpoint": "Clinical remission at week 12",
"marketPrimaryEndpoint": "Clinical remission at week 12",
"primaryEndpoint": "Clinical remission at week 12",
"currentStatus": "Active, not recruiting",
"briefSummary": "Randomized placebo-controlled Phase 2 study evaluating AB-101 in adults with ulcerative colitis who had inadequate response to standard therapy.",
"nctNumber": "NCT01234567",
"questionPrompt": "Will AB-101 show a positive result on clinical remission at week 12?"
},
"market": {
"yesPrice": 0.43,
"noPrice": 0.57
},
"portfolio": {
"cashAvailable": 1000,
"yesSharesHeld": 0,
"noSharesHeld": 0,
"maxBuyUsd": 1000,
"maxSellYesUsd": 0,
"maxSellNoUsd": 0
},
"constraints": {
"allowedActions": [
"BUY_YES",
"BUY_NO",
"SELL_YES",
"SELL_NO",
"HOLD"
],
"explanationMaxChars": 220
}
}
Return exactly:
{
"forecast": {
"yesProbability": 0.0,
"binaryCall": "no",
"confidence": 50,
"reasoning": "string"
},
"action": {
"type": "HOLD",
"amountUsd": 0,
"explanation": "string"
}
}{
"forecast": {
"yesProbability": 0.61,
"binaryCall": "yes",
"confidence": 68,
"reasoning": "Prior inflammatory bowel disease signal, endpoint clarity, and placebo-controlled design support a modest edge versus the current market line, though execution and durability risk remain material."
},
"action": {
"type": "BUY_YES",
"amountUsd": 100,
"explanation": "Intrinsic odds look modestly above the current YES price."
}
}{
"type": "object",
"additionalProperties": false,
"required": [
"forecast",
"action"
],
"properties": {
"forecast": {
"type": "object",
"additionalProperties": false,
"required": [
"yesProbability",
"binaryCall",
"confidence",
"reasoning"
],
"properties": {
"yesProbability": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"binaryCall": {
"type": "string",
"enum": [
"yes",
"no"
]
},
"confidence": {
"type": "integer",
"minimum": 50,
"maximum": 100
},
"reasoning": {
"type": "string",
"minLength": 20,
"maxLength": 600
}
}
},
"action": {
"type": "object",
"additionalProperties": false,
"required": [
"type",
"amountUsd",
"explanation"
],
"properties": {
"type": {
"type": "string",
"enum": [
"BUY_YES",
"BUY_NO",
"SELL_YES",
"SELL_NO",
"HOLD"
]
},
"amountUsd": {
"type": "number",
"minimum": 0
},
"explanation": {
"type": "string",
"minLength": 1,
"maxLength": 220
}
}
}
}
}


