Claude Code Benchmark Results¶
This page records the Claude Code compatibility-backend runs available in the local experiment artifacts. Bayesian-Agent handled benchmark orchestration, grading, and result writing; Claude Code executed each task prompt through the adapter boundary.
Evidence type: newly run / verified local artifact. Metrics were read from
results/claude_code/**/baseline/results.json and
results/claude_code_smoke/**/baseline/results.json.
Commands¶
Full baseline runs used this harness shape:
python experiments/run_benchmarks.py \
--harness claude-code \
--model "$MODEL" \
--bench "$BENCH" \
--mode baseline \
--out-root "results/claude_code/${MODEL_SLUG}/${BENCH}"
Smoke runs used the same backend with --limit 1 and smoke-specific output
roots under results/claude_code_smoke/.
Full Baseline Results¶
| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks |
|---|---|---|---|---|---|
| deepseek-v4-flash | SOP-Bench | 18/20 (90%) | 5,887,709 | 3.06 | sop_15, sop_16 |
| deepseek-v4-flash | Lifelong AgentBench | 20/20 (100%) | 1,552,970 | 12.88 | none |
| deepseek-v4-flash | RealFin-Bench | 31/40 (77.5%) | 49,414,353 | 0.63 | 9 tasks |
| deepseek-v4-pro[1m] | SOP-Bench | 13/20 (65%) | 2,759,895 | 4.71 | 7 tasks |
| deepseek-v4-pro[1m] | Lifelong AgentBench | 20/20 (100%) | 1,552,632 | 12.88 | none |
| deepseek-v4-pro[1m] | RealFin-Bench | 26/40 (65%) | 27,034,780 | 0.96 | 14 tasks |
Bayesian-Agent Modes with Claude Code Backend¶
These runs use Claude Code as the task execution backend and Bayesian-Agent for benchmark orchestration, verifier grading, Skill evidence updates, and failure-mode patch injection.
Evidence type: newly run / verified local artifact, 2026-06-06 to 2026-06-07. Metrics were read from:
results/claude_code/{deepseek_v4_flash,deepseek_v4_pro_1m}/{sop,lifelong,realfin}/{baseline,bayesian_full,bayesian_incremental}/results.json
Commands used the same shape for all three benchmarks:
python experiments/run_benchmarks.py \
--harness claude-code \
--model "$MODEL" \
--bench "$BENCH" \
--mode "$MODE" \
--out-root "results/claude_code/${MODEL_SLUG}/${BENCH}"
For bayesian_incremental, the runner received the matching Claude Code
baseline results.json and reran only failed tasks.
deepseek-v4-flash¶
| Benchmark | Mode | Score | Repaired | Input Tokens | Output Tokens | Total Tokens | Efficiency | Cumulative Tokens |
|---|---|---|---|---|---|---|---|---|
| SOP-Bench | baseline | 18/20 (90.0%) | - | 5.81M | 78k | 5.89M | 3.06 | - |
| SOP-Bench | bayesian_full | 20/20 (100.0%) | - | 3.39M | 55k | 3.44M | 5.81 | - |
| SOP-Bench | bayesian_incremental | 20/20 (100.0%) | 2/2 | 360k | 6k | 366k | 5.46 | 6.25M |
| Lifelong AgentBench | baseline | 20/20 (100.0%) | - | 1.53M | 25k | 1.55M | 12.88 | - |
| Lifelong AgentBench | bayesian_full | 20/20 (100.0%) | - | 1.54M | 24k | 1.57M | 12.77 | - |
| Lifelong AgentBench | bayesian_incremental | 20/20 (100.0%) | 0/0 | 0 | 0 | 0 | 0.00 | 1.55M |
| RealFin-Bench | baseline | 31/40 (77.5%) | - | 48.57M | 841k | 49.41M | 0.63 | - |
| RealFin-Bench | bayesian_full | 32/40 (80.0%) | - | 47.18M | 774k | 47.95M | 0.67 | - |
| RealFin-Bench | bayesian_incremental | 35/40 (87.5%) | 4/9 | 7.04M | 152k | 7.19M | 0.56 | 56.61M |
deepseek-v4-pro[1m]¶
| Benchmark | Mode | Score | Repaired | Input Tokens | Output Tokens | Total Tokens | Efficiency | Cumulative Tokens |
|---|---|---|---|---|---|---|---|---|
| SOP-Bench | baseline | 13/20 (65.0%) | - | 2.71M | 50k | 2.76M | 4.71 | - |
| SOP-Bench | bayesian_full | 19/20 (95.0%) | - | 2.78M | 45k | 2.82M | 6.74 | - |
| SOP-Bench | bayesian_incremental | 20/20 (100.0%) | 7/7 | 958k | 20k | 977k | 7.16 | 3.74M |
| Lifelong AgentBench | baseline | 20/20 (100.0%) | - | 1.53M | 24k | 1.55M | 12.88 | - |
| Lifelong AgentBench | bayesian_full | 20/20 (100.0%) | - | 1.55M | 24k | 1.57M | 12.74 | - |
| Lifelong AgentBench | bayesian_incremental | 20/20 (100.0%) | 0/0 | 0 | 0 | 0 | 0.00 | 1.55M |
| RealFin-Bench | baseline | 26/40 (65.0%) | - | 26.53M | 509k | 27.03M | 0.96 | - |
| RealFin-Bench | bayesian_full | 27/40 (67.5%) | - | 32.04M | 691k | 32.73M | 0.82 | - |
| RealFin-Bench | bayesian_incremental | 30/40 (75.0%) | 4/14 | 14.15M | 296k | 14.45M | 0.28 | 41.48M |
Notes:
- Incremental rows report repair-only token usage.
Cumulative Tokensreports baseline cost plus the incremental repair run cost. - Lifelong AgentBench had no failed baseline tasks, so the incremental run
performed no model calls and records
0repair tokens for both models. - SOP-Bench shows the clearest benefit. With
deepseek-v4-flash, Bayesian full self-evolution reaches 100% while using fewer total tokens than the Claude Code baseline. Withdeepseek-v4-pro[1m], incremental repair fixes all 7 baseline failures and raises final accuracy from 65.0% to 100.0%. - RealFin-Bench benefits from incremental repair on both models:
deepseek-v4-flashrises from 77.5% to 87.5%, anddeepseek-v4-pro[1m]rises from 65.0% to 75.0%.
Smoke Results¶
| Model | Benchmark | Accuracy | Total Tokens | Efficiency | Failed Tasks |
|---|---|---|---|---|---|
| deepseek-v4-flash | SOP-Bench | 1/1 (100%) | 276,483 | 3.62 | none |
| deepseek-v4-flash | Lifelong AgentBench | 1/1 (100%) | 75,551 | 13.24 | none |
| deepseek-v4-flash | RealFin-Bench | 0/1 (0%) | 686,454 | 0.00 | task_01_macd_rsi_filter |
| deepseek-v4-pro[1m] | SOP-Bench | 1/1 (100%) | 149,332 | 6.70 | none |
| deepseek-v4-pro[1m] | Lifelong AgentBench | 1/1 (100%) | 75,565 | 13.23 | none |
| deepseek-v4-pro[1m] | RealFin-Bench | 0/1 (0%) | 1,040,695 | 0.00 | task_01_macd_rsi_filter |
RealFin Failures¶
deepseek-v4-flash¶
task_01_macd_rsi_filtertask_11_semiconductor_macdtask_15_volume_price_contraction_breakouttask_16_pe_bollinger_reversaltask_21_morning_startask_22_bullish_sandwichtask_27_atr_volatility_breakouttask_31_triple_timeframe_macdtask_32_weekly_breakout_daily_pullback
deepseek-v4-pro[1m]¶
task_11_semiconductor_macdtask_20_ultimate_multi_conditiontask_21_morning_startask_22_bullish_sandwichtask_23_gentle_volume_risetask_24_doji_to_surgetask_26_ma_convergence_divergencetask_27_atr_volatility_breakouttask_28_volatility_compression_explosiontask_31_triple_timeframe_macdtask_32_weekly_breakout_daily_pullbacktask_33_sector_leadershiptask_39_interest_rate_sector_rotationtask_40_commodity_equity_linkage
Artifacts¶
The committed result artifacts intentionally include run summaries and
results.json files only. Large per-task workspaces, transcripts, and tool logs
remain under the ignored results/ tree unless explicitly force-added later.
- Full runs:
results/claude_code/ - Smoke runs:
results/claude_code_smoke/ - Adapter:
bayesian_agent/adapters/claude_code.py - Runner:
experiments/run_benchmarks.py --harness claude-code