Skip to content

Claude Code Benchmark Results

This page records the Claude Code compatibility-backend runs available in the local experiment artifacts. Bayesian-Agent handled benchmark orchestration, grading, and result writing; Claude Code executed each task prompt through the adapter boundary.

Evidence type: newly run / verified local artifact. Metrics were read from results/claude_code/**/baseline/results.json and results/claude_code_smoke/**/baseline/results.json.

Commands

Full baseline runs used this harness shape:

python experiments/run_benchmarks.py \
  --harness claude-code \
  --model "$MODEL" \
  --bench "$BENCH" \
  --mode baseline \
  --out-root "results/claude_code/${MODEL_SLUG}/${BENCH}"

Smoke runs used the same backend with --limit 1 and smoke-specific output roots under results/claude_code_smoke/.

Full Baseline Results

Model Benchmark Accuracy Total Tokens Efficiency Failed Tasks
deepseek-v4-flash SOP-Bench 18/20 (90%) 5,887,709 3.06 sop_15, sop_16
deepseek-v4-flash Lifelong AgentBench 20/20 (100%) 1,552,970 12.88 none
deepseek-v4-flash RealFin-Bench 31/40 (77.5%) 49,414,353 0.63 9 tasks
deepseek-v4-pro[1m] SOP-Bench 13/20 (65%) 2,759,895 4.71 7 tasks
deepseek-v4-pro[1m] Lifelong AgentBench 20/20 (100%) 1,552,632 12.88 none
deepseek-v4-pro[1m] RealFin-Bench 26/40 (65%) 27,034,780 0.96 14 tasks

Bayesian-Agent Modes with Claude Code Backend

These runs use Claude Code as the task execution backend and Bayesian-Agent for benchmark orchestration, verifier grading, Skill evidence updates, and failure-mode patch injection.

Evidence type: newly run / verified local artifact, 2026-06-06 to 2026-06-07. Metrics were read from:

results/claude_code/{deepseek_v4_flash,deepseek_v4_pro_1m}/{sop,lifelong,realfin}/{baseline,bayesian_full,bayesian_incremental}/results.json

Commands used the same shape for all three benchmarks:

python experiments/run_benchmarks.py \
  --harness claude-code \
  --model "$MODEL" \
  --bench "$BENCH" \
  --mode "$MODE" \
  --out-root "results/claude_code/${MODEL_SLUG}/${BENCH}"

For bayesian_incremental, the runner received the matching Claude Code baseline results.json and reran only failed tasks.

deepseek-v4-flash

Benchmark Mode Score Repaired Input Tokens Output Tokens Total Tokens Efficiency Cumulative Tokens
SOP-Bench baseline 18/20 (90.0%) - 5.81M 78k 5.89M 3.06 -
SOP-Bench bayesian_full 20/20 (100.0%) - 3.39M 55k 3.44M 5.81 -
SOP-Bench bayesian_incremental 20/20 (100.0%) 2/2 360k 6k 366k 5.46 6.25M
Lifelong AgentBench baseline 20/20 (100.0%) - 1.53M 25k 1.55M 12.88 -
Lifelong AgentBench bayesian_full 20/20 (100.0%) - 1.54M 24k 1.57M 12.77 -
Lifelong AgentBench bayesian_incremental 20/20 (100.0%) 0/0 0 0 0 0.00 1.55M
RealFin-Bench baseline 31/40 (77.5%) - 48.57M 841k 49.41M 0.63 -
RealFin-Bench bayesian_full 32/40 (80.0%) - 47.18M 774k 47.95M 0.67 -
RealFin-Bench bayesian_incremental 35/40 (87.5%) 4/9 7.04M 152k 7.19M 0.56 56.61M

deepseek-v4-pro[1m]

Benchmark Mode Score Repaired Input Tokens Output Tokens Total Tokens Efficiency Cumulative Tokens
SOP-Bench baseline 13/20 (65.0%) - 2.71M 50k 2.76M 4.71 -
SOP-Bench bayesian_full 19/20 (95.0%) - 2.78M 45k 2.82M 6.74 -
SOP-Bench bayesian_incremental 20/20 (100.0%) 7/7 958k 20k 977k 7.16 3.74M
Lifelong AgentBench baseline 20/20 (100.0%) - 1.53M 24k 1.55M 12.88 -
Lifelong AgentBench bayesian_full 20/20 (100.0%) - 1.55M 24k 1.57M 12.74 -
Lifelong AgentBench bayesian_incremental 20/20 (100.0%) 0/0 0 0 0 0.00 1.55M
RealFin-Bench baseline 26/40 (65.0%) - 26.53M 509k 27.03M 0.96 -
RealFin-Bench bayesian_full 27/40 (67.5%) - 32.04M 691k 32.73M 0.82 -
RealFin-Bench bayesian_incremental 30/40 (75.0%) 4/14 14.15M 296k 14.45M 0.28 41.48M

Notes:

  • Incremental rows report repair-only token usage. Cumulative Tokens reports baseline cost plus the incremental repair run cost.
  • Lifelong AgentBench had no failed baseline tasks, so the incremental run performed no model calls and records 0 repair tokens for both models.
  • SOP-Bench shows the clearest benefit. With deepseek-v4-flash, Bayesian full self-evolution reaches 100% while using fewer total tokens than the Claude Code baseline. With deepseek-v4-pro[1m], incremental repair fixes all 7 baseline failures and raises final accuracy from 65.0% to 100.0%.
  • RealFin-Bench benefits from incremental repair on both models: deepseek-v4-flash rises from 77.5% to 87.5%, and deepseek-v4-pro[1m] rises from 65.0% to 75.0%.

Smoke Results

Model Benchmark Accuracy Total Tokens Efficiency Failed Tasks
deepseek-v4-flash SOP-Bench 1/1 (100%) 276,483 3.62 none
deepseek-v4-flash Lifelong AgentBench 1/1 (100%) 75,551 13.24 none
deepseek-v4-flash RealFin-Bench 0/1 (0%) 686,454 0.00 task_01_macd_rsi_filter
deepseek-v4-pro[1m] SOP-Bench 1/1 (100%) 149,332 6.70 none
deepseek-v4-pro[1m] Lifelong AgentBench 1/1 (100%) 75,565 13.23 none
deepseek-v4-pro[1m] RealFin-Bench 0/1 (0%) 1,040,695 0.00 task_01_macd_rsi_filter

RealFin Failures

deepseek-v4-flash

  • task_01_macd_rsi_filter
  • task_11_semiconductor_macd
  • task_15_volume_price_contraction_breakout
  • task_16_pe_bollinger_reversal
  • task_21_morning_star
  • task_22_bullish_sandwich
  • task_27_atr_volatility_breakout
  • task_31_triple_timeframe_macd
  • task_32_weekly_breakout_daily_pullback

deepseek-v4-pro[1m]

  • task_11_semiconductor_macd
  • task_20_ultimate_multi_condition
  • task_21_morning_star
  • task_22_bullish_sandwich
  • task_23_gentle_volume_rise
  • task_24_doji_to_surge
  • task_26_ma_convergence_divergence
  • task_27_atr_volatility_breakout
  • task_28_volatility_compression_explosion
  • task_31_triple_timeframe_macd
  • task_32_weekly_breakout_daily_pullback
  • task_33_sector_leadership
  • task_39_interest_rate_sector_rotation
  • task_40_commodity_equity_linkage

Artifacts

The committed result artifacts intentionally include run summaries and results.json files only. Large per-task workspaces, transcripts, and tool logs remain under the ignored results/ tree unless explicitly force-added later.

  • Full runs: results/claude_code/
  • Smoke runs: results/claude_code_smoke/
  • Adapter: bayesian_agent/adapters/claude_code.py
  • Runner: experiments/run_benchmarks.py --harness claude-code