mini-swe-agent Benchmark Results¶
This page records the full mini-swe-agent compatibility-backend run from 2026-06-05. Bayesian-Agent handled benchmark orchestration, Bayesian full self-evolution, incremental repair, grading, and artifact writing; mini-swe-agent executed each task prompt through the adapter boundary.
Evidence type: newly run and verified from local result artifacts. Metrics were
read from results/mini_swe_agent/*/*/*/results.json after the runs completed.
RealFin final runs used --max-turns 16 --mini-swe-env-timeout 180.
For the RealFin model-size ablation with Qwen3.5 35B-A3B, Qwen3.5 122B-A10B, DeepSeek V4 Flash, and DeepSeek V4 Pro, see Model Scaling on RealFin-Bench.
Commands¶
Baseline, Bayesian full, and Bayesian incremental runs used this harness shape:
python experiments/run_benchmarks.py \
--harness mini-swe-agent \
--mini-swe-agent-root ../mini-swe-agent \
--model "$MODEL" \
--bench "$BENCH" \
--mode all \
--out-root "results/mini_swe_agent/${MODEL_SLUG}/${BENCH}"
For Pro RealFin, the interrupted Bayesian full run was rerun to completion, and Bayesian incremental was then run from the completed baseline file:
python experiments/run_benchmarks.py \
--harness mini-swe-agent \
--mini-swe-agent-root ../mini-swe-agent \
--model deepseek-v4-pro \
--bench realfin \
--mode bayesian-incremental \
--out-root results/mini_swe_agent/deepseek_v4_pro/realfin \
--baseline-results results/mini_swe_agent/deepseek_v4_pro/realfin/baseline/results.json \
--max-turns 16 \
--mini-swe-env-timeout 180
Summary¶
| Model | Benchmark | Baseline | Bayesian full | Bayesian incremental final | Incremental repairs |
|---|---|---|---|---|---|
| deepseek-v4-flash | SOP-Bench | 20/20 (100%) | 19/20 (95%) | 20/20 (100%) | 0/0 |
| deepseek-v4-flash | Lifelong AgentBench | 17/20 (85%) | 19/20 (95%) | 20/20 (100%) | 3/3 |
| deepseek-v4-flash | RealFin-Bench | 24/40 (60%) | 22/40 (55%) | 28/40 (70%) | 4/16 |
| deepseek-v4-pro | SOP-Bench | 19/20 (95%) | 20/20 (100%) | 20/20 (100%) | 1/1 |
| deepseek-v4-pro | Lifelong AgentBench | 18/20 (90%) | 20/20 (100%) | 20/20 (100%) | 2/2 |
| deepseek-v4-pro | RealFin-Bench | 28/40 (70%) | 28/40 (70%) | 32/40 (80%) | 4/12 |
Token Usage¶
| Model | Benchmark | Mode | Total Tokens | Efficiency |
|---|---|---|---|---|
| deepseek-v4-flash | SOP-Bench | baseline | 1,017,462 | 19.66 |
| deepseek-v4-flash | SOP-Bench | Bayesian full | 1,012,377 | 18.77 |
| deepseek-v4-flash | SOP-Bench | incremental extra | 0 | 0.00 |
| deepseek-v4-flash | Lifelong AgentBench | baseline | 580,764 | 29.27 |
| deepseek-v4-flash | Lifelong AgentBench | Bayesian full | 497,914 | 38.16 |
| deepseek-v4-flash | Lifelong AgentBench | incremental extra | 36,456 | 82.29 |
| deepseek-v4-flash | RealFin-Bench | baseline | 5,728,953 | 4.19 |
| deepseek-v4-flash | RealFin-Bench | Bayesian full | 6,956,061 | 3.16 |
| deepseek-v4-flash | RealFin-Bench | incremental extra | 2,506,388 | 1.60 |
| deepseek-v4-pro | SOP-Bench | baseline | 907,207 | 20.94 |
| deepseek-v4-pro | SOP-Bench | Bayesian full | 953,207 | 20.98 |
| deepseek-v4-pro | SOP-Bench | incremental extra | 48,434 | 20.65 |
| deepseek-v4-pro | Lifelong AgentBench | baseline | 348,842 | 51.60 |
| deepseek-v4-pro | Lifelong AgentBench | Bayesian full | 329,718 | 60.66 |
| deepseek-v4-pro | Lifelong AgentBench | incremental extra | 27,302 | 73.25 |
| deepseek-v4-pro | RealFin-Bench | baseline | 6,100,563 | 4.59 |
| deepseek-v4-pro | RealFin-Bench | Bayesian full | 6,734,253 | 4.16 |
| deepseek-v4-pro | RealFin-Bench | incremental extra | 2,016,244 | 1.98 |
RealFin Remaining Failures After Incremental¶
deepseek-v4-flash¶
task_11_semiconductor_macdtask_12_catl_correlation_kdjtask_16_pe_bollinger_reversaltask_20_ultimate_multi_conditiontask_22_bullish_sandwichtask_23_gentle_volume_risetask_30_composite_scoringtask_31_triple_timeframe_macdtask_33_sector_leadershiptask_34_etf_constituent_arbitragetask_37_floor_ceiling_reversaltask_40_commodity_equity_linkage
deepseek-v4-pro¶
task_11_semiconductor_macdtask_16_pe_bollinger_reversaltask_20_ultimate_multi_conditiontask_22_bullish_sandwichtask_23_gentle_volume_risetask_33_sector_leadershiptask_39_interest_rate_sector_rotationtask_40_commodity_equity_linkage
Artifacts¶
- Adapter:
bayesian_agent/adapters/mini_swe_agent.py - Runner:
experiments/run_benchmarks.py --harness mini-swe-agent - Results root:
results/mini_swe_agent/ - Local ignored convenience summary:
results/mini_swe_agent/summary.md