Skip to content

Model Scaling on RealFin-Bench

This note records a focused model-size ablation for Bayesian-Agent on RealFin-Bench. All full runs use the same experiment shape:

  • Harness: mini-swe-agent compatibility backend.
  • Evolution mode: bayesian_full.
  • Benchmark: realfin with 40 tasks.
  • Runtime controls: --max-turns 16 --mini-swe-env-timeout 180.
  • Metric: task completion accuracy, token usage, and efficiency.

Efficiency is computed as successful tasks per one million tokens:

efficiency = successes / total_tokens * 1,000,000

Commands

Before the full run, a one-task smoke test was run for both DashScope models to verify that the API, adapter, and verifier chain worked:

Model Smoke Result Total Tokens Evidence
qwen3.5-35b-a3b 1/1 151,658 results/model_scaling_mini_swe_realfin_qwen_smoke_20260612_004502/qwen3_5_35b_a3b
qwen3.5-122b-a10b 0/1 65,267 results/model_scaling_mini_swe_realfin_qwen_smoke_20260612_004502/qwen3_5_122b_a10b

The smoke test is only an execution check, not a quality claim.

Full Qwen runs used the DashScope OpenAI-compatible endpoint:

cd /Users/wuxiaojun/code/My-Agent/Bayesian-Agent
set -a && . ./.env && set +a

.venv/bin/python experiments/run_benchmarks.py \
  --harness mini-swe-agent \
  --mini-swe-agent-root /Users/wuxiaojun/code/My-Agent/mini-swe-agent \
  --model qwen3.5-35b-a3b \
  --bench realfin \
  --mode bayesian-full \
  --out-root results/model_scaling_mini_swe_realfin_qwen_full_screen_20260612_004933/qwen3_5_35b_a3b \
  --api-key-env ALI_API_KEY \
  --base-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
  --max-turns 16 \
  --mini-swe-env-timeout 180 \
  --limit 0

The qwen3.5-122b-a10b command was identical except for --model and --out-root. In this local desktop shell, nohup exited immediately with empty logs, so the full runs were submitted as detached screen sessions. The result artifacts and logs are still written under the same run root.

Results

The Qwen rows are newly run in this session. The DeepSeek rows are preexisting full RealFin artifacts from the mini-swe-agent backend.

Model Provider Size Label Evidence Type Success Accuracy Input Tokens Output Tokens Total Tokens Efficiency
qwen3.5-35b-a3b DashScope 35B / A3B newly_run 18/40 45.0% 6.95M 370k 7.32M 2.46
qwen3.5-122b-a10b DashScope 122B / A10B newly_run 3/40 7.5% 3.19M 144k 3.34M 0.90
deepseek-v4-flash DeepSeek 284B, user-provided preexisting_artifact 22/40 55.0% 6.50M 453k 6.96M 3.16
deepseek-v4-pro DeepSeek 1.6T, user-provided preexisting_artifact 28/40 70.0% 6.28M 459k 6.73M 4.16

Evidence paths:

  • results/model_scaling_mini_swe_realfin_qwen_full_screen_20260612_004933/qwen3_5_35b_a3b/bayesian_full/results.json
  • results/model_scaling_mini_swe_realfin_qwen_full_screen_20260612_004933/qwen3_5_122b_a10b/bayesian_full/results.json
  • results/mini_swe_agent/deepseek_v4_flash/realfin/bayesian_full/results.json
  • results/mini_swe_agent/deepseek_v4_pro/realfin/bayesian_full/results.json

Failure Shape

The failure distribution matters as much as the final score because Bayesian-Agent can only evolve Skills from observable execution evidence.

Model Failed Tasks Failures With No Output File Failures With Output File But Invalid Details
qwen3.5-35b-a3b 22 10 12
qwen3.5-122b-a10b 37 37 0
deepseek-v4-flash 18 9 9
deepseek-v4-pro 12 7 5

qwen3.5-122b-a10b is the most important negative result. Its failures are dominated by file_created = 0, which means the backend often did not produce the required benchmark artifact. That is a harness-output-contract failure shape, not merely a weak indicator-computation failure. It also explains why the 122B run spent fewer tokens: many tasks ended before producing a valid artifact, so lower token usage here should not be read as better efficiency.

qwen3.5-35b-a3b is a more useful Skill-evolution target. Many failed tasks created a file and missed specific verifier predicates. Those failures expose actionable evidence, such as wrong format, incomplete metrics, or one missing condition, so posterior-weighted Skill patches have something concrete to learn from.

Case Notes

task_30_composite_scoring shows that scaling is not monotonic across models. Qwen 35B-A3B and DeepSeek V4 Flash passed it, while DeepSeek V4 Pro and Qwen 122B-A10B failed. The 122B run did not create the file; Pro created no valid artifact for this task in the stored run.

task_38_feature_engineering shows the opposite pattern. Qwen 122B-A10B and DeepSeek V4 Pro passed, while Qwen 35B-A3B created a file but missed feature_count_valid and signal_present. Larger models can still help on feature-design-heavy tasks, but the benefit only appears when the backend successfully preserves the file-writing contract.

task_22_bullish_sandwich and task_23_gentle_volume_rise failed for all four models. For Qwen 35B-A3B, DeepSeek Flash, and DeepSeek Pro, the output files were created but missed a specific technical predicate such as MA trend, valid return, or turnover validity. These are good candidates for targeted Skill rewrite rules. For Qwen 122B-A10B, the same tasks produced no output file, so the immediate fix should be adapter/prompt-contract stabilization before domain-specific Skill evolution.

task_39_interest_rate_sector_rotation and task_40_commodity_equity_linkage failed for all four models with no valid output file in the compared runs. These failures likely need stronger data access/tooling guidance in the harness, not only larger model size.

Takeaways

Within the DeepSeek family, the larger Pro model improves RealFin Bayesian full accuracy from 55.0% to 70.0% while using slightly fewer total tokens. That is the cleanest size-scaling signal in this set.

Across providers and model families, the result is not monotonic. Qwen 35B-A3B outperforms Qwen 122B-A10B by a wide margin under the current mini-swe-agent adapter, and DeepSeek Flash also outperforms Qwen 122B-A10B. This suggests that Bayesian-Agent performance depends on the full stack:

  • base model capability,
  • backend tool behavior,
  • file-writing contract adherence,
  • benchmark data access,
  • Skill evolution evidence quality.

For Bayesian Skill evolution, a model that reliably produces imperfect but verifiable artifacts may be more valuable than a larger model that frequently fails to emit the required artifact. Posterior updates and Skill rewrites need observations; no-output failures provide much weaker learning signal than fine-grained verifier failures.

Protocol Risks

  • Single run per model; no variance estimate.
  • Cross-provider comparison mixes model size with serving stack, tokenizer, decoding defaults, and adapter behavior.
  • Qwen size labels include total and active parameter names from the model identifiers; DeepSeek size labels are user-provided.
  • RealFin-Bench depends on cached market-data access and file-based grading, so failures may reflect harness/data behavior as well as model reasoning.