Skip to content

Bayesian vs Frequentist on RealFin-Bench

This note compares Bayesian Skill evolution with a frequentist Skill-evolving control on RealFin-Bench. The goal is not to claim that Bayesian updating is universally better in every run. The narrower claim supported here is:

When agent runs are expensive, samples are limited, and failures are sparse but informative, a Bayesian evolution layer gives a more stable way to use prior Skill belief, uncertainty, and failure-mode evidence than a pure empirical-frequency update.

RealFin-Bench is a useful stress test for this claim because each task requires multi-step market-data inspection, strict output contracts, and robust handling of missing or blank fields. A single failed case is expensive in tokens and latency, so the method cannot wait for large-sample frequency estimates to stabilize.

Setup

All rows below are full RealFin-Bench runs with 40 tasks. Efficiency is the benchmark runner's success-per-million-token score.

Field Value
Benchmark RealFin-Bench
Task count 40
Backends GenericAgent compatibility backend; Bayesian-Agent native backend
Models deepseek-v4-flash; deepseek-v4-pro
Bayesian method Categorical Bayesian Skill evolution
Frequentist control Empirical success-rate Skill evolution, no prior, no smoothing
Mode Full self-evolving run from scratch

Evidence types:

  • preexisting_artifact: already available local result artifact.
  • newly_run: produced in the 2026-06-11 rerun for the frequentist native backend comparison.

Main Results

Backend Model Method Accuracy Success Input Tokens Output Tokens Total Tokens Efficiency Elapsed Evidence
GA deepseek-v4-flash Bayesian 52.5% 21/40 2.90M 244k 3.15M 6.67 2318s preexisting_artifact
GA deepseek-v4-flash Frequentist 52.5% 21/40 3.06M 244k 3.31M 6.35 2958s preexisting_artifact
GA deepseek-v4-pro Bayesian 65.0% 26/40 3.38M 323k 3.70M 7.02 5756s preexisting_artifact
GA deepseek-v4-pro Frequentist 57.5% 23/40 3.46M 309k 3.77M 6.10 6338s preexisting_artifact
BA native deepseek-v4-flash Bayesian 70.0% 28/40 10.43M 463k 10.89M 2.57 4485s preexisting_artifact
BA native deepseek-v4-flash Frequentist 37.5% 15/40 3.79M 236k 4.03M 3.72 3698s newly_run
BA native deepseek-v4-pro Bayesian 70.0% 28/40 9.33M 579k 9.91M 2.83 10573s preexisting_artifact
BA native deepseek-v4-pro Frequentist 45.0% 18/40 3.86M 299k 4.16M 4.32 5684s newly_run

What The Table Shows

On the GA backend, Bayesian is never worse than the frequentist control. With deepseek-v4-flash, both methods solve 21 of 40 tasks, but Bayesian uses fewer tokens and finishes faster. With deepseek-v4-pro, Bayesian solves 26 tasks while the frequentist control solves 23.

On the BA native backend, the gap is much larger. Bayesian solves 28 of 40 tasks for both models. The frequentist control solves only 15 tasks with deepseek-v4-flash and 18 tasks with deepseek-v4-pro.

The native frequentist runs spend far fewer tokens, so their token efficiency can look higher. That number is misleading if read alone: the lower token count often comes from early failures such as missing requested output files or crashes on blank OHLCV fields. For RealFin, task completion is the primary metric; token efficiency is meaningful only after accuracy is held roughly constant.

Pairwise Deltas

Backend Model Bayesian Success Frequentist Success Accuracy Delta Token Delta Interpretation
GA deepseek-v4-flash 21/40 21/40 0.0 pts Bayesian uses 161k fewer tokens Accuracy tie; Bayesian wins efficiency.
GA deepseek-v4-pro 26/40 23/40 +7.5 pts Bayesian uses 68k fewer tokens Bayesian wins both completion and token cost.
BA native deepseek-v4-flash 28/40 15/40 +32.5 pts Bayesian uses 6.87M more tokens Bayesian spends more exploration budget but solves far more tasks.
BA native deepseek-v4-pro 28/40 18/40 +25.0 pts Bayesian uses 5.75M more tokens Bayesian again wins completion; frequentist is cheaper but much less reliable.

Case Analysis

Case 1: Frequentist Overreacts To Sparse Native Evidence

In the BA native deepseek-v4-flash run, Bayesian succeeds on several tasks where the frequentist control fails with either blank-field crashes or missing output files.

Task Bayesian Frequentist Frequentist failure mode
task_04_consecutive_rise success fail blank_ohlcv_field_crashed_calculation
task_05_triple_golden_cross success fail blank_ohlcv_field_crashed_calculation
task_06_bollinger_squeeze success fail blank_ohlcv_field_crashed_calculation
task_12_catl_correlation_kdj success fail missing_requested_output_file
task_18_momentum_portfolio success fail missing_requested_output_file
task_29_momentum_reversal_combo success fail blank_ohlcv_field_crashed_calculation
task_30_composite_scoring success fail missing_requested_output_file
task_31_triple_timeframe_macd success fail missing_requested_output_file
task_36_intraday_anomaly success fail missing_requested_output_file
task_38_feature_engineering success fail blank_ohlcv_field_crashed_calculation

This is the most important qualitative pattern in the native comparison. The failure is not merely "the answer was numerically off"; many frequentist failures happen before the benchmark can even evaluate the requested artifact. Bayesian's advantage here is consistent with a posterior-driven Skill layer that preserves robust guardrails for output contracts, blank-field handling, and evidence extraction, instead of relying only on the empirical success ratio seen so far.

Case 2: Bayesian Is Not Just Spending More Tokens Blindly

The BA native deepseek-v4-flash Bayesian run spends 10.89M tokens, while the frequentist run spends 4.03M. More token use alone would not be impressive if it only produced marginal gains. Here the completion gap is large: 28/40 vs 15/40.

Several high-complexity tasks show this pattern:

Task Bayesian scores Frequentist scores
task_29_momentum_reversal_combo all six checks pass all six checks fail after blank-field crash
task_30_composite_scoring all six checks pass output file missing; all checks fail
task_31_triple_timeframe_macd all five checks pass output file missing; all checks fail
task_38_feature_engineering all six checks pass blank-field crash; all checks fail

These tasks require combining multiple indicators, maintaining output format discipline, and handling dataset irregularities. The Bayesian run spends more context and tool budget, but it converts that budget into completed artifacts.

Case 3: GA Flash Shows A Tie, Not A Universal Win

On GA with deepseek-v4-flash, both methods finish at 21/40. The task-level swaps are balanced:

Direction Tasks
Bayesian succeeds, frequentist fails task_10_macd_histogram_trend, task_12_catl_correlation_kdj, task_32_weekly_breakout_daily_pullback, task_36_intraday_anomaly
Frequentist succeeds, Bayesian fails task_17_fund_flow_obv, task_19_52week_high_followthrough, task_27_atr_volatility_breakout, task_38_feature_engineering

This is a useful negative-control style result. Bayesian is not automatically better on every model/backend pairing. In this setting, the better claim is efficiency: same completion rate, fewer tokens, and shorter elapsed time.

Case 4: GA Pro Shows Bayesian's Advantage With A Stronger Model

On GA with deepseek-v4-pro, Bayesian solves 26 tasks and the frequentist control solves 23. The Bayesian-only successes are:

task_01_macd_rsi_filter
task_24_doji_to_surge
task_27_atr_volatility_breakout
task_31_triple_timeframe_macd
task_32_weekly_breakout_daily_pullback
task_38_feature_engineering

The frequentist-only successes are:

task_12_catl_correlation_kdj
task_25_golden_valley
task_37_floor_ceiling_reversal

This suggests the Bayesian layer benefits more when the backend model is capable enough to use the evolved Skill context. The posterior evidence does not replace model capability; it conditions a capable model toward more reliable task behavior.

Case 5: Frequentist Still Has Local Wins

The BA native deepseek-v4-pro frequentist run succeeds on task_24_doji_to_surge, while the Bayesian run fails with invalid_realfin_output_format. This matters because it prevents an overly clean story.

The evidence supports a practical claim, not a dogma: Bayesian evolution is more robust overall in these runs, but single-case outcomes can still favor the frequentist control because LLM execution is stochastic, task difficulty varies, and current policies are still heuristic.

Why This Supports The Bayesian Claim

The frequentist control estimates Skill reliability from observed frequencies. In small samples, this is brittle:

p_hat(success | skill, context) = successes / observations

Bayesian evolution keeps a belief state instead:

P(skill quality | evidence) proportional to P(evidence | skill quality) P(skill quality)

In this implementation, that belief is operationalized through posterior-weighted Skill selection, failure-mode accumulation, and conservative rewrite/patch activation. The important distinction is not decorative math. It changes the control behavior:

Issue Frequentist control Bayesian evolution
Very few observations Empirical rate can swing sharply. Prior and posterior uncertainty keep updates conservative.
Repeated failure modes Treated mostly as counts. Become evidence for targeted Skill patches.
Expensive cases Needs more observations to stabilize. Can act under uncertainty with fewer samples.
Cross-harness transfer Frequency estimates are tied to observed local history. Belief state can be conditioned by benchmark, model, harness, and failure metadata.

RealFin demonstrates this difference because each task is costly, and failures such as missing output files or blank OHLCV crashes are sparse but highly diagnostic. Bayesian evolution can preserve and reuse those diagnostics as Skill evidence; a pure frequentist controller has less structure for deciding how much to trust or generalize early observations.

Caveats

These results should be read as local experimental evidence, not as a universal theorem.

  • Each row is a single full run, not a repeated-seed average.
  • Some artifacts are from earlier local runs, while the native frequentist rows were newly run on 2026-06-11.
  • The native Bayesian rows spend substantially more tokens, so they should be interpreted as stronger completion performance rather than cheaper execution.
  • RealFin is data-heavy and output-contract-sensitive; results may differ on short text-only benchmarks.

Artifact Inventory

Backend Model Method Evidence Type Artifact
GA deepseek-v4-flash Bayesian preexisting_artifact results/realfin_deepseek_v4_flash_full_20260602/bayesian_full/results.json
GA deepseek-v4-flash Frequentist preexisting_artifact results/frequentist_ga_deepseek_v4_flash_20260611/realfin/bayesian_full/results.json
GA deepseek-v4-pro Bayesian preexisting_artifact results/realfin_deepseek_v4_pro_20260602/bayesian_full/results.json
GA deepseek-v4-pro Frequentist preexisting_artifact results/frequentist_ga_deepseek_v4_pro_realfin_20260611/bayesian_full/results.json
BA native deepseek-v4-flash Bayesian preexisting_artifact results/native_harness_deepseek_v4_flash_full/realfin/bayesian_full/results.json
BA native deepseek-v4-flash Frequentist newly_run results/native_backend_frequentist_realfin_full_rerun_20260611_203737/deepseek_v4_flash/bayesian_full/results.json
BA native deepseek-v4-pro Bayesian preexisting_artifact results/native_harness_deepseek_v4_pro_full/realfin_retry/bayesian_full/results.json
BA native deepseek-v4-pro Frequentist newly_run results/native_backend_frequentist_realfin_full_rerun_20260611_203737/deepseek_v4_pro/bayesian_full/results.json