Experiments¶

The first prototype was validated inside GenericAgent with deepseek-v4-flash. Bayesian-Agent now also includes a first-party native harness, so experiments can run either inside BA itself or through an external compatibility backend.

These experiments are meant to show three deployment paths: Bayesian-Agent can run a minimal native harness, run a full self-evolving loop from scratch, and attach to an existing agent as an incremental repair layer.

Running Benchmarks¶

The repository includes one model-agnostic script for SOP-Bench, Lifelong AgentBench, and RealFin-Bench. By default it uses the first-party BA native harness. Select the benchmark with --bench core, --bench sop, --bench lifelong, or --bench realfin.

cd Bayesian-Agent
export DEEPSEEK_API_KEY="sk-..."
export MODEL="deepseek-v4-flash"
python \
  experiments/run_benchmarks.py \
  --harness bayesian-agent \
  --model "$MODEL" \
  --mode all \
  --bench core

External backends remain available:

--harness genericagent
--harness mini-swe-agent
--harness claude-code

--bench core selects SOP-Bench and Lifelong AgentBench together, but it does not write a combined result root. It fans out to results/sop_${MODEL//-/_} and results/lifelong_${MODEL//-/_}. If you pass --out-root temp/core_${MODEL//-/_}, that path is treated as a parent and the benchmark roots become temp/core_${MODEL//-/_}/sop and temp/core_${MODEL//-/_}/lifelong.

Default --mode all runs:

baseline: selected harness without Bayesian Skill context.
bayesian_full: Bayesian full self-evolution from scratch.
bayesian_incremental: Bayesian repair using the fresh baseline and rerunning only failed tasks.

Bayesian modes now persist per-task Skill evolution artifacts under:

<run-root>/skill_evolution/
  index.json
  sop_bench/
    sop_01/
      skill_context_before.md
      skill_context_after.md
      posterior_context_before.md
      posterior_context_after.md
      belief_before.json
      belief_after.json
      snapshot_before.json
      snapshot_after.json

skill_context_before.md is the exact model-facing Skill/SOP text injected into that task. For the built-in benchmarks, it contains stable benchmark guardrails and any active Bayesian Failure-Mode Patches. A patch becomes active only after the same failure mode has at least two verified occurrences, so single failures stay audit-only. skill_context_after.md is the next model-facing Skill/SOP text after verifier feedback is recorded.

posterior_context_before.md and posterior_context_after.md are audit artifacts for the Bayesian belief state. They may include posterior summaries such as posterior_success, alpha, beta, observations, and rewrite decisions, but those numeric summaries are not injected into the benchmark prompt.

For older result directories that only contain results.json, rebuild the Skill evolution trail without rerunning the model:

bayesian-agent replay-skill-artifacts \
  --results results/sop_deepseek_v4_flash/bayesian_full/results.json

Use --limit 1 for a smoke test before full runs. To switch to deepseek-v4-pro, set MODEL=deepseek-v4-pro; the script itself is the same. For RealFin-Bench, use the same entrypoint with --bench realfin.

To repair an existing GA baseline instead of using a fresh baseline from the same run, pass the baseline result files:

"$GENERICAGENT_ROOT/.venv/bin/python" \
  experiments/run_benchmarks.py \
  --harness genericagent \
  --genericagent-root "$GENERICAGENT_ROOT" \
  --model "$MODEL" \
  --mode bayesian-incremental \
  --bench core \
  --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json

Native Harness Full-Sample Results¶

These are local full-sample checks with the first-party BA native harness. SOP-Bench and Lifelong AgentBench contain 20 tasks each; RealFin-Bench contains 40 tasks.

For the focused RealFin comparison between Bayesian Skill evolution and the frequentist control, including task-level case analysis, see Bayesian vs Frequentist on RealFin-Bench.

For the focused model-size ablation on RealFin-Bench with the mini-swe-agent backend, see Model Scaling on RealFin-Bench.

Benchmark	Model	Mode	Score	Total Tokens	Evidence
SOP-Bench	deepseek-v4-flash	baseline	19/20 (95.0%)	1.05M	`results/native_harness_deepseek_v4_flash_full/sop`
SOP-Bench	deepseek-v4-flash	bayesian_full	20/20 (100.0%)	870k	`results/native_harness_deepseek_v4_flash_full/sop`
SOP-Bench	deepseek-v4-flash	bayesian_incremental	20/20 final, 1/1 repaired	45k incremental	`results/native_harness_deepseek_v4_flash_full/sop`
Lifelong AgentBench	deepseek-v4-flash	baseline	19/20 (95.0%)	538k	`results/native_harness_deepseek_v4_flash_full/lifelong`
Lifelong AgentBench	deepseek-v4-flash	bayesian_full	20/20 (100.0%)	514k	`results/native_harness_deepseek_v4_flash_full/lifelong`
Lifelong AgentBench	deepseek-v4-flash	bayesian_incremental	20/20 final, 1/1 repaired	65k incremental	`results/native_harness_deepseek_v4_flash_full/lifelong`
RealFin-Bench	deepseek-v4-flash	baseline	25/40 (62.5%)	10.29M	`results/native_harness_deepseek_v4_flash_full/realfin`
RealFin-Bench	deepseek-v4-flash	bayesian_full	28/40 (70.0%)	10.89M	`results/native_harness_deepseek_v4_flash_full/realfin`
RealFin-Bench	deepseek-v4-flash	bayesian_incremental	29/40 final, 4/15 repaired	3.76M incremental	`results/native_harness_deepseek_v4_flash_full/realfin`
SOP-Bench	deepseek-v4-pro	baseline	20/20 (100.0%)	744k	`results/native_harness_deepseek_v4_pro_full/sop`
SOP-Bench	deepseek-v4-pro	bayesian_full	20/20 (100.0%)	739k	`results/native_harness_deepseek_v4_pro_full/sop`
Lifelong AgentBench	deepseek-v4-pro	baseline	20/20 (100.0%)	422k	`results/native_harness_deepseek_v4_pro_full/lifelong`
Lifelong AgentBench	deepseek-v4-pro	bayesian_full	20/20 (100.0%)	437k	`results/native_harness_deepseek_v4_pro_full/lifelong`
RealFin-Bench	deepseek-v4-pro	baseline	26/40 (65.0%)	9.54M	`results/native_harness_deepseek_v4_pro_full/realfin_retry`
RealFin-Bench	deepseek-v4-pro	bayesian_full	28/40 (70.0%)	9.91M	`results/native_harness_deepseek_v4_pro_full/realfin_retry`
RealFin-Bench	deepseek-v4-pro	bayesian_incremental	31/40 final, 5/14 repaired	4.59M incremental	`results/native_harness_deepseek_v4_pro_full/realfin_retry`

The native harness is intentionally simple: LLM, workspace tools, trajectory capture, and three-layer memory. Its job is to execute and observe. More capability improvement is pushed into Bayesian Skill/SOP evolution.

Published GA Validation¶

The earlier published validation used GenericAgent as the execution backend on larger benchmark slices.

Baseline¶

Benchmark	Agent	Model	Accuracy	Input Tokens	Output Tokens	Total Tokens	Efficiency
SOP-Bench	GA	deepseek-v4-flash	80%	1.34M	57k	1.39M	11.47
Lifelong AgentBench	GA	deepseek-v4-flash	90%	649k	42k	690k	26.07

Full Self-Evolving Mode¶

Benchmark	Agent	Model	Accuracy	Input Tokens	Output Tokens	Total Tokens	Efficiency
SOP-Bench	GA+Bayesian	deepseek-v4-flash	100%	1.07M	52k	1.12M	17.86
Lifelong AgentBench	GA+Bayesian	deepseek-v4-flash	95%	666k	44k	710k	26.77

Incremental Repair Mode¶

Bayesian-Agent read the GA baseline traces and reran only failed tasks.

Benchmark	Agent	Model	Final Accuracy	Incremental Input	Incremental Output	Incremental Total	Incremental Efficiency
SOP-Bench	GA+BayesianIncremental	deepseek-v4-flash	100%	254k	14k	268k	14.93
Lifelong AgentBench	GA+BayesianIncremental	deepseek-v4-flash	100%	129k	10k	139k	14.41

Historical GA-Backed RealFin Run¶

The earlier RealFin validation used GenericAgent as the execution backend with deepseek-v4-pro.

Benchmark	Agent	Model	Accuracy	Total Tokens	Evidence
RealFin-Bench	GA	deepseek-v4-pro	60%	3.72M	`results/realfin_deepseek_v4_pro_20260602`
RealFin-Bench	GA+Bayesian	deepseek-v4-pro	65%	3.70M	`results/realfin_deepseek_v4_pro_20260602`
RealFin-Bench	GA+BayesianIncremental	deepseek-v4-pro	68%	1.72M incremental	`results/realfin_deepseek_v4_pro_20260602`

Compared with this historical GA-backed RealFin run, BA native reaches 77.5% final accuracy on deepseek-v4-pro, but spends more tokens because the minimal first-party harness lets the model inspect cached market data directly.

Published example artifacts are stored under artifacts/. New live runs write their own result and Skill evolution artifacts under each benchmark-specific result root.

The cross-harness path depends on the same evidence format: any agent framework that emits verified trajectories can feed the Bayesian Skill registry and receive model-facing Skill/SOP text through an adapter.