Experiments¶
The first prototype was validated inside GenericAgent with deepseek-v4-flash. Bayesian-Agent now also includes a first-party native harness, so experiments can run either inside BA itself or through an external compatibility backend.
These experiments are meant to show three deployment paths: Bayesian-Agent can run a minimal native harness, run a full self-evolving loop from scratch, and attach to an existing agent as an incremental repair layer.
Running Benchmarks¶
The repository includes one model-agnostic script for SOP-Bench, Lifelong AgentBench, and RealFin-Bench. By default it uses the first-party BA native harness. Select the benchmark with --bench core, --bench sop, --bench lifelong, or --bench realfin.
cd Bayesian-Agent
export DEEPSEEK_API_KEY="sk-..."
export MODEL="deepseek-v4-flash"
python \
experiments/run_benchmarks.py \
--harness bayesian-agent \
--model "$MODEL" \
--mode all \
--bench core
External backends remain available:
--bench core selects SOP-Bench and Lifelong AgentBench together, but it does not write a combined result root. It fans out to results/sop_${MODEL//-/_} and results/lifelong_${MODEL//-/_}. If you pass --out-root temp/core_${MODEL//-/_}, that path is treated as a parent and the benchmark roots become temp/core_${MODEL//-/_}/sop and temp/core_${MODEL//-/_}/lifelong.
Default --mode all runs:
baseline: selected harness without Bayesian Skill context.bayesian_full: Bayesian full self-evolution from scratch.bayesian_incremental: Bayesian repair using the fresh baseline and rerunning only failed tasks.
Bayesian modes now persist per-task Skill evolution artifacts under:
<run-root>/skill_evolution/
index.json
sop_bench/
sop_01/
skill_context_before.md
skill_context_after.md
posterior_context_before.md
posterior_context_after.md
belief_before.json
belief_after.json
snapshot_before.json
snapshot_after.json
skill_context_before.md is the exact model-facing Skill/SOP text injected into that task. For the built-in benchmarks, it contains stable benchmark guardrails and any active Bayesian Failure-Mode Patches. A patch becomes active only after the same failure mode has at least two verified occurrences, so single failures stay audit-only. skill_context_after.md is the next model-facing Skill/SOP text after verifier feedback is recorded.
posterior_context_before.md and posterior_context_after.md are audit artifacts for the Bayesian belief state. They may include posterior summaries such as posterior_success, alpha, beta, observations, and rewrite decisions, but those numeric summaries are not injected into the benchmark prompt.
For older result directories that only contain results.json, rebuild the Skill evolution trail without rerunning the model:
bayesian-agent replay-skill-artifacts \
--results results/sop_deepseek_v4_flash/bayesian_full/results.json
Use --limit 1 for a smoke test before full runs. To switch to deepseek-v4-pro, set MODEL=deepseek-v4-pro; the script itself is the same. For RealFin-Bench, use the same entrypoint with --bench realfin.
To repair an existing GA baseline instead of using a fresh baseline from the same run, pass the baseline result files:
"$GENERICAGENT_ROOT/.venv/bin/python" \
experiments/run_benchmarks.py \
--harness genericagent \
--genericagent-root "$GENERICAGENT_ROOT" \
--model "$MODEL" \
--mode bayesian-incremental \
--bench core \
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json
Native Harness Full-Sample Results¶
These are local full-sample checks with the first-party BA native harness. SOP-Bench and Lifelong AgentBench contain 20 tasks each; RealFin-Bench contains 40 tasks.
For the focused RealFin comparison between Bayesian Skill evolution and the frequentist control, including task-level case analysis, see Bayesian vs Frequentist on RealFin-Bench.
For the focused model-size ablation on RealFin-Bench with the mini-swe-agent backend, see Model Scaling on RealFin-Bench.
| Benchmark | Model | Mode | Score | Total Tokens | Evidence |
|---|---|---|---|---|---|
| SOP-Bench | deepseek-v4-flash | baseline | 19/20 (95.0%) | 1.05M | results/native_harness_deepseek_v4_flash_full/sop |
| SOP-Bench | deepseek-v4-flash | bayesian_full | 20/20 (100.0%) | 870k | results/native_harness_deepseek_v4_flash_full/sop |
| SOP-Bench | deepseek-v4-flash | bayesian_incremental | 20/20 final, 1/1 repaired | 45k incremental | results/native_harness_deepseek_v4_flash_full/sop |
| Lifelong AgentBench | deepseek-v4-flash | baseline | 19/20 (95.0%) | 538k | results/native_harness_deepseek_v4_flash_full/lifelong |
| Lifelong AgentBench | deepseek-v4-flash | bayesian_full | 20/20 (100.0%) | 514k | results/native_harness_deepseek_v4_flash_full/lifelong |
| Lifelong AgentBench | deepseek-v4-flash | bayesian_incremental | 20/20 final, 1/1 repaired | 65k incremental | results/native_harness_deepseek_v4_flash_full/lifelong |
| RealFin-Bench | deepseek-v4-flash | baseline | 25/40 (62.5%) | 10.29M | results/native_harness_deepseek_v4_flash_full/realfin |
| RealFin-Bench | deepseek-v4-flash | bayesian_full | 28/40 (70.0%) | 10.89M | results/native_harness_deepseek_v4_flash_full/realfin |
| RealFin-Bench | deepseek-v4-flash | bayesian_incremental | 29/40 final, 4/15 repaired | 3.76M incremental | results/native_harness_deepseek_v4_flash_full/realfin |
| SOP-Bench | deepseek-v4-pro | baseline | 20/20 (100.0%) | 744k | results/native_harness_deepseek_v4_pro_full/sop |
| SOP-Bench | deepseek-v4-pro | bayesian_full | 20/20 (100.0%) | 739k | results/native_harness_deepseek_v4_pro_full/sop |
| Lifelong AgentBench | deepseek-v4-pro | baseline | 20/20 (100.0%) | 422k | results/native_harness_deepseek_v4_pro_full/lifelong |
| Lifelong AgentBench | deepseek-v4-pro | bayesian_full | 20/20 (100.0%) | 437k | results/native_harness_deepseek_v4_pro_full/lifelong |
| RealFin-Bench | deepseek-v4-pro | baseline | 26/40 (65.0%) | 9.54M | results/native_harness_deepseek_v4_pro_full/realfin_retry |
| RealFin-Bench | deepseek-v4-pro | bayesian_full | 28/40 (70.0%) | 9.91M | results/native_harness_deepseek_v4_pro_full/realfin_retry |
| RealFin-Bench | deepseek-v4-pro | bayesian_incremental | 31/40 final, 5/14 repaired | 4.59M incremental | results/native_harness_deepseek_v4_pro_full/realfin_retry |
The native harness is intentionally simple: LLM, workspace tools, trajectory capture, and three-layer memory. Its job is to execute and observe. More capability improvement is pushed into Bayesian Skill/SOP evolution.
Published GA Validation¶
The earlier published validation used GenericAgent as the execution backend on larger benchmark slices.
Baseline¶
| Benchmark | Agent | Model | Accuracy | Input Tokens | Output Tokens | Total Tokens | Efficiency |
|---|---|---|---|---|---|---|---|
| SOP-Bench | GA | deepseek-v4-flash | 80% | 1.34M | 57k | 1.39M | 11.47 |
| Lifelong AgentBench | GA | deepseek-v4-flash | 90% | 649k | 42k | 690k | 26.07 |
Full Self-Evolving Mode¶
| Benchmark | Agent | Model | Accuracy | Input Tokens | Output Tokens | Total Tokens | Efficiency |
|---|---|---|---|---|---|---|---|
| SOP-Bench | GA+Bayesian | deepseek-v4-flash | 100% | 1.07M | 52k | 1.12M | 17.86 |
| Lifelong AgentBench | GA+Bayesian | deepseek-v4-flash | 95% | 666k | 44k | 710k | 26.77 |
Incremental Repair Mode¶
Bayesian-Agent read the GA baseline traces and reran only failed tasks.
| Benchmark | Agent | Model | Final Accuracy | Incremental Input | Incremental Output | Incremental Total | Incremental Efficiency |
|---|---|---|---|---|---|---|---|
| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 254k | 14k | 268k | 14.93 |
| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 129k | 10k | 139k | 14.41 |
Historical GA-Backed RealFin Run¶
The earlier RealFin validation used GenericAgent as the execution backend with deepseek-v4-pro.
| Benchmark | Agent | Model | Accuracy | Total Tokens | Evidence |
|---|---|---|---|---|---|
| RealFin-Bench | GA | deepseek-v4-pro | 60% | 3.72M | results/realfin_deepseek_v4_pro_20260602 |
| RealFin-Bench | GA+Bayesian | deepseek-v4-pro | 65% | 3.70M | results/realfin_deepseek_v4_pro_20260602 |
| RealFin-Bench | GA+BayesianIncremental | deepseek-v4-pro | 68% | 1.72M incremental | results/realfin_deepseek_v4_pro_20260602 |
Compared with this historical GA-backed RealFin run, BA native reaches 77.5% final accuracy on deepseek-v4-pro, but spends more tokens because the minimal first-party harness lets the model inspect cached market data directly.
Published example artifacts are stored under artifacts/. New live runs write their own result and Skill evolution artifacts under each benchmark-specific result root.
The cross-harness path depends on the same evidence format: any agent framework that emits verified trajectories can feed the Bayesian Skill registry and receive model-facing Skill/SOP text through an adapter.