Native Harness¶
Bayesian-Agent now includes a first-party native harness. It is intentionally small: the harness owns the executable loop, while most long-term capability improvement is pushed into Bayesian Skill evolution.
The goal is not to rebuild a large monolithic agent runtime. The goal is to provide a lean, observable substrate where verified trajectories can become better Skills and SOPs.
Minimal Design¶
The native harness has three core parts:
| Layer | File | Responsibility |
|---|---|---|
| LLM | bayesian_agent/harness/llm.py |
A tiny OpenAI-compatible chat-completions client for DeepSeek/OpenAI-style APIs. |
| Tools | bayesian_agent/harness/tools.py |
Workspace-scoped file_read, file_write, code_run, and finish. |
| Loop | bayesian_agent/harness/native.py |
Turn loop, tool dispatch, usage accounting, transcript capture, and trajectory persistence. |
Memory is deliberately kept to three layers:
| Memory | Role |
|---|---|
| Hippocampus | Fast episodic notes for the current run. |
| Intermediate State | Current task/run state such as last task, workspace, and outcome. |
| Cortex | Durable notes and the persistent Bayesian Skill belief registry. |
The harness stays simple:
- no vendored GenericAgent code
- no browser/runtime framework dependency
- no hidden global memory stack
- no model-specific reasoning framework
- no large prompt-engineering surface inside the harness
The harness is expected to execute, observe, and record. Bayesian Skill evolution is expected to learn reusable procedures from verified outcomes.
Why Keep The Harness Small?¶
A large harness can improve short-term task performance, but it also makes the learning layer harder to inspect. Bayesian-Agent keeps the native harness small for three reasons:
- Attribution: when a run improves, the change should be traceable to a Skill/SOP update, not an opaque runtime behavior.
- Portability: the same Skill evolution logic should transfer to GA, mini-swe-agent, Claude Code, or another harness.
- Efficiency: the native harness should spend tokens on task evidence and verification, not on maintaining a complicated harness persona.
This means the harness provides the minimum executable environment:
Prompt + Skill context
|
v
LLM turn
|
v
Workspace tool call
|
v
Observation / transcript
|
v
Verifier result
|
v
Bayesian Skill update
CLI Usage¶
bayesian-agent is now the default harness:
python experiments/run_benchmarks.py \
--harness bayesian-agent \
--model deepseek-v4-flash \
--bench core \
--mode all \
--limit 1
External harnesses remain available:
Use --dry-run to confirm the runner is first-party:
{
"harness": "bayesian-agent",
"ba_harness_core": true,
"native_first_party": true,
"harness_root": ".../Bayesian-Agent"
}
For RealFin-Bench, the native harness can read cache symlinks inside the workspace, while writes remain restricted to the task workspace.
Trajectory Artifacts¶
Every native run writes:
<task-workspace>/
harness_task.json
harness_run.json
native_harness_trajectory.json
model_response_log.txt
transcript.txt
native_harness_trajectory.json records the BA-owned turns, tool calls, observations, usage, and final run metadata. This is the evidence surface consumed by benchmark grading and Bayesian Skill evolution.
Full-Sample Results¶
The table below uses full benchmark samples rather than one-task debug checks. SOP-Bench and Lifelong AgentBench contain 20 tasks each; RealFin-Bench contains 40 tasks.
| Benchmark | Model | Mode | Score | Total Tokens | Evidence |
|---|---|---|---|---|---|
| SOP-Bench | deepseek-v4-flash | baseline | 19/20 (95.0%) | 1.05M | results/native_harness_deepseek_v4_flash_full/sop |
| SOP-Bench | deepseek-v4-flash | bayesian_full | 20/20 (100.0%) | 870k | results/native_harness_deepseek_v4_flash_full/sop |
| SOP-Bench | deepseek-v4-flash | bayesian_incremental | 20/20 final, 1/1 repaired | 45k incremental | results/native_harness_deepseek_v4_flash_full/sop |
| Lifelong AgentBench | deepseek-v4-flash | baseline | 19/20 (95.0%) | 538k | results/native_harness_deepseek_v4_flash_full/lifelong |
| Lifelong AgentBench | deepseek-v4-flash | bayesian_full | 20/20 (100.0%) | 514k | results/native_harness_deepseek_v4_flash_full/lifelong |
| Lifelong AgentBench | deepseek-v4-flash | bayesian_incremental | 20/20 final, 1/1 repaired | 65k incremental | results/native_harness_deepseek_v4_flash_full/lifelong |
| SOP-Bench | deepseek-v4-pro | baseline | 20/20 (100.0%) | 744k | results/native_harness_deepseek_v4_pro_full/sop |
| SOP-Bench | deepseek-v4-pro | bayesian_full | 20/20 (100.0%) | 739k | results/native_harness_deepseek_v4_pro_full/sop |
| Lifelong AgentBench | deepseek-v4-pro | baseline | 20/20 (100.0%) | 422k | results/native_harness_deepseek_v4_pro_full/lifelong |
| Lifelong AgentBench | deepseek-v4-pro | bayesian_full | 20/20 (100.0%) | 437k | results/native_harness_deepseek_v4_pro_full/lifelong |
| RealFin-Bench | deepseek-v4-flash | baseline | 25/40 (62.5%) | 10.29M | results/native_harness_deepseek_v4_flash_full/realfin |
| RealFin-Bench | deepseek-v4-flash | bayesian_full | 28/40 (70.0%) | 10.89M | results/native_harness_deepseek_v4_flash_full/realfin |
| RealFin-Bench | deepseek-v4-flash | bayesian_incremental | 29/40 final, 4/15 repaired | 3.76M incremental | results/native_harness_deepseek_v4_flash_full/realfin |
| RealFin-Bench | deepseek-v4-pro | baseline | 26/40 (65.0%) | 9.54M | results/native_harness_deepseek_v4_pro_full/realfin_retry |
| RealFin-Bench | deepseek-v4-pro | bayesian_full | 28/40 (70.0%) | 9.91M | results/native_harness_deepseek_v4_pro_full/realfin_retry |
| RealFin-Bench | deepseek-v4-pro | bayesian_incremental | 31/40 final, 5/14 repaired | 4.59M incremental | results/native_harness_deepseek_v4_pro_full/realfin_retry |
Interpretation:
- The native harness can now run every benchmark itself and emit its own verified trajectories.
- On SOP/Lifelong, the minimal native harness reaches 95-100% full-sample accuracy and uses less token budget than the historical GA-backed full runs.
- On RealFin, BA native improves
deepseek-v4-profinal accuracy from the historical GA-backed 68% to 77.5%, but spends more tokens because the first-party harness keeps domain logic minimal and lets the model inspect cached market data directly. - The intended design tradeoff is deliberate: the harness remains simple and observable, while long-term capability improvement is pushed into Bayesian Skill/SOP evolution.
Published GA Validation¶
The original GA-backed validation remains useful because it covers larger benchmark slices:
| Benchmark | Agent | Accuracy | Total Tokens | Efficiency |
|---|---|---|---|---|
| SOP-Bench | GA baseline | 80% | 1.39M | 11.47 |
| SOP-Bench | GA+Bayesian full | 100% | 1.12M | 17.86 |
| SOP-Bench | GA+Bayesian incremental | 100% final | 268k incremental | 14.93 |
| Lifelong AgentBench | GA baseline | 90% | 690k | 26.07 |
| Lifelong AgentBench | GA+Bayesian full | 95% | 710k | 26.77 |
| Lifelong AgentBench | GA+Bayesian incremental | 100% final | 139k incremental | 14.41 |
| RealFin-Bench | GA baseline | 60% | 3.72M | 6.46 |
| RealFin-Bench | GA+Bayesian full | 65% | 3.70M | 7.02 |
| RealFin-Bench | GA+Bayesian incremental | 68% final | 1.72M incremental | 1.74 |
The native harness result is therefore best read as a harness-minimality result: BA can now run tasks itself, emit trajectories itself, and still rely on the same Bayesian Skill evolution layer for durable improvement.