CLI¶

The bayesian-agent command provides trace ingestion, result summarization, and incremental repair utilities.

Install the CLI¶

python -m pip install -e .
bayesian-agent --help

`evolve`¶

Update a Bayesian Skill registry from one or more results JSON files.

bayesian-agent evolve \
  --results artifacts/ga_deepseek_baseline/sop_results.json \
  --results artifacts/ga_deepseek_baseline/lifelong_results.json \
  --registry temp/beliefs.json \
  --algorithm categorical_bayes \
  --context-out temp/skill_context.md

Arguments:

Argument	Required	Description
`--results`	yes	Path to a result JSON file. Can be repeated.
`--registry`	yes	Output registry JSON path.
`--algorithm`	no	Belief backend. Defaults to `categorical_bayes`; `naive_bayes` is accepted as a legacy alias, and `beta_bernoulli` enables the global success-rate baseline.
`--context-out`	no	Optional Markdown path for rendered Skill context.

`repair-plan`¶

List failed task ids for incremental repair.

bayesian-agent repair-plan \
  --baseline artifacts/ga_deepseek_baseline/sop_results.json \
  --out temp/failed_tasks.json

Output shape:

{
  "sop_bench": ["sop_12", "sop_13"]
}

`summarize`¶

Summarize accuracy and token usage for a result file.

bayesian-agent summarize \
  --results artifacts/bayesian_full/results.json \
  --out temp/summary.json

`incremental-summary`¶

Summarize the lift from a baseline run plus repair traces.

bayesian-agent incremental-summary \
  --baseline artifacts/ga_deepseek_baseline/sop_results.json \
  --repairs artifacts/bayesian_incremental/results.json \
  --out temp/incremental_summary.json

This command is useful for measuring the additional inference cost required to repair failed tasks.

`replay-skill-artifacts`¶

Rebuild per-task Skill evolution artifacts from an existing results.json without rerunning the model.

bayesian-agent replay-skill-artifacts \
  --results results/sop_deepseek_v4_flash/bayesian_full/results.json

By default, artifacts are written next to the result file:

<result-dir>/skill_evolution/
  index.json
  <benchmark>/<task_id>/
    skill_context_before.md
    skill_context_after.md
    posterior_context_before.md
    posterior_context_after.md
    belief_before.json
    belief_after.json
    snapshot_before.json
    snapshot_after.json

skill_context_*.md stores the exact model-facing Skill/SOP text. posterior_context_*.md stores posterior summaries for audit/debugging and is not part of the built-in benchmark prompt.

Use --out-root when the run root is different from the result file's parent directory.

Result File Assumptions¶

The CLI accepts benchmark-style JSON result files that can be normalized into benchmark names and run lists. Each run should contain at least:

task_id
success signal such as success
token fields such as input_tokens, output_tokens, and total_tokens

Extra fields are preserved in evidence metadata.

CLI¶

Install the CLI¶

evolve¶

repair-plan¶

summarize¶

incremental-summary¶

replay-skill-artifacts¶