Agentensor Training Runbook¶
Overview¶
- Training runs are dispatched over the existing workflow websocket (
type: "train_workflow"). Payload mirrors evaluation but accepts atrainingblock with dataset/evaluators plus optimizer settings (epochs,checkpoint_interval,case_timeout_seconds,max_concurrency). - AgentensorNode enforces guardrails: concurrency is capped at 8 (or the requested
max_concurrency, whichever is lower) and case timeouts default to 30s (max 300s). - Checkpoints are emitted at
checkpoint_intervalboundaries and whenever a new best score is observed. Each checkpoint captures the runnable config (including prompts), metrics, and metadata (epoch, summary).
Storage & Migrations¶
- SQLite:
agentensor_checkpointstable is auto-created alongside execution history.CREATE TABLE IF NOT EXISTS agentensor_checkpoints ( id TEXT PRIMARY KEY, workflow_id TEXT NOT NULL, config_version INTEGER NOT NULL, runnable_config TEXT NOT NULL, metrics TEXT NOT NULL, metadata TEXT NOT NULL DEFAULT '{}', artifact_url TEXT, is_best INTEGER NOT NULL DEFAULT 0, created_at TEXT NOT NULL ); CREATE INDEX IF NOT EXISTS idx_agentensor_checkpoints_workflow ON agentensor_checkpoints(workflow_id, config_version); CREATE INDEX IF NOT EXISTS idx_agentensor_checkpoints_best ON agentensor_checkpoints(workflow_id, is_best); - PostgreSQL: apply the equivalent DDL before deploying:
CREATE TABLE IF NOT EXISTS agentensor_checkpoints ( id TEXT PRIMARY KEY, workflow_id TEXT NOT NULL, config_version INTEGER NOT NULL, runnable_config JSONB NOT NULL, metrics JSONB NOT NULL, metadata JSONB NOT NULL DEFAULT '{}'::jsonb, artifact_url TEXT NULL, is_best BOOLEAN NOT NULL DEFAULT FALSE, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX IF NOT EXISTS idx_agentensor_checkpoints_workflow ON agentensor_checkpoints (workflow_id, config_version); CREATE INDEX IF NOT EXISTS idx_agentensor_checkpoints_best ON agentensor_checkpoints (workflow_id, is_best); - Rollback/restore: checkpoints are append-only. To roll back a regression, fetch the previous best via
GET /api/workflows/{workflow_id}/agentensor/checkpoints?limit=5, select the desiredconfig_version, and reuse itsrunnable_configin a new run.
Operations¶
- Start a training run by streaming:
- Monitor progress events:
training_progress: per-case output/evaluationstraining_checkpoint: persisted checkpoint payloadtraining_epoch_complete: summary per epoch- Download/reuse:
GET /api/workflows/{workflow_id}/agentensor/checkpoints/{checkpoint_id}returns the stored config and metrics for reuse.
Performance & Concurrency Checks¶
- The training node caps
max_concurrencyto 8 and applies wait-for timeouts per case. Increasecheckpoint_intervalto reduce write amplification in long runs. - For load testing, run concurrent websocket training sessions against SQLite and Postgres backends; verify checkpoint ordering and
is_bestflag consistency. On Postgres, wrap the DDL above in migrations to validate apply/rollback.