lab: AFK run lifecycle — auto-reap on PR success, fail on death/timeout #63
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#63
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent
#61
What to build
A watcher in
lab's poll loop that completes AFK runs on its own, sinceclaude --remote-controlnever exits by itself (it opens the PR, then idles holding its slot). A run is done when a PR with headafk/<N>exists; it is a failure when its session dies with no PR or it overruns a time budget.labreaps accordingly and frees the slot — which also gives the auto-loop (Slice 3) an "in-flight" signal that actually clears.Acceptance criteria
labenumerates live AFK runs from session names<project>~<slot>~afk-<N>and tracks each run's start time (the budget clock; resetting it on a lab restart is acceptable).afk/<N>, matched client-side fromtea pulls list --fields index,head,state.~/.local/state/lab/worktrees/<project>~<N>; the branch + PR survive (the PR'sCloses #Ncloses the issue on merge).in-progressfor manual requeue; it is not a success and (per Slice 4) must not count as a failure. Supersedes Slice 1's interim "Stop removes the worktree."Blocked by
Agent Brief
Category: enhancement
Summary: Add a periodic background watcher to lab that completes ("reaps") AFK runs on its own — success when the run's
afk/<N>PR appears, failure on session-death-without-PR or a budget overrun — freeing the instance slot in every terminal case.Context:
claude --remote-controlnever self-exits: an AFK run opens its PR and then idles, holding its tmux session (and its instance slot) forever. lab therefore can't use "the session ended" as the done signal. This is Slice 2 of #61, building on the manual-run substrate from #62 (now merged). The design is locked in ADR-0007 (docs/adr/0007-lab-drives-afk-runs.md) and the AFK run / Instance glossary entries in CONTEXT.md — follow them; this brief only sharpens the slice boundary and bakes in two decisions taken at triage.Current behavior:
ready-for-agentissue (flipping it toin-progress), creates an isolated worktree on branchafk/<N>, and spawns a seeded--remote-controlsession named<project>~<slot>~afk-<N>, shown as an instance row badgedAFK #N.handleStop) currently also removes the worktree — interim Slice-1 behavior, marked in-code as superseded by this issue. Theafk/<N>branch and pushed commits survive.startCapture) and the browser's own ~4s fragment poll. This slice introduces lab's first periodic watcher goroutine, started once at process startup.Desired behavior:
A single periodic watcher (one long-lived goroutine, ticking on a tunable interval ~30s, within ADR-0007's 30–60s band) enumerates live AFK runs and classifies each from three inputs — does an
afk/<N>PR exist, is the session still alive, run age vs. budget — taking exactly one terminal action:afk/<N>PR exists (open or merged): stop the session if still alive, then remove its worktree. Branch, commits, and PR survive (the PR'sCloses #Ncloses the issue on merge). A present PR means success regardless of session liveness — a run that opened its PR and then died is a success, not a death-failure.afk/<N>PR exists. Keep the worktree for inspection; the issue staysin-progress. (Nothing to stop; the worktree was never removed.)afk/<N>PR exists, and the run has exceeded a ~45-minute budget. Stop the session; keep the worktree; the issue staysin-progress.PR matching is client-side from the project's own tracker (
tea pulls list, head + state fields). A PR whose head isafk/<N>counts as done only when open or merged; a closed-and-unmergedafk/<N>PR is treated as no PR (so the run fails on death/timeout rather than being falsely reaped as success).Run start times (the budget clock) are held in memory, keyed by session name, stamped lazily the first time the watcher sees a run. Re-deriving the live-run set each tick from session names means a lab restart re-adopts in-flight runs with the budget clock reset to the restart — acceptable per ADR-0007.
Reaping in every terminal case kills the tmux session, so the run leaves the instances list and stops counting against the global cap with no extra slot bookkeeping (the cap counts live sessions).
The manual Stop becomes neutral, superseding the interim behavior: it keeps the worktree and leaves the issue
in-progressfor manual requeue. It is neither a success nor a failure. The coupling that makeshandleStopremove an AFK worktree must go.Key interfaces:
Trackerseam (already wrapsteafor issue queries, with a substitutable test fake) gains a way to list a project's pull requests with their head branch and state, so the watcher matchesafk/<N>client-side. Mirror the existing ready-queue method and keep theteashell-out inside this seam so it stays unit-testable via the fake.(prPresent bool, sessionAlive bool, age, budget)returning the outcome (in-progress / success / failure-with-reason). This is the unit-tested core; it must not touch tmux, tea, or the clock directly.removeAFKWorktree) already has the correct success semantics — removes the worktree, preserves branch/commits/PR. Reuse it for the success path.Acceptance criteria:
<project>~<slot>~afk-<N>session names and tracks each run's start time in memory (reset on lab restart is acceptable).afk/<N>, matched client-side from the project'steapull list (head + state).afk/<N>PR): the session is stopped and the run's worktree is removed; theafk/<N>branch and PR survive.in-progress.in-progress; it is not a success and not a failure (supersedes the interim "Stop removes the worktree").afk/<N>PR does not reap the run as success.(PR-present, session-alive, run-age)inputs — including PR-present-but-session-dead = success, and closed-unmerged-PR = not-success.Out of scope:
consecutiveFailurescounter, the 3-strikes auto-pause, and the UI reset — Slice 4 (#65). This slice only exposes the outcome seam those hook onto.(N ready)count hint — Slice 5 (#66).Implementer note: lab's Go tests do not run in the repo pre-commit hook (it is eval-only). Run
go test ./...,go vet, andgo buildfor this module locally before opening the PR — a Go regression won't be caught by the hook.