lab: AFK safety cap — pause a project after 3 consecutive failures, with UI reset #65
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#65
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent
#61
What to build
Keep a project's auto-loop from burning runs on a recurring failure. Track consecutive failed runs per project, pause auto at 3, and expose a Reset in the UI so a human can re-arm it.
Acceptance criteria
consecutiveFailuresinlab's store.consecutiveFailuresreaches 3, the project's auto-loop pauses: the scheduler launches no further auto-runs for it even whileautoEnabledis on.Auto paused · N failsand a Reset button (POST /afk/reset/<project>) that zeroes the counter and re-arms the loop.⋯menu.Blocked by
Agent Brief
Category: enhancement
Summary: Pause a project's automatic AFK loop after 3 consecutive failed runs (tracked per project, persisted), and expose a UI Reset that zeroes the counter and re-arms the loop.
Context: Slice 4 of #61; design locked in ADR-0007. Blockers #63 (reap lifecycle) and #64 (auto scheduler) are merged, so this is unblocked. The reap chokepoint and the auto-launch predicate were deliberately left with seams for this slice.
Current behavior:
An AFK run is reaped at a single chokepoint that classifies each terminal run as success (an
afk/<N>PR appeared), death-failure (session gone, no PR), or timeout-failure (over the run budget, no PR). The auto scheduler launches runs for auto-enabled Forgejo projects, serial per project, under the global instance cap. Nothing tracks repeated failure: an auto-enabled project whose issues keep failing will keep claiming the nextready-for-agentissue and burning a full run every sweep, indefinitely. There is no persisted failure counter and no way to pause one misbehaving project short of switching its auto toggle off. A user-initiated Stop is already neutral — it kills and forgets the run so the reaper never counts it as a failure.Desired behavior:
Key interfaces:
Store/projectState— add a persistedconsecutiveFailuresinteger beside the existingautoEnabled, following the same omitempty + atomic load/save pattern. Expose store methods that mutate atomically under the store's own lock (read-modify-write inside the store): a getter, an increment, and a reset-to-zero. Do not make callers do get-then-set — the reaper goroutine and the Reset HTTP handler write this concurrently, so a caller-side compound op would race. The reset method is shared by the success reap and the Reset action.reapAFKRun) — the single point every terminal outcome already routes through. OnafkSuccess, reset the project's counter; onafkFailureDeath/afkFailureTimeout, increment it. This is the only place the run lifecycle writes the counter. (Its current comment marks this exact seam as reserved for this slice; update that comment.)afkAutoDecision+shouldLaunchAuto) — add aPausedterm so an automatic launch additionally requires the project to be unpaused; the scheduler populates it fromconsecutiveFailures >= threshold. CRITICAL: this gate belongs to the scheduler predicate only. It must not be added to the shared select→claim→spawn path used by both manual and auto starts, or it would wrongly block manual Start too.POST /afk/reset/<project>, registered in the router and mirroring the existing auto-toggle handler: POST-only, resolves the project, zeroes the counter, then kicks one scheduler sweep so a re-armed project picks up a ready issue promptly (the sweep self-gates on the auto toggle, so kicking it is safe whether auto is on or off). It must use the shared success/fail response plumbing so the fetch path gets the re-rendered#livefragment and the no-JS path gets the 303 redirect — identical to the other per-project actions.projectGroup) + index template ⋯ menu — the snapshot should expose the per-project failure count and a derived "paused" flag for Forgejo projects. The menu (which already renders "Start AFK run" and "Auto AFK runs: On/Off" as server-text form-buttons) gains, when paused, an "Auto paused · N fails" indicator and a Reset form-button. Render these as server text the morph syncs — never client-owned<input>/checked/openstate — exactly like the existing auto toggle, so both the no-JS and fetch/morph paths stay correct. Keep the ≥44px tap targets the menu items already use (the UI is mobile-first).Acceptance criteria:
projectStatecarries a persistedconsecutiveFailuresthat round-trips across a store reload (mirror the existingautoEnabledpersistence test).consecutiveFailuresreaches the threshold (3), the scheduler launches no further automatic runs for that project even while its auto toggle is on; other projects are unaffected./afk/reset/<project>, zeroes the counter, and re-arms the loop.Out of scope:
Notes for the implementer:
go test ./... && go vet ./... && go build ./...in the lab module locally before opening the PR. Don't commit build artifacts (the prebuilt binary is gitignored).