lab: AFK safety cap — pause a project after 3 consecutive failures, with UI reset #65

Closed
opened 2026-06-01 12:24:05 +02:00 by dominik.polakovics · 1 comment

Parent

#61

What to build

Keep a project's auto-loop from burning runs on a recurring failure. Track consecutive failed runs per project, pause auto at 3, and expose a Reset in the UI so a human can re-arm it.

Acceptance criteria

  • New persisted per-project field consecutiveFailures in lab's store.
  • A failure reap (from Slice 2: session-death-without-PR or timeout) increments the counter; a success reap zeroes it; a user-initiated Stop leaves it unchanged.
  • When consecutiveFailures reaches 3, the project's auto-loop pauses: the scheduler launches no further auto-runs for it even while autoEnabled is on.
  • The card shows Auto paused · N fails and a Reset button (POST /afk/reset/<project>) that zeroes the counter and re-arms the loop.
  • Reset works both no-JS (form post) and via the fetch/morph path, and is reachable from the menu.
  • The pause is per project; other projects are unaffected; the threshold (3) is a tunable constant.
  • Go unit tests cover counter transitions (failure increments, success resets, user-Stop is neutral, pause at threshold, reset clears).

Blocked by

  • #63 (Slice 2 — provides the success/failure reap signal the counter consumes)
  • #64 (Slice 3 — provides the auto-loop that the pause gates)
## Parent #61 ## What to build Keep a project's auto-loop from burning runs on a recurring failure. Track consecutive failed runs per project, pause auto at 3, and expose a Reset in the UI so a human can re-arm it. ## Acceptance criteria - [ ] New persisted per-project field `consecutiveFailures` in `lab`'s store. - [ ] A failure reap (from Slice 2: session-death-without-PR or timeout) increments the counter; a success reap zeroes it; a user-initiated Stop leaves it unchanged. - [ ] When `consecutiveFailures` reaches 3, the project's auto-loop pauses: the scheduler launches no further auto-runs for it even while `autoEnabled` is on. - [ ] The card shows `Auto paused · N fails` and a **Reset** button (`POST /afk/reset/<project>`) that zeroes the counter and re-arms the loop. - [ ] Reset works both no-JS (form post) and via the fetch/morph path, and is reachable from the `⋯` menu. - [ ] The pause is per project; other projects are unaffected; the threshold (3) is a tunable constant. - [ ] Go unit tests cover counter transitions (failure increments, success resets, user-Stop is neutral, pause at threshold, reset clears). ## Blocked by - #63 (Slice 2 — provides the success/failure reap signal the counter consumes) - #64 (Slice 3 — provides the auto-loop that the pause gates)
Author
Owner

This was generated by AI during triage.

Agent Brief

Category: enhancement
Summary: Pause a project's automatic AFK loop after 3 consecutive failed runs (tracked per project, persisted), and expose a UI Reset that zeroes the counter and re-arms the loop.

Context: Slice 4 of #61; design locked in ADR-0007. Blockers #63 (reap lifecycle) and #64 (auto scheduler) are merged, so this is unblocked. The reap chokepoint and the auto-launch predicate were deliberately left with seams for this slice.

Current behavior:
An AFK run is reaped at a single chokepoint that classifies each terminal run as success (an afk/<N> PR appeared), death-failure (session gone, no PR), or timeout-failure (over the run budget, no PR). The auto scheduler launches runs for auto-enabled Forgejo projects, serial per project, under the global instance cap. Nothing tracks repeated failure: an auto-enabled project whose issues keep failing will keep claiming the next ready-for-agent issue and burning a full run every sweep, indefinitely. There is no persisted failure counter and no way to pause one misbehaving project short of switching its auto toggle off. A user-initiated Stop is already neutral — it kills and forgets the run so the reaper never counts it as a failure.

Desired behavior:

  • lab tracks consecutive failed AFK runs per project, persisted across restarts.
  • A failing reap (death or timeout) increments the project's counter; a successful reap resets it to zero; a user-initiated Stop leaves it unchanged (already true today). The counter is kind-agnostic — both manual and automatic runs feed it — but only the automatic loop is paused by it. (A consequence: a successful manual run re-arms a paused auto-loop, which is intended.)
  • When the counter reaches the threshold (default 3), the scheduler launches no further automatic runs for that project, even while its auto toggle stays on. Manual "Start AFK run" is unaffected and still works.
  • The pause is per project; other projects are untouched. Only a Reset (or a subsequent successful run) clears the counter and re-arms the loop — merely toggling auto off and on does not clear it.
  • The project card surfaces the paused state as "Auto paused · N fails" and offers a Reset action in the ⋯ menu that zeroes the counter and re-arms the loop. Reset must work both without JS (plain form POST → redirect) and via the fetch/morph path, like every other per-project action.

Key interfaces:

  • Store / projectState — add a persisted consecutiveFailures integer beside the existing autoEnabled, following the same omitempty + atomic load/save pattern. Expose store methods that mutate atomically under the store's own lock (read-modify-write inside the store): a getter, an increment, and a reset-to-zero. Do not make callers do get-then-set — the reaper goroutine and the Reset HTTP handler write this concurrently, so a caller-side compound op would race. The reset method is shared by the success reap and the Reset action.
  • The reap chokepoint (reapAFKRun) — the single point every terminal outcome already routes through. On afkSuccess, reset the project's counter; on afkFailureDeath / afkFailureTimeout, increment it. This is the only place the run lifecycle writes the counter. (Its current comment marks this exact seam as reserved for this slice; update that comment.)
  • The auto-launch predicate (afkAutoDecision + shouldLaunchAuto) — add a Paused term so an automatic launch additionally requires the project to be unpaused; the scheduler populates it from consecutiveFailures >= threshold. CRITICAL: this gate belongs to the scheduler predicate only. It must not be added to the shared select→claim→spawn path used by both manual and auto starts, or it would wrongly block manual Start too.
  • A new action POST /afk/reset/<project>, registered in the router and mirroring the existing auto-toggle handler: POST-only, resolves the project, zeroes the counter, then kicks one scheduler sweep so a re-armed project picks up a ready issue promptly (the sweep self-gates on the auto toggle, so kicking it is safe whether auto is on or off). It must use the shared success/fail response plumbing so the fetch path gets the re-rendered #live fragment and the no-JS path gets the 303 redirect — identical to the other per-project actions.
  • The threshold — a single named constant (default 3), defined beside the other AFK tunables.
  • The project view model (projectGroup) + index template ⋯ menu — the snapshot should expose the per-project failure count and a derived "paused" flag for Forgejo projects. The menu (which already renders "Start AFK run" and "Auto AFK runs: On/Off" as server-text form-buttons) gains, when paused, an "Auto paused · N fails" indicator and a Reset form-button. Render these as server text the morph syncs — never client-owned <input>/checked/open state — exactly like the existing auto toggle, so both the no-JS and fetch/morph paths stay correct. Keep the ≥44px tap targets the menu items already use (the UI is mobile-first).

Acceptance criteria:

  • projectState carries a persisted consecutiveFailures that round-trips across a store reload (mirror the existing autoEnabled persistence test).
  • A death-failure or timeout-failure reap increments the project's counter; a PR-success reap resets it to zero; a user-initiated Stop leaves it unchanged.
  • When consecutiveFailures reaches the threshold (3), the scheduler launches no further automatic runs for that project even while its auto toggle is on; other projects are unaffected.
  • Manual "Start AFK run" still works on a paused project — the pause gates only the automatic loop.
  • The card shows "Auto paused · N fails" and a Reset control reachable from the ⋯ menu; Reset posts to /afk/reset/<project>, zeroes the counter, and re-arms the loop.
  • Reset works both no-JS (form POST → redirect) and via the fetch/morph path.
  • The threshold (3) is a single tunable constant.
  • Go unit tests cover the counter transitions: failure increments, success resets, user-Stop is neutral, the scheduler pauses at threshold, and Reset clears the counter.

Out of scope:

  • Auto-retry of failed runs, backoff, or any automatic un-pause — re-arming is human-initiated via Reset (or implicit on the next success). ADR-0007 explicitly rejects auto-retry.
  • Changing how a run is classified (success/death/timeout) or the reap mechanics — Slice 2 (#63) owns those.
  • Changing the scheduler cadence, the global instance cap, or manual-run additivity.
  • The "(N ready)" menu count hint — that's the sibling Slice 5 (#66).
  • Per-project configurable thresholds or any settings UI — the threshold stays a code constant.
  • Any notification/alert when a project pauses.

Notes for the implementer:

  • The pre-commit hook is eval-only (it dry-builds the NixOS config; it does not compile or test the lab Go module). Run go test ./... && go vet ./... && go build ./... in the lab module locally before opening the PR. Don't commit build artifacts (the prebuilt binary is gitignored).
> *This was generated by AI during triage.* ## Agent Brief **Category:** enhancement **Summary:** Pause a project's *automatic* AFK loop after 3 consecutive failed runs (tracked per project, persisted), and expose a UI **Reset** that zeroes the counter and re-arms the loop. **Context:** Slice 4 of #61; design locked in **ADR-0007**. Blockers #63 (reap lifecycle) and #64 (auto scheduler) are merged, so this is unblocked. The reap chokepoint and the auto-launch predicate were deliberately left with seams for this slice. **Current behavior:** An AFK run is reaped at a single chokepoint that classifies each terminal run as *success* (an `afk/<N>` PR appeared), *death-failure* (session gone, no PR), or *timeout-failure* (over the run budget, no PR). The auto scheduler launches runs for auto-enabled Forgejo projects, serial per project, under the global instance cap. Nothing tracks repeated failure: an auto-enabled project whose issues keep failing will keep claiming the next `ready-for-agent` issue and burning a full run every sweep, indefinitely. There is no persisted failure counter and no way to pause one misbehaving project short of switching its auto toggle off. A user-initiated Stop is already neutral — it kills and forgets the run so the reaper never counts it as a failure. **Desired behavior:** - lab tracks **consecutive failed AFK runs per project**, persisted across restarts. - A failing reap (death or timeout) **increments** the project's counter; a successful reap **resets it to zero**; a user-initiated Stop leaves it **unchanged** (already true today). The counter is **kind-agnostic** — both manual and automatic runs feed it — but only the *automatic* loop is paused by it. (A consequence: a successful **manual** run re-arms a paused auto-loop, which is intended.) - When the counter reaches the threshold (**default 3**), the scheduler launches **no further automatic runs** for that project, *even while its auto toggle stays on*. **Manual "Start AFK run" is unaffected** and still works. - The pause is **per project**; other projects are untouched. Only a **Reset** (or a subsequent successful run) clears the counter and re-arms the loop — merely toggling auto off and on does **not** clear it. - The project card surfaces the paused state as **"Auto paused · N fails"** and offers a **Reset** action in the ⋯ menu that zeroes the counter and re-arms the loop. Reset must work **both without JS** (plain form POST → redirect) **and via the fetch/morph path**, like every other per-project action. **Key interfaces:** - **`Store` / `projectState`** — add a persisted `consecutiveFailures` integer beside the existing `autoEnabled`, following the same omitempty + atomic load/save pattern. Expose store methods that mutate **atomically under the store's own lock** (read-modify-write *inside* the store): a getter, an increment, and a reset-to-zero. Do **not** make callers do get-then-set — the reaper goroutine and the Reset HTTP handler write this concurrently, so a caller-side compound op would race. The reset method is shared by the success reap and the Reset action. - **The reap chokepoint (`reapAFKRun`)** — the single point every terminal outcome already routes through. On `afkSuccess`, reset the project's counter; on `afkFailureDeath` / `afkFailureTimeout`, increment it. This is the only place the run lifecycle writes the counter. (Its current comment marks this exact seam as reserved for this slice; update that comment.) - **The auto-launch predicate (`afkAutoDecision` + `shouldLaunchAuto`)** — add a `Paused` term so an automatic launch additionally requires the project to be unpaused; the scheduler populates it from `consecutiveFailures >= threshold`. **CRITICAL:** this gate belongs to the scheduler predicate only. It must **not** be added to the shared select→claim→spawn path used by both manual and auto starts, or it would wrongly block manual Start too. - **A new action `POST /afk/reset/<project>`**, registered in the router and mirroring the existing auto-toggle handler: POST-only, resolves the project, zeroes the counter, then kicks one scheduler sweep so a re-armed project picks up a ready issue promptly (the sweep self-gates on the auto toggle, so kicking it is safe whether auto is on or off). It must use the shared success/fail response plumbing so the fetch path gets the re-rendered `#live` fragment and the no-JS path gets the 303 redirect — identical to the other per-project actions. - **The threshold** — a single named constant (default 3), defined beside the other AFK tunables. - **The project view model (`projectGroup`) + index template ⋯ menu** — the snapshot should expose the per-project failure count and a derived "paused" flag for Forgejo projects. The menu (which already renders "Start AFK run" and "Auto AFK runs: On/Off" as server-text form-buttons) gains, when paused, an **"Auto paused · N fails"** indicator and a **Reset** form-button. Render these as **server text the morph syncs** — never client-owned `<input>`/`checked`/`open` state — exactly like the existing auto toggle, so both the no-JS and fetch/morph paths stay correct. Keep the ≥44px tap targets the menu items already use (the UI is mobile-first). **Acceptance criteria:** - [ ] `projectState` carries a persisted `consecutiveFailures` that round-trips across a store reload (mirror the existing `autoEnabled` persistence test). - [ ] A death-failure or timeout-failure reap increments the project's counter; a PR-success reap resets it to zero; a user-initiated Stop leaves it unchanged. - [ ] When `consecutiveFailures` reaches the threshold (3), the scheduler launches no further automatic runs for that project even while its auto toggle is on; other projects are unaffected. - [ ] Manual "Start AFK run" still works on a paused project — the pause gates only the automatic loop. - [ ] The card shows "Auto paused · N fails" and a Reset control reachable from the ⋯ menu; Reset posts to `/afk/reset/<project>`, zeroes the counter, and re-arms the loop. - [ ] Reset works both no-JS (form POST → redirect) and via the fetch/morph path. - [ ] The threshold (3) is a single tunable constant. - [ ] Go unit tests cover the counter transitions: failure increments, success resets, user-Stop is neutral, the scheduler pauses at threshold, and Reset clears the counter. **Out of scope:** - Auto-retry of failed runs, backoff, or any *automatic* un-pause — re-arming is human-initiated via Reset (or implicit on the next success). ADR-0007 explicitly rejects auto-retry. - Changing how a run is classified (success/death/timeout) or the reap mechanics — Slice 2 (#63) owns those. - Changing the scheduler cadence, the global instance cap, or manual-run additivity. - The "(N ready)" menu count hint — that's the sibling Slice 5 (#66). - Per-project configurable thresholds or any settings UI — the threshold stays a code constant. - Any notification/alert when a project pauses. **Notes for the implementer:** - The pre-commit hook is **eval-only** (it dry-builds the NixOS config; it does **not** compile or test the lab Go module). Run `go test ./... && go vet ./... && go build ./...` in the lab module locally before opening the PR. Don't commit build artifacts (the prebuilt binary is gitignored).
dominik.polakovics 2026-06-02 15:24:41 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#65
No description provided.