feat(dev): cap each lab session's open-file budget so a runaway can't wedge the VM #77
No reviewers
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos!77
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/lab-session-nofile"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
lab spawned every agent as a bare
tmux new-session -d, so the process tree inherited the system-defaultRLIMIT_NOFILE(~524288 hard). A single agent leaking file descriptors could drive the whole dev microvm to system-wide fd exhaustion (ENFILE) — on 2026-06-02 it took lab and SSH down together, recoverable only by a reboot (#76).This pins every spawned session's soft and hard
RLIMIT_NOFILEto a configurable cap viaprlimit, so a descriptor leak hits the agent's own EMFILE long before it can deplete the system pool. sshd, nscd, and lab keep their high default ceiling and stay responsive.Design (honoring the issue's settled decisions)
Start), AFK runs (StartCommand+ seed prompt), the login flow (StartCommand) — route throughStartCommand. The wrapper is prepended there, composing with%ssession-name substitution and AFK's trailing seed-prompt arg.prlimit --nofile=N:N -- <cmd>is the pane's command, so the bound reaches the whole agent tree (claude + node + LSPs + watchers) and survivesnew-session -ddaemonizing the pane under the shared server. Both limits set equal so the process can't raise its soft limit back toward the system ceiling; the trailing--lets the command's own flags (claude --remote-control …) pass through.-prlimitflag, resolved off the service PATH (util-linux added), matching the tmux/claude/git/tea convention.Changes
sessions.go—prlimitBin+nofilefields; purenewSessionArgs()prefixesprlimit --nofile=N:N --onto the substituted command whennofile > 0, bare otherwise (zero value leaves existing tests / minimal configs unaffected).main.go—-prlimit(defaultprlimit) and-session-nofile(default16384) flags; set onSessionsright after construction.default.nix— util-linux on the service PATH +nativeCheckInputs;-session-nofile 16384set on the service via ExecStart.Tests
go test ./...green (run locally — the pre-commit hook is eval-only for lab Go;fwdry-build: OK).TestSessions_newSessionArgsNofileCap— argv carries the prlimit wrapper (with%ssubstituted) when capped, bare at 0.TestSessions_nofileCapPropagatesThroughDaemonization— real tmux; the spawned pane reports soft+hard NOFILE == cap, proving the limit survivesnew-session -d.TestSessions_nofileCapBindsEachSessionOnSharedServer— two capped sessions on one shared server both report the cap, proving attach-to-a-running-server doesn't bypass the per-pane wrapper.Headroom
max-instances= 6 (login excluded) → ≤ 7 capped sessions including login. Worst case 7 × 16384 = 114,688 fds.fs.file-max=LONG_MAX≈ 9.2e18 (measured on dev during the 2026-06-02 incident, per #76) → ratio ~1e-14; the system file table cannot be depleted by capped agents.fs.nr_open(per-process hard ceiling, kernel default 1,048,576) → the 16384 cap is ~1.6% of it, comfortably settable by prlimit.The arithmetic is conclusive on documented values. One belt-and-suspenders item is pending: a fresh live read-only
sysctl fs.file-max fs.nr_openon dev to re-confirm against current kernel state — that SSH needs explicit approval; I'll append the output here once approved.Acceptance criteria
new-session -d).max-instances × cap≪fs.file-max/fs.nr_open— arithmetic recorded above; live dev sysctl re-confirmation pending SSH approval.Closes #76