Isolate lab agents so one runaway can't wedge the dev VM #76
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#76
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
labspawns each Claude agent as a bare detached tmux session (Sessions.StartCommand→tmux new-session -d …inhosts/fw/vms/dev/modules/lab/sessions.go), with no resource isolation. The session's process tree runs as userdominikoutside any bounded cgroup, sharing the same global resource pool assshd,nscd/nsncd, andlabitself.The consequence: a single agent's resource runaway is not contained — it exhausts a system-wide limit and takes down the entire
devmicrovm.Observed incident (2026-06-02)
The dev VM wedged:
lab.cloonar.comreturnedtmux: … too many open files in systemand SSH into dev failed simultaneously.Diagnosis: system-wide file-descriptor exhaustion (
ENFILE), not OOM.fs.file-max=LONG_MAX,fs.inotify.max_user_instances/max_user_watches= 524288,lab/nscdLimitNOFILE= 524288,kernel.pid_max= 4194304.nsncdfailed everygetservbyname_rwithcode 23(=ENFILE) in a continuous storm from 12:05 until reboot — nsncd was a victim, clean at idle afterward.labitself holds ~640 fds at idle and does not leak.Because the limit hit was system-wide, every process needing a new fd or NSS lookup failed at once — which is why lab and SSH died together and the box could only be recovered by a reboot.
Suspected fd leak (motivation for this issue)
At the time, only one agent was actively running and one was idle, yet the whole VM still wedged. Two idle/lightly-loaded agents should not exhaust a system with effectively-infinite count ceilings — so there is likely a file-descriptor leak in an agent's process tree (claude/node/LSP/watcher opening fds and never closing them).
That underlying leak is out of scope here — it can't be safely investigated while a single agent can take down the entire server. This issue is the prerequisite: contain the blast radius first, so the leak can be reproduced and debugged on a live box without losing SSH and lab.
Goal — per-agent resource isolation
Launch each agent in its own bounded systemd scope/slice with per-agent limits, e.g. via
systemd-run --scope --slice=lab-agents.slice -p LimitNOFILE=… -p MemoryMax=… -p TasksMax=…instead of a baretmux new-session. A runaway agent then hits its ownEMFILE/limit and dies alone, leavingsshd,nscd, andlabalive and the VM reachable.Acceptance criteria
LimitNOFILE, and ideallyMemoryMax/TasksMax).lab.cloonar.comstay responsive.sshd,nscd,lab) retain reserved headroom so the box stays recoverable under agent pressure.Out of scope / follow-ups
-max-instances(default 6) and VM RAM/inotify sysctls for higher safe concurrency.Pointers
hosts/fw/vms/dev/modules/lab/sessions.go—StartCommand(the baretmux new-session -dlaunch path).hosts/fw/vms/dev/modules/lab/default.nix— thelabsystemd unit andExecStart(-max-instancesdefaults to 6,handlers.go).hosts/fw/vms/dev/default.nix— microvm sizing (12 GB / 4 vcpu).Agent Brief
Category: enhancement
Summary: Bound each lab-spawned session's open-file count so one runaway agent hits its own EMFILE and dies alone, instead of exhausting the dev VM system-wide.
Current behavior:
lab launches every session — manual instances, AFK runs, and the
claude auth loginflow — by handing an argv to its tmux-session wrapper (theSessionstype), which runstmux new-session -don the shared default tmux socket. The spawned process tree inherits the system-defaultRLIMIT_NOFILE(~524288 hard), and since a process can raise its own soft limit up to that hard ceiling, a single agent leaking file descriptors can consume system-grade resources. There is no per-session bound. On 2026-06-02, one active + one idle agent drove the VM to system-wide fd exhaustion (ENFILE): lab and SSH died together and the box needed a reboot. Note:new-session -ddaemonizes the session under a long-lived shared server, so a bound applied to the tmux invocation would not reach the agent tree.Desired behavior:
Every session lab spawns runs under a per-process hard
RLIMIT_NOFILElow enough that a descriptor leak hitsEMFILEon itself — locally, well before it can drive the system to ENFILE — while trusted services (sshd, nscd, lab) keep their high default ceiling and stay responsive. The bound must apply to the agent's whole process tree (claude + node + language servers + watchers), so it is set on the inner command from inside the pane (after tmux daemonizes), not on the tmux call. Both soft and hard limits are set to the cap, so the process cannot raise its soft limit back up. The cap is operator-configurable without a recompile.Design decisions (settled during triage — honor these, don't relitigate):
RLIMIT_NOFILEviaprlimit), NOT cgroups. The observed failure was fd exhaustion; NOFILE is the exact lever and needs no new privilege, no systemd user-manager lingering, no controller delegation.systemd-run --scopecgroup limits (MemoryMax/TasksMax) are a possible future follow-up, deliberately out of scope — memory is already mitigated by the VM's zramSwap and did not fire in the incident.NOFILEonly. Do NOT addRLIMIT_NPROC(enforced per-UID system-wide, so a per-session cap would punish unrelateddominikprocesses — SSH logins, lab, sibling agents) orRLIMIT_AS(node/V8 reserve huge virtual address space; an AS cap tends to kill claude).%s) substitution and with AFK's trailing seed-prompt argument.Key interfaces:
Sessionstype (lab's tmux-session wrapper): its start path should prepend a descriptor-cap wrapper to the already-substituted argv beforetmux new-session, conditional on a configured cap. When the cap is 0/unset, spawn bare (tests and minimal configs unaffected).-session-nofile(int, default 16384), wired frommainintoSessionsand set in the lab service'sExecStart. 16384 errs generous (a busy agent — claude + multiple LSPs + inotify + ripgrep + the forgejo MCP server — can hold low-thousands of fds); containment holds at any value far below 524288, so start generous and tune down.prlimitbinary, following lab's convention of pinning each external tool behind a flag (as for tmux/claude/git/tea): add a-prlimitflag and pututil-linuxon the lab service PATH, since the inner command resolves against the restricted service PATH.util-linuxalongside the existingtmuxso the integration test runs incheckPhase.Acceptance criteria:
RLIMIT_NOFILEequal to the configured cap.ExecStart.Sessionstests) starts a capped session whose stand-in reports its own soft+hard NOFILE and asserts both equal the cap — proving the limit survivesnew-session -ddaemonization. A small cap value is fine; propagation is value-independent.max-instances× cap stays far below the live VM'sfs.file-max/fs.nr_open, grounded with a read-onlysysctlcheck on dev and recorded in the PR.Verification (safe — no fault injection):
Original criterion 2 ("a runaway fails that agent only; SSH + lab stay up") follows from (a) the integration tests proving the cap propagates and binds the tree, and (b) the headroom arithmetic proving agents collectively can't deplete the system pool — so deliberately exhausting fds on the live VM is unnecessary. Optionally, once a healthy agent is running, a read-only
ls /proc/<pid>/fd | wc -lconfirms 16384 has margin over steady-state.Out of scope:
systemd-run --scope,MemoryMax,TasksMax) and the user-manager lingering they need.-max-instances(stays at 6) and tuning VM RAM / inotify sysctls.