Isolate lab agents so one runaway can't wedge the dev VM #76

Closed
opened 2026-06-02 13:12:01 +02:00 by dominik.polakovics · 1 comment

Problem

lab spawns each Claude agent as a bare detached tmux session (Sessions.StartCommandtmux new-session -d … in hosts/fw/vms/dev/modules/lab/sessions.go), with no resource isolation. The session's process tree runs as user dominik outside any bounded cgroup, sharing the same global resource pool as sshd, nscd/nsncd, and lab itself.

The consequence: a single agent's resource runaway is not contained — it exhausts a system-wide limit and takes down the entire dev microvm.

Observed incident (2026-06-02)

The dev VM wedged: lab.cloonar.com returned tmux: … too many open files in system and SSH into dev failed simultaneously.

Diagnosis: system-wide file-descriptor exhaustion (ENFILE), not OOM.

  • No OOM-killer fired; userspace RAM was not the limit.
  • All count ceilings were huge and ruled out: fs.file-max = LONG_MAX, fs.inotify.max_user_instances/max_user_watches = 524288, lab/nscd LimitNOFILE = 524288, kernel.pid_max = 4194304.
  • Independent confirmation of the system-wide condition: nsncd failed every getservbyname_r with code 23 (= ENFILE) in a continuous storm from 12:05 until reboot — nsncd was a victim, clean at idle afterward.
  • lab itself holds ~640 fds at idle and does not leak.

Because the limit hit was system-wide, every process needing a new fd or NSS lookup failed at once — which is why lab and SSH died together and the box could only be recovered by a reboot.

Suspected fd leak (motivation for this issue)

At the time, only one agent was actively running and one was idle, yet the whole VM still wedged. Two idle/lightly-loaded agents should not exhaust a system with effectively-infinite count ceilings — so there is likely a file-descriptor leak in an agent's process tree (claude/node/LSP/watcher opening fds and never closing them).

That underlying leak is out of scope here — it can't be safely investigated while a single agent can take down the entire server. This issue is the prerequisite: contain the blast radius first, so the leak can be reproduced and debugged on a live box without losing SSH and lab.

Goal — per-agent resource isolation

Launch each agent in its own bounded systemd scope/slice with per-agent limits, e.g. via systemd-run --scope --slice=lab-agents.slice -p LimitNOFILE=… -p MemoryMax=… -p TasksMax=… instead of a bare tmux new-session. A runaway agent then hits its own EMFILE/limit and dies alone, leaving sshd, nscd, and lab alive and the VM reachable.

Acceptance criteria

  • Each lab-spawned agent runs in a dedicated, resource-bounded cgroup (own LimitNOFILE, and ideally MemoryMax/TasksMax).
  • A single agent exhausting its fd budget fails that agent only — SSH into dev and lab.cloonar.com stay responsive.
  • Critical services (sshd, nscd, lab) retain reserved headroom so the box stays recoverable under agent pressure.
  • tmux's session daemonization (double-fork) is handled so the limits actually apply to the agent tree, not just the launching call.

Out of scope / follow-ups

  • Root-causing and fixing the actual fd leak (separate issue — unblocked by this one).
  • Right-sizing -max-instances (default 6) and VM RAM/inotify sysctls for higher safe concurrency.
  • fd/mem/inotify monitoring (promtail/victoriametrics) to observe per-agent usage and set the cap empirically.

Pointers

  • hosts/fw/vms/dev/modules/lab/sessions.goStartCommand (the bare tmux new-session -d launch path).
  • hosts/fw/vms/dev/modules/lab/default.nix — the lab systemd unit and ExecStart (-max-instances defaults to 6, handlers.go).
  • hosts/fw/vms/dev/default.nix — microvm sizing (12 GB / 4 vcpu).
## Problem `lab` spawns each Claude agent as a bare detached tmux session (`Sessions.StartCommand` → `tmux new-session -d …` in `hosts/fw/vms/dev/modules/lab/sessions.go`), with **no resource isolation**. The session's process tree runs as user `dominik` outside any bounded cgroup, sharing the same global resource pool as `sshd`, `nscd`/nsncd, and `lab` itself. The consequence: a single agent's resource runaway is **not contained** — it exhausts a *system-wide* limit and takes down the entire `dev` microvm. ## Observed incident (2026-06-02) The dev VM wedged: `lab.cloonar.com` returned `tmux: … too many open files in system` and SSH into dev failed simultaneously. Diagnosis: **system-wide file-descriptor exhaustion (`ENFILE`)**, not OOM. - No OOM-killer fired; userspace RAM was not the limit. - All count ceilings were huge and ruled out: `fs.file-max` = `LONG_MAX`, `fs.inotify.max_user_instances`/`max_user_watches` = 524288, `lab`/`nscd` `LimitNOFILE` = 524288, `kernel.pid_max` = 4194304. - Independent confirmation of the system-wide condition: `nsncd` failed every `getservbyname_r` with `code 23` (= `ENFILE`) in a continuous storm from 12:05 until reboot — nsncd was a *victim*, clean at idle afterward. - `lab` itself holds ~640 fds at idle and does not leak. Because the limit hit was system-wide, **every** process needing a new fd or NSS lookup failed at once — which is why lab *and* SSH died together and the box could only be recovered by a reboot. ## Suspected fd leak (motivation for this issue) At the time, only **one agent was actively running and one was idle**, yet the whole VM still wedged. Two idle/lightly-loaded agents should not exhaust a system with effectively-infinite count ceilings — so there is likely a **file-descriptor leak** in an agent's process tree (claude/node/LSP/watcher opening fds and never closing them). That underlying leak is **out of scope here** — it can't be safely investigated while a single agent can take down the entire server. This issue is the prerequisite: **contain the blast radius first**, so the leak can be reproduced and debugged on a live box without losing SSH and lab. ## Goal — per-agent resource isolation Launch each agent in its own bounded systemd scope/slice with per-agent limits, e.g. via `systemd-run --scope --slice=lab-agents.slice -p LimitNOFILE=… -p MemoryMax=… -p TasksMax=…` instead of a bare `tmux new-session`. A runaway agent then hits its **own** `EMFILE`/limit and dies alone, leaving `sshd`, `nscd`, and `lab` alive and the VM reachable. ## Acceptance criteria - [ ] Each lab-spawned agent runs in a dedicated, resource-bounded cgroup (own `LimitNOFILE`, and ideally `MemoryMax`/`TasksMax`). - [ ] A single agent exhausting its fd budget fails *that agent only* — SSH into dev and `lab.cloonar.com` stay responsive. - [ ] Critical services (`sshd`, `nscd`, `lab`) retain reserved headroom so the box stays recoverable under agent pressure. - [ ] tmux's session daemonization (double-fork) is handled so the limits actually apply to the agent tree, not just the launching call. ## Out of scope / follow-ups - Root-causing and fixing the actual fd leak (separate issue — unblocked by this one). - Right-sizing `-max-instances` (default 6) and VM RAM/inotify sysctls for higher safe concurrency. - fd/mem/inotify monitoring (promtail/victoriametrics) to observe per-agent usage and set the cap empirically. ## Pointers - `hosts/fw/vms/dev/modules/lab/sessions.go` — `StartCommand` (the bare `tmux new-session -d` launch path). - `hosts/fw/vms/dev/modules/lab/default.nix` — the `lab` systemd unit and `ExecStart` (`-max-instances` defaults to 6, `handlers.go`). - `hosts/fw/vms/dev/default.nix` — microvm sizing (12 GB / 4 vcpu).
Author
Owner

This was generated by AI during triage.

Agent Brief

Category: enhancement
Summary: Bound each lab-spawned session's open-file count so one runaway agent hits its own EMFILE and dies alone, instead of exhausting the dev VM system-wide.

Current behavior:
lab launches every session — manual instances, AFK runs, and the claude auth login flow — by handing an argv to its tmux-session wrapper (the Sessions type), which runs tmux new-session -d on the shared default tmux socket. The spawned process tree inherits the system-default RLIMIT_NOFILE (~524288 hard), and since a process can raise its own soft limit up to that hard ceiling, a single agent leaking file descriptors can consume system-grade resources. There is no per-session bound. On 2026-06-02, one active + one idle agent drove the VM to system-wide fd exhaustion (ENFILE): lab and SSH died together and the box needed a reboot. Note: new-session -d daemonizes the session under a long-lived shared server, so a bound applied to the tmux invocation would not reach the agent tree.

Desired behavior:
Every session lab spawns runs under a per-process hard RLIMIT_NOFILE low enough that a descriptor leak hits EMFILE on itself — locally, well before it can drive the system to ENFILE — while trusted services (sshd, nscd, lab) keep their high default ceiling and stay responsive. The bound must apply to the agent's whole process tree (claude + node + language servers + watchers), so it is set on the inner command from inside the pane (after tmux daemonizes), not on the tmux call. Both soft and hard limits are set to the cap, so the process cannot raise its soft limit back up. The cap is operator-configurable without a recompile.

Design decisions (settled during triage — honor these, don't relitigate):

  • Use a per-process rlimit (RLIMIT_NOFILE via prlimit), NOT cgroups. The observed failure was fd exhaustion; NOFILE is the exact lever and needs no new privilege, no systemd user-manager lingering, no controller delegation. systemd-run --scope cgroup limits (MemoryMax/TasksMax) are a possible future follow-up, deliberately out of scope — memory is already mitigated by the VM's zramSwap and did not fire in the incident.
  • NOFILE only. Do NOT add RLIMIT_NPROC (enforced per-UID system-wide, so a per-session cap would punish unrelated dominik processes — SSH logins, lab, sibling agents) or RLIMIT_AS (node/V8 reserve huge virtual address space; an AS cap tends to kill claude).
  • Apply the cap uniformly in the single spawn chokepoint every session routes through, not per-call-site. It must compose with the existing session-name (%s) substitution and with AFK's trailing seed-prompt argument.
  • Keep the single shared tmux server (no per-agent sockets). Capping the inner command makes daemonization irrelevant.

Key interfaces:

  • The Sessions type (lab's tmux-session wrapper): its start path should prepend a descriptor-cap wrapper to the already-substituted argv before tmux new-session, conditional on a configured cap. When the cap is 0/unset, spawn bare (tests and minimal configs unaffected).
  • A new flag, suggested -session-nofile (int, default 16384), wired from main into Sessions and set in the lab service's ExecStart. 16384 errs generous (a busy agent — claude + multiple LSPs + inotify + ripgrep + the forgejo MCP server — can hold low-thousands of fds); containment holds at any value far below 524288, so start generous and tune down.
  • A pinned prlimit binary, following lab's convention of pinning each external tool behind a flag (as for tmux/claude/git/tea): add a -prlimit flag and put util-linux on the lab service PATH, since the inner command resolves against the restricted service PATH.
  • The lab derivation's check inputs: add util-linux alongside the existing tmux so the integration test runs in checkPhase.

Acceptance criteria:

  • Every spawned session (manual instance, AFK run, login flow) runs with soft and hard RLIMIT_NOFILE equal to the configured cap.
  • The cap is a flag with a sensible default, set on the running service via its ExecStart.
  • A unit test asserts the spawn argv is prefixed with the cap wrapper when a cap is configured, and bare when the cap is 0.
  • An integration test (real tmux, like the existing Sessions tests) starts a capped session whose stand-in reports its own soft+hard NOFILE and asserts both equal the cap — proving the limit survives new-session -d daemonization. A small cap value is fine; propagation is value-independent.
  • That integration test includes a two-session case (two capped sessions on the same shared server), asserting both panes report the cap — proving attach to a running server doesn't bypass the wrapper.
  • max-instances × cap stays far below the live VM's fs.file-max/fs.nr_open, grounded with a read-only sysctl check on dev and recorded in the PR.

Verification (safe — no fault injection):
Original criterion 2 ("a runaway fails that agent only; SSH + lab stay up") follows from (a) the integration tests proving the cap propagates and binds the tree, and (b) the headroom arithmetic proving agents collectively can't deplete the system pool — so deliberately exhausting fds on the live VM is unnecessary. Optionally, once a healthy agent is running, a read-only ls /proc/<pid>/fd | wc -l confirms 16384 has margin over steady-state.

Out of scope:

  • Root-causing/fixing the underlying fd leak (separate follow-up; this issue is its prerequisite).
  • cgroup memory/process caps (systemd-run --scope, MemoryMax, TasksMax) and the user-manager lingering they need.
  • Right-sizing -max-instances (stays at 6) and tuning VM RAM / inotify sysctls.
  • fd/memory/inotify monitoring to set the cap empirically (follow-up; this ships a conservative default + a knob).
> *This was generated by AI during triage.* ## Agent Brief **Category:** enhancement **Summary:** Bound each lab-spawned session's open-file count so one runaway agent hits its own EMFILE and dies alone, instead of exhausting the dev VM system-wide. **Current behavior:** lab launches every session — manual instances, AFK runs, and the `claude auth login` flow — by handing an argv to its tmux-session wrapper (the `Sessions` type), which runs `tmux new-session -d` on the shared default tmux socket. The spawned process tree inherits the system-default `RLIMIT_NOFILE` (~524288 hard), and since a process can raise its own soft limit up to that hard ceiling, a single agent leaking file descriptors can consume system-grade resources. There is no per-session bound. On 2026-06-02, one active + one idle agent drove the VM to system-wide fd exhaustion (ENFILE): lab and SSH died together and the box needed a reboot. Note: `new-session -d` daemonizes the session under a long-lived shared server, so a bound applied to the *tmux invocation* would not reach the agent tree. **Desired behavior:** Every session lab spawns runs under a per-process hard `RLIMIT_NOFILE` low enough that a descriptor leak hits `EMFILE` on itself — locally, well before it can drive the system to ENFILE — while trusted services (sshd, nscd, lab) keep their high default ceiling and stay responsive. The bound must apply to the agent's whole process tree (claude + node + language servers + watchers), so it is set on the *inner* command from inside the pane (after tmux daemonizes), not on the tmux call. Both soft and hard limits are set to the cap, so the process cannot raise its soft limit back up. The cap is operator-configurable without a recompile. **Design decisions (settled during triage — honor these, don't relitigate):** - Use a per-process rlimit (`RLIMIT_NOFILE` via `prlimit`), NOT cgroups. The observed failure was fd exhaustion; NOFILE is the exact lever and needs no new privilege, no systemd user-manager lingering, no controller delegation. `systemd-run --scope` cgroup limits (`MemoryMax`/`TasksMax`) are a possible future follow-up, deliberately out of scope — memory is already mitigated by the VM's zramSwap and did not fire in the incident. - `NOFILE` only. Do NOT add `RLIMIT_NPROC` (enforced per-UID system-wide, so a per-session cap would punish unrelated `dominik` processes — SSH logins, lab, sibling agents) or `RLIMIT_AS` (node/V8 reserve huge virtual address space; an AS cap tends to kill claude). - Apply the cap uniformly in the single spawn chokepoint every session routes through, not per-call-site. It must compose with the existing session-name (`%s`) substitution and with AFK's trailing seed-prompt argument. - Keep the single shared tmux server (no per-agent sockets). Capping the inner command makes daemonization irrelevant. **Key interfaces:** - The `Sessions` type (lab's tmux-session wrapper): its start path should prepend a descriptor-cap wrapper to the already-substituted argv before `tmux new-session`, conditional on a configured cap. When the cap is 0/unset, spawn bare (tests and minimal configs unaffected). - A new flag, suggested `-session-nofile` (int, default 16384), wired from `main` into `Sessions` and set in the lab service's `ExecStart`. 16384 errs generous (a busy agent — claude + multiple LSPs + inotify + ripgrep + the forgejo MCP server — can hold low-thousands of fds); containment holds at any value far below 524288, so start generous and tune down. - A pinned `prlimit` binary, following lab's convention of pinning each external tool behind a flag (as for tmux/claude/git/tea): add a `-prlimit` flag and put `util-linux` on the lab service PATH, since the inner command resolves against the restricted service PATH. - The lab derivation's check inputs: add `util-linux` alongside the existing `tmux` so the integration test runs in `checkPhase`. **Acceptance criteria:** - [ ] Every spawned session (manual instance, AFK run, login flow) runs with soft and hard `RLIMIT_NOFILE` equal to the configured cap. - [ ] The cap is a flag with a sensible default, set on the running service via its `ExecStart`. - [ ] A unit test asserts the spawn argv is prefixed with the cap wrapper when a cap is configured, and bare when the cap is 0. - [ ] An integration test (real tmux, like the existing `Sessions` tests) starts a capped session whose stand-in reports its own soft+hard NOFILE and asserts both equal the cap — proving the limit survives `new-session -d` daemonization. A small cap value is fine; propagation is value-independent. - [ ] That integration test includes a two-session case (two capped sessions on the same shared server), asserting both panes report the cap — proving attach to a running server doesn't bypass the wrapper. - [ ] `max-instances` × cap stays far below the live VM's `fs.file-max`/`fs.nr_open`, grounded with a read-only `sysctl` check on dev and recorded in the PR. **Verification (safe — no fault injection):** Original criterion 2 ("a runaway fails that agent only; SSH + lab stay up") follows from (a) the integration tests proving the cap propagates and binds the tree, and (b) the headroom arithmetic proving agents collectively can't deplete the system pool — so deliberately exhausting fds on the live VM is unnecessary. Optionally, once a healthy agent is running, a read-only `ls /proc/<pid>/fd | wc -l` confirms 16384 has margin over steady-state. **Out of scope:** - Root-causing/fixing the underlying fd leak (separate follow-up; this issue is its prerequisite). - cgroup memory/process caps (`systemd-run --scope`, `MemoryMax`, `TasksMax`) and the user-manager lingering they need. - Right-sizing `-max-instances` (stays at 6) and tuning VM RAM / inotify sysctls. - fd/memory/inotify monitoring to set the cap empirically (follow-up; this ships a conservative default + a knob).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#76
No description provided.