Migrate dev to a self-managed QEMU VM fleet host (ADR-0018) #159

Closed
opened 2026-06-14 13:50:39 +02:00 by dominik.polakovics · 1 comment

Tracks the migration decided in ADR-0018 — move dev from a shared-store astro/microvm.nix guest to a self-managed QEMU VM fleet host.
ADR: https://git.cloonar.com/Cloonar/nixos/src/branch/main/docs/adr/0018-dev-self-managed-qemu-vm.md

Why

fw's /nix/store shared RO + a guest overlay corrupts under two independent GCs — ESTALE from fw's nix-collect-garbage, plus rootfs inode exhaustion — recoverable only by a reboot that kills every Claude/lab session. microvm.nix can't fix it (it direct-kernel-boots fw's toplevel from a read-only store). Full rationale + rejected alternatives in the ADR.

Target

  • dev is a peer fleet host (hosts/dev/), pulled via bento, applied with its own nixos-rebuild switch.
  • Boots on a plain QEMU VM that fw only launches (the vms/openclaw/ pattern); the VM shell stays under hosts/fw/vms/dev/.
  • New utils/modules/qemu-vm.nixcloonar.vms.<name>; openclaw migrates onto it.
  • Cattle (no backup), regenerate-and-re-onboard identity, bento auto-switch + bento-reboot off + lab.service KillMode=process, independent store + aggressive GC.

Plan — parallel-validate on a temp IP, then swap

PR1 — module + VM shells

  • Add utils/modules/qemu-vm.nix exposing cloonar.vms.<name> (state dir, init oneshot for cloud image + seed ISO, tap-up → qemu → tap-down; opts: mac / ip / mem / vcpu / diskSizeG / autostart / cpuWeight / cloudInit).
  • Migrate openclaw onto cloonar.vms.openclaw — the two-consumer proof.
  • Add cloonar.vms.dev on a temp .97.16 + temp MAC, autostart, 100 G, cpuWeight. Old microvm keeps .97.15 untouched.

PR2 — hosts/dev + onboard

  • Console into the .97.16 Ubuntu VM, run the README nixos-infect flow (local VM, not Hetzner — bring up static net; serial console is the safety net).
  • Author hosts/dev/ (port from hosts/fw/vms/dev/): development module, lab with KillMode=process, forgejo-mcp, users, sops, GC, bento, no borg, hardware-configuration.nix from infect.
  • Onboard: new age key → .sops.yaml &dev, new pubkey → fleet.nix, ./scripts/update-secrets-keys (secrets move to hosts/dev/secrets.yaml).
  • Drop dev from scripts/pre-commit's skip rule so hosts/dev/ dry-builds as its own host.
  • Deploy; confirm bento converges. Validate hard: spawn a test Claude session; forgejo-mcp reachable from web (.97.5:8090); a switch does not drop the session; a reboot boots a guest-built kernel.

PR3 — swap + retire

  • Park sessions on the old microvm; flip cloonar.vms.dev + hosts/dev/ net .97.16 → .97.15.
  • Delete ./vms/dev microvm wrapper + config; retarget cpu-priorities.nix CFS weight from microvm@dev to the new unit.
  • Reclaim: rm the old 51 G rootfs.img + /var/lib/microvm-persist/dev.

Notes

  • Fresh install → new stateVersion (infect nixos-25.11); dev then joins the staged 26.05 fleet upgrade.
  • dnsmasq dev.cloonar.com → .97.15 preserved (IP unchanged after the swap), so the lab ingress + Authelia path don't move.
  • Clean start: re-clone projects, re-auth Claude (no /home copy).
  • Human-gated steps: console/infect, sops master keys, IP swaps.
Tracks the migration decided in **ADR-0018** — move `dev` from a shared-store astro/microvm.nix guest to a self-managed QEMU VM fleet host. ADR: https://git.cloonar.com/Cloonar/nixos/src/branch/main/docs/adr/0018-dev-self-managed-qemu-vm.md ## Why fw's `/nix/store` shared RO + a guest overlay corrupts under two independent GCs — ESTALE from fw's `nix-collect-garbage`, plus rootfs inode exhaustion — recoverable only by a reboot that kills every Claude/`lab` session. microvm.nix can't fix it (it direct-kernel-boots fw's toplevel from a read-only store). Full rationale + rejected alternatives in the ADR. ## Target - `dev` is a peer fleet host (`hosts/dev/`), pulled via bento, applied with its own `nixos-rebuild switch`. - Boots on a plain QEMU VM that fw only launches (the `vms/openclaw/` pattern); the VM shell stays under `hosts/fw/vms/dev/`. - New `utils/modules/qemu-vm.nix` → `cloonar.vms.<name>`; openclaw migrates onto it. - Cattle (no backup), regenerate-and-re-onboard identity, bento auto-`switch` + `bento-reboot` off + `lab.service` `KillMode=process`, independent store + aggressive GC. ## Plan — parallel-validate on a temp IP, then swap ### PR1 — module + VM shells - [ ] Add `utils/modules/qemu-vm.nix` exposing `cloonar.vms.<name>` (state dir, init oneshot for cloud image + seed ISO, tap-up → qemu → tap-down; opts: mac / ip / mem / vcpu / diskSizeG / autostart / cpuWeight / cloudInit). - [ ] Migrate `openclaw` onto `cloonar.vms.openclaw` — the two-consumer proof. - [ ] Add `cloonar.vms.dev` on a **temp `.97.16`** + temp MAC, autostart, 100 G, cpuWeight. Old microvm keeps `.97.15` untouched. ### PR2 — hosts/dev + onboard - [ ] Console into the `.97.16` Ubuntu VM, run the README `nixos-infect` flow (local VM, not Hetzner — bring up static net; serial console is the safety net). - [ ] Author `hosts/dev/` (port from `hosts/fw/vms/dev/`): development module, `lab` with `KillMode=process`, forgejo-mcp, users, sops, GC, bento, **no borg**, `hardware-configuration.nix` from infect. - [ ] Onboard: new age key → `.sops.yaml &dev`, new pubkey → `fleet.nix`, `./scripts/update-secrets-keys` (secrets move to `hosts/dev/secrets.yaml`). - [ ] Drop `dev` from `scripts/pre-commit`'s skip rule so `hosts/dev/` dry-builds as its own host. - [ ] Deploy; confirm bento converges. Validate hard: spawn a test Claude session; forgejo-mcp reachable from web (`.97.5` → `:8090`); a `switch` does not drop the session; a reboot boots a guest-built kernel. ### PR3 — swap + retire - [ ] Park sessions on the old microvm; flip `cloonar.vms.dev` + `hosts/dev/` net `.97.16 → .97.15`. - [ ] Delete `./vms/dev` microvm wrapper + config; retarget `cpu-priorities.nix` CFS weight from `microvm@dev` to the new unit. - [ ] Reclaim: `rm` the old 51 G `rootfs.img` + `/var/lib/microvm-persist/dev`. ## Notes - Fresh install → new `stateVersion` (infect `nixos-25.11`); dev then joins the staged 26.05 fleet upgrade. - `dnsmasq dev.cloonar.com → .97.15` preserved (IP unchanged after the swap), so the lab ingress + Authelia path don't move. - Clean start: re-clone projects, re-auth Claude (no `/home` copy). - Human-gated steps: console/infect, sops master keys, IP swaps.
Author
Owner

Split into #160 (ready-for-agent, PR1 — directly actionable), #161 (ready-for-human, PR2, depends on #160), and #162 (needs-triage, PR3 — blocked by #161). Closing this umbrella in favour of the three.

Split into #160 (ready-for-agent, PR1 — directly actionable), #161 (ready-for-human, PR2, depends on #160), and #162 (needs-triage, PR3 — blocked by #161). Closing this umbrella in favour of the three.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#159
No description provided.