Migrate dev to a self-managed QEMU VM fleet host (ADR-0018) #159
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#159
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Tracks the migration decided in ADR-0018 — move
devfrom a shared-store astro/microvm.nix guest to a self-managed QEMU VM fleet host.ADR: https://git.cloonar.com/Cloonar/nixos/src/branch/main/docs/adr/0018-dev-self-managed-qemu-vm.md
Why
fw's
/nix/storeshared RO + a guest overlay corrupts under two independent GCs — ESTALE from fw'snix-collect-garbage, plus rootfs inode exhaustion — recoverable only by a reboot that kills every Claude/labsession. microvm.nix can't fix it (it direct-kernel-boots fw's toplevel from a read-only store). Full rationale + rejected alternatives in the ADR.Target
devis a peer fleet host (hosts/dev/), pulled via bento, applied with its ownnixos-rebuild switch.vms/openclaw/pattern); the VM shell stays underhosts/fw/vms/dev/.utils/modules/qemu-vm.nix→cloonar.vms.<name>; openclaw migrates onto it.switch+bento-rebootoff +lab.serviceKillMode=process, independent store + aggressive GC.Plan — parallel-validate on a temp IP, then swap
PR1 — module + VM shells
utils/modules/qemu-vm.nixexposingcloonar.vms.<name>(state dir, init oneshot for cloud image + seed ISO, tap-up → qemu → tap-down; opts: mac / ip / mem / vcpu / diskSizeG / autostart / cpuWeight / cloudInit).openclawontocloonar.vms.openclaw— the two-consumer proof.cloonar.vms.devon a temp.97.16+ temp MAC, autostart, 100 G, cpuWeight. Old microvm keeps.97.15untouched.PR2 — hosts/dev + onboard
.97.16Ubuntu VM, run the READMEnixos-infectflow (local VM, not Hetzner — bring up static net; serial console is the safety net).hosts/dev/(port fromhosts/fw/vms/dev/): development module,labwithKillMode=process, forgejo-mcp, users, sops, GC, bento, no borg,hardware-configuration.nixfrom infect..sops.yaml &dev, new pubkey →fleet.nix,./scripts/update-secrets-keys(secrets move tohosts/dev/secrets.yaml).devfromscripts/pre-commit's skip rule sohosts/dev/dry-builds as its own host..97.5→:8090); aswitchdoes not drop the session; a reboot boots a guest-built kernel.PR3 — swap + retire
cloonar.vms.dev+hosts/dev/net.97.16 → .97.15../vms/devmicrovm wrapper + config; retargetcpu-priorities.nixCFS weight frommicrovm@devto the new unit.rmthe old 51 Grootfs.img+/var/lib/microvm-persist/dev.Notes
stateVersion(infectnixos-25.11); dev then joins the staged 26.05 fleet upgrade.dnsmasq dev.cloonar.com → .97.15preserved (IP unchanged after the swap), so the lab ingress + Authelia path don't move./homecopy).Split into #160 (ready-for-agent, PR1 — directly actionable), #161 (ready-for-human, PR2, depends on #160), and #162 (needs-triage, PR3 — blocked by #161). Closing this umbrella in favour of the three.