Agent has no diagnostic SSH access to fw microVMs (e.g. web-02) #88
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#88
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Diagnosing services that run inside the
fwmicroVMs (e.g. Invidious +invidious-companiononweb-02) requires reading container logs/status from inside the VM, but the agent has no SSH path into the VMs.Why
Hostblock for the bare-metal hosts —fw web-arm mail amzebs-01 nas nb— usingUser diag+IdentityFile /run/secrets/diag-ssh-key(the read-onlydiagwrapper).fw(e.g.web-02at<prefix>.97.5) authorize root only, viautils/ssh-keys.nix. The agent's key (ssh-ed25519 …dominik@dev) is not in that list, and the VM FQDNs are not in the SSH config — so they fall through toHost *(no user/key) and the connection is refused (Permission denied (publickey)/Too many authentication failures).Impact
Every service hosted in a microVM is undiagnosable by the agent without falling back to human-in-the-loop. On
web-02alone: Invidious/companion, Matrix (synapse + mautrix bridges), n8n, phpldapadmin, mcp-forgejo, lab.Possible approaches (for triage)
diagpattern into the VMs — add thediaguser + diag-wrapper (utils/modules/diag) to the microVM configs, and add the VM FQDNs (web-02.cloonar.com, …) to the fleetHostblock in the SSH config. Keeps access read-only.fw(the VMs sit on the.97.0/24LAN behind it).utils/ssh-keys.nix, granting full root on every VM.Prefer the read-only diag-wrapper model (#1) given the VMs run sensitive services.
Context
Surfaced while diagnosing: Yattee fails to open videos;
https://invidious.cloonar.com/api/v1/videos/<id>returns{"error":"Companion is starting. Please wait until a valid potoken is found…"}(persistent across probes). Companion root-cause work continues via HITL.Agent Brief
Category: enhancement
Summary: Extend the read-only
diagchannel (ADR-0005) into theweb-02fw microVM so the agent ondevcan SSH in to diagnose its services, with the wrapper's secret-path denylist widened to cover web-02's service set.Current behavior:
The agent runs on the
devmicroVM and can reach the bare-metal fleet hosts (fw,web-arm,mail,amzebs-01,nas,nb) over the read-onlydiagchannel from ADR-0005: a non-rootdiagsystem user pinned byrestrict+ForceCommandto the diag-wrapper allowlist, reached via the diag client SSHmatchBlocksin the home-manager config using a SOPS-deployed private key. The fw guest microVMs were never added to that channel.web-02authorizes only the shared admin keys fromutils/ssh-keys.nixand has nodiaguser, so the agent's own key is refused — every service inside web-02 (Invidious + companion, Matrix/synapse + mautrix bridges, n8n, phpldapadmin, mcp-forgejo, lab) is undiagnosable without falling back to a human.The network path itself already works and needs no change: fw permits
server → vm-*traffic, web-02 runs sshd on port 22, and split-horizon DNS already resolvesweb-02.cloonar.comto its internal LAN address from insidedev. The only missing piece is authorization.Desired behavior:
From the
devmicroVM,ssh web-02 '<allowed diag command>'succeeds as the read-onlydiaguser, subject to the same wrapper allowlist and audit logging as the rest of the fleet, and cannot read web-02's service secrets on disk.ssh web-02 '<mutating or disallowed command>'is still rejected by the wrapper exactly as on the bare-metal hosts.Key interfaces:
diagNixOS module (ADR-0005) — reuse it unchanged on web-02. It authorizes the already-committed diag public key, so no new SOPS secret is required; the matching private key already exists ondev.matchBlocksfor the diag channel) — must gain a host entry so that connecting toweb-02usesUser diag+ the diag identity file (resolving toweb-02.cloonar.com), instead of falling through to the default block and offering the agent's personal key.cat/head/tail/lsallowances cannot read web-02's on-disk secrets. At minimum cover the service state dirs that hold credentials/keys — Matrix synapse, the mautrix-* bridges, matrix-authentication-service, n8n, zammad — and ensure the existing SSH host-key denial still holds for web-02 even though its host keys live under the impermanence/persisttree (the denylist must match the realpath-resolved location, not just/etc/ssh/...). The denylist must keep winning over the allowlist.Acceptance criteria:
scripts/test-configuration).diaguser via the shared diag module; no new secret is added to anysecrets.yaml.web-02toUser diag+ the diag identity, so the agent no longer offers its personal key to web-02./persistlocation, with the deny rules winning over the read allowlist.ssh web-02 'systemctl status <svc>'succeeds read-only, whilessh web-02 'cat <a synapse/n8n secret>'and any mutating command are refused.Out of scope:
forgejo-runner,openclaw) — this issue is web-02 only; the same pattern can follow later if wanted.devitself — deliberately excluded by ADR-0005 (it's the agent's own host, wheredominikalready has passwordless sudo).utils/ssh-keys.nixroot authorization (approach #3 — granting the agent full root on the VMs — is explicitly rejected).