Agent has no diagnostic SSH access to fw microVMs (e.g. web-02) #88

Closed
opened 2026-06-04 09:00:47 +02:00 by dominik.polakovics · 1 comment

Problem

Diagnosing services that run inside the fw microVMs (e.g. Invidious + invidious-companion on web-02) requires reading container logs/status from inside the VM, but the agent has no SSH path into the VMs.

Why

  • The agent's SSH config only defines a fleet Host block for the bare-metal hostsfw web-arm mail amzebs-01 nas nb — using User diag + IdentityFile /run/secrets/diag-ssh-key (the read-only diag wrapper).
  • The microVMs on fw (e.g. web-02 at <prefix>.97.5) authorize root only, via utils/ssh-keys.nix. The agent's key (ssh-ed25519 …dominik@dev) is not in that list, and the VM FQDNs are not in the SSH config — so they fall through to Host * (no user/key) and the connection is refused (Permission denied (publickey) / Too many authentication failures).

Impact

Every service hosted in a microVM is undiagnosable by the agent without falling back to human-in-the-loop. On web-02 alone: Invidious/companion, Matrix (synapse + mautrix bridges), n8n, phpldapadmin, mcp-forgejo, lab.

Possible approaches (for triage)

  1. Extend the diag pattern into the VMs — add the diag user + diag-wrapper (utils/modules/diag) to the microVM configs, and add the VM FQDNs (web-02.cloonar.com, …) to the fleet Host block in the SSH config. Keeps access read-only.
  2. Provide a read-only jump path via fw (the VMs sit on the .97.0/24 LAN behind it).
  3. (Discouraged — too broad) add the agent key to utils/ssh-keys.nix, granting full root on every VM.

Prefer the read-only diag-wrapper model (#1) given the VMs run sensitive services.

Context

Surfaced while diagnosing: Yattee fails to open videos; https://invidious.cloonar.com/api/v1/videos/<id> returns {"error":"Companion is starting. Please wait until a valid potoken is found…"} (persistent across probes). Companion root-cause work continues via HITL.

## Problem Diagnosing services that run **inside the `fw` microVMs** (e.g. Invidious + `invidious-companion` on `web-02`) requires reading container logs/status from inside the VM, but the agent has no SSH path into the VMs. ## Why - The agent's SSH config only defines a fleet `Host` block for the **bare-metal hosts** — `fw web-arm mail amzebs-01 nas nb` — using `User diag` + `IdentityFile /run/secrets/diag-ssh-key` (the read-only `diag` wrapper). - The microVMs on `fw` (e.g. `web-02` at `<prefix>.97.5`) authorize **root only**, via `utils/ssh-keys.nix`. The agent's key (`ssh-ed25519 …dominik@dev`) is **not** in that list, and the VM FQDNs are **not** in the SSH config — so they fall through to `Host *` (no user/key) and the connection is refused (`Permission denied (publickey)` / `Too many authentication failures`). ## Impact Every service hosted in a microVM is undiagnosable by the agent without falling back to human-in-the-loop. On `web-02` alone: Invidious/companion, Matrix (synapse + mautrix bridges), n8n, phpldapadmin, mcp-forgejo, lab. ## Possible approaches (for triage) 1. **Extend the `diag` pattern into the VMs** — add the `diag` user + diag-wrapper (`utils/modules/diag`) to the microVM configs, and add the VM FQDNs (`web-02.cloonar.com`, …) to the fleet `Host` block in the SSH config. Keeps access read-only. 2. Provide a read-only jump path via `fw` (the VMs sit on the `.97.0/24` LAN behind it). 3. (Discouraged — too broad) add the agent key to `utils/ssh-keys.nix`, granting full root on every VM. Prefer the read-only diag-wrapper model (#1) given the VMs run sensitive services. ## Context Surfaced while diagnosing: Yattee fails to open videos; `https://invidious.cloonar.com/api/v1/videos/<id>` returns `{"error":"Companion is starting. Please wait until a valid potoken is found…"}` (persistent across probes). Companion root-cause work continues via HITL.
Author
Owner

This was generated by AI during triage.

Agent Brief

Category: enhancement
Summary: Extend the read-only diag channel (ADR-0005) into the web-02 fw microVM so the agent on dev can SSH in to diagnose its services, with the wrapper's secret-path denylist widened to cover web-02's service set.

Current behavior:
The agent runs on the dev microVM and can reach the bare-metal fleet hosts (fw, web-arm, mail, amzebs-01, nas, nb) over the read-only diag channel from ADR-0005: a non-root diag system user pinned by restrict + ForceCommand to the diag-wrapper allowlist, reached via the diag client SSH matchBlocks in the home-manager config using a SOPS-deployed private key. The fw guest microVMs were never added to that channel. web-02 authorizes only the shared admin keys from utils/ssh-keys.nix and has no diag user, so the agent's own key is refused — every service inside web-02 (Invidious + companion, Matrix/synapse + mautrix bridges, n8n, phpldapadmin, mcp-forgejo, lab) is undiagnosable without falling back to a human.

The network path itself already works and needs no change: fw permits server → vm-* traffic, web-02 runs sshd on port 22, and split-horizon DNS already resolves web-02.cloonar.com to its internal LAN address from inside dev. The only missing piece is authorization.

Desired behavior:
From the dev microVM, ssh web-02 '<allowed diag command>' succeeds as the read-only diag user, subject to the same wrapper allowlist and audit logging as the rest of the fleet, and cannot read web-02's service secrets on disk. ssh web-02 '<mutating or disallowed command>' is still rejected by the wrapper exactly as on the bare-metal hosts.

Key interfaces:

  • The shared diag NixOS module (ADR-0005) — reuse it unchanged on web-02. It authorizes the already-committed diag public key, so no new SOPS secret is required; the matching private key already exists on dev.
  • The diag client SSH config (home-manager matchBlocks for the diag channel) — must gain a host entry so that connecting to web-02 uses User diag + the diag identity file (resolving to web-02.cloonar.com), instead of falling through to the default block and offering the agent's personal key.
  • The diag-wrapper path denylist — currently curated for the bare-metal hosts' services. It must be widened so the cat/head/tail/ls allowances cannot read web-02's on-disk secrets. At minimum cover the service state dirs that hold credentials/keys — Matrix synapse, the mautrix-* bridges, matrix-authentication-service, n8n, zammad — and ensure the existing SSH host-key denial still holds for web-02 even though its host keys live under the impermanence /persist tree (the denylist must match the realpath-resolved location, not just /etc/ssh/...). The denylist must keep winning over the allowlist.
  • ADR-0005 — amend it (or add a short extending ADR) to record that the diag channel now reaches fw guest microVMs (web-02), noting the widened denylist and that the journal-read tradeoff now also covers web-02's sensitive services.

Acceptance criteria:

  • All affected fw configs dry-build clean via the pre-commit hook (scripts/test-configuration).
  • web-02's guest config gains the diag user via the shared diag module; no new secret is added to any secrets.yaml.
  • The diag client SSH config routes web-02 to User diag + the diag identity, so the agent no longer offers its personal key to web-02.
  • The diag-wrapper denylist denies reads of web-02's service secret/state dirs (synapse, mautrix-*, mas, n8n, zammad) and of web-02's SSH host private keys at their real /persist location, with the deny rules winning over the read allowlist.
  • The wrapper's existing allowlist/denylist behavior for the bare-metal hosts is unchanged (no regression).
  • ADR-0005 is updated (or a new ADR added) documenting the extension to guest microVMs and the denylist change.
  • The PR description records the one post-merge check the agent cannot self-perform (prod SSH is human-gated): after deploy, ssh web-02 'systemctl status <svc>' succeeds read-only, while ssh web-02 'cat <a synapse/n8n secret>' and any mutating command are refused.

Out of scope:

  • Adding diag to the other guest microVMs (forgejo-runner, openclaw) — this issue is web-02 only; the same pattern can follow later if wanted.
  • Adding diag to dev itself — deliberately excluded by ADR-0005 (it's the agent's own host, where dominik already has passwordless sudo).
  • Any change to the bare-metal hosts' diag setup or to utils/ssh-keys.nix root authorization (approach #3 — granting the agent full root on the VMs — is explicitly rejected).
  • Diagnosing or fixing the underlying Invidious/companion potoken issue that surfaced this gap — tracked separately.
  • Building an MCP-based diagnostic interface (the deferred alternative in ADR-0005).
> *This was generated by AI during triage.* ## Agent Brief **Category:** enhancement **Summary:** Extend the read-only `diag` channel (ADR-0005) into the `web-02` fw microVM so the agent on `dev` can SSH in to diagnose its services, with the wrapper's secret-path denylist widened to cover web-02's service set. **Current behavior:** The agent runs on the `dev` microVM and can reach the bare-metal fleet hosts (`fw`, `web-arm`, `mail`, `amzebs-01`, `nas`, `nb`) over the read-only `diag` channel from ADR-0005: a non-root `diag` system user pinned by `restrict` + `ForceCommand` to the diag-wrapper allowlist, reached via the diag client SSH `matchBlocks` in the home-manager config using a SOPS-deployed private key. The fw guest microVMs were never added to that channel. `web-02` authorizes only the shared admin keys from `utils/ssh-keys.nix` and has no `diag` user, so the agent's own key is refused — every service inside web-02 (Invidious + companion, Matrix/synapse + mautrix bridges, n8n, phpldapadmin, mcp-forgejo, lab) is undiagnosable without falling back to a human. The network path itself already works and needs no change: fw permits `server → vm-*` traffic, web-02 runs sshd on port 22, and split-horizon DNS already resolves `web-02.cloonar.com` to its internal LAN address from inside `dev`. The only missing piece is **authorization**. **Desired behavior:** From the `dev` microVM, `ssh web-02 '<allowed diag command>'` succeeds as the read-only `diag` user, subject to the same wrapper allowlist and audit logging as the rest of the fleet, and **cannot** read web-02's service secrets on disk. `ssh web-02 '<mutating or disallowed command>'` is still rejected by the wrapper exactly as on the bare-metal hosts. **Key interfaces:** - **The shared `diag` NixOS module (ADR-0005)** — reuse it unchanged on web-02. It authorizes the already-committed diag public key, so **no new SOPS secret is required**; the matching private key already exists on `dev`. - **The diag client SSH config (home-manager `matchBlocks` for the diag channel)** — must gain a host entry so that connecting to `web-02` uses `User diag` + the diag identity file (resolving to `web-02.cloonar.com`), instead of falling through to the default block and offering the agent's personal key. - **The diag-wrapper path denylist** — currently curated for the bare-metal hosts' services. It must be widened so the `cat`/`head`/`tail`/`ls` allowances cannot read web-02's on-disk secrets. At minimum cover the service state dirs that hold credentials/keys — Matrix synapse, the mautrix-* bridges, matrix-authentication-service, n8n, zammad — and ensure the existing SSH host-key denial still holds for web-02 even though its host keys live under the impermanence `/persist` tree (the denylist must match the realpath-resolved location, not just `/etc/ssh/...`). The denylist must keep winning over the allowlist. - **ADR-0005** — amend it (or add a short extending ADR) to record that the diag channel now reaches fw guest microVMs (web-02), noting the widened denylist and that the journal-read tradeoff now also covers web-02's sensitive services. **Acceptance criteria:** - [ ] All affected fw configs dry-build clean via the pre-commit hook (`scripts/test-configuration`). - [ ] web-02's guest config gains the `diag` user via the shared diag module; no new secret is added to any `secrets.yaml`. - [ ] The diag client SSH config routes `web-02` to `User diag` + the diag identity, so the agent no longer offers its personal key to web-02. - [ ] The diag-wrapper denylist denies reads of web-02's service secret/state dirs (synapse, mautrix-*, mas, n8n, zammad) and of web-02's SSH host private keys at their real `/persist` location, with the deny rules winning over the read allowlist. - [ ] The wrapper's existing allowlist/denylist behavior for the bare-metal hosts is unchanged (no regression). - [ ] ADR-0005 is updated (or a new ADR added) documenting the extension to guest microVMs and the denylist change. - [ ] The PR description records the one post-merge check the agent cannot self-perform (prod SSH is human-gated): after deploy, `ssh web-02 'systemctl status <svc>'` succeeds read-only, while `ssh web-02 'cat <a synapse/n8n secret>'` and any mutating command are refused. **Out of scope:** - Adding diag to the other guest microVMs (`forgejo-runner`, `openclaw`) — this issue is **web-02 only**; the same pattern can follow later if wanted. - Adding diag to `dev` itself — deliberately excluded by ADR-0005 (it's the agent's own host, where `dominik` already has passwordless sudo). - Any change to the bare-metal hosts' diag setup or to `utils/ssh-keys.nix` root authorization (approach #3 — granting the agent full root on the VMs — is explicitly rejected). - Diagnosing or fixing the underlying Invidious/companion potoken issue that surfaced this gap — tracked separately. - Building an MCP-based diagnostic interface (the deferred alternative in ADR-0005).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#88
No description provided.