feat(logging): grafana-alloy shared module + nas canary (promtail→alloy step 1) #121

Closed
opened 2026-06-07 14:33:14 +02:00 by dominik.polakovics · 0 comments

Step 1 of #118 (promtail→alloy migration). The eval-gate + sops prerequisites land in the PR "fix(fw,web-arm): pin docker_29 to unblock fleet eval" — merge that first.

Agent Brief

Category: enhancement

Summary: Create a shared utils/modules/alloy module (grafana-alloy) that ships the systemd journal to Loki as a drop-in promtail replacement, and switch nas to it as the canary. Verify in Grafana before fanning the rest out (step 2).

Background / why: 26.05 removed promtail; nas (already on 26.05) dropped the import and currently has no central log shipping. grafana-alloy (services.alloy, present in both 25.11 and 26.05) is the vendor successor. The design was grilled and settled — see below.

Settled design decisions:

  • Committed utils/modules/alloy/config.alloy, surfaced via environment.etc."alloy/config.alloy".source (keeps hot-reload). It's a committed Alloy-language file, not Nix attrs.
  • Produced via the hybrid method: run alloy convert --source-format=promtail <rendered promtail.yaml> as a correctness oracle, then commit a hand-cleaned config and diff the two component-by-component to prove parity. (Render the current promtail config to YAML first — it's Nix-generated.)
  • Secret = Option D: sops secret alloy-env holding LOKI_PASSWORD=<pw>, fed via services.alloy.environmentFile, referenced as sys.env("LOKI_PASSWORD") in the loki.write basic_auth. Keeps the module's DynamicUser sandbox; systemd reads the env file as root. The .sops.yaml rule + utils/modules/alloy/secrets.yaml scaffold already exist (from the docker_29 PR); populate the value with sops utils/modules/alloy/secrets.yaml if not already done.
  • Journal access is free — the alloy module already sets SupplementaryGroups = [ "systemd-journal" ].
  • Logs-only scope; metrics consolidation is deferred to #120.

Current behavior (must be preserved): utils/modules/promtail scrapes the journal (json, max_age 12h, job=systemd-journal) and pushes to https://loki.cloonar.com/loki/api/v1/push (basic auth promtail@cloonar.com). Pipeline: JSON-extract _TRANSPORT/SYSTEMD_UNIT/MESSAGE/COREDUMP*, default unit→transport, reshape coredumps into a human-readable message, normalize session-N.scopesession.scope, drop known noise (nscd inotify, rpi undervoltage, refused connection: IN=), relabel __journal__hostnamehost.

Desired behavior: utils/modules/alloy reproduces the same labels (host, unit, coredump_unit, job), drops, and coredump/session reshaping, pushing to the same Loki endpoint with the same creds. Only nas imports it in this step; promtail stays on the other four (removed in step 2).

Acceptance:

  • pre-commit eval green for all hosts (docker_29 PR merged first).
  • After nas deploys: Grafana Loki shows nas's journal with labels host/unit/job; coredump reshaping and the noise drops verified via LogQL.

Depends on: the docker_29 eval-unblock PR. Part of #118.

*Step 1 of #118 (promtail→alloy migration). The eval-gate + sops prerequisites land in the PR "fix(fw,web-arm): pin docker_29 to unblock fleet eval" — merge that first.* ## Agent Brief **Category:** enhancement **Summary:** Create a shared `utils/modules/alloy` module (grafana-alloy) that ships the systemd journal to Loki as a drop-in promtail replacement, and switch **nas** to it as the canary. Verify in Grafana before fanning the rest out (step 2). **Background / why:** 26.05 removed promtail; nas (already on 26.05) dropped the import and currently has **no central log shipping**. grafana-alloy (`services.alloy`, present in both 25.11 and 26.05) is the vendor successor. The design was grilled and settled — see below. **Settled design decisions:** - Committed `utils/modules/alloy/config.alloy`, surfaced via `environment.etc."alloy/config.alloy".source` (keeps hot-reload). It's a committed Alloy-language file, not Nix attrs. - Produced via the **hybrid** method: run `alloy convert --source-format=promtail <rendered promtail.yaml>` as a correctness oracle, then commit a hand-cleaned config and diff the two component-by-component to prove parity. (Render the current promtail config to YAML first — it's Nix-generated.) - Secret = **Option D**: sops secret `alloy-env` holding `LOKI_PASSWORD=<pw>`, fed via `services.alloy.environmentFile`, referenced as `sys.env("LOKI_PASSWORD")` in the `loki.write` `basic_auth`. Keeps the module's DynamicUser sandbox; systemd reads the env file as root. The `.sops.yaml` rule + `utils/modules/alloy/secrets.yaml` scaffold already exist (from the docker_29 PR); populate the value with `sops utils/modules/alloy/secrets.yaml` if not already done. - Journal access is free — the alloy module already sets `SupplementaryGroups = [ "systemd-journal" ]`. - **Logs-only** scope; metrics consolidation is deferred to #120. **Current behavior (must be preserved):** `utils/modules/promtail` scrapes the journal (json, max_age 12h, job=systemd-journal) and pushes to https://loki.cloonar.com/loki/api/v1/push (basic auth `promtail@cloonar.com`). Pipeline: JSON-extract _TRANSPORT/_SYSTEMD_UNIT/MESSAGE/COREDUMP_*, default unit→transport, reshape coredumps into a human-readable message, normalize `session-N.scope`→`session.scope`, drop known noise (nscd inotify, rpi undervoltage, `refused connection: IN=`), relabel `__journal__hostname`→`host`. **Desired behavior:** `utils/modules/alloy` reproduces the same labels (host, unit, coredump_unit, job), drops, and coredump/session reshaping, pushing to the same Loki endpoint with the same creds. Only **nas** imports it in this step; promtail stays on the other four (removed in step 2). **Acceptance:** - pre-commit eval green for all hosts (docker_29 PR merged first). - After nas deploys: Grafana Loki shows nas's journal with labels host/unit/job; coredump reshaping and the noise drops verified via LogQL. **Depends on:** the docker_29 eval-unblock PR. Part of #118.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#121
No description provided.