feat(logging): grafana-alloy shared module + nas canary (promtail→alloy step 1) #124

Merged
dominik.polakovics merged 1 commit from afk/121 into main 2026-06-07 16:32:23 +02:00

What

Step 1 of the promtail→alloy migration (#118). 26.05 removed services.promtail, so nas (already on 26.05) had no central journal shipping. This adds a shared utils/modules/alloy module (grafana-alloy) as a drop-in promtail replacement and switches nas to it as the canary. The other four hosts stay on promtail until step 2.

How

  • utils/modules/alloy/config.alloy — committed Alloy-language pipeline, surfaced via environment.etc."alloy/config.alloy".source so it hot-reloads (not a Nix-rendered store path). Reproduces promtail's journal scrape exactly: labels host/unit/coredump_unit/job, unit→transport defaulting, coredump reshaping, session-N.scopesession.scope normalization, and the four noise drops; pushes to the same Loki endpoint.
  • utils/modules/alloy/default.nixservices.alloy.enable, environmentFile = the alloy-env sops secret, and the environment.etc config. --disable-reporting suppresses Alloy's anonymous telemetry (promtail had none).
  • hosts/nas/configuration.nix — imports ./utils/modules/alloy, replacing the "promtail dropped" placeholder.

Parity proof (hybrid method)

Rendered the current promtail config from the committed module to YAML, ran alloy convert --source-format=promtail as a correctness oracle, then hand-cleaned config.alloy and diffed component-by-component. All four components (discovery.relabel, loki.source.journal, loki.process, loki.write) are byte-identical to the oracle except the credential: Option D — LOKI_PASSWORD from the alloy-env sops secret via services.alloy.environmentFile, read as sys.env("LOKI_PASSWORD"), keeping the module's DynamicUser sandbox (systemd reads the env file as root).

Verification

  • alloy fmt accepts config.alloy (config is already canonical).
  • Pre-commit eval green for all hosts.

Post-merge (manual, per the issue)

After nas deploys, confirm in Grafana that Loki shows nas's journal with labels host/unit/job, and verify coredump reshaping + the noise drops via LogQL before fanning the rest out (step 2).

Closes #121

## What Step 1 of the promtail→alloy migration (#118). 26.05 removed `services.promtail`, so **nas** (already on 26.05) had no central journal shipping. This adds a shared `utils/modules/alloy` module (grafana-alloy) as a drop-in promtail replacement and switches **nas** to it as the canary. The other four hosts stay on promtail until step 2. ## How - **`utils/modules/alloy/config.alloy`** — committed Alloy-language pipeline, surfaced via `environment.etc."alloy/config.alloy".source` so it hot-reloads (not a Nix-rendered store path). Reproduces promtail's journal scrape exactly: labels `host`/`unit`/`coredump_unit`/`job`, unit→transport defaulting, coredump reshaping, `session-N.scope`→`session.scope` normalization, and the four noise drops; pushes to the same Loki endpoint. - **`utils/modules/alloy/default.nix`** — `services.alloy.enable`, `environmentFile` = the `alloy-env` sops secret, and the `environment.etc` config. `--disable-reporting` suppresses Alloy's anonymous telemetry (promtail had none). - **`hosts/nas/configuration.nix`** — imports `./utils/modules/alloy`, replacing the "promtail dropped" placeholder. ## Parity proof (hybrid method) Rendered the current promtail config from the committed module to YAML, ran `alloy convert --source-format=promtail` as a correctness oracle, then hand-cleaned `config.alloy` and diffed component-by-component. All four components (`discovery.relabel`, `loki.source.journal`, `loki.process`, `loki.write`) are byte-identical to the oracle **except** the credential: Option D — `LOKI_PASSWORD` from the `alloy-env` sops secret via `services.alloy.environmentFile`, read as `sys.env("LOKI_PASSWORD")`, keeping the module's `DynamicUser` sandbox (systemd reads the env file as root). ## Verification - `alloy fmt` accepts `config.alloy` (config is already canonical). - Pre-commit eval green for all hosts. ## Post-merge (manual, per the issue) After nas deploys, confirm in Grafana that Loki shows nas's journal with labels `host`/`unit`/`job`, and verify coredump reshaping + the noise drops via LogQL before fanning the rest out (step 2). Closes #121
26.05 removed services.promtail, leaving nas (already on 26.05) with no
central journal shipping. Add a shared utils/modules/alloy module that
ships the systemd journal to Loki as a drop-in promtail replacement, and
switch nas to it as the migration canary (step 1 of #118).

The pipeline lives in a committed config.alloy (surfaced via
environment.etc for hot-reload), reproducing promtail's labels (host,
unit, coredump_unit, job), coredump/session reshaping, and known noise
drops, pushing to the same Loki endpoint with the same credentials.
Parity was proven by running `alloy convert --source-format=promtail` on
the rendered promtail config and diffing component-by-component; the only
divergence is the credential mechanism: LOKI_PASSWORD comes from the
alloy-env sops secret via services.alloy.environmentFile and is read as
sys.env("LOKI_PASSWORD"), keeping the module's DynamicUser sandbox.
--disable-reporting suppresses Alloy's anonymous telemetry (promtail had
none).

The other four hosts stay on promtail until step 2.

Closes #121
Author
Owner

This was generated by AI while landing a PR.

Validation: PASS

Signal relied on: the repo's commit-time gate — the pre-commit dry-build eval (scripts/test-configuration), reported green for all hosts in the PR body; not re-run, per the repo's gate model. I independently corroborated the two non-obvious NixOS option references resolve: services.alloy.environmentFile and services.alloy.extraFlags both exist and are stable (25.11 → unstable; 26.05 isn't indexed yet but sits between them and branched from unstable). The PR changes no derivation src/*Hash, so there's no build-time failure the eval-only gate would miss.

Checked:

  • Mergeable ✔, not a draft. Branch afk/121 is current with main (merge-base = tip of main).
  • AFK contract (ADR-0007): head afk/121 + body Closes #121 match → merge auto-closes the issue and releases the branch claim.
  • Conventions: title is Conventional Commits (feat(logging):); no secrets.yaml edits in the diff (scaffold untouched, value populated separately on main); no system.stateVersion change; module imported by explicit path (./utils/modules/alloy).
  • Parity (the core of the change): config.alloy reproduces utils/modules/promtail component-for-component — __journal__hostnamehost relabel; journal json / max_age 12h / job=systemd-journal; the 8-field json extract; unit→transport default; coredump regex + reshape; session.scope normalization; coredump_unit/unit label promotion; all four noise drops; same Loki endpoint + promtail@cloonar.com basic_auth. The single intentional divergence is the credential — password_filesys.env("LOKI_PASSWORD") from the alloy-env sops secret via environmentFile — matching the settled "Option D" in #121.

Deferred to post-deploy (by design, per #121's acceptance criteria): runtime confirmation in Grafana that nas's journal reaches Loki with labels host/unit/job, plus coredump reshaping + the noise drops via LogQL. The eval gate can't observe runtime log flow.

Verdict: ready to merge.

> *This was generated by AI while landing a PR.* **Validation: PASS** ✅ **Signal relied on:** the repo's commit-time gate — the pre-commit dry-build eval (`scripts/test-configuration`), reported green for all hosts in the PR body; not re-run, per the repo's gate model. I independently corroborated the two non-obvious NixOS option references resolve: `services.alloy.environmentFile` and `services.alloy.extraFlags` both exist and are stable (25.11 → unstable; 26.05 isn't indexed yet but sits between them and branched from unstable). The PR changes no derivation `src`/`*Hash`, so there's no build-time failure the eval-only gate would miss. **Checked:** - **Mergeable** ✔, not a draft. Branch `afk/121` is current with `main` (merge-base = tip of `main`). - **AFK contract** (ADR-0007): head `afk/121` + body `Closes #121` match → merge auto-closes the issue and releases the branch claim. - **Conventions:** title is Conventional Commits (`feat(logging):`); no `secrets.yaml` edits in the diff (scaffold untouched, value populated separately on `main`); no `system.stateVersion` change; module imported by explicit path (`./utils/modules/alloy`). - **Parity (the core of the change):** `config.alloy` reproduces `utils/modules/promtail` component-for-component — `__journal__hostname`→`host` relabel; journal json / `max_age 12h` / `job=systemd-journal`; the 8-field json extract; unit→transport default; coredump regex + reshape; `session.scope` normalization; `coredump_unit`/`unit` label promotion; all four noise drops; same Loki endpoint + `promtail@cloonar.com` basic_auth. The single intentional divergence is the credential — `password_file` → `sys.env("LOKI_PASSWORD")` from the `alloy-env` sops secret via `environmentFile` — matching the settled "Option D" in #121. **Deferred to post-deploy (by design, per #121's acceptance criteria):** runtime confirmation in Grafana that nas's journal reaches Loki with labels `host`/`unit`/`job`, plus coredump reshaping + the noise drops via LogQL. The eval gate can't observe runtime log flow. Verdict: ready to merge.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos!124
No description provided.