feat(logging): migrate fleet journal shipping from promtail to grafana-alloy #118

Open
opened 2026-06-07 12:53:04 +02:00 by dominik.polakovics · 1 comment

This was generated by AI during triage.

Agent Brief

Category: enhancement
Summary: Replace the shared utils/modules/promtail journal shipper with a grafana-alloy equivalent so central logging survives the 25.11→26.05 upgrade.

Background / why now:
nixos-26.05 removed promtail entirely — the services.promtail module (now a mkRemovedOptionModule stub), the pkgs.promtail package (throw stub, "reached its end of life"), and the binary is no longer bundled in grafana-loki (its subPackages dropped cmd/promtail). Upstream's replacement is grafana-alloy (services.alloy.enable, v1.16.0 in 26.05); fluent-bit is the lighter alternative.

The shared module utils/modules/promtail is imported by all five hosts (amzebs-01, fw, mail, nas, web-arm). nas was the first to bump to 26.05 (#103 / PR #117) and had to drop the import, so central journald→Loki shipping is currently paused on nas (local journald intact). Every remaining host bump hits the same wall: fw #105, mail #107, web-arm #109, amzebs-01 #111. The Loki receiver (loki.cloonar.com) is unaffected — only the per-host shipper needs replacing.

Current behavior:
The shared promtail module runs services.promtail on each host: scrapes the systemd journal (json, max_age = 12h, job = systemd-journal) and pushes to https://loki.cloonar.com/loki/api/v1/push with HTTP basic auth (user promtail@cloonar.com, password from the sops secret promtail-password). Its pipeline_stages do non-trivial processing that must be preserved:

  • JSON-extract journal fields (_TRANSPORT, _SYSTEMD_UNIT, MESSAGE, the COREDUMP_* set)
  • default the unit label to the transport (so audit/kernel records get a unit)
  • reformat coredump records into a human-readable message
  • normalize session scope units (session-1234.scopesession.scope) to cap label cardinality
  • drop known noise lines (nscd inotify spam, rpi undervoltage, internet portscan "refused connection")
  • relabel __journal__hostnamehost

Desired behavior:
A new shared module (utils/modules/alloy) runs grafana-alloy on each host as a drop-in replacement: reads the systemd journal and writes to the same Loki endpoint with the same auth, preserving the label set (host, unit, coredump_unit, job) and the same noise-dropping and coredump/session reshaping. After the swap, all hosts import utils/modules/alloy instead of utils/modules/promtail, and the promtail module is deleted. nas re-gains central shipping; the other four keep it through their 26.05 bumps.

Equivalent alloy components to evaluate: loki.source.journal (journal read) → loki.process (the pipeline_stages map onto alloy stage.json / stage.template / stage.regex / stage.replace / stage.drop / stage.labels / stage.output blocks) → loki.relabel (hostname→host) → loki.write (basic-auth push). Confirm the exact stage mapping against alloy's docs.

Secrets (needs human):
Alloy needs the same Loki basic-auth password. Per repo policy the agent must not edit secrets files — it should state the secret name/owner it expects (keep promtail-password, or rename to alloy-password) and the maintainer wires the value via sops.

Acceptance criteria:

  • A shared utils/modules/alloy module ships journal logs to https://loki.cloonar.com/loki/api/v1/push with basic auth, replacing utils/modules/promtail.
  • Processing/label parity: host (from __journal__hostname), unit (defaulting to transport), normalized session.scope, coredump_unit, reformatted coredump messages, and all existing drop filters are preserved.
  • All five hosts (amzebs-01, fw, mail, nas, web-arm) import the alloy module; utils/modules/promtail is removed.
  • pre-commit dry-build is green for every affected host (a shared-module change rebuilds all hosts).
  • Build-verified on 26.05, not just eval (eval misses 26.05 build gates).
  • Verified on a live host: that host's logs appear in Loki/Grafana after switch (human verify, like the bump-verify issues).

Out of scope:

  • Changing the Loki server, its URL, or its auth scheme.
  • Adding metrics/traces collection — this is logs-only parity.
  • The fluent-bit alternative — alloy is the chosen path.
  • Retiring central logging (the opposite decision) — this issue assumes we keep it.

Sequencing: Land this before fw (#105) is armed so the remaining 26.05 bumps don't silently lose logging. Relates to the staged upgrade (#103, #105, #107, #109, #111) and re-enables shipping on nas.

> *This was generated by AI during triage.* ## Agent Brief **Category:** enhancement **Summary:** Replace the shared `utils/modules/promtail` journal shipper with a grafana-alloy equivalent so central logging survives the 25.11→26.05 upgrade. **Background / why now:** nixos-26.05 removed promtail entirely — the `services.promtail` module (now a `mkRemovedOptionModule` stub), the `pkgs.promtail` package (throw stub, "reached its end of life"), and the binary is no longer bundled in `grafana-loki` (its `subPackages` dropped `cmd/promtail`). Upstream's replacement is grafana-alloy (`services.alloy.enable`, v1.16.0 in 26.05); `fluent-bit` is the lighter alternative. The shared module `utils/modules/promtail` is imported by all five hosts (amzebs-01, fw, mail, nas, web-arm). nas was the first to bump to 26.05 (#103 / PR #117) and had to drop the import, so **central journald→Loki shipping is currently paused on nas** (local journald intact). Every remaining host bump hits the same wall: fw #105, mail #107, web-arm #109, amzebs-01 #111. The Loki receiver (`loki.cloonar.com`) is unaffected — only the per-host shipper needs replacing. **Current behavior:** The shared promtail module runs `services.promtail` on each host: scrapes the systemd journal (`json`, `max_age = 12h`, `job = systemd-journal`) and pushes to `https://loki.cloonar.com/loki/api/v1/push` with HTTP basic auth (user `promtail@cloonar.com`, password from the sops secret `promtail-password`). Its `pipeline_stages` do non-trivial processing that must be preserved: - JSON-extract journal fields (`_TRANSPORT`, `_SYSTEMD_UNIT`, `MESSAGE`, the `COREDUMP_*` set) - default the `unit` label to the transport (so audit/kernel records get a unit) - reformat coredump records into a human-readable message - normalize session scope units (`session-1234.scope` → `session.scope`) to cap label cardinality - drop known noise lines (nscd inotify spam, rpi undervoltage, internet portscan "refused connection") - relabel `__journal__hostname` → `host` **Desired behavior:** A new shared module (`utils/modules/alloy`) runs grafana-alloy on each host as a drop-in replacement: reads the systemd journal and writes to the same Loki endpoint with the same auth, preserving the label set (`host`, `unit`, `coredump_unit`, `job`) and the same noise-dropping and coredump/session reshaping. After the swap, all hosts import `utils/modules/alloy` instead of `utils/modules/promtail`, and the promtail module is deleted. nas re-gains central shipping; the other four keep it through their 26.05 bumps. Equivalent alloy components to evaluate: `loki.source.journal` (journal read) → `loki.process` (the `pipeline_stages` map onto alloy `stage.json` / `stage.template` / `stage.regex` / `stage.replace` / `stage.drop` / `stage.labels` / `stage.output` blocks) → `loki.relabel` (hostname→host) → `loki.write` (basic-auth push). Confirm the exact stage mapping against alloy's docs. **Secrets (needs human):** Alloy needs the same Loki basic-auth password. Per repo policy the agent must not edit secrets files — it should state the secret name/owner it expects (keep `promtail-password`, or rename to `alloy-password`) and the maintainer wires the value via `sops`. **Acceptance criteria:** - [ ] A shared `utils/modules/alloy` module ships journal logs to `https://loki.cloonar.com/loki/api/v1/push` with basic auth, replacing `utils/modules/promtail`. - [ ] Processing/label parity: `host` (from `__journal__hostname`), `unit` (defaulting to transport), normalized `session.scope`, `coredump_unit`, reformatted coredump messages, and all existing `drop` filters are preserved. - [ ] All five hosts (amzebs-01, fw, mail, nas, web-arm) import the alloy module; `utils/modules/promtail` is removed. - [ ] pre-commit dry-build is green for every affected host (a shared-module change rebuilds all hosts). - [ ] Build-verified on 26.05, not just eval (eval misses 26.05 build gates). - [ ] Verified on a live host: that host's logs appear in Loki/Grafana after switch (human verify, like the bump-verify issues). **Out of scope:** - Changing the Loki server, its URL, or its auth scheme. - Adding metrics/traces collection — this is logs-only parity. - The `fluent-bit` alternative — alloy is the chosen path. - Retiring central logging (the opposite decision) — this issue assumes we keep it. **Sequencing:** Land this before fw (#105) is armed so the remaining 26.05 bumps don't silently lose logging. Relates to the staged upgrade (#103, #105, #107, #109, #111) and re-enables shipping on nas.
Author
Owner

Grilled the design; here's the resolved plan, split into a sequence.

Decisions (logs-only):

  • Tool: grafana-alloy (services.alloy); the Loki receiver is untouched.
  • Config: committed utils/modules/alloy/config.alloy via environment.etc."alloy/config.alloy".source (keeps hot-reload). Produced hybridalloy convert --source-format=promtail as a correctness oracle, then a hand-cleaned committed config, diffed component-by-component for parity.
  • Secret: sops alloy-env = LOKI_PASSWORD=<pw> via services.alloy.environmentFile + sys.env("LOKI_PASSWORD"). Keeps the module's DynamicUser sandbox; journal access is already granted by the module (SupplementaryGroups = [ "systemd-journal" ]).
  • Metrics consolidation explicitly deferred#120.

Execution (3 PRs):

  • PR0 — #123 (open): docker_29 eval-unblock + alloy sops scaffold. A prerequisite, because shared-path changes can't pass the pre-commit hook while fw/web-arm fail eval on docker_28.
  • PR1 — #121 (needs-triage): alloy module + switch nas (canary), verify in Grafana.
  • PR2 — #122 (needs-triage): fan out to fw/mail/web-arm/amzebs-01 + delete the promtail module.
Grilled the design; here's the resolved plan, split into a sequence. **Decisions (logs-only):** - Tool: grafana-alloy (`services.alloy`); the Loki receiver is untouched. - Config: committed `utils/modules/alloy/config.alloy` via `environment.etc."alloy/config.alloy".source` (keeps hot-reload). Produced **hybrid** — `alloy convert --source-format=promtail` as a correctness oracle, then a hand-cleaned committed config, diffed component-by-component for parity. - Secret: sops `alloy-env` = `LOKI_PASSWORD=<pw>` via `services.alloy.environmentFile` + `sys.env("LOKI_PASSWORD")`. Keeps the module's DynamicUser sandbox; journal access is already granted by the module (`SupplementaryGroups = [ "systemd-journal" ]`). - Metrics consolidation explicitly **deferred** → #120. **Execution (3 PRs):** - **PR0 — #123 (open):** docker_29 eval-unblock + alloy sops scaffold. A prerequisite, because shared-path changes can't pass the pre-commit hook while fw/web-arm fail eval on docker_28. - **PR1 — #121 (needs-triage):** alloy module + switch nas (canary), verify in Grafana. - **PR2 — #122 (needs-triage):** fan out to fw/mail/web-arm/amzebs-01 + delete the promtail module.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#118
No description provided.