feat(logging): migrate fleet journal shipping from promtail to grafana-alloy #118
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos#118
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Agent Brief
Category: enhancement
Summary: Replace the shared
utils/modules/promtailjournal shipper with a grafana-alloy equivalent so central logging survives the 25.11→26.05 upgrade.Background / why now:
nixos-26.05 removed promtail entirely — the
services.promtailmodule (now amkRemovedOptionModulestub), thepkgs.promtailpackage (throw stub, "reached its end of life"), and the binary is no longer bundled ingrafana-loki(itssubPackagesdroppedcmd/promtail). Upstream's replacement is grafana-alloy (services.alloy.enable, v1.16.0 in 26.05);fluent-bitis the lighter alternative.The shared module
utils/modules/promtailis imported by all five hosts (amzebs-01, fw, mail, nas, web-arm). nas was the first to bump to 26.05 (#103 / PR #117) and had to drop the import, so central journald→Loki shipping is currently paused on nas (local journald intact). Every remaining host bump hits the same wall: fw #105, mail #107, web-arm #109, amzebs-01 #111. The Loki receiver (loki.cloonar.com) is unaffected — only the per-host shipper needs replacing.Current behavior:
The shared promtail module runs
services.promtailon each host: scrapes the systemd journal (json,max_age = 12h,job = systemd-journal) and pushes tohttps://loki.cloonar.com/loki/api/v1/pushwith HTTP basic auth (userpromtail@cloonar.com, password from the sops secretpromtail-password). Itspipeline_stagesdo non-trivial processing that must be preserved:_TRANSPORT,_SYSTEMD_UNIT,MESSAGE, theCOREDUMP_*set)unitlabel to the transport (so audit/kernel records get a unit)session-1234.scope→session.scope) to cap label cardinality__journal__hostname→hostDesired behavior:
A new shared module (
utils/modules/alloy) runs grafana-alloy on each host as a drop-in replacement: reads the systemd journal and writes to the same Loki endpoint with the same auth, preserving the label set (host,unit,coredump_unit,job) and the same noise-dropping and coredump/session reshaping. After the swap, all hosts importutils/modules/alloyinstead ofutils/modules/promtail, and the promtail module is deleted. nas re-gains central shipping; the other four keep it through their 26.05 bumps.Equivalent alloy components to evaluate:
loki.source.journal(journal read) →loki.process(thepipeline_stagesmap onto alloystage.json/stage.template/stage.regex/stage.replace/stage.drop/stage.labels/stage.outputblocks) →loki.relabel(hostname→host) →loki.write(basic-auth push). Confirm the exact stage mapping against alloy's docs.Secrets (needs human):
Alloy needs the same Loki basic-auth password. Per repo policy the agent must not edit secrets files — it should state the secret name/owner it expects (keep
promtail-password, or rename toalloy-password) and the maintainer wires the value viasops.Acceptance criteria:
utils/modules/alloymodule ships journal logs tohttps://loki.cloonar.com/loki/api/v1/pushwith basic auth, replacingutils/modules/promtail.host(from__journal__hostname),unit(defaulting to transport), normalizedsession.scope,coredump_unit, reformatted coredump messages, and all existingdropfilters are preserved.utils/modules/promtailis removed.Out of scope:
fluent-bitalternative — alloy is the chosen path.Sequencing: Land this before fw (#105) is armed so the remaining 26.05 bumps don't silently lose logging. Relates to the staged upgrade (#103, #105, #107, #109, #111) and re-enables shipping on nas.
Grilled the design; here's the resolved plan, split into a sequence.
Decisions (logs-only):
services.alloy); the Loki receiver is untouched.utils/modules/alloy/config.alloyviaenvironment.etc."alloy/config.alloy".source(keeps hot-reload). Produced hybrid —alloy convert --source-format=promtailas a correctness oracle, then a hand-cleaned committed config, diffed component-by-component for parity.alloy-env=LOKI_PASSWORD=<pw>viaservices.alloy.environmentFile+sys.env("LOKI_PASSWORD"). Keeps the module's DynamicUser sandbox; journal access is already granted by the module (SupplementaryGroups = [ "systemd-journal" ]).Execution (3 PRs):