feat(logging): grafana-alloy shared module + nas canary (promtail→alloy step 1) #124
No reviewers
Labels
No labels
bug
enhancement
in-progress
needs-info
needs-triage
p0
ready-for-agent
ready-for-human
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Cloonar/nixos!124
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "afk/121"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Step 1 of the promtail→alloy migration (#118). 26.05 removed
services.promtail, so nas (already on 26.05) had no central journal shipping. This adds a sharedutils/modules/alloymodule (grafana-alloy) as a drop-in promtail replacement and switches nas to it as the canary. The other four hosts stay on promtail until step 2.How
utils/modules/alloy/config.alloy— committed Alloy-language pipeline, surfaced viaenvironment.etc."alloy/config.alloy".sourceso it hot-reloads (not a Nix-rendered store path). Reproduces promtail's journal scrape exactly: labelshost/unit/coredump_unit/job, unit→transport defaulting, coredump reshaping,session-N.scope→session.scopenormalization, and the four noise drops; pushes to the same Loki endpoint.utils/modules/alloy/default.nix—services.alloy.enable,environmentFile= thealloy-envsops secret, and theenvironment.etcconfig.--disable-reportingsuppresses Alloy's anonymous telemetry (promtail had none).hosts/nas/configuration.nix— imports./utils/modules/alloy, replacing the "promtail dropped" placeholder.Parity proof (hybrid method)
Rendered the current promtail config from the committed module to YAML, ran
alloy convert --source-format=promtailas a correctness oracle, then hand-cleanedconfig.alloyand diffed component-by-component. All four components (discovery.relabel,loki.source.journal,loki.process,loki.write) are byte-identical to the oracle except the credential: Option D —LOKI_PASSWORDfrom thealloy-envsops secret viaservices.alloy.environmentFile, read assys.env("LOKI_PASSWORD"), keeping the module'sDynamicUsersandbox (systemd reads the env file as root).Verification
alloy fmtacceptsconfig.alloy(config is already canonical).Post-merge (manual, per the issue)
After nas deploys, confirm in Grafana that Loki shows nas's journal with labels
host/unit/job, and verify coredump reshaping + the noise drops via LogQL before fanning the rest out (step 2).Closes #121
Validation: PASS ✅
Signal relied on: the repo's commit-time gate — the pre-commit dry-build eval (
scripts/test-configuration), reported green for all hosts in the PR body; not re-run, per the repo's gate model. I independently corroborated the two non-obvious NixOS option references resolve:services.alloy.environmentFileandservices.alloy.extraFlagsboth exist and are stable (25.11 → unstable; 26.05 isn't indexed yet but sits between them and branched from unstable). The PR changes no derivationsrc/*Hash, so there's no build-time failure the eval-only gate would miss.Checked:
afk/121is current withmain(merge-base = tip ofmain).afk/121+ bodyCloses #121match → merge auto-closes the issue and releases the branch claim.feat(logging):); nosecrets.yamledits in the diff (scaffold untouched, value populated separately onmain); nosystem.stateVersionchange; module imported by explicit path (./utils/modules/alloy).config.alloyreproducesutils/modules/promtailcomponent-for-component —__journal__hostname→hostrelabel; journal json /max_age 12h/job=systemd-journal; the 8-field json extract; unit→transport default; coredump regex + reshape;session.scopenormalization;coredump_unit/unitlabel promotion; all four noise drops; same Loki endpoint +promtail@cloonar.combasic_auth. The single intentional divergence is the credential —password_file→sys.env("LOKI_PASSWORD")from thealloy-envsops secret viaenvironmentFile— matching the settled "Option D" in #121.Deferred to post-deploy (by design, per #121's acceptance criteria): runtime confirmation in Grafana that nas's journal reaches Loki with labels
host/unit/job, plus coredump reshaping + the noise drops via LogQL. The eval gate can't observe runtime log flow.Verdict: ready to merge.