feat(web-arm): fetch PowerSync sync rules live from Cloonar/fit instead of vendoring #95

Closed
opened 2026-06-04 21:20:24 +02:00 by dominik.polakovics · 0 comments

Goal

powersync.reptide.eu (self-hosted PowerSync on web-arm, ADR-0012) currently mounts a vendored copy of the sync rules at hosts/web-arm/modules/powersync/sync-rules.yaml, hand-copied from the app repo on every change. Make the deployed service source its sync rules live from the app repo so app-team changes reach prod with no nixos edit.

Source of truth becomes Cloonar/fit:powersync/sync-rules.yaml on master (ssh://forgejo@git.cloonar.com/Cloonar/fit.git). A merge to fit:master is already human-reviewed, so that review — not a nixos PR — is the gate.

Scope: sync rules only. service.yaml stays nix-rendered (it carries the sops DSN / storage / JWKS) — do not change how it is produced.

The design below was settled in a grilling session. Implement it; do not re-litigate the approach. If a blocker invalidates a decision, raise it in the PR rather than silently diverging.

Design (decided)

Runtime fetch-and-restart — not an eval-time fetchGit. (An unpinned eval-time fetch would couple every web-arm rebuild, and the local pre-commit dry-build, to git.cloonar.com availability and break reproducibility — rejected.)

  • powersync-syncrules-fetch.service (oneshot, runs as root) + a .timer every 5 min.
  • Fetch Cloonar/fit:powersync/sync-rules.yaml@master over HTTPS via Forgejo's raw API using a read-only token (sops secret reptide-powersync-syncrules-token, already created — see Secret). Expected endpoint (verify against the running Forgejo): GET https://git.cloonar.com/api/v1/repos/Cloonar/fit/raw/powersync/sync-rules.yaml?ref=master, header Authorization: token <T>.
  • Pre-swap validation gate: HTTP 200 → non-empty → response is the file (not an HTML/JSON error body) → parses as YAML. If the journeyapps/powersync-service image exposes a cheap sync-rules validate/dry-run subcommand, run it too (verify; the liveness step backstops it if absent).
  • Swap only if content changed: snapshot the current live file to sync-rules.yaml.prev, write the new file, then systemctl try-restart podman-powersync.service. Use try-restart (no-op when the container is not running, e.g. at boot) — PowerSync re-reads rules only on restart, it does not hot-reload.
  • Post-restart rollback (keep): poll the liveness endpoint (http://127.0.0.1:8080/probes/liveness) until healthy or an N-second timeout; if it does not go healthy, restore .prev, restart again, and exit non-zero so it pages. Net effect: a reviewed-but-bad-for-this-deploy file self-heals to last-good and still alerts.
  • No success notification — a normal applied change is silent (the human already reviewed it on master).

File & persistence

  • Live file at /var/lib/powersync/sync-rules.yaml. web-arm root is plain ext4 and fully persistent (no impermanence), so the last good rules always survive reboots. No persistence wiring needed beyond creating the dir.
  • Delete hosts/web-arm/modules/powersync/sync-rules.yaml. No seed fileCloonar/fit@master is the sole source of truth; a copy in nixos reintroduces the drift trap.
  • In the container, split the mounts: keep service.yaml from the nix store (unchanged), mount sync-rules.yaml from /var/lib/powersync (read-only into the container). The current configDir runCommand copies both — rework so only service.yaml comes from the store.
  • Ordering / fresh provision: order podman-powersync.service after the fetch unit, and run the fetch oneshot at boot. The fetch unit must exit 0 when a usable file already exists even if the network fetch failed (so a git.cloonar.com blip at boot never blocks the container, which starts from the persisted file). Hard-fail only when no usable file exists at all (truly fresh host) — that legitimately blocks startup and should page.

Alerting (reuse existing Pushover)

  • The host alerts via Grafana-managed rules → Pushover (cp_dominik_normal / cp_dominik_emergency, see hosts/web-arm/modules/grafana/default.nix); node-exporter's systemd collector exposes node_systemd_unit_state.
  • Page when the fetch unit fails. NOTE: hosts/web-arm/modules/grafana/alerting/service/services_down.nix alerts on state="active" == 0, which is wrong for a oneshot (a healthy oneshot is inactive, not active). Instead alert on node_systemd_unit_state{instance="web-arm:9100", name="powersync-syncrules-fetch.service", state="failed"} == 1, or wire OnFailure= to a oneshot that pushes Pushover. Route to the existing normal receiver.
  • "PowerSync itself wedged" is already covered by ADR-0012's launch-blocking blackbox_powersync_liveness probe — no change there.

Secret (already created — do NOT touch secrets files)

reptide-powersync-syncrules-token exists in hosts/web-arm/secrets.yaml: a read-only Forgejo token that can read Cloonar/fit. Wire it via sops.secrets.reptide-powersync-syncrules-token and feed the fetch service (env file or sops.templates, mirroring how reptide-powersync-source-dsn -> powersync.env is done in the same module). Per repo policy, never edit secrets.yaml.

Verify at impl time

  1. Exact Forgejo raw route + auth on the running version (the /api/v1/repos/.../raw/...?ref=master form above is the expectation).
  2. Whether the journeyapps/powersync-service image has a sync-rules validate/dry-run subcommand (adds a pre-swap check; liveness rollback covers it if absent).
  3. Privilege to restart the container — root is fine for a system maintenance oneshot; if you pick a hardened user, add the polkit rule.

ADR (required part of this issue)

Write docs/adr/0015-*.md in the repo's ADR style (see docs/adr/0012-self-hosted-powersync-on-web-arm.md). It must:

  • State that sync rules are now fetched live from Cloonar/fit@master at runtime, validated before swap, with the vendored copy removed.
  • Explicitly note it amends ADR-0012, which deliberately chose "sync rules mounted byte-for-byte … static and auditable from the repo." Record the trade-off: live propagation + upstream (fit PR) review, at the cost of in-repo auditability and the nixos review gate.
  • Capture rejected alternatives (eval-time unpinned fetchGit; an in-nixos structural/table-allowlist validator) and why.

Constraints / conventions

  • Branch from origin/main; open a PR (tea pr create) — never push to main. Conventional Commits, scope feat(web-arm).
  • The pre-commit hook only dry-builds (eval). The HTTPS fetch, container restart, liveness rollback and Pushover alert verify only on the deployed host — note this in the PR (as ADR-0012 does).
  • Do not modify system.stateVersion.

Acceptance criteria

  • hosts/web-arm/modules/powersync/sync-rules.yaml deleted; nixos holds no sync-rule content.
  • Fetch oneshot + 5-min timer pull Cloonar/fit:powersync/sync-rules.yaml@master over HTTPS using reptide-powersync-syncrules-token.
  • Pre-swap gate (HTTP/non-empty/not-error/YAML) + swap-only-if-changed via try-restart.
  • .prev snapshot + post-restart liveness rollback; rollback exits non-zero and pages.
  • Container mounts sync-rules.yaml from /var/lib/powersync; service.yaml still from the store; container ordered after the fetch unit; boot fetch exits 0 when a usable file exists.
  • Fetch-unit failure pages via existing Pushover (modeled for a oneshot's failed state).
  • docs/adr/0015-*.md written, amending ADR-0012.
  • Dry-build passes; PR opened with deployed-host functional-verification notes.

Relates to ADR-0012; follows the PowerSync work in #37 (operator prerequisites) / #38 (implementation).

## Goal `powersync.reptide.eu` (self-hosted PowerSync on web-arm, ADR-0012) currently mounts a **vendored** copy of the sync rules at `hosts/web-arm/modules/powersync/sync-rules.yaml`, hand-copied from the app repo on every change. Make the deployed service source its sync rules **live** from the app repo so app-team changes reach prod with no nixos edit. Source of truth becomes **`Cloonar/fit:powersync/sync-rules.yaml` on `master`** (`ssh://forgejo@git.cloonar.com/Cloonar/fit.git`). A merge to `fit:master` is already human-reviewed, so that review — not a nixos PR — is the gate. **Scope: sync rules only.** `service.yaml` stays nix-rendered (it carries the sops DSN / storage / JWKS) — do not change how it is produced. > The design below was settled in a grilling session. Implement it; do not re-litigate the approach. If a blocker invalidates a decision, raise it in the PR rather than silently diverging. ## Design (decided) Runtime fetch-and-restart — **not** an eval-time `fetchGit`. (An unpinned eval-time fetch would couple every web-arm rebuild, and the local pre-commit dry-build, to git.cloonar.com availability and break reproducibility — rejected.) - **`powersync-syncrules-fetch.service`** (oneshot, runs as root) + a **`.timer` every 5 min**. - **Fetch** `Cloonar/fit:powersync/sync-rules.yaml@master` over **HTTPS via Forgejo's raw API** using a **read-only token** (sops secret `reptide-powersync-syncrules-token`, already created — see Secret). Expected endpoint (verify against the running Forgejo): `GET https://git.cloonar.com/api/v1/repos/Cloonar/fit/raw/powersync/sync-rules.yaml?ref=master`, header `Authorization: token <T>`. - **Pre-swap validation gate:** HTTP 200 → non-empty → response is the file (not an HTML/JSON error body) → parses as YAML. If the `journeyapps/powersync-service` image exposes a cheap sync-rules validate/dry-run subcommand, run it too (verify; the liveness step backstops it if absent). - **Swap only if content changed:** snapshot the current live file to `sync-rules.yaml.prev`, write the new file, then `systemctl try-restart podman-powersync.service`. Use **`try-restart`** (no-op when the container is not running, e.g. at boot) — PowerSync re-reads rules only on restart, it does not hot-reload. - **Post-restart rollback (keep):** poll the liveness endpoint (`http://127.0.0.1:8080/probes/liveness`) until healthy or an N-second timeout; if it does not go healthy, **restore `.prev`, restart again, and exit non-zero** so it pages. Net effect: a reviewed-but-bad-for-this-deploy file self-heals to last-good and still alerts. - **No success notification** — a normal applied change is silent (the human already reviewed it on master). ### File & persistence - Live file at **`/var/lib/powersync/sync-rules.yaml`**. web-arm root is plain ext4 and fully persistent (no impermanence), so the last good rules always survive reboots. No persistence wiring needed beyond creating the dir. - **Delete** `hosts/web-arm/modules/powersync/sync-rules.yaml`. **No seed file** — `Cloonar/fit@master` is the sole source of truth; a copy in nixos reintroduces the drift trap. - In the container, **split the mounts**: keep `service.yaml` from the nix store (unchanged), mount **`sync-rules.yaml` from `/var/lib/powersync`** (read-only into the container). The current `configDir` `runCommand` copies both — rework so only `service.yaml` comes from the store. - **Ordering / fresh provision:** order `podman-powersync.service` `after` the fetch unit, and run the fetch oneshot at boot. The fetch unit must **exit 0 when a usable file already exists** even if the network fetch failed (so a git.cloonar.com blip at boot never blocks the container, which starts from the persisted file). Hard-fail only when **no** usable file exists at all (truly fresh host) — that legitimately blocks startup and should page. ### Alerting (reuse existing Pushover) - The host alerts via Grafana-managed rules → Pushover (`cp_dominik_normal` / `cp_dominik_emergency`, see `hosts/web-arm/modules/grafana/default.nix`); node-exporter's systemd collector exposes `node_systemd_unit_state`. - Page when the fetch unit **fails**. NOTE: `hosts/web-arm/modules/grafana/alerting/service/services_down.nix` alerts on `state="active" == 0`, which is **wrong for a oneshot** (a healthy oneshot is `inactive`, not `active`). Instead alert on `node_systemd_unit_state{instance="web-arm:9100", name="powersync-syncrules-fetch.service", state="failed"} == 1`, or wire `OnFailure=` to a oneshot that pushes Pushover. Route to the existing normal receiver. - "PowerSync itself wedged" is already covered by ADR-0012's launch-blocking `blackbox_powersync_liveness` probe — no change there. ## Secret (already created — do NOT touch secrets files) `reptide-powersync-syncrules-token` exists in `hosts/web-arm/secrets.yaml`: a read-only Forgejo token that can read `Cloonar/fit`. Wire it via `sops.secrets.reptide-powersync-syncrules-token` and feed the fetch service (env file or `sops.templates`, mirroring how `reptide-powersync-source-dsn` -> `powersync.env` is done in the same module). Per repo policy, **never edit `secrets.yaml`**. ## Verify at impl time 1. Exact Forgejo raw route + auth on the running version (the `/api/v1/repos/.../raw/...?ref=master` form above is the expectation). 2. Whether the `journeyapps/powersync-service` image has a sync-rules validate/dry-run subcommand (adds a pre-swap check; liveness rollback covers it if absent). 3. Privilege to restart the container — root is fine for a system maintenance oneshot; if you pick a hardened user, add the polkit rule. ## ADR (required part of this issue) Write **`docs/adr/0015-*.md`** in the repo's ADR style (see `docs/adr/0012-self-hosted-powersync-on-web-arm.md`). It must: - State that sync rules are now fetched live from `Cloonar/fit@master` at runtime, validated before swap, with the vendored copy removed. - Explicitly note it **amends ADR-0012**, which deliberately chose "sync rules mounted byte-for-byte … static and auditable from the repo." Record the trade-off: live propagation + upstream (fit PR) review, at the cost of in-repo auditability and the nixos review gate. - Capture rejected alternatives (eval-time unpinned `fetchGit`; an in-nixos structural/table-allowlist validator) and why. ## Constraints / conventions - Branch from `origin/main`; open a **PR** (`tea pr create`) — never push to main. Conventional Commits, scope `feat(web-arm)`. - The pre-commit hook only **dry-builds** (eval). The HTTPS fetch, container restart, liveness rollback and Pushover alert verify **only on the deployed host** — note this in the PR (as ADR-0012 does). - Do **not** modify `system.stateVersion`. ## Acceptance criteria - [ ] `hosts/web-arm/modules/powersync/sync-rules.yaml` deleted; nixos holds no sync-rule content. - [ ] Fetch oneshot + 5-min timer pull `Cloonar/fit:powersync/sync-rules.yaml@master` over HTTPS using `reptide-powersync-syncrules-token`. - [ ] Pre-swap gate (HTTP/non-empty/not-error/YAML) + swap-only-if-changed via `try-restart`. - [ ] `.prev` snapshot + post-restart liveness rollback; rollback exits non-zero and pages. - [ ] Container mounts `sync-rules.yaml` from `/var/lib/powersync`; `service.yaml` still from the store; container ordered after the fetch unit; boot fetch exits 0 when a usable file exists. - [ ] Fetch-unit failure pages via existing Pushover (modeled for a oneshot's `failed` state). - [ ] `docs/adr/0015-*.md` written, amending ADR-0012. - [ ] Dry-build passes; PR opened with deployed-host functional-verification notes. Relates to ADR-0012; follows the PowerSync work in #37 (operator prerequisites) / #38 (implementation).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#95
No description provided.