docs: K3s HA hardening notes - CoreDNS, CNPG, pooler spread

This commit is contained in:
Hoid 2026-02-18 15:41:44 +00:00
parent 8407a3a941
commit c085858b5e
2 changed files with 22 additions and 0 deletions

View file

@ -1,5 +1,8 @@
# MEMORY.md - Long-Term Memory # MEMORY.md - Long-Term Memory
## Lessons Learned
- **CEO sessions need 1 hour timeout** (`runTimeoutSeconds: 3600`). Default 10min is way too short — CEOs hire sub-agents for long-running tasks. Always set explicitly.
## Product Ideas & Future CEOs ## Product Ideas & Future CEOs
- `projects/ideas/product-ideas.md` — All product ideas + SnapAPI CEO setup plan - `projects/ideas/product-ideas.md` — All product ideas + SnapAPI CEO setup plan
- Selected next product: **SnapAPI** (Screenshot API) — ready to launch when user says go - Selected next product: **SnapAPI** (Screenshot API) — ready to launch when user says go
@ -15,6 +18,16 @@
- Old server (167.235.156.214) kept for git push + SMTP relay only - Old server (167.235.156.214) kept for git push + SMTP relay only
- Total infra cost: €17.06/mo (3x CAX11 + LB) - Total infra cost: €17.06/mo (3x CAX11 + LB)
## K3s HA Hardening (2026-02-18)
- **CoreDNS**: 3 replicas with podAntiAffinity (one per node) — was single SPOF
- **CNPG operator**: 2 replicas with topologySpreadConstraints (w1 + w2) — was single SPOF preventing DB failover
- **PgBouncer pooler**: anti-affinity via Pooler CRD template (w1 + w2) — was landing both on same node
- **DocFast prod**: preferredDuringScheduling anti-affinity to spread across workers
- **App v0.2.7**: `client.release(true)` destroys dead pool connections on transient errors
- **HA test PASSED**: Shut down either worker → prod stays up, DB failover works, zero downtime
- **Note**: Staging is 1 replica = not HA by design. CoreDNS scale may not persist K3s upgrades — check after updates.
- **Note**: Deployment patches to system components (CoreDNS, CNPG operator) are runtime changes. Document in infra notes so they can be re-applied if needed.
## Game Save Files ## Game Save Files
- `memory/d2r.json` — Diablo II: Resurrected progress (Necro "Baltasar", Summoner build) - `memory/d2r.json` — Diablo II: Resurrected progress (Necro "Baltasar", Summoner build)
- `memory/bg3.json` — Baldur's Gate 3 progress (Act 1, level 3) - `memory/bg3.json` — Baldur's Gate 3 progress (Act 1, level 3)

View file

@ -1,3 +1,12 @@
## BUG-075: Database Connection Pool Does Not Handle PgBouncer Failover
- **Date:** 2026-02-18 14:05 UTC
- **Severity:** CRITICAL
- **Issue:** When a PgBouncer pooler instance goes down (K8s node failure), pods keep stale TCP connections and get "no available server" errors. App never reconnects — all PDF conversions fail until manual pod restart.
- **Root cause:** `pg.Pool` in `src/services/db.ts` has no keepAlive, no connectionTimeoutMillis, no retry logic. No mechanism to detect or recover from dead connections.
- **Impact:** Complete service outage for pods connected to failed pooler. Violates HA requirements.
- **Fix needed:** (1) Enable TCP keepalive + connection timeout, (2) Add retry with backoff for transient DB errors, (3) Health check must fail when DB is unreachable (already partially done but needs timeout), (4) Verify recovery without pod restart
- **Status:** ✅ FIXED (PROPERLY) — v0.2.6 fix was insufficient (retried on same dead pool connections). Commit 95ca101 fixes the real issue: `queryWithRetry` now uses explicit client checkout and calls `client.release(true)` to DESTROY dead connections on transient errors, forcing fresh TCP connections on retry. `connectWithRetry` validates with `SELECT 1` before returning. `idleTimeoutMillis` reduced to 10s. Health check destroys bad connections. **Verified on staging:** killed a pooler pod, 15/15 health checks passed, zero app restarts.
## BUG-074: Email Broken on K3s Production — SMTP Misconfigured ## BUG-074: Email Broken on K3s Production — SMTP Misconfigured
- **Date:** 2026-02-18 13:00 UTC - **Date:** 2026-02-18 13:00 UTC
- **Severity:** CRITICAL - **Severity:** CRITICAL