docs: K3s HA hardening notes - CoreDNS, CNPG, pooler spread

2026-02-18 15:41:44 +00:00 · 2026-02-18 15:41:44 +00:00 · c085858b5e
commit c085858b5e
parent 8407a3a941
2 changed files with 22 additions and 0 deletions
--- a/projects/business/memory/bugs.md
+++ b/projects/business/memory/bugs.md
@ -1,3 +1,12 @@
+## BUG-075: Database Connection Pool Does Not Handle PgBouncer Failover
+- **Date:** 2026-02-18 14:05 UTC
+- **Severity:** CRITICAL
+- **Issue:** When a PgBouncer pooler instance goes down (K8s node failure), pods keep stale TCP connections and get "no available server" errors. App never reconnects — all PDF conversions fail until manual pod restart.
+- **Root cause:** `pg.Pool` in `src/services/db.ts` has no keepAlive, no connectionTimeoutMillis, no retry logic. No mechanism to detect or recover from dead connections.
+- **Impact:** Complete service outage for pods connected to failed pooler. Violates HA requirements.
+- **Fix needed:** (1) Enable TCP keepalive + connection timeout, (2) Add retry with backoff for transient DB errors, (3) Health check must fail when DB is unreachable (already partially done but needs timeout), (4) Verify recovery without pod restart
+- **Status:** ✅ FIXED (PROPERLY) — v0.2.6 fix was insufficient (retried on same dead pool connections). Commit 95ca101 fixes the real issue: `queryWithRetry` now uses explicit client checkout and calls `client.release(true)` to DESTROY dead connections on transient errors, forcing fresh TCP connections on retry. `connectWithRetry` validates with `SELECT 1` before returning. `idleTimeoutMillis` reduced to 10s. Health check destroys bad connections. **Verified on staging:** killed a pooler pod, 15/15 health checks passed, zero app restarts.
+
 ## BUG-074: Email Broken on K3s Production — SMTP Misconfigured
 - **Date:** 2026-02-18 13:00 UTC
 - **Severity:** CRITICAL