docs: K3s HA hardening notes - CoreDNS, CNPG, pooler spread
This commit is contained in:
parent
8407a3a941
commit
c085858b5e
2 changed files with 22 additions and 0 deletions
|
|
@ -1,3 +1,12 @@
|
|||
## BUG-075: Database Connection Pool Does Not Handle PgBouncer Failover
|
||||
- **Date:** 2026-02-18 14:05 UTC
|
||||
- **Severity:** CRITICAL
|
||||
- **Issue:** When a PgBouncer pooler instance goes down (K8s node failure), pods keep stale TCP connections and get "no available server" errors. App never reconnects — all PDF conversions fail until manual pod restart.
|
||||
- **Root cause:** `pg.Pool` in `src/services/db.ts` has no keepAlive, no connectionTimeoutMillis, no retry logic. No mechanism to detect or recover from dead connections.
|
||||
- **Impact:** Complete service outage for pods connected to failed pooler. Violates HA requirements.
|
||||
- **Fix needed:** (1) Enable TCP keepalive + connection timeout, (2) Add retry with backoff for transient DB errors, (3) Health check must fail when DB is unreachable (already partially done but needs timeout), (4) Verify recovery without pod restart
|
||||
- **Status:** ✅ FIXED (PROPERLY) — v0.2.6 fix was insufficient (retried on same dead pool connections). Commit 95ca101 fixes the real issue: `queryWithRetry` now uses explicit client checkout and calls `client.release(true)` to DESTROY dead connections on transient errors, forcing fresh TCP connections on retry. `connectWithRetry` validates with `SELECT 1` before returning. `idleTimeoutMillis` reduced to 10s. Health check destroys bad connections. **Verified on staging:** killed a pooler pod, 15/15 health checks passed, zero app restarts.
|
||||
|
||||
## BUG-074: Email Broken on K3s Production — SMTP Misconfigured
|
||||
- **Date:** 2026-02-18 13:00 UTC
|
||||
- **Severity:** CRITICAL
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue