diff --git a/MEMORY.md b/MEMORY.md index 9cc7c34..50ef6a5 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -1,5 +1,8 @@ # MEMORY.md - Long-Term Memory +## Lessons Learned +- **CEO sessions need 1 hour timeout** (`runTimeoutSeconds: 3600`). Default 10min is way too short — CEOs hire sub-agents for long-running tasks. Always set explicitly. + ## Product Ideas & Future CEOs - `projects/ideas/product-ideas.md` — All product ideas + SnapAPI CEO setup plan - Selected next product: **SnapAPI** (Screenshot API) — ready to launch when user says go @@ -15,6 +18,16 @@ - Old server (167.235.156.214) kept for git push + SMTP relay only - Total infra cost: €17.06/mo (3x CAX11 + LB) +## K3s HA Hardening (2026-02-18) +- **CoreDNS**: 3 replicas with podAntiAffinity (one per node) — was single SPOF +- **CNPG operator**: 2 replicas with topologySpreadConstraints (w1 + w2) — was single SPOF preventing DB failover +- **PgBouncer pooler**: anti-affinity via Pooler CRD template (w1 + w2) — was landing both on same node +- **DocFast prod**: preferredDuringScheduling anti-affinity to spread across workers +- **App v0.2.7**: `client.release(true)` destroys dead pool connections on transient errors +- **HA test PASSED**: Shut down either worker → prod stays up, DB failover works, zero downtime +- **Note**: Staging is 1 replica = not HA by design. CoreDNS scale may not persist K3s upgrades — check after updates. +- **Note**: Deployment patches to system components (CoreDNS, CNPG operator) are runtime changes. Document in infra notes so they can be re-applied if needed. + ## Game Save Files - `memory/d2r.json` — Diablo II: Resurrected progress (Necro "Baltasar", Summoner build) - `memory/bg3.json` — Baldur's Gate 3 progress (Act 1, level 3) diff --git a/projects/business/memory/bugs.md b/projects/business/memory/bugs.md index 0abcba4..e7afc66 100644 --- a/projects/business/memory/bugs.md +++ b/projects/business/memory/bugs.md @@ -1,3 +1,12 @@ +## BUG-075: Database Connection Pool Does Not Handle PgBouncer Failover +- **Date:** 2026-02-18 14:05 UTC +- **Severity:** CRITICAL +- **Issue:** When a PgBouncer pooler instance goes down (K8s node failure), pods keep stale TCP connections and get "no available server" errors. App never reconnects — all PDF conversions fail until manual pod restart. +- **Root cause:** `pg.Pool` in `src/services/db.ts` has no keepAlive, no connectionTimeoutMillis, no retry logic. No mechanism to detect or recover from dead connections. +- **Impact:** Complete service outage for pods connected to failed pooler. Violates HA requirements. +- **Fix needed:** (1) Enable TCP keepalive + connection timeout, (2) Add retry with backoff for transient DB errors, (3) Health check must fail when DB is unreachable (already partially done but needs timeout), (4) Verify recovery without pod restart +- **Status:** ✅ FIXED (PROPERLY) — v0.2.6 fix was insufficient (retried on same dead pool connections). Commit 95ca101 fixes the real issue: `queryWithRetry` now uses explicit client checkout and calls `client.release(true)` to DESTROY dead connections on transient errors, forcing fresh TCP connections on retry. `connectWithRetry` validates with `SELECT 1` before returning. `idleTimeoutMillis` reduced to 10s. Health check destroys bad connections. **Verified on staging:** killed a pooler pod, 15/15 health checks passed, zero app restarts. + ## BUG-074: Email Broken on K3s Production — SMTP Misconfigured - **Date:** 2026-02-18 13:00 UTC - **Severity:** CRITICAL