From 5c9f55d2dbeebf61bf8c2bc4a0aab90af4c7e7ae Mon Sep 17 00:00:00 2001 From: Hoid Date: Wed, 18 Feb 2026 15:42:26 +0000 Subject: [PATCH] docs: update HA hardening notes after successful failover testing --- MEMORY.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/MEMORY.md b/MEMORY.md index 50ef6a5..0103489 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -19,14 +19,16 @@ - Total infra cost: €17.06/mo (3x CAX11 + LB) ## K3s HA Hardening (2026-02-18) -- **CoreDNS**: 3 replicas with podAntiAffinity (one per node) — was single SPOF +- **CoreDNS**: 3 replicas with podAntiAffinity (one per node) — was single SPOF, all DNS broke when node died - **CNPG operator**: 2 replicas with topologySpreadConstraints (w1 + w2) — was single SPOF preventing DB failover -- **PgBouncer pooler**: anti-affinity via Pooler CRD template (w1 + w2) — was landing both on same node +- **PgBouncer pooler**: requiredDuringScheduling anti-affinity via Pooler CRD template (w1 + w2) — was landing both on same node - **DocFast prod**: preferredDuringScheduling anti-affinity to spread across workers - **App v0.2.7**: `client.release(true)` destroys dead pool connections on transient errors -- **HA test PASSED**: Shut down either worker → prod stays up, DB failover works, zero downtime +- **HA test PASSED**: Shut down either worker → prod stays up, DB failover works, 4/4 health checks pass over 3 minutes +- **Root causes found**: CoreDNS (1 replica), CNPG operator (1 replica), PgBouncer (both same node), app dead connections - **Note**: Staging is 1 replica = not HA by design. CoreDNS scale may not persist K3s upgrades — check after updates. - **Note**: Deployment patches to system components (CoreDNS, CNPG operator) are runtime changes. Document in infra notes so they can be re-applied if needed. +- **Note**: CNPG Pooler CRD supports `spec.template.spec.affinity` but requires `containers` field too (name+image of pgbouncer) ## Game Save Files - `memory/d2r.json` — Diablo II: Resurrected progress (Necro "Baltasar", Summoner build)