Session 54: k3s-w1 node down, HA working, escalated

This commit is contained in:
Hoid 2026-02-18 16:03:38 +00:00
parent 5c9f55d2db
commit 331b4c1517
4 changed files with 41 additions and 49 deletions

View file

@ -24,11 +24,17 @@
## BUG-073: Staging Landing Page Shows Wrong Pro Plan Quota (2,500 vs 5,000)
- **Date:** 2026-02-18 13:05 UTC
- **Severity:** MEDIUM
- **Environment:** Staging (https://staging.docfast.dev)
- **Issue:** Staging landing page shows Pro plan as "2,500 PDFs per month" but production also shows "2,500 PDFs per month". Previous bugs (BUG-045, BUG-057) referenced 5,000 and 10,000 PDFs. The Stripe checkout page says "5,000 PDF conversions per month". There is a mismatch between what the landing page advertises (2,500) and what Stripe checkout says (5,000).
- **Impact:** Customer confusion — they see 2,500 on the pricing page but 5,000 on the checkout page
- **Fix:** Align landing page and Stripe product description to the same number
- **Status:** OPEN
- **Issue:** Landing page showed "2,500" but Stripe said "5,000". Mismatch.
- **Fix:** Landing page + JSON-LD updated to 5,000. Tagged v0.2.4.
- **Status:** ✅ FIXED (Session 53)
## BUG-076: k3s-w1 Node Down — Complete Network Unreachability
- **Date:** 2026-02-18 16:00 UTC
- **Severity:** HIGH (degraded HA, not outage)
- **Issue:** k3s-w1 (159.69.23.121) completely unreachable — 100% packet loss from both external and private network (k3s-mgr). Node shows NotReady in K8s. CNPG failover triggered: primary moved to main-db-2 on w2. Production running on single node (w2 only).
- **Impact:** HA is degraded — running on 1 worker. If w2 also fails, full outage. No data loss (DB failover worked).
- **Requires:** Investor to reboot k3s-w1 via Hetzner Console (CEO's API token doesn't have access to K3s project).
- **Status:** OPEN — escalated to investor
---

View file

@ -1285,3 +1285,23 @@
- **Budget:** €181.71 remaining, Revenue: €9
- **Open bugs:** ZERO
- **Status:** LAUNCH-READY — K3s migration verified, all post-migration issues resolved
## Session 54 — 2026-02-18 16:00 UTC (Late Afternoon Session)
- **k3s-w1 NODE DOWN (BUG-076 HIGH):**
- Discovery: k3s-w1 (159.69.23.121) completely unreachable — 100% packet loss from external AND private network
- K8s status: NotReady, CNPG auto-failover triggered (primary → main-db-2 on w2)
- Production: Running on w2 only (1 pod serving traffic, ~100ms response times)
- HA validation: Failover worked perfectly — zero downtime, DB switched primaries, traffic routed to w2
- Cannot reboot: CEO's Hetzner API token only covers old docfast-1 project, not K3s cluster
- **Escalated to investor:** Need Hetzner Console reboot of k3s-w1
- **Support check:** Zero open tickets ✅
- **Production health:** 5/5 health checks passed, all ~100ms, DB connected (PostgreSQL 17.4)
- **Investor Test:**
1. Trust with money? ✅ Yes (working, fast)
2. Data loss on crash? ✅ No (CNPG replication + MinIO backups)
3. Free tier abuse? ✅ Rate limited + usage enforced
4. Lost key recovery? ✅ Yes
5. Features match website? ✅ Yes
- **Budget:** €181.71 remaining, Revenue: €9
- **Open bugs:** 0 CRITICAL, 1 HIGH (BUG-076 node down), 0 MEDIUM, 0 LOW
- **Status:** Production operational but HA degraded — single worker node

View file

@ -3,7 +3,7 @@
"phaseLabel": "Build Production-Grade Product",
"status": "launch-ready",
"product": "DocFast \u2014 HTML/Markdown to PDF API",
"currentPriority": "K3s migration verified. All post-migration issues resolved. Zero open bugs. Launch-ready.",
"currentPriority": "k3s-w1 NODE DOWN — running on w2 only. HA degraded. Escalated to investor for Hetzner reboot.",
"ownerDirectives_PRIORITY": "Process these IN ORDER. Do not skip.",
"ownerDirectives": [
"Stripe: owner has existing Stripe account from another project \u2014 use same account, just create separate Product + webhook endpoint for DocFast.",
@ -104,10 +104,10 @@
},
"openBugs": {
"CRITICAL": [],
"HIGH": [],
"HIGH": ["BUG-076: k3s-w1 node down, HA degraded, needs Hetzner reboot"],
"MEDIUM": [],
"LOW": [],
"note": "Session 53: BUG-074 CRITICAL (email broken on K3s) fixed. BUG-073 MEDIUM (quota mismatch) fixed. CNPG backups configured with MinIO. Old Docker server decommissioned. ZERO open bugs."
"note": "Session 54: k3s-w1 node down. CNPG failover to main-db-2 worked. Production running on w2 only. HA validated but degraded."
},
"blockers": [],
"resolvedBlockers": [
@ -120,5 +120,5 @@
"Checkout .env persistence + CI/CD secrets pipeline \u2014 DONE 2026-02-17"
],
"startDate": "2026-02-14",
"sessionCount": 53
"sessionCount": 54
}

View file

@ -1,45 +1,11 @@
# DocFast Support Log
## 2026-02-16 20:17 UTC
## 2026-02-18 16:00 UTC
**Ticket #369** - Lost API key
- Customer: dominik@superbros.tv
- Issue: Lost API key recovery
- Action: Replied with key recovery instructions (POST /v1/recover endpoint)
- Status: Resolved with self-service solution
## 2026-02-16 20:21 UTC — Ticket #369
- **Customer:** dominik@superbros.tv
- **Subject:** Lost API key
- **Action:** Replied with self-service recovery instructions (website link + API endpoint)
- **Status:** Replied, awaiting customer confirmation
**Tickets Checked:**
- All tickets: 0 found
- Pending tickets: 0 found
## 2026-02-16 20:24 UTC
- **Ticket #369** (dominik@superbros.tv): Lost API key → Replied with recovery flow instructions. Simple case.
**Status:** ✅ No open support tickets requiring action.
## 2026-02-16 20:27 UTC
- **Ticket #369** (dominik@superbros.tv): Lost API key → Replied with recovery flow instructions. Straightforward.
## 2026-02-17 13:02 UTC — Ticket #370
- **Customer:** office@cloonar.com (dominik.polakovics@cloonar.com)
- **Subject:** Lost API key
- **Issue:** Customer lost API key and couldn't receive password reset email (verification code never arrived)
- **Root Cause:** BUG-050 — cloonar.com mail server was rejecting noreply@docfast.dev due to sender verification (not a real mailbox)
- **Fix Applied:** DocFast updated email sender configuration to use a verified sender address
- **Action:** Replied to ticket confirming fix is applied, asked customer to retry recovery flow
- **Status:** Awaiting customer retry; should resolve once email is received
## 2026-02-17 16:00 UTC — Ticket #370 RESOLVED
- **Follow-up:** Customer confirmed still not receiving verification email
- **Resolution:** Provided two options: (1) retry recovery flow now that email is fixed, or (2) direct key generation from our side
- **Notes:** Customer has been patient through multiple attempts; acknowledged inconvenience and recommended storing keys securely
- **Status:** CLOSED — awaiting customer confirmation of preferred resolution method
## 2026-02-18 08:00 UTC — Ticket #374 TEST/RESOLVED
- **Customer:** dominik.polakovics@cloonar.com (CEO)
- **Subject:** Security Notice: Your DocFast API Key Has Been Rotated
- **Issue:** Test ticket with security notice about API key rotation
- **Messages:** Multiple test messages from franz.hubert@docfast.dev (2026-02-17 21:57 onward) verifying email formatting
- **Customer Question:** CEO asked what tools/binaries the support team has access to
- **Franz's Response:** Appropriately declined to share internal tooling info; redirected to DocFast support scope
- **Status:** No further action needed — ticket appears to be a test of support system; properly handled by Franz
- **Notes:** This appears to be an internal test of the support system with test messages; no customer action required
**Notes:** System clean, no replies needed.