From 83595c17fb155bf2b46384bfe2583ac4cb9325ce Mon Sep 17 00:00:00 2001
From: Hoid <hoid@openclaw>
Date: Thu, 19 Feb 2026 08:58:44 +0000
Subject: [PATCH] Add K3s infrastructure skill

---
 skills/k3s-infra/SKILL.md | 225 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 225 insertions(+)
 create mode 100644 skills/k3s-infra/SKILL.md
diff --git a/skills/k3s-infra/SKILL.md b/skills/k3s-infra/SKILL.md
new file mode 100644
index 0000000..d0a789f
--- /dev/null
+++ b/skills/k3s-infra/SKILL.md
@@ -0,0 +1,225 @@
+---
+name: k3s-infra
+description: "K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster."
+---
+
+# K3s Infrastructure Skill
+
+Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud.
+
+## Quick Reference
+
+**SSH Access:**
+```bash
+ssh k3s-mgr   # Control plane (188.34.201.101)
+ssh k3s-w1    # Worker (159.69.23.121)
+ssh k3s-w2    # Worker (46.225.169.60)
+```
+
+**kubectl (on k3s-mgr):**
+```bash
+export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
+export PATH=$PATH:/usr/local/bin
+```
+
+**Hetzner API:**
+```bash
+source /home/openclaw/.openclaw/workspace/.credentials/services.env
+# Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls
+```
+
+**NEVER read credential files directly. Source them in scripts.**
+
+## Cluster Architecture
+
+```
+Internet → Hetzner LB (46.225.37.135:80/443)
+              ↓
+         k3s-w1 / k3s-w2 (Traefik DaemonSet)
+              ↓
+         App Pods (DocFast, SnapAPI, ...)
+              ↓
+         PgBouncer Pooler (main-db-pooler.postgres.svc:5432)
+              ↓
+         CloudNativePG (main-db, 2 instances, PostgreSQL 17.4)
+```
+
+## Nodes
+
+| Node | Role | Public IP | Private IP | Hetzner ID |
+|------|------|-----------|------------|------------|
+| k3s-mgr | Control plane (tainted NoSchedule) | 188.34.201.101 | 10.0.1.5 | 121365837 |
+| k3s-w1 | Worker | 159.69.23.121 | 10.0.1.6 | 121365839 |
+| k3s-w2 | Worker | 46.225.169.60 | 10.0.1.7 | 121365840 |
+
+- All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1
+- SSH key: `/home/openclaw/.ssh/id_ed25519`
+- Private network: 10.0.0.0/16, Flannel on enp7s0
+- K3s version: v1.34.4+k3s1
+
+## Load Balancer
+
+- Name: `k3s-lb`, Hetzner ID 5834131
+- IP: 46.225.37.135
+- Targets: k3s-w1, k3s-w2 on ports 80/443
+- Health: TCP, 15s interval, 3 retries
+
+## Namespaces
+
+| Namespace | Purpose | Replicas |
+|-----------|---------|----------|
+| `docfast` | DocFast production | 2 |
+| `docfast-staging` | DocFast staging | 1 |
+| `snapapi` | SnapAPI production | 2 (target) |
+| `snapapi-staging` | SnapAPI staging | 1 |
+| `postgres` | CNPG cluster + PgBouncer | 2+2 |
+| `cnpg-system` | CNPG operator | 2 |
+| `cert-manager` | Let's Encrypt certs | - |
+| `kube-system` | CoreDNS, Traefik, etc. | - |
+
+## Database (CloudNativePG)
+
+- **Cluster:** `main-db` in `postgres` namespace, 2 instances
+- **PostgreSQL:** 17.4, 10Gi local-path storage per instance
+- **PgBouncer:** `main-db-pooler`, 2 instances, transaction mode
+- **Connection:** `main-db-pooler.postgres.svc:5432`
+- **User:** docfast / docfast (shared across projects for now)
+
+**Databases:**
+| Database | Project | Namespace |
+|----------|---------|-----------|
+| `docfast` | DocFast prod | docfast |
+| `docfast_staging` | DocFast staging | docfast-staging |
+| `snapapi` | SnapAPI prod | snapapi |
+| `snapapi_staging` | SnapAPI staging | snapapi-staging |
+
+**DB access (find primary first!):**
+```bash
+# Check which is primary
+kubectl get pods -n postgres -l role=primary -o name
+# Or check manually
+kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();"
+# f = primary, t = replica
+
+# Connect to a database
+kubectl exec -n postgres <primary-pod> -c postgres -- psql -U docfast -d <dbname>
+```
+
+## HA Configuration
+
+**⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!**
+
+| Component | Replicas | Strategy |
+|-----------|----------|----------|
+| CoreDNS | 3 | preferredDuringScheduling podAntiAffinity (mgr+w1+w2) |
+| CNPG Operator | 2 | topologySpreadConstraints DoNotSchedule (w1+w2) |
+| PgBouncer | 2 | requiredDuringScheduling podAntiAffinity (w1+w2) |
+| DocFast Prod | 2 | preferredDuringScheduling podAntiAffinity (w1+w2) |
+
+**Failover Tuning:**
+- Readiness: every 5s, fail after 2 (~10s detection)
+- Liveness: every 10s, fail after 3
+- Node tolerations: 10s (was 300s default)
+- **Result: ~10-15 second failover**
+
+**HA validated:** All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down).
+
+## CI/CD Pipeline
+
+**Pattern (per project):**
+- Push to `main` → Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace
+- Push tag `v*` → production namespace
+- Registry: `git.cloonar.com/openclawd/<project>`
+- Runner: agent host (178.115.247.134), x86
+
+**Per-project Forgejo secrets:**
+- `REGISTRY_TOKEN` — PAT with write:package
+- `KUBECONFIG` — base64 deployer kubeconfig
+
+**Deployer ServiceAccount:** namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access)
+
+## DNS
+
+| Record | Type | Value | Status |
+|--------|------|-------|--------|
+| docfast.dev | A | 46.225.37.135 | ✅ |
+| staging.docfast.dev | A | 46.225.37.135 | ❌ NOT SET |
+| MX for docfast.dev | MX | mail.cloonar.com. | ✅ |
+
+## Firewall
+
+- Hetzner FW: `coolify-fw` (ID 10553199)
+- Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner)
+
+## Backup (TO IMPLEMENT)
+
+**Current: ❌ NO BACKUPS**
+
+**Plan: Borg → Hetzner Storage Box**
+- Target: `u149513-sub11@u149513-sub11.your-backup.de:23`
+- SSH key already configured on k3s-mgr (`/root/.ssh/id_ed25519`, fingerprint `docfast-backup`)
+- Per-machine subdir: `./docfast-1/` (existing), `./k3s-cluster/` and `./k3s-db/` (planned)
+
+**What to back up:**
+1. **Cluster state:** etcd snapshots + `/var/lib/rancher/k3s/server/manifests/` → daily
+2. **Databases:** pg_dump all databases → every 6h
+3. **K8s manifests:** export all resources as YAML → daily
+
+**Recovery:** Fresh nodes → K3s install → restore etcd or re-apply manifests → restore DB → update DNS → ~15-30 min
+
+## Common Operations
+
+### Deploy new project
+1. Create namespace: `kubectl create namespace <project>`
+2. Create database on CNPG primary: `CREATE DATABASE <name> OWNER docfast;`
+3. Create secrets: `kubectl create secret generic <project>-secrets -n <namespace> ...`
+4. Copy pull secret: `kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f -`
+5. Create deployer SA + RBAC
+6. Set up CI/CD workflows in Forgejo repo
+7. Deploy + Ingress + cert-manager TLS
+
+### Check cluster health
+```bash
+ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running'
+```
+
+### Power cycle a node (Hetzner API)
+```bash
+source ~/.openclaw/workspace/.credentials/services.env
+# Status
+curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>"
+# Power on/off
+curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>/actions/poweron" -X POST
+```
+
+### Scale a deployment
+```bash
+kubectl scale deployment <name> -n <namespace> --replicas=<N>
+```
+
+### Force-delete stuck pod
+```bash
+kubectl delete pod <name> -n <namespace> --force --grace-period=0
+```
+
+### Check CNPG cluster status
+```bash
+kubectl get cluster -n postgres
+kubectl get pods -n postgres -o wide
+```
+
+## Old Server (PENDING DECOMMISSION)
+
+- IP: 167.235.156.214, SSH alias: `docfast`
+- Still used for: git push to Forgejo (SSH access)
+- **No longer serves traffic** — decommission to save €4.5/mo
+
+## Future Improvements
+
+See `projects/business/memory/infrastructure.md` for full roadmap.
+
+**High priority:**
+- Implement Borg backup
+- DNS: staging.docfast.dev
+- Persist HA constraints as infra-as-code
+- Decommission old server