--- name: k3s-infra description: "K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster." --- # K3s Infrastructure Skill Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud. ## Quick Reference **SSH Access:** ```bash ssh k3s-mgr # Control plane (188.34.201.101) ssh k3s-w1 # Worker (159.69.23.121) ssh k3s-w2 # Worker (46.225.169.60) ``` **kubectl (on k3s-mgr):** ```bash export KUBECONFIG=/etc/rancher/k3s/k3s.yaml export PATH=$PATH:/usr/local/bin ``` **Hetzner API:** ```bash source /home/openclaw/.openclaw/workspace/.credentials/services.env # Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls ``` **NEVER read credential files directly. Source them in scripts.** ## Cluster Architecture ``` Internet → Hetzner LB (46.225.37.135:80/443) ↓ k3s-w1 / k3s-w2 (Traefik DaemonSet) ↓ App Pods (DocFast, SnapAPI, ...) ↓ PgBouncer Pooler (main-db-pooler.postgres.svc:5432) ↓ CloudNativePG (main-db, 2 instances, PostgreSQL 17.4) ``` ## Nodes | Node | Role | Public IP | Private IP | Hetzner ID | |------|------|-----------|------------|------------| | k3s-mgr | Control plane (tainted NoSchedule) | 188.34.201.101 | 10.0.1.5 | 121365837 | | k3s-w1 | Worker | 159.69.23.121 | 10.0.1.6 | 121365839 | | k3s-w2 | Worker | 46.225.169.60 | 10.0.1.7 | 121365840 | - All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1 - SSH key: `/home/openclaw/.ssh/id_ed25519` - Private network: 10.0.0.0/16, Flannel on enp7s0 - K3s version: v1.34.4+k3s1 ## Load Balancer - Name: `k3s-lb`, Hetzner ID 5834131 - IP: 46.225.37.135 - Targets: k3s-w1, k3s-w2 on ports 80/443 - Health: TCP, 15s interval, 3 retries ## Namespaces | Namespace | Purpose | Replicas | |-----------|---------|----------| | `docfast` | DocFast production | 2 | | `docfast-staging` | DocFast staging | 1 | | `snapapi` | SnapAPI production | 2 (target) | | `snapapi-staging` | SnapAPI staging | 1 | | `postgres` | CNPG cluster + PgBouncer | 2+2 | | `cnpg-system` | CNPG operator | 2 | | `cert-manager` | Let's Encrypt certs | - | | `kube-system` | CoreDNS, Traefik, etc. | - | ## Database (CloudNativePG) - **Cluster:** `main-db` in `postgres` namespace, 2 instances - **PostgreSQL:** 17.4, 10Gi local-path storage per instance - **PgBouncer:** `main-db-pooler`, 2 instances, transaction mode - **Connection:** `main-db-pooler.postgres.svc:5432` - **User:** docfast / docfast (shared across projects for now) **Databases:** | Database | Project | Namespace | |----------|---------|-----------| | `docfast` | DocFast prod | docfast | | `docfast_staging` | DocFast staging | docfast-staging | | `snapapi` | SnapAPI prod | snapapi | | `snapapi_staging` | SnapAPI staging | snapapi-staging | **DB access (find primary first!):** ```bash # Check which is primary kubectl get pods -n postgres -l role=primary -o name # Or check manually kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();" # f = primary, t = replica # Connect to a database kubectl exec -n postgres -c postgres -- psql -U docfast -d ``` ## HA Configuration **⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!** | Component | Replicas | Strategy | |-----------|----------|----------| | CoreDNS | 3 | preferredDuringScheduling podAntiAffinity (mgr+w1+w2) | | CNPG Operator | 2 | topologySpreadConstraints DoNotSchedule (w1+w2) | | PgBouncer | 2 | requiredDuringScheduling podAntiAffinity (w1+w2) | | DocFast Prod | 2 | preferredDuringScheduling podAntiAffinity (w1+w2) | **Failover Tuning:** - Readiness: every 5s, fail after 2 (~10s detection) - Liveness: every 10s, fail after 3 - Node tolerations: 10s (was 300s default) - **Result: ~10-15 second failover** **HA validated:** All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down). ## CI/CD Pipeline **Pattern (per project):** - Push to `main` → Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace - Push tag `v*` → production namespace - Registry: `git.cloonar.com/openclawd/` - Runner: agent host (178.115.247.134), x86 **Per-project Forgejo secrets:** - `REGISTRY_TOKEN` — PAT with write:package - `KUBECONFIG` — base64 deployer kubeconfig **Deployer ServiceAccount:** namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access) ## DNS | Record | Type | Value | Status | |--------|------|-------|--------| | docfast.dev | A | 46.225.37.135 | ✅ | | staging.docfast.dev | A | 46.225.37.135 | ❌ NOT SET | | MX for docfast.dev | MX | mail.cloonar.com. | ✅ | ## Firewall - Hetzner FW: `coolify-fw` (ID 10553199) - Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner) ## Backup ✅ OPERATIONAL **Borg → Hetzner Storage Box** - Target: `u149513-sub10@u149513-sub10.your-backup.de:23` - SSH key: `/root/.ssh/id_ed25519` (k3s-mgr-backup) - Passphrase: `/root/.borg-passphrase` (on k3s-mgr) - Key exports: `/root/.borg-key-cluster`, `/root/.borg-key-db` - Script: `/root/k3s-backup.sh` - Log: `/var/log/k3s-backup.log` **Repos:** | Repo | Contents | Size | |------|----------|------| | `./k3s-cluster` | K3s SQLite, manifests, token, all namespace YAML exports, CNPG specs | ~45 MB | | `./k3s-db` | pg_dump of all databases (docfast, docfast_staging, snapapi, snapapi_staging) + globals | ~30 KB | **Schedule (cron on k3s-mgr):** - `0 */6 * * *` — DB backup (pg_dump) every 6 hours - `0 3 * * *` — Full backup (DB + cluster state + manifests) daily at 03:00 UTC **Retention:** 7 daily, 4 weekly (auto-pruned) **Recovery:** 1. Provision 3 fresh CAX11 nodes 2. Install K3s, restore SQLite DB from Borg (`/var/lib/rancher/k3s/server/db/`) 3. Or: fresh K3s + re-apply manifest YAMLs from Borg 4. Restore databases: `psql -U postgres -d < dump.sql` 5. Update DNS to new LB IP 6. Estimated recovery time: ~15-30 minutes **Verify backup:** ```bash ssh k3s-mgr 'export BORG_RSH="ssh -p23"; export BORG_PASSPHRASE=$(cat /root/.borg-passphrase); borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-db' ``` ## Common Operations ### Deploy new project 1. Create namespace: `kubectl create namespace ` 2. Create database on CNPG primary: `CREATE DATABASE OWNER docfast;` 3. Create secrets: `kubectl create secret generic -secrets -n ...` 4. Copy pull secret: `kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f -` 5. Create deployer SA + RBAC 6. Set up CI/CD workflows in Forgejo repo 7. Deploy + Ingress + cert-manager TLS ### Check cluster health ```bash ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running' ``` ### Power cycle a node (Hetzner API) ```bash source ~/.openclaw/workspace/.credentials/services.env # Status curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/" # Power on/off curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers//actions/poweron" -X POST ``` ### Scale a deployment ```bash kubectl scale deployment -n --replicas= ``` ### Force-delete stuck pod ```bash kubectl delete pod -n --force --grace-period=0 ``` ### Check CNPG cluster status ```bash kubectl get cluster -n postgres kubectl get pods -n postgres -o wide ``` ## Old Server (PENDING DECOMMISSION) - IP: 167.235.156.214, SSH alias: `docfast` - Still used for: git push to Forgejo (SSH access) - **No longer serves traffic** — decommission to save €4.5/mo ## Future Improvements See `projects/business/memory/infrastructure.md` for full roadmap. **High priority:** - Implement Borg backup - DNS: staging.docfast.dev - Persist HA constraints as infra-as-code - Decommission old server ## Staging IP Whitelist All staging environments are IP-whitelisted to the openclaw-vm public IP only. **How it works:** - Hetzner LB has proxy protocol enabled (both port 80 and 443) - Traefik configured with `proxyProtocol.trustedIPs` for the LB IP (46.225.37.135/32) and private network (10.0.0.0/8) - Traefik Middleware `staging-ipwhitelist` in each staging namespace allows only 178.115.247.134/32 - Middleware attached to staging ingresses via annotation `traefik.ingress.kubernetes.io/router.middlewares` **For new projects:** 1. Create middleware in the staging namespace: ```yaml apiVersion: traefik.io/v1alpha1 kind: Middleware metadata: name: staging-ipwhitelist namespace: -staging spec: ipAllowList: sourceRange: - 178.115.247.134/32 ``` 2. Annotate the staging ingress: ``` traefik.ingress.kubernetes.io/router.middlewares: -staging-staging-ipwhitelist@kubernetescrd ``` **Traefik Helm config (managed via `helm upgrade`):** - `additionalArguments`: proxyProtocol.trustedIPs for web + websecure entrypoints - `logs.access.enabled=true` for debugging - DaemonSet updateStrategy must be patched to `maxUnavailable: 1` after each helm upgrade (helm resets it) **Note:** If openclaw-vm's public IP changes, update ALL staging-ipwhitelist middlewares. ## CI/CD Deployer Kubeconfigs **Critical rules when generating deployer kubeconfigs:** 1. **Always use PUBLIC IP** (188.34.201.101:6443) — CI runners run externally and can't reach private IPs (10.0.1.5) 2. **Must be base64-encoded** for Forgejo secrets — workflow does `base64 -d` before use 3. **Use `kubectl config` commands** to build kubeconfig, NOT heredoc interpolation — avoids CA cert corruption 4. **Cross-namespace RoleBindings** — each deployer SA needs access to both staging and prod namespaces (e.g. docfast SA in `docfast` namespace needs RoleBinding in `docfast-staging` too) 5. **Never read kubeconfig contents** — generate on k3s-mgr, base64 encode, scp to /tmp on openclaw-vm, let user paste into Forgejo **Generation script pattern (run on k3s-mgr):** ```bash export KUBECONFIG=/etc/rancher/k3s/k3s.yaml TOKEN=$(kubectl -n get secret deployer-token -o jsonpath="{.data.token}" | base64 -d) kubectl -n get secret deployer-token -o jsonpath="{.data.ca\.crt}" | base64 -d > /tmp/ca.crt KUBECONFIG=/tmp/deployer.yaml kubectl config set-cluster k3s --server=https://188.34.201.101:6443 --certificate-authority=/tmp/ca.crt --embed-certs=true KUBECONFIG=/tmp/deployer.yaml kubectl config set-credentials deployer --token="$TOKEN" KUBECONFIG=/tmp/deployer.yaml kubectl config set-context deployer --cluster=k3s --user=deployer KUBECONFIG=/tmp/deployer.yaml kubectl config use-context deployer # Verify before encoding kubectl --kubeconfig=/tmp/deployer.yaml -n get pods # Encode for Forgejo base64 -w0 /tmp/deployer.yaml > /tmp/kubeconfig-b64.txt rm /tmp/ca.crt /tmp/deployer.yaml ```