config/skills/k3s-infra/SKILL.md
Hoid feba85c7ba Staging IP whitelist: proxy protocol + Traefik middleware
- Hetzner LB: proxy protocol enabled on port 80+443
- Traefik: proxyProtocol.trustedIPs includes LB public IP (46.225.37.135)
- Middleware in docfast-staging + snapapi-staging: allows only 178.115.247.134
- Documented in k3s-infra skill for future projects
- DaemonSet updateStrategy note: helm resets maxUnavailable
2026-02-20 10:24:44 +00:00

9.1 KiB

name description
k3s-infra K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster.

K3s Infrastructure Skill

Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud.

Quick Reference

SSH Access:

ssh k3s-mgr   # Control plane (188.34.201.101)
ssh k3s-w1    # Worker (159.69.23.121)
ssh k3s-w2    # Worker (46.225.169.60)

kubectl (on k3s-mgr):

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
export PATH=$PATH:/usr/local/bin

Hetzner API:

source /home/openclaw/.openclaw/workspace/.credentials/services.env
# Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls

NEVER read credential files directly. Source them in scripts.

Cluster Architecture

Internet → Hetzner LB (46.225.37.135:80/443)
              ↓
         k3s-w1 / k3s-w2 (Traefik DaemonSet)
              ↓
         App Pods (DocFast, SnapAPI, ...)
              ↓
         PgBouncer Pooler (main-db-pooler.postgres.svc:5432)
              ↓
         CloudNativePG (main-db, 2 instances, PostgreSQL 17.4)

Nodes

Node Role Public IP Private IP Hetzner ID
k3s-mgr Control plane (tainted NoSchedule) 188.34.201.101 10.0.1.5 121365837
k3s-w1 Worker 159.69.23.121 10.0.1.6 121365839
k3s-w2 Worker 46.225.169.60 10.0.1.7 121365840
  • All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1
  • SSH key: /home/openclaw/.ssh/id_ed25519
  • Private network: 10.0.0.0/16, Flannel on enp7s0
  • K3s version: v1.34.4+k3s1

Load Balancer

  • Name: k3s-lb, Hetzner ID 5834131
  • IP: 46.225.37.135
  • Targets: k3s-w1, k3s-w2 on ports 80/443
  • Health: TCP, 15s interval, 3 retries

Namespaces

Namespace Purpose Replicas
docfast DocFast production 2
docfast-staging DocFast staging 1
snapapi SnapAPI production 2 (target)
snapapi-staging SnapAPI staging 1
postgres CNPG cluster + PgBouncer 2+2
cnpg-system CNPG operator 2
cert-manager Let's Encrypt certs -
kube-system CoreDNS, Traefik, etc. -

Database (CloudNativePG)

  • Cluster: main-db in postgres namespace, 2 instances
  • PostgreSQL: 17.4, 10Gi local-path storage per instance
  • PgBouncer: main-db-pooler, 2 instances, transaction mode
  • Connection: main-db-pooler.postgres.svc:5432
  • User: docfast / docfast (shared across projects for now)

Databases:

Database Project Namespace
docfast DocFast prod docfast
docfast_staging DocFast staging docfast-staging
snapapi SnapAPI prod snapapi
snapapi_staging SnapAPI staging snapapi-staging

DB access (find primary first!):

# Check which is primary
kubectl get pods -n postgres -l role=primary -o name
# Or check manually
kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();"
# f = primary, t = replica

# Connect to a database
kubectl exec -n postgres <primary-pod> -c postgres -- psql -U docfast -d <dbname>

HA Configuration

⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!

Component Replicas Strategy
CoreDNS 3 preferredDuringScheduling podAntiAffinity (mgr+w1+w2)
CNPG Operator 2 topologySpreadConstraints DoNotSchedule (w1+w2)
PgBouncer 2 requiredDuringScheduling podAntiAffinity (w1+w2)
DocFast Prod 2 preferredDuringScheduling podAntiAffinity (w1+w2)

Failover Tuning:

  • Readiness: every 5s, fail after 2 (~10s detection)
  • Liveness: every 10s, fail after 3
  • Node tolerations: 10s (was 300s default)
  • Result: ~10-15 second failover

HA validated: All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down).

CI/CD Pipeline

Pattern (per project):

  • Push to main → Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace
  • Push tag v* → production namespace
  • Registry: git.cloonar.com/openclawd/<project>
  • Runner: agent host (178.115.247.134), x86

Per-project Forgejo secrets:

  • REGISTRY_TOKEN — PAT with write:package
  • KUBECONFIG — base64 deployer kubeconfig

Deployer ServiceAccount: namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access)

DNS

Record Type Value Status
docfast.dev A 46.225.37.135
staging.docfast.dev A 46.225.37.135 NOT SET
MX for docfast.dev MX mail.cloonar.com.

Firewall

  • Hetzner FW: coolify-fw (ID 10553199)
  • Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner)

Backup OPERATIONAL

Borg → Hetzner Storage Box

  • Target: u149513-sub10@u149513-sub10.your-backup.de:23
  • SSH key: /root/.ssh/id_ed25519 (k3s-mgr-backup)
  • Passphrase: /root/.borg-passphrase (on k3s-mgr)
  • Key exports: /root/.borg-key-cluster, /root/.borg-key-db
  • Script: /root/k3s-backup.sh
  • Log: /var/log/k3s-backup.log

Repos:

Repo Contents Size
./k3s-cluster K3s SQLite, manifests, token, all namespace YAML exports, CNPG specs ~45 MB
./k3s-db pg_dump of all databases (docfast, docfast_staging, snapapi, snapapi_staging) + globals ~30 KB

Schedule (cron on k3s-mgr):

  • 0 */6 * * * — DB backup (pg_dump) every 6 hours
  • 0 3 * * * — Full backup (DB + cluster state + manifests) daily at 03:00 UTC

Retention: 7 daily, 4 weekly (auto-pruned)

Recovery:

  1. Provision 3 fresh CAX11 nodes
  2. Install K3s, restore SQLite DB from Borg (/var/lib/rancher/k3s/server/db/)
  3. Or: fresh K3s + re-apply manifest YAMLs from Borg
  4. Restore databases: psql -U postgres -d <dbname> < dump.sql
  5. Update DNS to new LB IP
  6. Estimated recovery time: ~15-30 minutes

Verify backup:

ssh k3s-mgr 'export BORG_RSH="ssh -p23"; export BORG_PASSPHRASE=$(cat /root/.borg-passphrase); borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-db'

Common Operations

Deploy new project

  1. Create namespace: kubectl create namespace <project>
  2. Create database on CNPG primary: CREATE DATABASE <name> OWNER docfast;
  3. Create secrets: kubectl create secret generic <project>-secrets -n <namespace> ...
  4. Copy pull secret: kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f -
  5. Create deployer SA + RBAC
  6. Set up CI/CD workflows in Forgejo repo
  7. Deploy + Ingress + cert-manager TLS

Check cluster health

ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running'

Power cycle a node (Hetzner API)

source ~/.openclaw/workspace/.credentials/services.env
# Status
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>"
# Power on/off
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>/actions/poweron" -X POST

Scale a deployment

kubectl scale deployment <name> -n <namespace> --replicas=<N>

Force-delete stuck pod

kubectl delete pod <name> -n <namespace> --force --grace-period=0

Check CNPG cluster status

kubectl get cluster -n postgres
kubectl get pods -n postgres -o wide

Old Server (PENDING DECOMMISSION)

  • IP: 167.235.156.214, SSH alias: docfast
  • Still used for: git push to Forgejo (SSH access)
  • No longer serves traffic — decommission to save €4.5/mo

Future Improvements

See projects/business/memory/infrastructure.md for full roadmap.

High priority:

  • Implement Borg backup
  • DNS: staging.docfast.dev
  • Persist HA constraints as infra-as-code
  • Decommission old server

Staging IP Whitelist

All staging environments are IP-whitelisted to the openclaw-vm public IP only.

How it works:

  • Hetzner LB has proxy protocol enabled (both port 80 and 443)
  • Traefik configured with proxyProtocol.trustedIPs for the LB IP (46.225.37.135/32) and private network (10.0.0.0/8)
  • Traefik Middleware staging-ipwhitelist in each staging namespace allows only 178.115.247.134/32
  • Middleware attached to staging ingresses via annotation traefik.ingress.kubernetes.io/router.middlewares

For new projects:

  1. Create middleware in the staging namespace:
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: staging-ipwhitelist
  namespace: <project>-staging
spec:
  ipAllowList:
    sourceRange:
      - 178.115.247.134/32
  1. Annotate the staging ingress:
traefik.ingress.kubernetes.io/router.middlewares: <project>-staging-staging-ipwhitelist@kubernetescrd

Traefik Helm config (managed via helm upgrade):

  • additionalArguments: proxyProtocol.trustedIPs for web + websecure entrypoints
  • logs.access.enabled=true for debugging
  • DaemonSet updateStrategy must be patched to maxUnavailable: 1 after each helm upgrade (helm resets it)

Note: If openclaw-vm's public IP changes, update ALL staging-ipwhitelist middlewares.