7.8 KiB
7.8 KiB
| name | description |
|---|---|
| k3s-infra | K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster. |
K3s Infrastructure Skill
Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud.
Quick Reference
SSH Access:
ssh k3s-mgr # Control plane (188.34.201.101)
ssh k3s-w1 # Worker (159.69.23.121)
ssh k3s-w2 # Worker (46.225.169.60)
kubectl (on k3s-mgr):
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
export PATH=$PATH:/usr/local/bin
Hetzner API:
source /home/openclaw/.openclaw/workspace/.credentials/services.env
# Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls
NEVER read credential files directly. Source them in scripts.
Cluster Architecture
Internet → Hetzner LB (46.225.37.135:80/443)
↓
k3s-w1 / k3s-w2 (Traefik DaemonSet)
↓
App Pods (DocFast, SnapAPI, ...)
↓
PgBouncer Pooler (main-db-pooler.postgres.svc:5432)
↓
CloudNativePG (main-db, 2 instances, PostgreSQL 17.4)
Nodes
| Node | Role | Public IP | Private IP | Hetzner ID |
|---|---|---|---|---|
| k3s-mgr | Control plane (tainted NoSchedule) | 188.34.201.101 | 10.0.1.5 | 121365837 |
| k3s-w1 | Worker | 159.69.23.121 | 10.0.1.6 | 121365839 |
| k3s-w2 | Worker | 46.225.169.60 | 10.0.1.7 | 121365840 |
- All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1
- SSH key:
/home/openclaw/.ssh/id_ed25519 - Private network: 10.0.0.0/16, Flannel on enp7s0
- K3s version: v1.34.4+k3s1
Load Balancer
- Name:
k3s-lb, Hetzner ID 5834131 - IP: 46.225.37.135
- Targets: k3s-w1, k3s-w2 on ports 80/443
- Health: TCP, 15s interval, 3 retries
Namespaces
| Namespace | Purpose | Replicas |
|---|---|---|
docfast |
DocFast production | 2 |
docfast-staging |
DocFast staging | 1 |
snapapi |
SnapAPI production | 2 (target) |
snapapi-staging |
SnapAPI staging | 1 |
postgres |
CNPG cluster + PgBouncer | 2+2 |
cnpg-system |
CNPG operator | 2 |
cert-manager |
Let's Encrypt certs | - |
kube-system |
CoreDNS, Traefik, etc. | - |
Database (CloudNativePG)
- Cluster:
main-dbinpostgresnamespace, 2 instances - PostgreSQL: 17.4, 10Gi local-path storage per instance
- PgBouncer:
main-db-pooler, 2 instances, transaction mode - Connection:
main-db-pooler.postgres.svc:5432 - User: docfast / docfast (shared across projects for now)
Databases:
| Database | Project | Namespace |
|---|---|---|
docfast |
DocFast prod | docfast |
docfast_staging |
DocFast staging | docfast-staging |
snapapi |
SnapAPI prod | snapapi |
snapapi_staging |
SnapAPI staging | snapapi-staging |
DB access (find primary first!):
# Check which is primary
kubectl get pods -n postgres -l role=primary -o name
# Or check manually
kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();"
# f = primary, t = replica
# Connect to a database
kubectl exec -n postgres <primary-pod> -c postgres -- psql -U docfast -d <dbname>
HA Configuration
⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!
| Component | Replicas | Strategy |
|---|---|---|
| CoreDNS | 3 | preferredDuringScheduling podAntiAffinity (mgr+w1+w2) |
| CNPG Operator | 2 | topologySpreadConstraints DoNotSchedule (w1+w2) |
| PgBouncer | 2 | requiredDuringScheduling podAntiAffinity (w1+w2) |
| DocFast Prod | 2 | preferredDuringScheduling podAntiAffinity (w1+w2) |
Failover Tuning:
- Readiness: every 5s, fail after 2 (~10s detection)
- Liveness: every 10s, fail after 3
- Node tolerations: 10s (was 300s default)
- Result: ~10-15 second failover
HA validated: All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down).
CI/CD Pipeline
Pattern (per project):
- Push to
main→ Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace - Push tag
v*→ production namespace - Registry:
git.cloonar.com/openclawd/<project> - Runner: agent host (178.115.247.134), x86
Per-project Forgejo secrets:
REGISTRY_TOKEN— PAT with write:packageKUBECONFIG— base64 deployer kubeconfig
Deployer ServiceAccount: namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access)
DNS
| Record | Type | Value | Status |
|---|---|---|---|
| docfast.dev | A | 46.225.37.135 | ✅ |
| staging.docfast.dev | A | 46.225.37.135 | ❌ NOT SET |
| MX for docfast.dev | MX | mail.cloonar.com. | ✅ |
Firewall
- Hetzner FW:
coolify-fw(ID 10553199) - Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner)
Backup ✅ OPERATIONAL
Borg → Hetzner Storage Box
- Target:
u149513-sub10@u149513-sub10.your-backup.de:23 - SSH key:
/root/.ssh/id_ed25519(k3s-mgr-backup) - Passphrase:
/root/.borg-passphrase(on k3s-mgr) - Key exports:
/root/.borg-key-cluster,/root/.borg-key-db - Script:
/root/k3s-backup.sh - Log:
/var/log/k3s-backup.log
Repos:
| Repo | Contents | Size |
|---|---|---|
./k3s-cluster |
K3s SQLite, manifests, token, all namespace YAML exports, CNPG specs | ~45 MB |
./k3s-db |
pg_dump of all databases (docfast, docfast_staging, snapapi, snapapi_staging) + globals | ~30 KB |
Schedule (cron on k3s-mgr):
0 */6 * * *— DB backup (pg_dump) every 6 hours0 3 * * *— Full backup (DB + cluster state + manifests) daily at 03:00 UTC
Retention: 7 daily, 4 weekly (auto-pruned)
Recovery:
- Provision 3 fresh CAX11 nodes
- Install K3s, restore SQLite DB from Borg (
/var/lib/rancher/k3s/server/db/) - Or: fresh K3s + re-apply manifest YAMLs from Borg
- Restore databases:
psql -U postgres -d <dbname> < dump.sql - Update DNS to new LB IP
- Estimated recovery time: ~15-30 minutes
Verify backup:
ssh k3s-mgr 'export BORG_RSH="ssh -p23"; export BORG_PASSPHRASE=$(cat /root/.borg-passphrase); borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-db'
Common Operations
Deploy new project
- Create namespace:
kubectl create namespace <project> - Create database on CNPG primary:
CREATE DATABASE <name> OWNER docfast; - Create secrets:
kubectl create secret generic <project>-secrets -n <namespace> ... - Copy pull secret:
kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f - - Create deployer SA + RBAC
- Set up CI/CD workflows in Forgejo repo
- Deploy + Ingress + cert-manager TLS
Check cluster health
ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running'
Power cycle a node (Hetzner API)
source ~/.openclaw/workspace/.credentials/services.env
# Status
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>"
# Power on/off
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>/actions/poweron" -X POST
Scale a deployment
kubectl scale deployment <name> -n <namespace> --replicas=<N>
Force-delete stuck pod
kubectl delete pod <name> -n <namespace> --force --grace-period=0
Check CNPG cluster status
kubectl get cluster -n postgres
kubectl get pods -n postgres -o wide
Old Server (PENDING DECOMMISSION)
- IP: 167.235.156.214, SSH alias:
docfast - Still used for: git push to Forgejo (SSH access)
- No longer serves traffic — decommission to save €4.5/mo
Future Improvements
See projects/business/memory/infrastructure.md for full roadmap.
High priority:
- Implement Borg backup
- DNS: staging.docfast.dev
- Persist HA constraints as infra-as-code
- Decommission old server