Hoid 83595c17fb Add K3s infrastructure skill

2026-02-19 08:58:44 +00:00

7.2 KiB

Raw Blame History

name	description
k3s-infra	K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster.

K3s Infrastructure Skill

Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud.

Quick Reference

SSH Access:

ssh k3s-mgr   # Control plane (188.34.201.101)
ssh k3s-w1    # Worker (159.69.23.121)
ssh k3s-w2    # Worker (46.225.169.60)

kubectl (on k3s-mgr):

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
export PATH=$PATH:/usr/local/bin

Hetzner API:

source /home/openclaw/.openclaw/workspace/.credentials/services.env
# Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls

NEVER read credential files directly. Source them in scripts.

Cluster Architecture

Internet → Hetzner LB (46.225.37.135:80/443)
              ↓
         k3s-w1 / k3s-w2 (Traefik DaemonSet)
              ↓
         App Pods (DocFast, SnapAPI, ...)
              ↓
         PgBouncer Pooler (main-db-pooler.postgres.svc:5432)
              ↓
         CloudNativePG (main-db, 2 instances, PostgreSQL 17.4)

Nodes

Node	Role	Public IP	Private IP	Hetzner ID
k3s-mgr	Control plane (tainted NoSchedule)	188.34.201.101	10.0.1.5	121365837
k3s-w1	Worker	159.69.23.121	10.0.1.6	121365839
k3s-w2	Worker	46.225.169.60	10.0.1.7	121365840

All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1
SSH key: /home/openclaw/.ssh/id_ed25519
Private network: 10.0.0.0/16, Flannel on enp7s0
K3s version: v1.34.4+k3s1

Load Balancer

Name: k3s-lb, Hetzner ID 5834131
IP: 46.225.37.135
Targets: k3s-w1, k3s-w2 on ports 80/443
Health: TCP, 15s interval, 3 retries

Namespaces

Namespace	Purpose	Replicas
`docfast`	DocFast production	2
`docfast-staging`	DocFast staging	1
`snapapi`	SnapAPI production	2 (target)
`snapapi-staging`	SnapAPI staging	1
`postgres`	CNPG cluster + PgBouncer	2+2
`cnpg-system`	CNPG operator	2
`cert-manager`	Let's Encrypt certs	-
`kube-system`	CoreDNS, Traefik, etc.	-

Database (CloudNativePG)

Cluster: main-db in postgres namespace, 2 instances
PostgreSQL: 17.4, 10Gi local-path storage per instance
PgBouncer: main-db-pooler, 2 instances, transaction mode
Connection: main-db-pooler.postgres.svc:5432
User: docfast / docfast (shared across projects for now)

Databases:

Database	Project	Namespace
`docfast`	DocFast prod	docfast
`docfast_staging`	DocFast staging	docfast-staging
`snapapi`	SnapAPI prod	snapapi
`snapapi_staging`	SnapAPI staging	snapapi-staging

DB access (find primary first!):

# Check which is primary
kubectl get pods -n postgres -l role=primary -o name
# Or check manually
kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();"
# f = primary, t = replica

# Connect to a database
kubectl exec -n postgres <primary-pod> -c postgres -- psql -U docfast -d <dbname>

HA Configuration

⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!

Component	Replicas	Strategy
CoreDNS	3	preferredDuringScheduling podAntiAffinity (mgr+w1+w2)
CNPG Operator	2	topologySpreadConstraints DoNotSchedule (w1+w2)
PgBouncer	2	requiredDuringScheduling podAntiAffinity (w1+w2)
DocFast Prod	2	preferredDuringScheduling podAntiAffinity (w1+w2)

Failover Tuning:

Readiness: every 5s, fail after 2 (~10s detection)
Liveness: every 10s, fail after 3
Node tolerations: 10s (was 300s default)
Result: ~10-15 second failover

HA validated: All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down).

CI/CD Pipeline

Pattern (per project):

Push to main → Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace
Push tag v* → production namespace
Registry: git.cloonar.com/openclawd/<project>
Runner: agent host (178.115.247.134), x86

Per-project Forgejo secrets:

REGISTRY_TOKEN — PAT with write:package
KUBECONFIG — base64 deployer kubeconfig

Deployer ServiceAccount: namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access)

DNS

Record	Type	Value	Status
docfast.dev	A	46.225.37.135	✅
staging.docfast.dev	A	46.225.37.135	❌ NOT SET
MX for docfast.dev	MX	mail.cloonar.com.	✅

Firewall

Hetzner FW: coolify-fw (ID 10553199)
Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner)

Backup (TO IMPLEMENT)

Current: ❌ NO BACKUPS

Plan: Borg → Hetzner Storage Box

Target: u149513-sub11@u149513-sub11.your-backup.de:23
SSH key already configured on k3s-mgr (/root/.ssh/id_ed25519, fingerprint docfast-backup)
Per-machine subdir: ./docfast-1/ (existing), ./k3s-cluster/ and ./k3s-db/ (planned)

What to back up:

Cluster state: etcd snapshots + /var/lib/rancher/k3s/server/manifests/ → daily
Databases: pg_dump all databases → every 6h
K8s manifests: export all resources as YAML → daily

Recovery: Fresh nodes → K3s install → restore etcd or re-apply manifests → restore DB → update DNS → ~15-30 min

Common Operations

Deploy new project

Create namespace: kubectl create namespace <project>
Create database on CNPG primary: CREATE DATABASE <name> OWNER docfast;
Create secrets: kubectl create secret generic <project>-secrets -n <namespace> ...
Copy pull secret: kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f -
Create deployer SA + RBAC
Set up CI/CD workflows in Forgejo repo
Deploy + Ingress + cert-manager TLS

Check cluster health

ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running'

Power cycle a node (Hetzner API)

source ~/.openclaw/workspace/.credentials/services.env
# Status
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>"
# Power on/off
curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/<ID>/actions/poweron" -X POST

Scale a deployment

kubectl scale deployment <name> -n <namespace> --replicas=<N>

Force-delete stuck pod

kubectl delete pod <name> -n <namespace> --force --grace-period=0

Check CNPG cluster status

kubectl get cluster -n postgres
kubectl get pods -n postgres -o wide

Old Server (PENDING DECOMMISSION)

IP: 167.235.156.214, SSH alias: docfast
Still used for: git push to Forgejo (SSH access)
No longer serves traffic — decommission to save €4.5/mo

Future Improvements

See projects/business/memory/infrastructure.md for full roadmap.

High priority:

Implement Borg backup
DNS: staging.docfast.dev
Persist HA constraints as infra-as-code
Decommission old server

7.2 KiB Raw Blame History