From 83595c17fb155bf2b46384bfe2583ac4cb9325ce Mon Sep 17 00:00:00 2001 From: Hoid Date: Thu, 19 Feb 2026 08:58:44 +0000 Subject: [PATCH] Add K3s infrastructure skill --- skills/k3s-infra/SKILL.md | 225 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 skills/k3s-infra/SKILL.md diff --git a/skills/k3s-infra/SKILL.md b/skills/k3s-infra/SKILL.md new file mode 100644 index 0000000..d0a789f --- /dev/null +++ b/skills/k3s-infra/SKILL.md @@ -0,0 +1,225 @@ +--- +name: k3s-infra +description: "K3s Kubernetes cluster management. Use when working on cluster infrastructure, backups, deployments, HA, networking, scaling, CNPG databases, Traefik ingress, cert-manager, or any K8s operations on the Hetzner cluster." +--- + +# K3s Infrastructure Skill + +Manage the 3-node K3s Kubernetes cluster on Hetzner Cloud. + +## Quick Reference + +**SSH Access:** +```bash +ssh k3s-mgr # Control plane (188.34.201.101) +ssh k3s-w1 # Worker (159.69.23.121) +ssh k3s-w2 # Worker (46.225.169.60) +``` + +**kubectl (on k3s-mgr):** +```bash +export KUBECONFIG=/etc/rancher/k3s/k3s.yaml +export PATH=$PATH:/usr/local/bin +``` + +**Hetzner API:** +```bash +source /home/openclaw/.openclaw/workspace/.credentials/services.env +# Use $COOLIFY_HETZNER_API_KEY for Hetzner Cloud API calls +``` + +**NEVER read credential files directly. Source them in scripts.** + +## Cluster Architecture + +``` +Internet → Hetzner LB (46.225.37.135:80/443) + ↓ + k3s-w1 / k3s-w2 (Traefik DaemonSet) + ↓ + App Pods (DocFast, SnapAPI, ...) + ↓ + PgBouncer Pooler (main-db-pooler.postgres.svc:5432) + ↓ + CloudNativePG (main-db, 2 instances, PostgreSQL 17.4) +``` + +## Nodes + +| Node | Role | Public IP | Private IP | Hetzner ID | +|------|------|-----------|------------|------------| +| k3s-mgr | Control plane (tainted NoSchedule) | 188.34.201.101 | 10.0.1.5 | 121365837 | +| k3s-w1 | Worker | 159.69.23.121 | 10.0.1.6 | 121365839 | +| k3s-w2 | Worker | 46.225.169.60 | 10.0.1.7 | 121365840 | + +- All CAX11 ARM64 (2 vCPU, 4GB RAM), Hetzner nbg1 +- SSH key: `/home/openclaw/.ssh/id_ed25519` +- Private network: 10.0.0.0/16, Flannel on enp7s0 +- K3s version: v1.34.4+k3s1 + +## Load Balancer + +- Name: `k3s-lb`, Hetzner ID 5834131 +- IP: 46.225.37.135 +- Targets: k3s-w1, k3s-w2 on ports 80/443 +- Health: TCP, 15s interval, 3 retries + +## Namespaces + +| Namespace | Purpose | Replicas | +|-----------|---------|----------| +| `docfast` | DocFast production | 2 | +| `docfast-staging` | DocFast staging | 1 | +| `snapapi` | SnapAPI production | 2 (target) | +| `snapapi-staging` | SnapAPI staging | 1 | +| `postgres` | CNPG cluster + PgBouncer | 2+2 | +| `cnpg-system` | CNPG operator | 2 | +| `cert-manager` | Let's Encrypt certs | - | +| `kube-system` | CoreDNS, Traefik, etc. | - | + +## Database (CloudNativePG) + +- **Cluster:** `main-db` in `postgres` namespace, 2 instances +- **PostgreSQL:** 17.4, 10Gi local-path storage per instance +- **PgBouncer:** `main-db-pooler`, 2 instances, transaction mode +- **Connection:** `main-db-pooler.postgres.svc:5432` +- **User:** docfast / docfast (shared across projects for now) + +**Databases:** +| Database | Project | Namespace | +|----------|---------|-----------| +| `docfast` | DocFast prod | docfast | +| `docfast_staging` | DocFast staging | docfast-staging | +| `snapapi` | SnapAPI prod | snapapi | +| `snapapi_staging` | SnapAPI staging | snapapi-staging | + +**DB access (find primary first!):** +```bash +# Check which is primary +kubectl get pods -n postgres -l role=primary -o name +# Or check manually +kubectl exec -n postgres main-db-1 -c postgres -- psql -U postgres -c "SELECT pg_is_in_recovery();" +# f = primary, t = replica + +# Connect to a database +kubectl exec -n postgres -c postgres -- psql -U docfast -d +``` + +## HA Configuration + +**⚠️ All spread constraints are RUNTIME PATCHES — may not survive K3s upgrades!** + +| Component | Replicas | Strategy | +|-----------|----------|----------| +| CoreDNS | 3 | preferredDuringScheduling podAntiAffinity (mgr+w1+w2) | +| CNPG Operator | 2 | topologySpreadConstraints DoNotSchedule (w1+w2) | +| PgBouncer | 2 | requiredDuringScheduling podAntiAffinity (w1+w2) | +| DocFast Prod | 2 | preferredDuringScheduling podAntiAffinity (w1+w2) | + +**Failover Tuning:** +- Readiness: every 5s, fail after 2 (~10s detection) +- Liveness: every 10s, fail after 3 +- Node tolerations: 10s (was 300s default) +- **Result: ~10-15 second failover** + +**HA validated:** All 3 failover scenarios tested and passed (w1 down, w2 down, mgr down). + +## CI/CD Pipeline + +**Pattern (per project):** +- Push to `main` → Forgejo CI → ARM64 image (QEMU cross-compile) → staging namespace +- Push tag `v*` → production namespace +- Registry: `git.cloonar.com/openclawd/` +- Runner: agent host (178.115.247.134), x86 + +**Per-project Forgejo secrets:** +- `REGISTRY_TOKEN` — PAT with write:package +- `KUBECONFIG` — base64 deployer kubeconfig + +**Deployer ServiceAccount:** namespace-scoped RBAC (update deployments, list/watch/exec pods, no secret access) + +## DNS + +| Record | Type | Value | Status | +|--------|------|-------|--------| +| docfast.dev | A | 46.225.37.135 | ✅ | +| staging.docfast.dev | A | 46.225.37.135 | ❌ NOT SET | +| MX for docfast.dev | MX | mail.cloonar.com. | ✅ | + +## Firewall + +- Hetzner FW: `coolify-fw` (ID 10553199) +- Port 6443: 10.0.0.0/16 + 178.115.247.134/32 (CI runner) + +## Backup (TO IMPLEMENT) + +**Current: ❌ NO BACKUPS** + +**Plan: Borg → Hetzner Storage Box** +- Target: `u149513-sub11@u149513-sub11.your-backup.de:23` +- SSH key already configured on k3s-mgr (`/root/.ssh/id_ed25519`, fingerprint `docfast-backup`) +- Per-machine subdir: `./docfast-1/` (existing), `./k3s-cluster/` and `./k3s-db/` (planned) + +**What to back up:** +1. **Cluster state:** etcd snapshots + `/var/lib/rancher/k3s/server/manifests/` → daily +2. **Databases:** pg_dump all databases → every 6h +3. **K8s manifests:** export all resources as YAML → daily + +**Recovery:** Fresh nodes → K3s install → restore etcd or re-apply manifests → restore DB → update DNS → ~15-30 min + +## Common Operations + +### Deploy new project +1. Create namespace: `kubectl create namespace ` +2. Create database on CNPG primary: `CREATE DATABASE OWNER docfast;` +3. Create secrets: `kubectl create secret generic -secrets -n ...` +4. Copy pull secret: `kubectl get secret forgejo-registry -n docfast -o json | sed ... | kubectl apply -f -` +5. Create deployer SA + RBAC +6. Set up CI/CD workflows in Forgejo repo +7. Deploy + Ingress + cert-manager TLS + +### Check cluster health +```bash +ssh k3s-mgr 'kubectl get nodes; kubectl get pods -A | grep -v Running' +``` + +### Power cycle a node (Hetzner API) +```bash +source ~/.openclaw/workspace/.credentials/services.env +# Status +curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers/" +# Power on/off +curl -s -H "Authorization: Bearer $COOLIFY_HETZNER_API_KEY" "https://api.hetzner.cloud/v1/servers//actions/poweron" -X POST +``` + +### Scale a deployment +```bash +kubectl scale deployment -n --replicas= +``` + +### Force-delete stuck pod +```bash +kubectl delete pod -n --force --grace-period=0 +``` + +### Check CNPG cluster status +```bash +kubectl get cluster -n postgres +kubectl get pods -n postgres -o wide +``` + +## Old Server (PENDING DECOMMISSION) + +- IP: 167.235.156.214, SSH alias: `docfast` +- Still used for: git push to Forgejo (SSH access) +- **No longer serves traffic** — decommission to save €4.5/mo + +## Future Improvements + +See `projects/business/memory/infrastructure.md` for full roadmap. + +**High priority:** +- Implement Borg backup +- DNS: staging.docfast.dev +- Persist HA constraints as infra-as-code +- Decommission old server