Add K3s restore guides to infra skill references

2026-02-19 10:08:08 +00:00 · 2026-02-19 10:08:08 +00:00 · dd5a51fdd0
commit dd5a51fdd0
parent e2e9ae55f7
2 changed files with 287 additions and 0 deletions
--- a/skills/k3s-infra/references/RESTORE-MGR.md
+++ b/skills/k3s-infra/references/RESTORE-MGR.md
@ -0,0 +1,137 @@
+# Quick Restore: Control Plane (k3s-mgr) Only
+
+Use this when only k3s-mgr is down. Workers and workloads keep running — they just can't be managed until the control plane is back.
+
+## What Still Works Without k3s-mgr
+
+- ✅ Running pods continue serving traffic (DocFast, SnapAPI)
+- ✅ Hetzner LB routes to workers directly
+- ✅ CNPG PostgreSQL (runs on workers, auto-failover between w1/w2)
+- ✅ Traefik ingress (DaemonSet on workers)
+- ❌ `kubectl` commands fail
+- ❌ No new deployments, scaling, or pod scheduling
+- ❌ cert-manager can't renew certs (but existing certs valid for 90 days)
+- ❌ No etcd/SQLite state changes
+
+## Steps
+
+### 1. Provision New Server
+
+Hetzner Cloud → CAX11 ARM64, Ubuntu 24.04, nbg1 datacenter.
+Assign to private network 10.0.0.0/16, set IP to **10.0.1.5**.
+Update public IP in:
+- Hetzner Firewall (allow 6443 from runner IP)
+- DNS if applicable
+- SSH config
+
+### 2. Install Borg & Restore Backup
+
+```bash
+apt update && apt install -y borgbackup python3-pyfuse3
+
+# Copy SSH key for Storage Box (from password manager or another host)
+mkdir -p /root/.ssh && chmod 700 /root/.ssh
+# Paste id_ed25519 (k3s-mgr-backup key)
+chmod 600 /root/.ssh/id_ed25519
+ssh-keyscan -p 23 u149513-sub10.your-backup.de >> /root/.ssh/known_hosts
+
+export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519'
+export BORG_PASSPHRASE='<from password manager>'
+
+# List & mount latest
+borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster
+mkdir -p /mnt/borg
+borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster::<latest> /mnt/borg
+```
+
+### 3. Restore Token & Install K3s
+
+```bash
+mkdir -p /var/lib/rancher/k3s/server
+cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token
+
+curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \
+  --node-taint CriticalAddonsOnly=true:NoSchedule \
+  --flannel-iface enp7s0 \
+  --cluster-cidr 10.42.0.0/16 \
+  --service-cidr 10.43.0.0/16 \
+  --tls-san $(curl -s http://169.254.169.254/hetzner/v1/metadata/public-ipv4) \
+  --token "$(cat /var/lib/rancher/k3s/server/token)"
+```
+
+Workers will auto-reconnect using the same token. Verify:
+```bash
+kubectl get nodes
+```
+
+### 4. Restore Manifests & Config
+
+```bash
+cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml
+cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/
+```
+
+### 5. Reinstall Operators (if not auto-recovered)
+
+K3s keeps state in SQLite — if workers retained their state, pods may already be running. Check first:
+```bash
+kubectl get pods -A
+```
+
+If operators are missing:
+```bash
+# cert-manager
+kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
+
+# CloudNativePG
+kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml
+```
+
+Traefik runs as DaemonSet on workers — should already be running.
+
+### 6. Re-apply HA Spread Constraints
+
+These are runtime patches that don't survive a fresh control plane:
+```bash
+# CoreDNS: 3 replicas
+kubectl -n kube-system scale deployment coredns --replicas=3
+
+# CNPG operator: 2 replicas with topology spread
+kubectl -n cnpg-system scale deployment cnpg-controller-manager --replicas=2
+```
+
+### 7. Restore Backup Infrastructure
+
+```bash
+# Borg passphrase
+# Copy from password manager to /root/.borg-passphrase
+chmod 600 /root/.borg-passphrase
+
+# Restore backup script & helpers
+cp /mnt/borg/var/backup/RESTORE.md /var/backup/RESTORE.md
+# Or just re-run the OpenClaw setup (Hoid will recreate them)
+
+# Restore cron
+echo "30 3 * * * /root/k3s-backup.sh >> /var/log/k3s-backup.log 2>&1" | crontab -
+
+# Unmount
+borg umount /mnt/borg
+```
+
+### 8. Verify Everything
+
+```bash
+kubectl get nodes                        # 3 nodes Ready
+kubectl get pods -A                      # All pods Running
+kubectl -n postgres get cluster main-db  # CNPG healthy
+curl -k https://docfast.dev/health       # App responding
+borg-list                                # Backup accessible
+borg-backup                              # Test backup works
+```
+
+### Total Downtime Estimate
+
+- Server provisioning: ~2 min (Hetzner API)
+- K3s install + worker reconnect: ~3 min
+- Operator recovery: ~2 min
+- **Total: ~10 minutes** (workloads unaffected during this time)