# Quick Restore: Control Plane (k3s-mgr) Only Use this when only k3s-mgr is down. Workers and workloads keep running — they just can't be managed until the control plane is back. ## What Still Works Without k3s-mgr - ✅ Running pods continue serving traffic (DocFast, SnapAPI) - ✅ Hetzner LB routes to workers directly - ✅ CNPG PostgreSQL (runs on workers, auto-failover between w1/w2) - ✅ Traefik ingress (DaemonSet on workers) - ❌ `kubectl` commands fail - ❌ No new deployments, scaling, or pod scheduling - ❌ cert-manager can't renew certs (but existing certs valid for 90 days) - ❌ No etcd/SQLite state changes ## Steps ### 1. Provision New Server Hetzner Cloud → CAX11 ARM64, Ubuntu 24.04, nbg1 datacenter. Assign to private network 10.0.0.0/16, set IP to **10.0.1.5**. Update public IP in: - Hetzner Firewall (allow 6443 from runner IP) - DNS if applicable - SSH config ### 2. Install Borg & Restore Backup ```bash apt update && apt install -y borgbackup python3-pyfuse3 # Copy SSH key for Storage Box (from password manager or another host) mkdir -p /root/.ssh && chmod 700 /root/.ssh # Paste id_ed25519 (k3s-mgr-backup key) chmod 600 /root/.ssh/id_ed25519 ssh-keyscan -p 23 u149513-sub10.your-backup.de >> /root/.ssh/known_hosts export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519' export BORG_PASSPHRASE='' # List & mount latest borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster mkdir -p /mnt/borg borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster:: /mnt/borg ``` ### 3. Restore Token & Install K3s ```bash mkdir -p /var/lib/rancher/k3s/server cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \ --node-taint CriticalAddonsOnly=true:NoSchedule \ --flannel-iface enp7s0 \ --cluster-cidr 10.42.0.0/16 \ --service-cidr 10.43.0.0/16 \ --tls-san $(curl -s http://169.254.169.254/hetzner/v1/metadata/public-ipv4) \ --token "$(cat /var/lib/rancher/k3s/server/token)" ``` Workers will auto-reconnect using the same token. Verify: ```bash kubectl get nodes ``` ### 4. Restore Manifests & Config ```bash cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/ ``` ### 5. Reinstall Operators (if not auto-recovered) K3s keeps state in SQLite — if workers retained their state, pods may already be running. Check first: ```bash kubectl get pods -A ``` If operators are missing: ```bash # cert-manager kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml # CloudNativePG kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml ``` Traefik runs as DaemonSet on workers — should already be running. ### 6. Re-apply HA Spread Constraints These are runtime patches that don't survive a fresh control plane: ```bash # CoreDNS: 3 replicas kubectl -n kube-system scale deployment coredns --replicas=3 # CNPG operator: 2 replicas with topology spread kubectl -n cnpg-system scale deployment cnpg-controller-manager --replicas=2 ``` ### 7. Restore Backup Infrastructure ```bash # Borg passphrase # Copy from password manager to /root/.borg-passphrase chmod 600 /root/.borg-passphrase # Restore backup script & helpers cp /mnt/borg/var/backup/RESTORE.md /var/backup/RESTORE.md # Or just re-run the OpenClaw setup (Hoid will recreate them) # Restore cron echo "30 3 * * * /root/k3s-backup.sh >> /var/log/k3s-backup.log 2>&1" | crontab - # Unmount borg umount /mnt/borg ``` ### 8. Verify Everything ```bash kubectl get nodes # 3 nodes Ready kubectl get pods -A # All pods Running kubectl -n postgres get cluster main-db # CNPG healthy curl -k https://docfast.dev/health # App responding borg-list # Backup accessible borg-backup # Test backup works ``` ### Total Downtime Estimate - Server provisioning: ~2 min (Hetzner API) - K3s install + worker reconnect: ~3 min - Operator recovery: ~2 min - **Total: ~10 minutes** (workloads unaffected during this time)