diff --git a/skills/k3s-infra/references/RESTORE-FULL.md b/skills/k3s-infra/references/RESTORE-FULL.md new file mode 100644 index 0000000..63e9eb1 --- /dev/null +++ b/skills/k3s-infra/references/RESTORE-FULL.md @@ -0,0 +1,150 @@ +# K3s Cluster Restore Guide + +## Prerequisites +- Fresh Ubuntu 24.04 server (CAX11 ARM64, Hetzner) +- Borg backup repo access: `ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster` +- SSH key for Storage Box: `/root/.ssh/id_ed25519` (fingerprint: `k3s-mgr-backup`) +- Borg passphrase (stored in password manager) + +## 1. Install Borg & Mount Backup + +```bash +apt update && apt install -y borgbackup python3-pyfuse3 +export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519' +export BORG_PASSPHRASE='' + +# List available archives +borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster + +# Mount latest archive +mkdir -p /mnt/borg +borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster:: /mnt/borg +``` + +## 2. Install K3s (Control Plane) + +```bash +# Restore K3s token (needed for worker rejoin) +mkdir -p /var/lib/rancher/k3s/server +cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token + +# Install K3s server (tainted, no workloads on mgr) +curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \ + --node-taint CriticalAddonsOnly=true:NoSchedule \ + --flannel-iface enp7s0 \ + --cluster-cidr 10.42.0.0/16 \ + --service-cidr 10.43.0.0/16 \ + --tls-san 188.34.201.101 \ + --token "$(cat /var/lib/rancher/k3s/server/token)" +``` + +## 3. Rejoin Worker Nodes + +On each worker (k3s-w1: 159.69.23.121, k3s-w2: 46.225.169.60): +```bash +curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" \ + K3S_URL=https://188.34.201.101:6443 \ + K3S_TOKEN="" \ + sh -s - agent --flannel-iface enp7s0 +``` + +## 4. Restore K3s Manifests & Config + +```bash +cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml +cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/ +``` + +## 5. Install Operators + +```bash +# cert-manager +kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml + +# CloudNativePG +kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml + +# Traefik (via Helm) +helm repo add traefik https://traefik.github.io/charts +helm install traefik traefik/traefik -n kube-system \ + --set deployment.kind=DaemonSet \ + --set nodeSelector."kubernetes\.io/os"=linux \ + --set tolerations={} +``` + +## 6. Restore K8s Manifests (Namespaces, Secrets, Services) + +```bash +# Apply namespace manifests from backup +for ns in postgres docfast docfast-staging snapapi snapapi-staging cert-manager; do + kubectl apply -f /mnt/borg/var/backup/manifests/${ns}.yaml +done +``` + +**Review before applying** — manifests contain secrets and may reference node-specific IPs. + +## 7. Restore CNPG PostgreSQL + +Option A: Let CNPG create fresh cluster, then restore from SQL dumps: +```bash +# After CNPG cluster is healthy: +PRIMARY=$(kubectl -n postgres get pods -l cnpg.io/cluster=main-db,role=primary -o name | head -1) + +for db in docfast docfast_staging snapapi snapapi_staging; do + # Create database if needed + kubectl -n postgres exec $PRIMARY -- psql -U postgres -c "CREATE DATABASE ${db};" 2>/dev/null || true + # Restore dump + kubectl -n postgres exec -i $PRIMARY -- psql -U postgres "${db}" < /mnt/borg/var/backup/postgresql/${db}.sql +done +``` + +Option B: Restore CNPG from backup (if configured with barman/S3 — not currently used). + +## 8. Restore Application Deployments + +Deployments are in the namespace manifests, but it's cleaner to redeploy from CI/CD: +```bash +# DocFast: push to main (staging) or tag v* (prod) on Forgejo +# SnapAPI: same workflow +``` + +Or apply from backup manifests: +```bash +kubectl apply -f /mnt/borg/var/backup/manifests/docfast.yaml +kubectl apply -f /mnt/borg/var/backup/manifests/snapapi.yaml +``` + +## 9. Verify + +```bash +kubectl get nodes # All 3 nodes Ready +kubectl get pods -A # All pods Running +kubectl -n postgres get cluster main-db # CNPG healthy +curl -k https://docfast.dev/health # App responding +``` + +## 10. Post-Restore Checklist + +- [ ] DNS: `docfast.dev` A record → 46.225.37.135 (Hetzner LB) +- [ ] Hetzner LB targets updated (w1 + w2 on ports 80/443) +- [ ] Let's Encrypt certs issued (cert-manager auto-renews) +- [ ] Stripe webhook endpoint updated if IP changed +- [ ] HA spread constraints re-applied (CoreDNS 3 replicas, CNPG operator 2 replicas, PgBouncer anti-affinity) +- [ ] Borg backup cron re-enabled: `30 3 * * * /root/k3s-backup.sh` +- [ ] Verify backup works: `borg-backup && borg-list` + +## Cluster Info + +| Node | IP (Public) | IP (Private) | Role | +|------|------------|--------------|------| +| k3s-mgr | 188.34.201.101 | 10.0.1.5 | Control plane (tainted) | +| k3s-w1 | 159.69.23.121 | 10.0.1.6 | Worker | +| k3s-w2 | 46.225.169.60 | 10.0.1.7 | Worker | + +| Resource | Detail | +|----------|--------| +| Hetzner LB | ID 5834131, IP 46.225.37.135 | +| Private Network | 10.0.0.0/16, ID 11949384 | +| Firewall | coolify-fw, ID 10553199 | +| Storage Box | u149513-sub10.your-backup.de:23 | + diff --git a/skills/k3s-infra/references/RESTORE-MGR.md b/skills/k3s-infra/references/RESTORE-MGR.md new file mode 100644 index 0000000..bbce10d --- /dev/null +++ b/skills/k3s-infra/references/RESTORE-MGR.md @@ -0,0 +1,137 @@ +# Quick Restore: Control Plane (k3s-mgr) Only + +Use this when only k3s-mgr is down. Workers and workloads keep running — they just can't be managed until the control plane is back. + +## What Still Works Without k3s-mgr + +- ✅ Running pods continue serving traffic (DocFast, SnapAPI) +- ✅ Hetzner LB routes to workers directly +- ✅ CNPG PostgreSQL (runs on workers, auto-failover between w1/w2) +- ✅ Traefik ingress (DaemonSet on workers) +- ❌ `kubectl` commands fail +- ❌ No new deployments, scaling, or pod scheduling +- ❌ cert-manager can't renew certs (but existing certs valid for 90 days) +- ❌ No etcd/SQLite state changes + +## Steps + +### 1. Provision New Server + +Hetzner Cloud → CAX11 ARM64, Ubuntu 24.04, nbg1 datacenter. +Assign to private network 10.0.0.0/16, set IP to **10.0.1.5**. +Update public IP in: +- Hetzner Firewall (allow 6443 from runner IP) +- DNS if applicable +- SSH config + +### 2. Install Borg & Restore Backup + +```bash +apt update && apt install -y borgbackup python3-pyfuse3 + +# Copy SSH key for Storage Box (from password manager or another host) +mkdir -p /root/.ssh && chmod 700 /root/.ssh +# Paste id_ed25519 (k3s-mgr-backup key) +chmod 600 /root/.ssh/id_ed25519 +ssh-keyscan -p 23 u149513-sub10.your-backup.de >> /root/.ssh/known_hosts + +export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519' +export BORG_PASSPHRASE='' + +# List & mount latest +borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster +mkdir -p /mnt/borg +borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster:: /mnt/borg +``` + +### 3. Restore Token & Install K3s + +```bash +mkdir -p /var/lib/rancher/k3s/server +cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token + +curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \ + --node-taint CriticalAddonsOnly=true:NoSchedule \ + --flannel-iface enp7s0 \ + --cluster-cidr 10.42.0.0/16 \ + --service-cidr 10.43.0.0/16 \ + --tls-san $(curl -s http://169.254.169.254/hetzner/v1/metadata/public-ipv4) \ + --token "$(cat /var/lib/rancher/k3s/server/token)" +``` + +Workers will auto-reconnect using the same token. Verify: +```bash +kubectl get nodes +``` + +### 4. Restore Manifests & Config + +```bash +cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml +cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/ +``` + +### 5. Reinstall Operators (if not auto-recovered) + +K3s keeps state in SQLite — if workers retained their state, pods may already be running. Check first: +```bash +kubectl get pods -A +``` + +If operators are missing: +```bash +# cert-manager +kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml + +# CloudNativePG +kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml +``` + +Traefik runs as DaemonSet on workers — should already be running. + +### 6. Re-apply HA Spread Constraints + +These are runtime patches that don't survive a fresh control plane: +```bash +# CoreDNS: 3 replicas +kubectl -n kube-system scale deployment coredns --replicas=3 + +# CNPG operator: 2 replicas with topology spread +kubectl -n cnpg-system scale deployment cnpg-controller-manager --replicas=2 +``` + +### 7. Restore Backup Infrastructure + +```bash +# Borg passphrase +# Copy from password manager to /root/.borg-passphrase +chmod 600 /root/.borg-passphrase + +# Restore backup script & helpers +cp /mnt/borg/var/backup/RESTORE.md /var/backup/RESTORE.md +# Or just re-run the OpenClaw setup (Hoid will recreate them) + +# Restore cron +echo "30 3 * * * /root/k3s-backup.sh >> /var/log/k3s-backup.log 2>&1" | crontab - + +# Unmount +borg umount /mnt/borg +``` + +### 8. Verify Everything + +```bash +kubectl get nodes # 3 nodes Ready +kubectl get pods -A # All pods Running +kubectl -n postgres get cluster main-db # CNPG healthy +curl -k https://docfast.dev/health # App responding +borg-list # Backup accessible +borg-backup # Test backup works +``` + +### Total Downtime Estimate + +- Server provisioning: ~2 min (Hetzner API) +- K3s install + worker reconnect: ~3 min +- Operator recovery: ~2 min +- **Total: ~10 minutes** (workloads unaffected during this time)