config/skills/k3s-infra/references/RESTORE-FULL.md

5 KiB

K3s Cluster Restore Guide

Prerequisites

  • Fresh Ubuntu 24.04 server (CAX11 ARM64, Hetzner)
  • Borg backup repo access: ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster
  • SSH key for Storage Box: /root/.ssh/id_ed25519 (fingerprint: k3s-mgr-backup)
  • Borg passphrase (stored in password manager)

1. Install Borg & Mount Backup

apt update && apt install -y borgbackup python3-pyfuse3
export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519'
export BORG_PASSPHRASE='<from password manager>'

# List available archives
borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster

# Mount latest archive
mkdir -p /mnt/borg
borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster::<archive-name> /mnt/borg

2. Install K3s (Control Plane)

# Restore K3s token (needed for worker rejoin)
mkdir -p /var/lib/rancher/k3s/server
cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token

# Install K3s server (tainted, no workloads on mgr)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \
  --node-taint CriticalAddonsOnly=true:NoSchedule \
  --flannel-iface enp7s0 \
  --cluster-cidr 10.42.0.0/16 \
  --service-cidr 10.43.0.0/16 \
  --tls-san 188.34.201.101 \
  --token "$(cat /var/lib/rancher/k3s/server/token)"

3. Rejoin Worker Nodes

On each worker (k3s-w1: 159.69.23.121, k3s-w2: 46.225.169.60):

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" \
  K3S_URL=https://188.34.201.101:6443 \
  K3S_TOKEN="<token from step 2>" \
  sh -s - agent --flannel-iface enp7s0

4. Restore K3s Manifests & Config

cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml
cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/

5. Install Operators

# cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml

# CloudNativePG
kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml

# Traefik (via Helm)
helm repo add traefik https://traefik.github.io/charts
helm install traefik traefik/traefik -n kube-system \
  --set deployment.kind=DaemonSet \
  --set nodeSelector."kubernetes\.io/os"=linux \
  --set tolerations={}

6. Restore K8s Manifests (Namespaces, Secrets, Services)

# Apply namespace manifests from backup
for ns in postgres docfast docfast-staging snapapi snapapi-staging cert-manager; do
  kubectl apply -f /mnt/borg/var/backup/manifests/${ns}.yaml
done

Review before applying — manifests contain secrets and may reference node-specific IPs.

7. Restore CNPG PostgreSQL

Option A: Let CNPG create fresh cluster, then restore from SQL dumps:

# After CNPG cluster is healthy:
PRIMARY=$(kubectl -n postgres get pods -l cnpg.io/cluster=main-db,role=primary -o name | head -1)

for db in docfast docfast_staging snapapi snapapi_staging; do
  # Create database if needed
  kubectl -n postgres exec $PRIMARY -- psql -U postgres -c "CREATE DATABASE ${db};" 2>/dev/null || true
  # Restore dump
  kubectl -n postgres exec -i $PRIMARY -- psql -U postgres "${db}" < /mnt/borg/var/backup/postgresql/${db}.sql
done

Option B: Restore CNPG from backup (if configured with barman/S3 — not currently used).

8. Restore Application Deployments

Deployments are in the namespace manifests, but it's cleaner to redeploy from CI/CD:

# DocFast: push to main (staging) or tag v* (prod) on Forgejo
# SnapAPI: same workflow

Or apply from backup manifests:

kubectl apply -f /mnt/borg/var/backup/manifests/docfast.yaml
kubectl apply -f /mnt/borg/var/backup/manifests/snapapi.yaml

9. Verify

kubectl get nodes                           # All 3 nodes Ready
kubectl get pods -A                         # All pods Running
kubectl -n postgres get cluster main-db     # CNPG healthy
curl -k https://docfast.dev/health          # App responding

10. Post-Restore Checklist

  • DNS: docfast.dev A record → 46.225.37.135 (Hetzner LB)
  • Hetzner LB targets updated (w1 + w2 on ports 80/443)
  • Let's Encrypt certs issued (cert-manager auto-renews)
  • Stripe webhook endpoint updated if IP changed
  • HA spread constraints re-applied (CoreDNS 3 replicas, CNPG operator 2 replicas, PgBouncer anti-affinity)
  • Borg backup cron re-enabled: 30 3 * * * /root/k3s-backup.sh
  • Verify backup works: borg-backup && borg-list

Cluster Info

Node IP (Public) IP (Private) Role
k3s-mgr 188.34.201.101 10.0.1.5 Control plane (tainted)
k3s-w1 159.69.23.121 10.0.1.6 Worker
k3s-w2 46.225.169.60 10.0.1.7 Worker
Resource Detail
Hetzner LB ID 5834131, IP 46.225.37.135
Private Network 10.0.0.0/16, ID 11949384
Firewall coolify-fw, ID 10553199
Storage Box u149513-sub10.your-backup.de:23