137 lines
4.2 KiB
Markdown
137 lines
4.2 KiB
Markdown
# Quick Restore: Control Plane (k3s-mgr) Only
|
|
|
|
Use this when only k3s-mgr is down. Workers and workloads keep running — they just can't be managed until the control plane is back.
|
|
|
|
## What Still Works Without k3s-mgr
|
|
|
|
- ✅ Running pods continue serving traffic (DocFast, SnapAPI)
|
|
- ✅ Hetzner LB routes to workers directly
|
|
- ✅ CNPG PostgreSQL (runs on workers, auto-failover between w1/w2)
|
|
- ✅ Traefik ingress (DaemonSet on workers)
|
|
- ❌ `kubectl` commands fail
|
|
- ❌ No new deployments, scaling, or pod scheduling
|
|
- ❌ cert-manager can't renew certs (but existing certs valid for 90 days)
|
|
- ❌ No etcd/SQLite state changes
|
|
|
|
## Steps
|
|
|
|
### 1. Provision New Server
|
|
|
|
Hetzner Cloud → CAX11 ARM64, Ubuntu 24.04, nbg1 datacenter.
|
|
Assign to private network 10.0.0.0/16, set IP to **10.0.1.5**.
|
|
Update public IP in:
|
|
- Hetzner Firewall (allow 6443 from runner IP)
|
|
- DNS if applicable
|
|
- SSH config
|
|
|
|
### 2. Install Borg & Restore Backup
|
|
|
|
```bash
|
|
apt update && apt install -y borgbackup python3-pyfuse3
|
|
|
|
# Copy SSH key for Storage Box (from password manager or another host)
|
|
mkdir -p /root/.ssh && chmod 700 /root/.ssh
|
|
# Paste id_ed25519 (k3s-mgr-backup key)
|
|
chmod 600 /root/.ssh/id_ed25519
|
|
ssh-keyscan -p 23 u149513-sub10.your-backup.de >> /root/.ssh/known_hosts
|
|
|
|
export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519'
|
|
export BORG_PASSPHRASE='<from password manager>'
|
|
|
|
# List & mount latest
|
|
borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster
|
|
mkdir -p /mnt/borg
|
|
borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster::<latest> /mnt/borg
|
|
```
|
|
|
|
### 3. Restore Token & Install K3s
|
|
|
|
```bash
|
|
mkdir -p /var/lib/rancher/k3s/server
|
|
cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token
|
|
|
|
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \
|
|
--node-taint CriticalAddonsOnly=true:NoSchedule \
|
|
--flannel-iface enp7s0 \
|
|
--cluster-cidr 10.42.0.0/16 \
|
|
--service-cidr 10.43.0.0/16 \
|
|
--tls-san $(curl -s http://169.254.169.254/hetzner/v1/metadata/public-ipv4) \
|
|
--token "$(cat /var/lib/rancher/k3s/server/token)"
|
|
```
|
|
|
|
Workers will auto-reconnect using the same token. Verify:
|
|
```bash
|
|
kubectl get nodes
|
|
```
|
|
|
|
### 4. Restore Manifests & Config
|
|
|
|
```bash
|
|
cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml
|
|
cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/
|
|
```
|
|
|
|
### 5. Reinstall Operators (if not auto-recovered)
|
|
|
|
K3s keeps state in SQLite — if workers retained their state, pods may already be running. Check first:
|
|
```bash
|
|
kubectl get pods -A
|
|
```
|
|
|
|
If operators are missing:
|
|
```bash
|
|
# cert-manager
|
|
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
|
|
|
|
# CloudNativePG
|
|
kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml
|
|
```
|
|
|
|
Traefik runs as DaemonSet on workers — should already be running.
|
|
|
|
### 6. Re-apply HA Spread Constraints
|
|
|
|
These are runtime patches that don't survive a fresh control plane:
|
|
```bash
|
|
# CoreDNS: 3 replicas
|
|
kubectl -n kube-system scale deployment coredns --replicas=3
|
|
|
|
# CNPG operator: 2 replicas with topology spread
|
|
kubectl -n cnpg-system scale deployment cnpg-controller-manager --replicas=2
|
|
```
|
|
|
|
### 7. Restore Backup Infrastructure
|
|
|
|
```bash
|
|
# Borg passphrase
|
|
# Copy from password manager to /root/.borg-passphrase
|
|
chmod 600 /root/.borg-passphrase
|
|
|
|
# Restore backup script & helpers
|
|
cp /mnt/borg/var/backup/RESTORE.md /var/backup/RESTORE.md
|
|
# Or just re-run the OpenClaw setup (Hoid will recreate them)
|
|
|
|
# Restore cron
|
|
echo "30 3 * * * /root/k3s-backup.sh >> /var/log/k3s-backup.log 2>&1" | crontab -
|
|
|
|
# Unmount
|
|
borg umount /mnt/borg
|
|
```
|
|
|
|
### 8. Verify Everything
|
|
|
|
```bash
|
|
kubectl get nodes # 3 nodes Ready
|
|
kubectl get pods -A # All pods Running
|
|
kubectl -n postgres get cluster main-db # CNPG healthy
|
|
curl -k https://docfast.dev/health # App responding
|
|
borg-list # Backup accessible
|
|
borg-backup # Test backup works
|
|
```
|
|
|
|
### Total Downtime Estimate
|
|
|
|
- Server provisioning: ~2 min (Hetzner API)
|
|
- K3s install + worker reconnect: ~3 min
|
|
- Operator recovery: ~2 min
|
|
- **Total: ~10 minutes** (workloads unaffected during this time)
|