Add K3s restore guides to infra skill references
This commit is contained in:
parent
e2e9ae55f7
commit
dd5a51fdd0
2 changed files with 287 additions and 0 deletions
150
skills/k3s-infra/references/RESTORE-FULL.md
Normal file
150
skills/k3s-infra/references/RESTORE-FULL.md
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
# K3s Cluster Restore Guide
|
||||
|
||||
## Prerequisites
|
||||
- Fresh Ubuntu 24.04 server (CAX11 ARM64, Hetzner)
|
||||
- Borg backup repo access: `ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster`
|
||||
- SSH key for Storage Box: `/root/.ssh/id_ed25519` (fingerprint: `k3s-mgr-backup`)
|
||||
- Borg passphrase (stored in password manager)
|
||||
|
||||
## 1. Install Borg & Mount Backup
|
||||
|
||||
```bash
|
||||
apt update && apt install -y borgbackup python3-pyfuse3
|
||||
export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519'
|
||||
export BORG_PASSPHRASE='<from password manager>'
|
||||
|
||||
# List available archives
|
||||
borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster
|
||||
|
||||
# Mount latest archive
|
||||
mkdir -p /mnt/borg
|
||||
borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster::<archive-name> /mnt/borg
|
||||
```
|
||||
|
||||
## 2. Install K3s (Control Plane)
|
||||
|
||||
```bash
|
||||
# Restore K3s token (needed for worker rejoin)
|
||||
mkdir -p /var/lib/rancher/k3s/server
|
||||
cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token
|
||||
|
||||
# Install K3s server (tainted, no workloads on mgr)
|
||||
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \
|
||||
--node-taint CriticalAddonsOnly=true:NoSchedule \
|
||||
--flannel-iface enp7s0 \
|
||||
--cluster-cidr 10.42.0.0/16 \
|
||||
--service-cidr 10.43.0.0/16 \
|
||||
--tls-san 188.34.201.101 \
|
||||
--token "$(cat /var/lib/rancher/k3s/server/token)"
|
||||
```
|
||||
|
||||
## 3. Rejoin Worker Nodes
|
||||
|
||||
On each worker (k3s-w1: 159.69.23.121, k3s-w2: 46.225.169.60):
|
||||
```bash
|
||||
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" \
|
||||
K3S_URL=https://188.34.201.101:6443 \
|
||||
K3S_TOKEN="<token from step 2>" \
|
||||
sh -s - agent --flannel-iface enp7s0
|
||||
```
|
||||
|
||||
## 4. Restore K3s Manifests & Config
|
||||
|
||||
```bash
|
||||
cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml
|
||||
cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/
|
||||
```
|
||||
|
||||
## 5. Install Operators
|
||||
|
||||
```bash
|
||||
# cert-manager
|
||||
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
|
||||
|
||||
# CloudNativePG
|
||||
kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml
|
||||
|
||||
# Traefik (via Helm)
|
||||
helm repo add traefik https://traefik.github.io/charts
|
||||
helm install traefik traefik/traefik -n kube-system \
|
||||
--set deployment.kind=DaemonSet \
|
||||
--set nodeSelector."kubernetes\.io/os"=linux \
|
||||
--set tolerations={}
|
||||
```
|
||||
|
||||
## 6. Restore K8s Manifests (Namespaces, Secrets, Services)
|
||||
|
||||
```bash
|
||||
# Apply namespace manifests from backup
|
||||
for ns in postgres docfast docfast-staging snapapi snapapi-staging cert-manager; do
|
||||
kubectl apply -f /mnt/borg/var/backup/manifests/${ns}.yaml
|
||||
done
|
||||
```
|
||||
|
||||
**Review before applying** — manifests contain secrets and may reference node-specific IPs.
|
||||
|
||||
## 7. Restore CNPG PostgreSQL
|
||||
|
||||
Option A: Let CNPG create fresh cluster, then restore from SQL dumps:
|
||||
```bash
|
||||
# After CNPG cluster is healthy:
|
||||
PRIMARY=$(kubectl -n postgres get pods -l cnpg.io/cluster=main-db,role=primary -o name | head -1)
|
||||
|
||||
for db in docfast docfast_staging snapapi snapapi_staging; do
|
||||
# Create database if needed
|
||||
kubectl -n postgres exec $PRIMARY -- psql -U postgres -c "CREATE DATABASE ${db};" 2>/dev/null || true
|
||||
# Restore dump
|
||||
kubectl -n postgres exec -i $PRIMARY -- psql -U postgres "${db}" < /mnt/borg/var/backup/postgresql/${db}.sql
|
||||
done
|
||||
```
|
||||
|
||||
Option B: Restore CNPG from backup (if configured with barman/S3 — not currently used).
|
||||
|
||||
## 8. Restore Application Deployments
|
||||
|
||||
Deployments are in the namespace manifests, but it's cleaner to redeploy from CI/CD:
|
||||
```bash
|
||||
# DocFast: push to main (staging) or tag v* (prod) on Forgejo
|
||||
# SnapAPI: same workflow
|
||||
```
|
||||
|
||||
Or apply from backup manifests:
|
||||
```bash
|
||||
kubectl apply -f /mnt/borg/var/backup/manifests/docfast.yaml
|
||||
kubectl apply -f /mnt/borg/var/backup/manifests/snapapi.yaml
|
||||
```
|
||||
|
||||
## 9. Verify
|
||||
|
||||
```bash
|
||||
kubectl get nodes # All 3 nodes Ready
|
||||
kubectl get pods -A # All pods Running
|
||||
kubectl -n postgres get cluster main-db # CNPG healthy
|
||||
curl -k https://docfast.dev/health # App responding
|
||||
```
|
||||
|
||||
## 10. Post-Restore Checklist
|
||||
|
||||
- [ ] DNS: `docfast.dev` A record → 46.225.37.135 (Hetzner LB)
|
||||
- [ ] Hetzner LB targets updated (w1 + w2 on ports 80/443)
|
||||
- [ ] Let's Encrypt certs issued (cert-manager auto-renews)
|
||||
- [ ] Stripe webhook endpoint updated if IP changed
|
||||
- [ ] HA spread constraints re-applied (CoreDNS 3 replicas, CNPG operator 2 replicas, PgBouncer anti-affinity)
|
||||
- [ ] Borg backup cron re-enabled: `30 3 * * * /root/k3s-backup.sh`
|
||||
- [ ] Verify backup works: `borg-backup && borg-list`
|
||||
|
||||
## Cluster Info
|
||||
|
||||
| Node | IP (Public) | IP (Private) | Role |
|
||||
|------|------------|--------------|------|
|
||||
| k3s-mgr | 188.34.201.101 | 10.0.1.5 | Control plane (tainted) |
|
||||
| k3s-w1 | 159.69.23.121 | 10.0.1.6 | Worker |
|
||||
| k3s-w2 | 46.225.169.60 | 10.0.1.7 | Worker |
|
||||
|
||||
| Resource | Detail |
|
||||
|----------|--------|
|
||||
| Hetzner LB | ID 5834131, IP 46.225.37.135 |
|
||||
| Private Network | 10.0.0.0/16, ID 11949384 |
|
||||
| Firewall | coolify-fw, ID 10553199 |
|
||||
| Storage Box | u149513-sub10.your-backup.de:23 |
|
||||
|
||||
137
skills/k3s-infra/references/RESTORE-MGR.md
Normal file
137
skills/k3s-infra/references/RESTORE-MGR.md
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
# Quick Restore: Control Plane (k3s-mgr) Only
|
||||
|
||||
Use this when only k3s-mgr is down. Workers and workloads keep running — they just can't be managed until the control plane is back.
|
||||
|
||||
## What Still Works Without k3s-mgr
|
||||
|
||||
- ✅ Running pods continue serving traffic (DocFast, SnapAPI)
|
||||
- ✅ Hetzner LB routes to workers directly
|
||||
- ✅ CNPG PostgreSQL (runs on workers, auto-failover between w1/w2)
|
||||
- ✅ Traefik ingress (DaemonSet on workers)
|
||||
- ❌ `kubectl` commands fail
|
||||
- ❌ No new deployments, scaling, or pod scheduling
|
||||
- ❌ cert-manager can't renew certs (but existing certs valid for 90 days)
|
||||
- ❌ No etcd/SQLite state changes
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Provision New Server
|
||||
|
||||
Hetzner Cloud → CAX11 ARM64, Ubuntu 24.04, nbg1 datacenter.
|
||||
Assign to private network 10.0.0.0/16, set IP to **10.0.1.5**.
|
||||
Update public IP in:
|
||||
- Hetzner Firewall (allow 6443 from runner IP)
|
||||
- DNS if applicable
|
||||
- SSH config
|
||||
|
||||
### 2. Install Borg & Restore Backup
|
||||
|
||||
```bash
|
||||
apt update && apt install -y borgbackup python3-pyfuse3
|
||||
|
||||
# Copy SSH key for Storage Box (from password manager or another host)
|
||||
mkdir -p /root/.ssh && chmod 700 /root/.ssh
|
||||
# Paste id_ed25519 (k3s-mgr-backup key)
|
||||
chmod 600 /root/.ssh/id_ed25519
|
||||
ssh-keyscan -p 23 u149513-sub10.your-backup.de >> /root/.ssh/known_hosts
|
||||
|
||||
export BORG_RSH='ssh -p 23 -i /root/.ssh/id_ed25519'
|
||||
export BORG_PASSPHRASE='<from password manager>'
|
||||
|
||||
# List & mount latest
|
||||
borg list ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster
|
||||
mkdir -p /mnt/borg
|
||||
borg mount ssh://u149513-sub10@u149513-sub10.your-backup.de/./k3s-cluster::<latest> /mnt/borg
|
||||
```
|
||||
|
||||
### 3. Restore Token & Install K3s
|
||||
|
||||
```bash
|
||||
mkdir -p /var/lib/rancher/k3s/server
|
||||
cp /mnt/borg/var/lib/rancher/k3s/server/token /var/lib/rancher/k3s/server/token
|
||||
|
||||
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.34.4+k3s1" sh -s - server \
|
||||
--node-taint CriticalAddonsOnly=true:NoSchedule \
|
||||
--flannel-iface enp7s0 \
|
||||
--cluster-cidr 10.42.0.0/16 \
|
||||
--service-cidr 10.43.0.0/16 \
|
||||
--tls-san $(curl -s http://169.254.169.254/hetzner/v1/metadata/public-ipv4) \
|
||||
--token "$(cat /var/lib/rancher/k3s/server/token)"
|
||||
```
|
||||
|
||||
Workers will auto-reconnect using the same token. Verify:
|
||||
```bash
|
||||
kubectl get nodes
|
||||
```
|
||||
|
||||
### 4. Restore Manifests & Config
|
||||
|
||||
```bash
|
||||
cp /mnt/borg/etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.yaml
|
||||
cp -r /mnt/borg/var/lib/rancher/k3s/server/manifests/* /var/lib/rancher/k3s/server/manifests/
|
||||
```
|
||||
|
||||
### 5. Reinstall Operators (if not auto-recovered)
|
||||
|
||||
K3s keeps state in SQLite — if workers retained their state, pods may already be running. Check first:
|
||||
```bash
|
||||
kubectl get pods -A
|
||||
```
|
||||
|
||||
If operators are missing:
|
||||
```bash
|
||||
# cert-manager
|
||||
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
|
||||
|
||||
# CloudNativePG
|
||||
kubectl apply --server-side -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.1.yaml
|
||||
```
|
||||
|
||||
Traefik runs as DaemonSet on workers — should already be running.
|
||||
|
||||
### 6. Re-apply HA Spread Constraints
|
||||
|
||||
These are runtime patches that don't survive a fresh control plane:
|
||||
```bash
|
||||
# CoreDNS: 3 replicas
|
||||
kubectl -n kube-system scale deployment coredns --replicas=3
|
||||
|
||||
# CNPG operator: 2 replicas with topology spread
|
||||
kubectl -n cnpg-system scale deployment cnpg-controller-manager --replicas=2
|
||||
```
|
||||
|
||||
### 7. Restore Backup Infrastructure
|
||||
|
||||
```bash
|
||||
# Borg passphrase
|
||||
# Copy from password manager to /root/.borg-passphrase
|
||||
chmod 600 /root/.borg-passphrase
|
||||
|
||||
# Restore backup script & helpers
|
||||
cp /mnt/borg/var/backup/RESTORE.md /var/backup/RESTORE.md
|
||||
# Or just re-run the OpenClaw setup (Hoid will recreate them)
|
||||
|
||||
# Restore cron
|
||||
echo "30 3 * * * /root/k3s-backup.sh >> /var/log/k3s-backup.log 2>&1" | crontab -
|
||||
|
||||
# Unmount
|
||||
borg umount /mnt/borg
|
||||
```
|
||||
|
||||
### 8. Verify Everything
|
||||
|
||||
```bash
|
||||
kubectl get nodes # 3 nodes Ready
|
||||
kubectl get pods -A # All pods Running
|
||||
kubectl -n postgres get cluster main-db # CNPG healthy
|
||||
curl -k https://docfast.dev/health # App responding
|
||||
borg-list # Backup accessible
|
||||
borg-backup # Test backup works
|
||||
```
|
||||
|
||||
### Total Downtime Estimate
|
||||
|
||||
- Server provisioning: ~2 min (Hetzner API)
|
||||
- K3s install + worker reconnect: ~3 min
|
||||
- Operator recovery: ~2 min
|
||||
- **Total: ~10 minutes** (workloads unaffected during this time)
|
||||
Loading…
Add table
Add a link
Reference in a new issue