config/projects/business/memory/infrastructure.md

# K3s Infrastructure Documentation

*Last updated: 2026-02-18*

## Cluster Overview

| Component | Details |
|-----------|---------|
| K3s Version | v1.34.4+k3s1 |
| Datacenter | Hetzner nbg1 |
| Server Type | CAX11 (ARM64, 2 vCPU, 4GB RAM) |
| Monthly Cost | €17.06 (3× CAX11 + LB) |
| Private Network | 10.0.0.0/16, ID 11949384 |
| Cluster CIDR | 10.42.0.0/16 |
| Service CIDR | 10.43.0.0/16 |
| Flannel Interface | enp7s0 (private network) |

## Nodes

| Node | Role | Public IP | Private IP | Hetzner ID |
|------|------|-----------|------------|------------|
| k3s-mgr | Control plane (tainted NoSchedule) | 188.34.201.101 | 10.0.1.5 | 121365837 |
| k3s-w1 | Worker | 159.69.23.121 | 10.0.1.6 | 121365839 |
| k3s-w2 | Worker | 46.225.169.60 | 10.0.1.7 | 121365840 |

## Load Balancer

| Field | Value |
|-------|-------|
| Name | k3s-lb |
| Hetzner ID | 5834131 |
| Public IP | 46.225.37.135 |
| Targets | k3s-w1, k3s-w2 (ports 80/443) |
| Health Checks | TCP, 15s interval, 3 retries, 10s timeout |

## Installed Operators & Components

| Component | Version | Notes |
|-----------|---------|-------|
| Traefik | Helm (DaemonSet) | Runs on all workers, handles ingress + TLS termination |
| cert-manager | 1.17.2 | Let's Encrypt ClusterIssuer `letsencrypt-prod` |
| CloudNativePG | 1.25.1 | PostgreSQL operator |

## Database (CNPG)

| Field | Value |
|-------|-------|
| Cluster Name | main-db |
| Namespace | postgres |
| Instances | 2 (primary + replica) |
| PostgreSQL | 17.4 |
| Storage | 10Gi local-path per instance |
| Databases | `docfast` (prod), `docfast_staging` (staging) |
| PgBouncer | `main-db-pooler`, 2 instances, transaction mode |

### Credentials
- `docfast-db-credentials` secret: user=docfast, pass=docfast
- `main-db-superuser` secret: managed by CNPG

## Namespaces

| Namespace | Purpose |
|-----------|---------|
| postgres | CNPG cluster + pooler |
| docfast | Production DocFast (2 replicas) |
| docfast-staging | Staging DocFast (1 replica) |
| cnpg-system | CNPG operator |
| cert-manager | cert-manager |
| kube-system | K3s system (CoreDNS, Traefik, etc.) |

## HA Configuration

All spread constraints are **runtime patches** — may not survive K3s upgrades. Re-apply after updates.

| Component | Replicas | Spread Strategy |
|-----------|----------|-----------------|
| CoreDNS | 3 | `preferredDuringScheduling` podAntiAffinity (mgr + w1 + w2) |
| CNPG Operator | 2 | `topologySpreadConstraints DoNotSchedule` (w1 + w2) |
| PgBouncer Pooler | 2 | `requiredDuringScheduling` podAntiAffinity via Pooler CRD (w1 + w2) |
| DocFast Prod | 2 | `preferredDuringScheduling` podAntiAffinity (w1 + w2) |
| DocFast Staging | 1 | Not HA by design |

### Failover Tuning (2026-02-18)
- **Readiness probe**: every 5s, fail after 2 = pod unhealthy in ~10s
- **Liveness probe**: every 10s, fail after 3
- **Node tolerations**: pods evicted after 10s (default was 300s)
- **Result**: Failover window ~10-15 seconds

### HA Test Results (2026-02-18)
- ✅ w1 down: 4/4 health checks passed
- ✅ w2 down: 4/4 health checks passed, CNPG promoted replica
- ✅ mgr down: 4/4 health checks passed (workers keep running)

## CI/CD Pipeline

| Field | Value |
|-------|-------|
| Registry | git.cloonar.com (Forgejo container registry) |
| Runner | Agent host (178.115.247.134), x86 → ARM64 cross-compile via QEMU |
| Build time | ~8 min |
| Deployer SA | `docfast:deployer` with namespace-scoped RBAC |

### Workflows
- **deploy.yml**: Push to `main` → build + deploy to `docfast-staging`
- **promote.yml**: Tag `v*` → build + deploy to `docfast` (prod)

### Secrets Required in Forgejo
- `REGISTRY_TOKEN` — PAT with write:package scope
- `KUBECONFIG` — base64 encoded deployer kubeconfig

### Pull Secrets
- `forgejo-registry` imagePullSecret in both `docfast` and `docfast-staging` namespaces

## DNS

| Record | Type | Value |
|--------|------|-------|
| docfast.dev | A | 46.225.37.135 (LB) |
| staging.docfast.dev | A | **NOT SET** — needed for staging TLS |
| MX | MX | mail.cloonar.com. |

## Firewall

- Name: coolify-fw, Hetzner ID 10553199
- Port 6443 open to: 10.0.0.0/16 (cluster internal) + 178.115.247.134/32 (CI runner)

## SSH Access

Config in `/home/openclaw/.ssh/config`:
- `k3s-mgr`, `k3s-w1`, `k3s-w2` — root access
- `deployer` user on k3s-mgr — limited kubeconfig at `/home/deployer/.kube-config.yaml`
- KUBECONFIG on mgr: `/etc/rancher/k3s/k3s.yaml`

---

## Backup Strategy (TO IMPLEMENT)

### Current State: ✅ OPERATIONAL (since 2026-02-19)

### Plan: Borg to Hetzner Storage Box

Target: `u149513-sub11@u149513-sub11.your-backup.de:23` (already set up, SSH key configured)

**1. Cluster State (etcd snapshots)**
- K3s built-in: `--etcd-snapshot-schedule-cron` on k3s-mgr
- Borg repo: `./k3s-cluster/` on Storage Box
- Contents: etcd snapshot + `/var/lib/rancher/k3s/server/manifests/` + all applied YAML manifests
- Schedule: Daily
- Retention: 7 daily, 4 weekly

**2. Database (pg_dump)**
- CronJob in `postgres` namespace → `pg_dump` both databases
- Push to Borg repo: `./k3s-db/` on Storage Box
- Schedule: Every 6 hours
- Retention: 7 daily, 4 weekly
- DB size: ~8 MB (tiny — Borg dedup makes this basically free)

**3. Kubernetes Manifests**
- Export all namespaced resources as YAML
- Include: deployments, services, ingresses, secrets (encrypted by Borg), configmaps, CNPG cluster spec, pooler spec
- Push to Borg alongside etcd snapshots

**4. Recovery Procedure**
1. Provision 3 fresh CAX11 nodes
2. Install K3s, restore etcd snapshot
3. Or: fresh K3s + re-apply manifests from Borg
4. Restore CNPG database from pg_dump
5. Update DNS to new LB IP
6. Estimated recovery time: ~15-30 minutes

### Future: CNPG Barman/S3 (when needed)
- Hetzner Object Storage (S3-compatible)
- Continuous WAL archiving for point-in-time recovery
- Worth it when DB grows past ~1 GB or revenue justifies €5/mo
- Current DB: 7.6 MB — overkill for now

---

## Future Improvements

### Priority: High
- [x] **Implement Borg backup** — operational since 2026-02-19 (DB every 6h, full daily at 03:00 UTC)
- [ ] **DNS: staging.docfast.dev** → 46.225.37.135 — needed for staging ingress TLS
- [ ] **Persist HA spread constraints** — CoreDNS scale, CNPG operator replicas, pooler anti-affinity are runtime patches. Need infra-as-code (manifests in Git) to survive K3s upgrades/reinstalls
- [x] **Old server decommissioned** (167.235.156.214) — deleted, no longer exists

### Priority: Medium
- [ ] **CNPG backup to S3** — upgrade from pg_dump to continuous WAL archiving when DB grows
- [ ] **Monitoring/alerting** — Prometheus + Grafana stack, or lightweight alternative (VictoriaMetrics)
- [ ] **Resource limits tuning** — current: 100m-1000m CPU, 256Mi-1Gi RAM per pod. Profile actual usage and right-size
- [ ] **Network policies** — restrict pod-to-pod traffic (e.g., only DocFast → PgBouncer, not direct to DB)
- [ ] **Pod Disruption Budgets** — ensure at least 1 pod stays running during voluntary disruptions (upgrades, drains)
- [ ] **Automated K3s upgrades** — system-upgrade-controller for rolling node updates

### Priority: Low
- [ ] **Multi-project namespaces** — SnapAPI and future products get own namespaces + RBAC
- [ ] **ServiceAccount per CEO agent** — scoped kubectl access for autonomous deployment
- [ ] **Horizontal Pod Autoscaler** — scale DocFast replicas based on CPU/request load
- [ ] **External Secrets Operator** — centralized secret management instead of per-namespace secrets
- [ ] **Loki for log aggregation** — centralized logging instead of `kubectl logs`
- [ ] **Node auto-scaling** — Hetzner Cloud Controller Manager + Cluster Autoscaler