config/projects/business/memory/infrastructure.md

7.5 KiB
Raw Permalink Blame History

K3s Infrastructure Documentation

Last updated: 2026-02-18

Cluster Overview

Component Details
K3s Version v1.34.4+k3s1
Datacenter Hetzner nbg1
Server Type CAX11 (ARM64, 2 vCPU, 4GB RAM)
Monthly Cost €17.06 (3× CAX11 + LB)
Private Network 10.0.0.0/16, ID 11949384
Cluster CIDR 10.42.0.0/16
Service CIDR 10.43.0.0/16
Flannel Interface enp7s0 (private network)

Nodes

Node Role Public IP Private IP Hetzner ID
k3s-mgr Control plane (tainted NoSchedule) 188.34.201.101 10.0.1.5 121365837
k3s-w1 Worker 159.69.23.121 10.0.1.6 121365839
k3s-w2 Worker 46.225.169.60 10.0.1.7 121365840

Load Balancer

Field Value
Name k3s-lb
Hetzner ID 5834131
Public IP 46.225.37.135
Targets k3s-w1, k3s-w2 (ports 80/443)
Health Checks TCP, 15s interval, 3 retries, 10s timeout

Installed Operators & Components

Component Version Notes
Traefik Helm (DaemonSet) Runs on all workers, handles ingress + TLS termination
cert-manager 1.17.2 Let's Encrypt ClusterIssuer letsencrypt-prod
CloudNativePG 1.25.1 PostgreSQL operator

Database (CNPG)

Field Value
Cluster Name main-db
Namespace postgres
Instances 2 (primary + replica)
PostgreSQL 17.4
Storage 10Gi local-path per instance
Databases docfast (prod), docfast_staging (staging)
PgBouncer main-db-pooler, 2 instances, transaction mode

Credentials

  • docfast-db-credentials secret: user=docfast, pass=docfast
  • main-db-superuser secret: managed by CNPG

Namespaces

Namespace Purpose
postgres CNPG cluster + pooler
docfast Production DocFast (2 replicas)
docfast-staging Staging DocFast (1 replica)
cnpg-system CNPG operator
cert-manager cert-manager
kube-system K3s system (CoreDNS, Traefik, etc.)

HA Configuration

All spread constraints are runtime patches — may not survive K3s upgrades. Re-apply after updates.

Component Replicas Spread Strategy
CoreDNS 3 preferredDuringScheduling podAntiAffinity (mgr + w1 + w2)
CNPG Operator 2 topologySpreadConstraints DoNotSchedule (w1 + w2)
PgBouncer Pooler 2 requiredDuringScheduling podAntiAffinity via Pooler CRD (w1 + w2)
DocFast Prod 2 preferredDuringScheduling podAntiAffinity (w1 + w2)
DocFast Staging 1 Not HA by design

Failover Tuning (2026-02-18)

  • Readiness probe: every 5s, fail after 2 = pod unhealthy in ~10s
  • Liveness probe: every 10s, fail after 3
  • Node tolerations: pods evicted after 10s (default was 300s)
  • Result: Failover window ~10-15 seconds

HA Test Results (2026-02-18)

  • w1 down: 4/4 health checks passed
  • w2 down: 4/4 health checks passed, CNPG promoted replica
  • mgr down: 4/4 health checks passed (workers keep running)

CI/CD Pipeline

Field Value
Registry git.cloonar.com (Forgejo container registry)
Runner Agent host (178.115.247.134), x86 → ARM64 cross-compile via QEMU
Build time ~8 min
Deployer SA docfast:deployer with namespace-scoped RBAC

Workflows

  • deploy.yml: Push to main → build + deploy to docfast-staging
  • promote.yml: Tag v* → build + deploy to docfast (prod)

Secrets Required in Forgejo

  • REGISTRY_TOKEN — PAT with write:package scope
  • KUBECONFIG — base64 encoded deployer kubeconfig

Pull Secrets

  • forgejo-registry imagePullSecret in both docfast and docfast-staging namespaces

DNS

Record Type Value
docfast.dev A 46.225.37.135 (LB)
staging.docfast.dev A 46.225.37.135 (TLS working)
MX MX mail.cloonar.com.

Firewall

  • Name: coolify-fw, Hetzner ID 10553199
  • Port 6443 open to: 10.0.0.0/16 (cluster internal) + 178.115.247.134/32 (CI runner)

SSH Access

Config in /home/openclaw/.ssh/config:

  • k3s-mgr, k3s-w1, k3s-w2 — root access
  • deployer user on k3s-mgr — limited kubeconfig at /home/deployer/.kube-config.yaml
  • KUBECONFIG on mgr: /etc/rancher/k3s/k3s.yaml

Backup Strategy (TO IMPLEMENT)

Current State: OPERATIONAL (since 2026-02-19)

Plan: Borg to Hetzner Storage Box

Target: u149513-sub11@u149513-sub11.your-backup.de:23 (already set up, SSH key configured)

1. Cluster State (etcd snapshots)

  • K3s built-in: --etcd-snapshot-schedule-cron on k3s-mgr
  • Borg repo: ./k3s-cluster/ on Storage Box
  • Contents: etcd snapshot + /var/lib/rancher/k3s/server/manifests/ + all applied YAML manifests
  • Schedule: Daily
  • Retention: 7 daily, 4 weekly

2. Database (pg_dump)

  • CronJob in postgres namespace → pg_dump both databases
  • Push to Borg repo: ./k3s-db/ on Storage Box
  • Schedule: Every 6 hours
  • Retention: 7 daily, 4 weekly
  • DB size: ~8 MB (tiny — Borg dedup makes this basically free)

3. Kubernetes Manifests

  • Export all namespaced resources as YAML
  • Include: deployments, services, ingresses, secrets (encrypted by Borg), configmaps, CNPG cluster spec, pooler spec
  • Push to Borg alongside etcd snapshots

4. Recovery Procedure

  1. Provision 3 fresh CAX11 nodes
  2. Install K3s, restore etcd snapshot
  3. Or: fresh K3s + re-apply manifests from Borg
  4. Restore CNPG database from pg_dump
  5. Update DNS to new LB IP
  6. Estimated recovery time: ~15-30 minutes

Future: CNPG Barman/S3 (when needed)

  • Hetzner Object Storage (S3-compatible)
  • Continuous WAL archiving for point-in-time recovery
  • Worth it when DB grows past ~1 GB or revenue justifies €5/mo
  • Current DB: 7.6 MB — overkill for now

Future Improvements

Priority: High

  • Implement Borg backup — operational since 2026-02-19 (DB every 6h, full daily at 03:00 UTC)
  • DNS: staging.docfast.dev → 46.225.37.135 — needed for staging ingress TLS
  • Persist HA spread constraints — CoreDNS scale, CNPG operator replicas, pooler anti-affinity are runtime patches. Need infra-as-code (manifests in Git) to survive K3s upgrades/reinstalls
  • Old server decommissioned (167.235.156.214) — deleted, no longer exists

Priority: Medium

  • CNPG backup to S3 — upgrade from pg_dump to continuous WAL archiving when DB grows
  • Monitoring/alerting — Prometheus + Grafana stack, or lightweight alternative (VictoriaMetrics)
  • Resource limits tuning — current: 100m-1000m CPU, 256Mi-1Gi RAM per pod. Profile actual usage and right-size
  • Network policies — restrict pod-to-pod traffic (e.g., only DocFast → PgBouncer, not direct to DB)
  • Pod Disruption Budgets — ensure at least 1 pod stays running during voluntary disruptions (upgrades, drains)
  • Automated K3s upgrades — system-upgrade-controller for rolling node updates

Priority: Low

  • Multi-project namespaces — SnapAPI and future products get own namespaces + RBAC
  • ServiceAccount per CEO agent — scoped kubectl access for autonomous deployment
  • Horizontal Pod Autoscaler — scale DocFast replicas based on CPU/request load
  • External Secrets Operator — centralized secret management instead of per-namespace secrets
  • Loki for log aggregation — centralized logging instead of kubectl logs
  • Node auto-scaling — Hetzner Cloud Controller Manager + Cluster Autoscaler