Hoid 803ebd2613 fix: staging.docfast.dev DNS + TLS is working, remove stale blocker

2026-02-24 16:28:45 +00:00

7.5 KiB

Raw Permalink Blame History

K3s Infrastructure Documentation

Last updated: 2026-02-18

Cluster Overview

Component	Details
K3s Version	v1.34.4+k3s1
Datacenter	Hetzner nbg1
Server Type	CAX11 (ARM64, 2 vCPU, 4GB RAM)
Monthly Cost	€17.06 (3× CAX11 + LB)
Private Network	10.0.0.0/16, ID 11949384
Cluster CIDR	10.42.0.0/16
Service CIDR	10.43.0.0/16
Flannel Interface	enp7s0 (private network)

Nodes

Node	Role	Public IP	Private IP	Hetzner ID
k3s-mgr	Control plane (tainted NoSchedule)	188.34.201.101	10.0.1.5	121365837
k3s-w1	Worker	159.69.23.121	10.0.1.6	121365839
k3s-w2	Worker	46.225.169.60	10.0.1.7	121365840

Load Balancer

Field	Value
Name	k3s-lb
Hetzner ID	5834131
Public IP	46.225.37.135
Targets	k3s-w1, k3s-w2 (ports 80/443)
Health Checks	TCP, 15s interval, 3 retries, 10s timeout

Installed Operators & Components

Component	Version	Notes
Traefik	Helm (DaemonSet)	Runs on all workers, handles ingress + TLS termination
cert-manager	1.17.2	Let's Encrypt ClusterIssuer `letsencrypt-prod`
CloudNativePG	1.25.1	PostgreSQL operator

Database (CNPG)

Field	Value
Cluster Name	main-db
Namespace	postgres
Instances	2 (primary + replica)
PostgreSQL	17.4
Storage	10Gi local-path per instance
Databases	`docfast` (prod), `docfast_staging` (staging)
PgBouncer	`main-db-pooler`, 2 instances, transaction mode

Credentials

docfast-db-credentials secret: user=docfast, pass=docfast
main-db-superuser secret: managed by CNPG

Namespaces

Namespace	Purpose
postgres	CNPG cluster + pooler
docfast	Production DocFast (2 replicas)
docfast-staging	Staging DocFast (1 replica)
cnpg-system	CNPG operator
cert-manager	cert-manager
kube-system	K3s system (CoreDNS, Traefik, etc.)

HA Configuration

All spread constraints are runtime patches — may not survive K3s upgrades. Re-apply after updates.

Component	Replicas	Spread Strategy
CoreDNS	3	`preferredDuringScheduling` podAntiAffinity (mgr + w1 + w2)
CNPG Operator	2	`topologySpreadConstraints DoNotSchedule` (w1 + w2)
PgBouncer Pooler	2	`requiredDuringScheduling` podAntiAffinity via Pooler CRD (w1 + w2)
DocFast Prod	2	`preferredDuringScheduling` podAntiAffinity (w1 + w2)
DocFast Staging	1	Not HA by design

Failover Tuning (2026-02-18)

Readiness probe: every 5s, fail after 2 = pod unhealthy in ~10s
Liveness probe: every 10s, fail after 3
Node tolerations: pods evicted after 10s (default was 300s)
Result: Failover window ~10-15 seconds

HA Test Results (2026-02-18)

✅ w1 down: 4/4 health checks passed
✅ w2 down: 4/4 health checks passed, CNPG promoted replica
✅ mgr down: 4/4 health checks passed (workers keep running)

CI/CD Pipeline

Field	Value
Registry	git.cloonar.com (Forgejo container registry)
Runner	Agent host (178.115.247.134), x86 → ARM64 cross-compile via QEMU
Build time	~8 min
Deployer SA	`docfast:deployer` with namespace-scoped RBAC

Workflows

deploy.yml: Push to main → build + deploy to docfast-staging
promote.yml: Tag v* → build + deploy to docfast (prod)

Secrets Required in Forgejo

REGISTRY_TOKEN — PAT with write:package scope
KUBECONFIG — base64 encoded deployer kubeconfig

Pull Secrets

forgejo-registry imagePullSecret in both docfast and docfast-staging namespaces

DNS

Record	Type	Value
docfast.dev	A	46.225.37.135 (LB)
staging.docfast.dev	A	46.225.37.135 ✅ (TLS working)
MX	MX	mail.cloonar.com.

Firewall

Name: coolify-fw, Hetzner ID 10553199
Port 6443 open to: 10.0.0.0/16 (cluster internal) + 178.115.247.134/32 (CI runner)

SSH Access

Config in /home/openclaw/.ssh/config:

k3s-mgr, k3s-w1, k3s-w2 — root access
deployer user on k3s-mgr — limited kubeconfig at /home/deployer/.kube-config.yaml
KUBECONFIG on mgr: /etc/rancher/k3s/k3s.yaml

Backup Strategy (TO IMPLEMENT)

Current State: ✅ OPERATIONAL (since 2026-02-19)

Plan: Borg to Hetzner Storage Box

Target: u149513-sub11@u149513-sub11.your-backup.de:23 (already set up, SSH key configured)

1. Cluster State (etcd snapshots)

K3s built-in: --etcd-snapshot-schedule-cron on k3s-mgr
Borg repo: ./k3s-cluster/ on Storage Box
Contents: etcd snapshot + /var/lib/rancher/k3s/server/manifests/ + all applied YAML manifests
Schedule: Daily
Retention: 7 daily, 4 weekly

2. Database (pg_dump)

CronJob in postgres namespace → pg_dump both databases
Push to Borg repo: ./k3s-db/ on Storage Box
Schedule: Every 6 hours
Retention: 7 daily, 4 weekly
DB size: ~8 MB (tiny — Borg dedup makes this basically free)

3. Kubernetes Manifests

Export all namespaced resources as YAML
Include: deployments, services, ingresses, secrets (encrypted by Borg), configmaps, CNPG cluster spec, pooler spec
Push to Borg alongside etcd snapshots

4. Recovery Procedure

Provision 3 fresh CAX11 nodes
Install K3s, restore etcd snapshot
Or: fresh K3s + re-apply manifests from Borg
Restore CNPG database from pg_dump
Update DNS to new LB IP
Estimated recovery time: ~15-30 minutes

Future: CNPG Barman/S3 (when needed)

Hetzner Object Storage (S3-compatible)
Continuous WAL archiving for point-in-time recovery
Worth it when DB grows past ~1 GB or revenue justifies €5/mo
Current DB: 7.6 MB — overkill for now

Future Improvements

Priority: High

Implement Borg backup — operational since 2026-02-19 (DB every 6h, full daily at 03:00 UTC)
DNS: staging.docfast.dev → 46.225.37.135 — needed for staging ingress TLS
Persist HA spread constraints — CoreDNS scale, CNPG operator replicas, pooler anti-affinity are runtime patches. Need infra-as-code (manifests in Git) to survive K3s upgrades/reinstalls
Old server decommissioned (167.235.156.214) — deleted, no longer exists

Priority: Medium

CNPG backup to S3 — upgrade from pg_dump to continuous WAL archiving when DB grows
Monitoring/alerting — Prometheus + Grafana stack, or lightweight alternative (VictoriaMetrics)
Resource limits tuning — current: 100m-1000m CPU, 256Mi-1Gi RAM per pod. Profile actual usage and right-size
Network policies — restrict pod-to-pod traffic (e.g., only DocFast → PgBouncer, not direct to DB)
Pod Disruption Budgets — ensure at least 1 pod stays running during voluntary disruptions (upgrades, drains)
Automated K3s upgrades — system-upgrade-controller for rolling node updates

Priority: Low

Multi-project namespaces — SnapAPI and future products get own namespaces + RBAC
ServiceAccount per CEO agent — scoped kubectl access for autonomous deployment
Horizontal Pod Autoscaler — scale DocFast replicas based on CPU/request load
External Secrets Operator — centralized secret management instead of per-namespace secrets
Loki for log aggregation — centralized logging instead of kubectl logs
Node auto-scaling — Hetzner Cloud Controller Manager + Cluster Autoscaler

7.5 KiB Raw Permalink Blame History Unescape Escape