12 KiB
12 KiB
Service Monitoring & Alerting Plan
Date: 2025-05-31 Prepared by: Roo (AI Technical Leader)
1. Goal
Implement "service down" alerting in Grafana for OpenLDAP, Postfix, Dovecot, Gitea, Gitea Action Runners (via Gitea server metrics), and WireGuard. All configurations are to be managed via NixOS, using VictoriaMetrics as the central metrics backend and Grafana for alerting.
2. Hosts Involved
fw:- Runs Gitea (in a container), Gitea Action Runner MicroVMs, and WireGuard interfaces.
- Will host
vmagentfor metrics collection. - Will host
wireguard_exporter.
mail:- Runs OpenLDAP, Postfix, and Dovecot.
- Will host
vmagentfor metrics collection. - Will host
openldap_exporter,postfix_exporter, anddovecot_exporter.
web-arm:- Runs the central VictoriaMetrics service (accessible as
victoria-server.cloonar.com). - Runs Grafana.
- Will host the NixOS-provisioned alert definitions for Grafana.
- Runs the central VictoriaMetrics service (accessible as
3. Strategy Overview
The strategy involves three main phases for each service:
- Metrics Exposure: Ensure each service (or an associated exporter) provides Prometheus-compatible metrics.
- Metrics Collection: Configure
vmagenton the relevant hosts (fw,mail) to scrape these metrics and send them to the central VictoriaMetrics instance onweb-arm. - Alert Definition: Define alert rules in Grafana (on
web-arm) using NixOS provisioning. These rules will query VictoriaMetrics and trigger notifications (via the existing Pushover contact point) if a service is detected as down.
4. Detailed Plan
4.1. Metrics Exposure & Collection (Modular NixOS Approach)
4.1.1. Gitea Server (on fw host)
- Action: Enable built-in Prometheus metrics in
hosts/fw/modules/gitea.nix.- Modify
services.gitea.settingswithin the Gitea container's configuration. - Example:
metrics = { ENABLED = true; TOKEN = "your_secure_token_here"; // Optional: Consider if a token is needed/desired }; - The Gitea
/metricsendpoint will also provide status information for Gitea Action Runners.
- Modify
vmagentScrape Job: Defined inhosts/fw/modules/gitea.nix.config.services.vmagent.scrapeJobs = [ { job_name = "gitea"; static_configs = [{ targets = ["<gitea_container_ip_or_localhost>:<gitea_port>"]; // e.g., "localhost:3001" if vmagent is on host and can reach container port }]; // metrics_path defaults to /metrics } ];
4.1.2. Gitea Action Runners (on fw host, MicroVMs)
- Action: Monitoring will be performed by querying metrics exposed by the Gitea server itself (see 4.1.1). No separate
node_exporteror specific scrape jobs for the runner VMs will be added for this phase of alerting.
4.1.3. OpenLDAP (on mail host)
- Action:
- Add
pkgs.openldap_exporterto themailhost's system packages. - Configure
services.openldap-exporterinhosts/mail/modules/openldap.nix.- Ensure it points to the local OpenLDAP instance (e.g.,
ldap:///) and has permissions forcn=monitor. - Default port is
9330.
- Ensure it points to the local OpenLDAP instance (e.g.,
- Add
vmagentScrape Job: Defined inhosts/mail/modules/openldap.nix.config.services.vmagent.scrapeJobs = [ { job_name = "openldap"; static_configs = [{ targets = ["localhost:9330"]; }]; } ];
4.1.4. Postfix (on mail host)
- Action:
- Add
pkgs.postfix_exporterto themailhost's system packages. - Configure
services.postfix-exporterinhosts/mail/modules/postfix.nix.- May require log file access or
postconfpermissions. - Default port is
9154.
- May require log file access or
- Add
vmagentScrape Job: Defined inhosts/mail/modules/postfix.nix.config.services.vmagent.scrapeJobs = [ { job_name = "postfix"; static_configs = [{ targets = ["localhost:9154"]; }]; } ];
4.1.5. Dovecot (on mail host)
- Action:
- In
hosts/mail/modules/dovecot.nix, enable Dovecot's internal statistics service, making it accessible (e.g., via a local socket or TCP port). - Add
pkgs.dovecot_exporterto themailhost's system packages. - Configure
services.dovecot-exporterinhosts/mail/modules/dovecot.nixto connect to Dovecot's stats.- Default port is
9166.
- Default port is
- In
vmagentScrape Job: Defined inhosts/mail/modules/dovecot.nix.config.services.vmagent.scrapeJobs = [ { job_name = "dovecot"; static_configs = [{ targets = ["localhost:9166"]; }]; } ];
4.1.6. WireGuard (interfaces on fw host)
- Action:
- Add
pkgs.wireguard_exporterto thefwhost's system packages. - Configure
services.wireguard-exporterinhosts/fw/modules/wireguard.nix.- Requires privileges for
wg show all dump. - Default port is
9586.
- Requires privileges for
- Add
vmagentScrape Job: Defined inhosts/fw/modules/wireguard.nix.config.services.vmagent.scrapeJobs = [ { job_name = "wireguard"; static_configs = [{ targets = ["localhost:9586"]; }]; } ];
4.1.7. Central vmagent Configuration (utils/modules/victoriametrics/default.nix)
- Action:
- Ensure the
services.vmagent.scrapeJobsoption is defined to allow merging of job lists from various modules:# In utils/modules/victoriametrics/default.nix options.services.vmagent.scrapeJobs = lib.mkOption { type = lib.types.listOf lib.types.attrs; default = [ # Default job for the host's own node_exporter { job_name = "node_exporter_${config.networking.hostName}"; # Unique job name stream_parse = true; static_configs = [{ targets = ["${config.networking.hostName}:9100"]; }]; } ]; apply = lib.concatLists; // Or use lib.mkMerge for more complex merging if needed later description = "List of scrape_configs jobs for vmagent."; }; - The
prometheus.ymlgeneration script (configure_prom) will useconfig.services.vmagent.scrapeJobs. vmagentonfwandmailhosts will continue to useconfig.sops.secrets.victoria-agent-envfor secure remote write tohttps://victoria-server.cloonar.com/api/v1/write(which is theweb-armhost).
- Ensure the
4.2. Grafana Alert Provisioning (on web-arm host)
- Action:
- Create a new Nix file:
hosts/web-arm/modules/grafana/alerting/infrastructure/default.nix. - Populate this file with alert rule groups. Example structure:
# In hosts/web-arm/modules/grafana/alerting/infrastructure/default.nix { lib, config, ... }: { services.grafana.provision.alerting.rules.settings.groups = lib.mkMerge [ { // Gitea Server Alert name = "Gitea Service Alerts"; folder = "Infrastructure Services"; interval = "1m"; rules = [ { alert = "GiteaServerDown"; expr = ''up{job="gitea"} == 0''; for = "2m"; labels = { severity = "critical", service = "gitea" }; annotations = { /* ... */ }; } ]; }, { // Gitea Runner Alerts (via Gitea Server Metrics) name = "Gitea Runner Alerts"; folder = "Infrastructure Services"; interval = "1m"; rules = [ { alert = "GiteaRunnerOffline_git-runner-1"; // Verify exact metric name & labels from Gitea's /metrics endpoint expr = ''gitea_actions_runner_status{runner_name="git-runner-1", status="offline"} == 1''; for = "5m"; labels = { severity = "warning", service = "gitea-runner", runner = "git-runner-1" }; annotations = { /* ... */ }; }, { alert = "GiteaRunnerOffline_git-runner-2"; expr = ''gitea_actions_runner_status{runner_name="git-runner-2", status="offline"} == 1''; for = "5m"; labels = { severity = "warning", service = "gitea-runner", runner = "git-runner-2" }; annotations = { /* ... */ }; } ]; }, { // OpenLDAP Alert name = "OpenLDAP Service Alerts"; /* ... */ rules = [ { alert = "OpenLDAPDown"; expr = ''up{job="openldap"} == 0''; /* ... */ } ]; }, { // Postfix Alert name = "Postfix Service Alerts"; /* ... */ rules = [ { alert = "PostfixDown"; expr = ''up{job="postfix"} == 0''; /* ... */ } ]; }, { // Dovecot Alert name = "Dovecot Service Alerts"; /* ... */ rules = [ { alert = "DovecotDown"; expr = ''up{job="dovecot"} == 0''; /* ... */ } ]; }, { // WireGuard Alert name = "WireGuard Service Alerts"; /* ... */ rules = [ { alert = "WireGuardExporterDown"; expr = ''up{job="wireguard"} == 0''; /* ... */ } ]; } ]; } - Import this new rules file into
hosts/web-arm/modules/grafana/default.nix:imports = [ ./alerting/system/default.nix ./alerting/infrastructure/default.nix // <-- Add this line ./datasources/victoriametrics.nix ]; - Alerts will use the existing
cp_dominikPushover contact point by default.
- Create a new Nix file:
5. Diagram of Metrics Flow
graph TD
subgraph Host_FW ["Host: fw (vmagent)"]
GiteaApp[Gitea in Container] -- :3001/metrics --> VMAgentFW
RunnerVM1[Gitea Runner VM 1] -.-> GiteaApp; subgraph Gitea Runner VMs
RunnerVM1
RunnerVM2[Gitea Runner VM 2] -.-> GiteaApp;
end
WG[WireGuard Kernel] -- wg show --> WGExporter(wireguard_exporter :9586)
WGExporter -- metrics --> VMAgentFW
VMAgentFW[vmagent] -- remoteWrite --> VictoriaMetricsSvc
end
subgraph Host_Mail ["Host: mail (vmagent)"]
OpenLDAPApp[OpenLDAP] -- cn=monitor --> OpenLDAPExporter(openldap_exporter :9330)
OpenLDAPExporter -- metrics --> VMAgentMail
PostfixApp[Postfix] -- logs/stats --> PostfixExporter(postfix_exporter :9154)
PostfixExporter -- metrics --> VMAgentMail
DovecotApp[Dovecot] -- stats --> DovecotExporter(dovecot_exporter :9166)
DovecotExporter -- metrics --> VMAgentMail
VMAgentMail[vmagent] -- remoteWrite --> VictoriaMetricsSvc
end
subgraph Host_Web_ARM ["Host: web-arm (victoria-server.cloonar.com)"]
VictoriaMetricsSvc[VictoriaMetrics Service]
Grafana[Grafana] -- queries --> VictoriaMetricsSvc
Grafana -- Alert Rules (Nix Provisioned) --> Notifications[Pushover]
end
style VMAgentFW fill:#lightgreen
style VMAgentMail fill:#lightgreen
style Grafana fill:#lightblue
style VictoriaMetricsSvc fill:#orange
6. Pre-Implementation Checklist & Notes
- Verify Exporter Package Names: Confirm the exact NixOS package names in
pkgsfor:openldap_exporterpostfix_exporterdovecot_exporterwireguard_exporter
- Gitea Metrics Token: Decide on and implement a token strategy for Gitea's
/metricsendpoint if desired for security. - Gitea Runner Metrics: Inspect the Gitea server's
/metricsendpoint to confirm the exact metric names and labels for runner status (e.g.,gitea_actions_runner_statusorgitea_actions_runners_total) to ensure alert queries are accurate. - Exporter Ports: Default ports are assumed. Adjust configurations if non-default ports are used.
- Firewall Rules: Ensure
vmagentcan reach all local exporter ports and that exporters can reach their respective services.
This plan provides a comprehensive approach to enhancing service monitoring and alerting.