Files
nixos/service_monitoring_plan.md

12 KiB

Service Monitoring & Alerting Plan

Date: 2025-05-31 Prepared by: Roo (AI Technical Leader)

1. Goal

Implement "service down" alerting in Grafana for OpenLDAP, Postfix, Dovecot, Gitea, Gitea Action Runners (via Gitea server metrics), and WireGuard. All configurations are to be managed via NixOS, using VictoriaMetrics as the central metrics backend and Grafana for alerting.

2. Hosts Involved

  • fw:
    • Runs Gitea (in a container), Gitea Action Runner MicroVMs, and WireGuard interfaces.
    • Will host vmagent for metrics collection.
    • Will host wireguard_exporter.
  • mail:
    • Runs OpenLDAP, Postfix, and Dovecot.
    • Will host vmagent for metrics collection.
    • Will host openldap_exporter, postfix_exporter, and dovecot_exporter.
  • web-arm:
    • Runs the central VictoriaMetrics service (accessible as victoria-server.cloonar.com).
    • Runs Grafana.
    • Will host the NixOS-provisioned alert definitions for Grafana.

3. Strategy Overview

The strategy involves three main phases for each service:

  1. Metrics Exposure: Ensure each service (or an associated exporter) provides Prometheus-compatible metrics.
  2. Metrics Collection: Configure vmagent on the relevant hosts (fw, mail) to scrape these metrics and send them to the central VictoriaMetrics instance on web-arm.
  3. Alert Definition: Define alert rules in Grafana (on web-arm) using NixOS provisioning. These rules will query VictoriaMetrics and trigger notifications (via the existing Pushover contact point) if a service is detected as down.

4. Detailed Plan

4.1. Metrics Exposure & Collection (Modular NixOS Approach)

4.1.1. Gitea Server (on fw host)

  • Action: Enable built-in Prometheus metrics in hosts/fw/modules/gitea.nix.
    • Modify services.gitea.settings within the Gitea container's configuration.
    • Example:
      metrics = {
        ENABLED = true;
        TOKEN = "your_secure_token_here"; // Optional: Consider if a token is needed/desired
      };
      
    • The Gitea /metrics endpoint will also provide status information for Gitea Action Runners.
  • vmagent Scrape Job: Defined in hosts/fw/modules/gitea.nix.
    config.services.vmagent.scrapeJobs = [
      {
        job_name = "gitea";
        static_configs = [{
          targets = ["<gitea_container_ip_or_localhost>:<gitea_port>"]; // e.g., "localhost:3001" if vmagent is on host and can reach container port
        }];
        // metrics_path defaults to /metrics
      }
    ];
    

4.1.2. Gitea Action Runners (on fw host, MicroVMs)

  • Action: Monitoring will be performed by querying metrics exposed by the Gitea server itself (see 4.1.1). No separate node_exporter or specific scrape jobs for the runner VMs will be added for this phase of alerting.

4.1.3. OpenLDAP (on mail host)

  • Action:
    1. Add pkgs.openldap_exporter to the mail host's system packages.
    2. Configure services.openldap-exporter in hosts/mail/modules/openldap.nix.
      • Ensure it points to the local OpenLDAP instance (e.g., ldap:///) and has permissions for cn=monitor.
      • Default port is 9330.
  • vmagent Scrape Job: Defined in hosts/mail/modules/openldap.nix.
    config.services.vmagent.scrapeJobs = [
      {
        job_name = "openldap";
        static_configs = [{ targets = ["localhost:9330"]; }];
      }
    ];
    

4.1.4. Postfix (on mail host)

  • Action:
    1. Add pkgs.postfix_exporter to the mail host's system packages.
    2. Configure services.postfix-exporter in hosts/mail/modules/postfix.nix.
      • May require log file access or postconf permissions.
      • Default port is 9154.
  • vmagent Scrape Job: Defined in hosts/mail/modules/postfix.nix.
    config.services.vmagent.scrapeJobs = [
      {
        job_name = "postfix";
        static_configs = [{ targets = ["localhost:9154"]; }];
      }
    ];
    

4.1.5. Dovecot (on mail host)

  • Action:
    1. In hosts/mail/modules/dovecot.nix, enable Dovecot's internal statistics service, making it accessible (e.g., via a local socket or TCP port).
    2. Add pkgs.dovecot_exporter to the mail host's system packages.
    3. Configure services.dovecot-exporter in hosts/mail/modules/dovecot.nix to connect to Dovecot's stats.
      • Default port is 9166.
  • vmagent Scrape Job: Defined in hosts/mail/modules/dovecot.nix.
    config.services.vmagent.scrapeJobs = [
      {
        job_name = "dovecot";
        static_configs = [{ targets = ["localhost:9166"]; }];
      }
    ];
    

4.1.6. WireGuard (interfaces on fw host)

  • Action:
    1. Add pkgs.wireguard_exporter to the fw host's system packages.
    2. Configure services.wireguard-exporter in hosts/fw/modules/wireguard.nix.
      • Requires privileges for wg show all dump.
      • Default port is 9586.
  • vmagent Scrape Job: Defined in hosts/fw/modules/wireguard.nix.
    config.services.vmagent.scrapeJobs = [
      {
        job_name = "wireguard";
        static_configs = [{ targets = ["localhost:9586"]; }];
      }
    ];
    

4.1.7. Central vmagent Configuration (utils/modules/victoriametrics/default.nix)

  • Action:
    1. Ensure the services.vmagent.scrapeJobs option is defined to allow merging of job lists from various modules:
      # In utils/modules/victoriametrics/default.nix
      options.services.vmagent.scrapeJobs = lib.mkOption {
        type = lib.types.listOf lib.types.attrs;
        default = [
          # Default job for the host's own node_exporter
          { job_name = "node_exporter_${config.networking.hostName}"; # Unique job name
            stream_parse = true;
            static_configs = [{ targets = ["${config.networking.hostName}:9100"]; }];
          }
        ];
        apply = lib.concatLists; // Or use lib.mkMerge for more complex merging if needed later
        description = "List of scrape_configs jobs for vmagent.";
      };
      
    2. The prometheus.yml generation script (configure_prom) will use config.services.vmagent.scrapeJobs.
    3. vmagent on fw and mail hosts will continue to use config.sops.secrets.victoria-agent-env for secure remote write to https://victoria-server.cloonar.com/api/v1/write (which is the web-arm host).

4.2. Grafana Alert Provisioning (on web-arm host)

  • Action:
    1. Create a new Nix file: hosts/web-arm/modules/grafana/alerting/infrastructure/default.nix.
    2. Populate this file with alert rule groups. Example structure:
      # In hosts/web-arm/modules/grafana/alerting/infrastructure/default.nix
      { lib, config, ... }:
      {
        services.grafana.provision.alerting.rules.settings.groups = lib.mkMerge [
          { // Gitea Server Alert
            name = "Gitea Service Alerts";
            folder = "Infrastructure Services";
            interval = "1m";
            rules = [
              {
                alert = "GiteaServerDown";
                expr = ''up{job="gitea"} == 0'';
                for = "2m";
                labels = { severity = "critical", service = "gitea" };
                annotations = { /* ... */ };
              }
            ];
          },
          { // Gitea Runner Alerts (via Gitea Server Metrics)
            name = "Gitea Runner Alerts";
            folder = "Infrastructure Services";
            interval = "1m";
            rules = [
              {
                alert = "GiteaRunnerOffline_git-runner-1";
                // Verify exact metric name & labels from Gitea's /metrics endpoint
                expr = ''gitea_actions_runner_status{runner_name="git-runner-1", status="offline"} == 1'';
                for = "5m";
                labels = { severity = "warning", service = "gitea-runner", runner = "git-runner-1" };
                annotations = { /* ... */ };
              },
              {
                alert = "GiteaRunnerOffline_git-runner-2";
                expr = ''gitea_actions_runner_status{runner_name="git-runner-2", status="offline"} == 1'';
                for = "5m";
                labels = { severity = "warning", service = "gitea-runner", runner = "git-runner-2" };
                annotations = { /* ... */ };
              }
            ];
          },
          { // OpenLDAP Alert
            name = "OpenLDAP Service Alerts"; /* ... */
            rules = [ { alert = "OpenLDAPDown"; expr = ''up{job="openldap"} == 0''; /* ... */ } ];
          },
          { // Postfix Alert
            name = "Postfix Service Alerts"; /* ... */
            rules = [ { alert = "PostfixDown"; expr = ''up{job="postfix"} == 0''; /* ... */ } ];
          },
          { // Dovecot Alert
            name = "Dovecot Service Alerts"; /* ... */
            rules = [ { alert = "DovecotDown"; expr = ''up{job="dovecot"} == 0''; /* ... */ } ];
          },
          { // WireGuard Alert
            name = "WireGuard Service Alerts"; /* ... */
            rules = [ { alert = "WireGuardExporterDown"; expr = ''up{job="wireguard"} == 0''; /* ... */ } ];
          }
        ];
      }
      
    3. Import this new rules file into hosts/web-arm/modules/grafana/default.nix:
      imports = [
        ./alerting/system/default.nix
        ./alerting/infrastructure/default.nix // <-- Add this line
        ./datasources/victoriametrics.nix
      ];
      
    4. Alerts will use the existing cp_dominik Pushover contact point by default.

5. Diagram of Metrics Flow

graph TD
    subgraph Host_FW ["Host: fw (vmagent)"]
        GiteaApp[Gitea in Container] -- :3001/metrics --> VMAgentFW
        RunnerVM1[Gitea Runner VM 1] -.-> GiteaApp; subgraph Gitea Runner VMs
            RunnerVM1
            RunnerVM2[Gitea Runner VM 2] -.-> GiteaApp;
        end
        WG[WireGuard Kernel] -- wg show --> WGExporter(wireguard_exporter :9586)
        WGExporter -- metrics --> VMAgentFW
        VMAgentFW[vmagent] -- remoteWrite --> VictoriaMetricsSvc
    end

    subgraph Host_Mail ["Host: mail (vmagent)"]
        OpenLDAPApp[OpenLDAP] -- cn=monitor --> OpenLDAPExporter(openldap_exporter :9330)
        OpenLDAPExporter -- metrics --> VMAgentMail
        PostfixApp[Postfix] -- logs/stats --> PostfixExporter(postfix_exporter :9154)
        PostfixExporter -- metrics --> VMAgentMail
        DovecotApp[Dovecot] -- stats --> DovecotExporter(dovecot_exporter :9166)
        DovecotExporter -- metrics --> VMAgentMail
        VMAgentMail[vmagent] -- remoteWrite --> VictoriaMetricsSvc
    end

    subgraph Host_Web_ARM ["Host: web-arm (victoria-server.cloonar.com)"]
        VictoriaMetricsSvc[VictoriaMetrics Service]
        Grafana[Grafana] -- queries --> VictoriaMetricsSvc
        Grafana -- Alert Rules (Nix Provisioned) --> Notifications[Pushover]
    end

    style VMAgentFW fill:#lightgreen
    style VMAgentMail fill:#lightgreen
    style Grafana fill:#lightblue
    style VictoriaMetricsSvc fill:#orange

6. Pre-Implementation Checklist & Notes

  1. Verify Exporter Package Names: Confirm the exact NixOS package names in pkgs for:
    • openldap_exporter
    • postfix_exporter
    • dovecot_exporter
    • wireguard_exporter
  2. Gitea Metrics Token: Decide on and implement a token strategy for Gitea's /metrics endpoint if desired for security.
  3. Gitea Runner Metrics: Inspect the Gitea server's /metrics endpoint to confirm the exact metric names and labels for runner status (e.g., gitea_actions_runner_status or gitea_actions_runners_total) to ensure alert queries are accurate.
  4. Exporter Ports: Default ports are assumed. Adjust configurations if non-default ports are used.
  5. Firewall Rules: Ensure vmagent can reach all local exporter ports and that exporters can reach their respective services.

This plan provides a comprehensive approach to enhancing service monitoring and alerting.