Migrate OpenLDAP tenants to per-tenant LMDB environments #5

Open
opened 2026-05-20 16:01:25 +02:00 by dominik.polakovics · 1 comment

Problem Statement

All nine olcDatabase entries on mail.cloonar.com currently set olcDbDirectory = "/var/lib/openldap/data". There is a single LMDB environment (one data.mdb + one lock.mdb) on disk shared across the primary cloonar database and every per-Tenant database. The setup works only because slapd-mdb uses fixed sub-database names internally (id2entry, dn2id, etc.) — every Tenant's entries coexist in the same store, and slapd's per-request suffix routing is what keeps a query against dc=superbros,dc=tv from returning entries belonging to dc=cloonar,dc=com.

Concrete consequences of the shared-env shape:

  • Backups cannot be taken per Tenant. A borgbackup of the LMDB env is all-or-nothing.
  • slapcat -b <suffix> does not return per-suffix data — it dumps the entire shared env regardless of which -b is passed. Tooling assuming standard slapd-mdb semantics misbehaves.
  • Audit and forensics are coarse: "what does dc=foo,dc=bar contain on disk" cannot be answered without parsing the full env.
  • The env's LMDB size limit applies globally across all Tenants combined.
  • Two suffixes currently have entries in the env that are not served by any configured olcDatabase: dc=optiprot,dc=eu (8 entries) and dc=ghetto,dc=at (1 entry). Plus a typo entry dc=cloonar,dc=co (1 entry, .co not .com). These ride forward in every backup as cruft.
  • New engineers reading the Nix config see nine databases pointing at one directory and reasonably assume it is a bug. ADR-0001 documents the deliberateness, but the canonical fix is to make storage match topology.

Solution

Migrate each Tenant onto its own LMDB environment by setting olcDbDirectory = "/var/lib/openldap/<tenant-slug>/", where <tenant-slug> is derived deterministically from the olcSuffix. This is a one-time offline migration on mail.cloonar.com: snapshot the existing env, slapcat it to a single LDIF, split the LDIF into per-Tenant LDIFs by DN suffix, switch the Nix config, then slapadd each filtered LDIF into its Tenant's new directory.

This is the target shape committed to in ADR-0001.

The "copy data.mdb to every per-Tenant directory then clean up later" alternative was considered and rejected. Reasoning: cleanup via the LDAP API is impossible — slapd's suffix routing sends a delete request for cn=foo,dc=cloonar,dc=com to whichever database claims dc=cloonar,dc=com, never to a Tenant env where that DN exists only as a ghost. Offline cleanup requires the same slapcat / filter / slapadd cycle as the upfront-filter approach, while the meantime carries the full dataset duplicated nine times in every Tenant's env. Filter-then-slapadd is the same elapsed effort with cleaner steady state.

User Stories

  1. As a fleet operator, I want each Tenant's data to live in its own LMDB environment, so that I can back up, restore, or wipe a single Tenant without touching others.
  2. As a fleet operator, I want the on-disk storage layout to match the LDAP namingContext layout (one env per suffix), so that the system is debuggable using stock OpenLDAP knowledge instead of an undocumented quirk.
  3. As a fleet operator, I want slapcat -b <suffix> to actually return only that suffix's entries, so that per-Tenant dumps for audit, debugging, or selective restore behave as documented.
  4. As a release operator, I want a verified backup of /var/lib/openldap/data/ recorded before migration, so that I have a documented rollback target.
  5. As a release operator, I want a written runbook covering the maintenance window — downtime estimate, exact slapcat/filter/slapadd commands, pre- and post-condition checks — so that the execution is predictable and another operator could run it.
  6. As a release operator, I want a post-migration verification step that confirms each Tenant returns the same number and identity of entries it did pre-migration, so that I can declare success without ambiguity.
  7. As a fleet operator, I want the migration to optionally sweep up the orphan suffixes (dc=optiprot,dc=eu, dc=ghetto,dc=at) and the dc=cloonar,dc=co typo entry, so that the post-migration state is also a cleanup.
  8. As a future maintainer, I want the per-Tenant-env shape reflected in the mkTenant helper (already introduced by the index-padding PRD) so that newly added Tenants acquire their own env automatically.
  9. As an on-call engineer, I want the migration to be a single discrete event with a clear rollback procedure, so that midway failure does not leave the directory in a corrupted intermediate state.

Implementation Decisions

  • Migration is performed as an offline operation on mail.cloonar.com inside a maintenance window. Online migration without downtime is explicitly not pursued.
  • Migration steps:
    1. Announce maintenance window to the fleet. Stop or otherwise tolerate disruption in services that bind LDAP (postfix, dovecot, authelia, owncloud, home-assistant).
    2. Stop openldap.service.
    3. Snapshot /var/lib/openldap/data/ to a timestamped tarball — both locally and pushed to the off-host borgbackup repo. The tarball is the documented rollback target.
    4. Run slapcat -F /etc/openldap/slapd.d -b dc=cloonar,dc=com -l /tmp/openldap-all.ldif. The -b argument is incidental — every configured database in the current setup shares the env, so any -b dumps the same thing.
    5. Split /tmp/openldap-all.ldif into one file per Tenant by DN suffix using a small script kept in the repo (likely Python — needs to handle LDIF line continuations correctly).
    6. The split script optionally drops orphan suffixes (dc=optiprot,dc=eu, dc=ghetto,dc=at, dc=cloonar,dc=co). Decision on whether to sweep them is per-execution.
    7. Deploy the updated hosts/mail/modules/openldap.nix. The helper now emits per-Tenant olcDbDirectory. The Nix activation creates /etc/openldap/slapd.d/ afresh but does not populate the data directories.
    8. For each Tenant (including cloonar), create /var/lib/openldap/<slug>/, set ownership to openldap:openldap, and run slapadd -F /etc/openldap/slapd.d -b <suffix> -l <slug>.ldif.
    9. Start openldap.service.
    10. Verify per-Tenant entry counts via ldapsearch -Y EXTERNAL -H ldapi:/// -b <suffix> -s sub "(objectClass=*)" 1.1 | grep -c "^dn:". Counts must match the pre-migration awk-derived counts captured in the runbook.
    11. Verify cross-Tenant isolation: ldapsearch -Y EXTERNAL -H ldapi:/// -b <other-suffix> against a Tenant still routes correctly (returns entries under that suffix only). This already worked pre-migration; the post-check confirms parity.
    12. Update the borgbackup configuration to ensure /var/lib/openldap/ (now containing multiple subdirectories) is fully included.
  • The mkTenant helper from the index-padding PRD gains a derived olcDbDirectory field. The slug is the first DC component of the suffix (e.g., dc=superbros,dc=tvsuperbros). Collisions between Tenants sharing a first DC component are statically rejected by the helper (assertion at module level). If a real collision arises later, the slug rule is reconsidered.
  • The cloonar primary database also gets its own dedicated directory (/var/lib/openldap/cloonar/), not the legacy data/ path. The legacy data/ is removed entirely after migration.
  • The migration script is kept under scripts/ so it can be reused if a similar migration is ever needed again (or in a staging dry-run).
  • The dry-run gate is mandatory: a copy of data.mdb is taken to a non-production environment (local VM or staging), the script is executed end-to-end there, counts are verified before the production maintenance window opens.
  • Rollback procedure documented inline in the runbook: stop slapd, restore the tarball from step 3 over /var/lib/openldap/, git revert the Nix change, redeploy. Time-bounded — if the post-migration verification at step 10 finds count mismatches, immediate rollback.

Testing Decisions

  • A good test for a stateful one-time migration verifies that the post-state equals the pre-state plus the intended diff. Concretely: every Tenant has the same number and identity of entries as before, and no Tenant can see entries from any other.
  • Pre-migration measurement: capture per-suffix entry counts using the awk grouping (sort + uniq -c over the DC-suffix component of every DN), record in the runbook.
  • Post-migration measurement: per-Tenant ldapsearch count under -b <suffix> -s sub must equal the pre-migration count for that suffix.
  • Cross-Tenant isolation check: a search base outside a Tenant's suffix returns no entries from that Tenant's env. This was already true pre-migration (because of suffix routing); the post-check confirms parity.
  • Mandatory dry-run on a non-production environment. Acceptance criterion for proceeding to production: every per-Tenant count post-script equals pre-script, including for the cloonar primary.
  • Prior art for the verification approach: the diagnostic awk + ldapsearch -Y EXTERNAL -H ldapi:/// commands used during the planning session, which produced the per-suffix entry distribution that informed this PRD.
  • No unit-test layer is meaningful here — the migration is a one-shot offline procedure executed by a human operator from a runbook. The "test" is the dry-run + the post-migration verification.

Out of Scope

  • The structural {N} index padding fix — separate, prerequisite PRD.
  • Reorganising the LDAP tree to fold Tenants under cloonar (option Z from the planning session). Rejected in ADR-0001 in favor of per-Tenant envs (Y).
  • Online migration without downtime. Explicitly accepts a maintenance window.
  • Changing the LDAP backend away from back-mdb. No change planned.
  • Per-Tenant authelia rules beyond the existing on/off flag inherited from PRD-1.
  • Splitting the cloonar primary database itself across multiple envs (e.g., one for users, one for system). Cloonar stays as one env at /var/lib/openldap/cloonar/.

Further Notes

  • Depends on the index-padding PRD landing first. The helper introduced there is the integration point for the per-Tenant olcDbDirectory change.
  • ADR-0001 commits to this target shape.
  • The pre/post entry-count verification is the single most important checkpoint. If counts disagree post-migration, rollback to the tarball before any client traffic is allowed.
  • Estimated downtime budget: 30–60 minutes for a clean run on the current ~130 entries. Dominated by tarball backup time, not slapd ops.
  • The orphan/typo cleanup is bundled as an option rather than required so the first execution can be strictly value-preserving if the operator prefers. The cleanup is then a trivial follow-up because each orphan would naturally have its own (or no) Tenant env.
## Problem Statement All nine `olcDatabase` entries on `mail.cloonar.com` currently set `olcDbDirectory = "/var/lib/openldap/data"`. There is a single LMDB environment (one `data.mdb` + one `lock.mdb`) on disk shared across the primary cloonar database and every per-**Tenant** database. The setup works only because `slapd-mdb` uses fixed sub-database names internally (`id2entry`, `dn2id`, etc.) — every **Tenant**'s entries coexist in the same store, and slapd's per-request suffix routing is what keeps a query against `dc=superbros,dc=tv` from returning entries belonging to `dc=cloonar,dc=com`. Concrete consequences of the shared-env shape: - Backups cannot be taken per **Tenant**. A `borgbackup` of the LMDB env is all-or-nothing. - `slapcat -b <suffix>` does not return per-suffix data — it dumps the entire shared env regardless of which `-b` is passed. Tooling assuming standard slapd-mdb semantics misbehaves. - Audit and forensics are coarse: "what does `dc=foo,dc=bar` contain on disk" cannot be answered without parsing the full env. - The env's LMDB size limit applies globally across all **Tenants** combined. - Two suffixes currently have entries in the env that are not served by any configured `olcDatabase`: `dc=optiprot,dc=eu` (8 entries) and `dc=ghetto,dc=at` (1 entry). Plus a typo entry `dc=cloonar,dc=co` (1 entry, `.co` not `.com`). These ride forward in every backup as cruft. - New engineers reading the Nix config see nine databases pointing at one directory and reasonably assume it is a bug. ADR-0001 documents the deliberateness, but the canonical fix is to make storage match topology. ## Solution Migrate each **Tenant** onto its own LMDB environment by setting `olcDbDirectory = "/var/lib/openldap/<tenant-slug>/"`, where `<tenant-slug>` is derived deterministically from the `olcSuffix`. This is a one-time offline migration on `mail.cloonar.com`: snapshot the existing env, `slapcat` it to a single LDIF, split the LDIF into per-**Tenant** LDIFs by DN suffix, switch the Nix config, then `slapadd` each filtered LDIF into its **Tenant**'s new directory. This is the target shape committed to in ADR-0001. The "copy `data.mdb` to every per-**Tenant** directory then clean up later" alternative was considered and rejected. Reasoning: cleanup via the LDAP API is impossible — slapd's suffix routing sends a delete request for `cn=foo,dc=cloonar,dc=com` to whichever database claims `dc=cloonar,dc=com`, never to a **Tenant** env where that DN exists only as a ghost. Offline cleanup requires the same `slapcat` / filter / `slapadd` cycle as the upfront-filter approach, while the meantime carries the full dataset duplicated nine times in every **Tenant**'s env. Filter-then-`slapadd` is the same elapsed effort with cleaner steady state. ## User Stories 1. As a fleet operator, I want each **Tenant**'s data to live in its own LMDB environment, so that I can back up, restore, or wipe a single **Tenant** without touching others. 2. As a fleet operator, I want the on-disk storage layout to match the LDAP `namingContext` layout (one env per suffix), so that the system is debuggable using stock OpenLDAP knowledge instead of an undocumented quirk. 3. As a fleet operator, I want `slapcat -b <suffix>` to actually return only that suffix's entries, so that per-**Tenant** dumps for audit, debugging, or selective restore behave as documented. 4. As a release operator, I want a verified backup of `/var/lib/openldap/data/` recorded before migration, so that I have a documented rollback target. 5. As a release operator, I want a written runbook covering the maintenance window — downtime estimate, exact slapcat/filter/slapadd commands, pre- and post-condition checks — so that the execution is predictable and another operator could run it. 6. As a release operator, I want a post-migration verification step that confirms each **Tenant** returns the same number and identity of entries it did pre-migration, so that I can declare success without ambiguity. 7. As a fleet operator, I want the migration to optionally sweep up the orphan suffixes (`dc=optiprot,dc=eu`, `dc=ghetto,dc=at`) and the `dc=cloonar,dc=co` typo entry, so that the post-migration state is also a cleanup. 8. As a future maintainer, I want the per-**Tenant**-env shape reflected in the `mkTenant` helper (already introduced by the index-padding PRD) so that newly added **Tenants** acquire their own env automatically. 9. As an on-call engineer, I want the migration to be a single discrete event with a clear rollback procedure, so that midway failure does not leave the directory in a corrupted intermediate state. ## Implementation Decisions - Migration is performed as an offline operation on `mail.cloonar.com` inside a maintenance window. Online migration without downtime is explicitly not pursued. - Migration steps: 1. Announce maintenance window to the fleet. Stop or otherwise tolerate disruption in services that bind LDAP (postfix, dovecot, authelia, owncloud, home-assistant). 2. Stop `openldap.service`. 3. Snapshot `/var/lib/openldap/data/` to a timestamped tarball — both locally and pushed to the off-host borgbackup repo. The tarball is the documented rollback target. 4. Run `slapcat -F /etc/openldap/slapd.d -b dc=cloonar,dc=com -l /tmp/openldap-all.ldif`. The `-b` argument is incidental — every configured database in the current setup shares the env, so any `-b` dumps the same thing. 5. Split `/tmp/openldap-all.ldif` into one file per **Tenant** by DN suffix using a small script kept in the repo (likely Python — needs to handle LDIF line continuations correctly). 6. The split script optionally drops orphan suffixes (`dc=optiprot,dc=eu`, `dc=ghetto,dc=at`, `dc=cloonar,dc=co`). Decision on whether to sweep them is per-execution. 7. Deploy the updated `hosts/mail/modules/openldap.nix`. The helper now emits per-**Tenant** `olcDbDirectory`. The Nix activation creates `/etc/openldap/slapd.d/` afresh but does not populate the data directories. 8. For each **Tenant** (including cloonar), create `/var/lib/openldap/<slug>/`, set ownership to `openldap:openldap`, and run `slapadd -F /etc/openldap/slapd.d -b <suffix> -l <slug>.ldif`. 9. Start `openldap.service`. 10. Verify per-**Tenant** entry counts via `ldapsearch -Y EXTERNAL -H ldapi:/// -b <suffix> -s sub "(objectClass=*)" 1.1 | grep -c "^dn:"`. Counts must match the pre-migration awk-derived counts captured in the runbook. 11. Verify cross-**Tenant** isolation: `ldapsearch -Y EXTERNAL -H ldapi:/// -b <other-suffix>` against a **Tenant** still routes correctly (returns entries under that suffix only). This already worked pre-migration; the post-check confirms parity. 12. Update the borgbackup configuration to ensure `/var/lib/openldap/` (now containing multiple subdirectories) is fully included. - The `mkTenant` helper from the index-padding PRD gains a derived `olcDbDirectory` field. The slug is the first DC component of the suffix (e.g., `dc=superbros,dc=tv` → `superbros`). Collisions between **Tenants** sharing a first DC component are statically rejected by the helper (assertion at module level). If a real collision arises later, the slug rule is reconsidered. - The cloonar primary database also gets its own dedicated directory (`/var/lib/openldap/cloonar/`), not the legacy `data/` path. The legacy `data/` is removed entirely after migration. - The migration script is kept under `scripts/` so it can be reused if a similar migration is ever needed again (or in a staging dry-run). - The dry-run gate is mandatory: a copy of `data.mdb` is taken to a non-production environment (local VM or staging), the script is executed end-to-end there, counts are verified before the production maintenance window opens. - Rollback procedure documented inline in the runbook: stop slapd, restore the tarball from step 3 over `/var/lib/openldap/`, `git revert` the Nix change, redeploy. Time-bounded — if the post-migration verification at step 10 finds count mismatches, immediate rollback. ## Testing Decisions - A good test for a stateful one-time migration verifies that the **post-state** equals the **pre-state plus the intended diff**. Concretely: every **Tenant** has the same number and identity of entries as before, and no **Tenant** can see entries from any other. - Pre-migration measurement: capture per-suffix entry counts using the awk grouping (sort + uniq -c over the DC-suffix component of every DN), record in the runbook. - Post-migration measurement: per-**Tenant** `ldapsearch` count under `-b <suffix> -s sub` must equal the pre-migration count for that suffix. - Cross-**Tenant** isolation check: a search base outside a **Tenant**'s suffix returns no entries from that **Tenant**'s env. This was already true pre-migration (because of suffix routing); the post-check confirms parity. - Mandatory dry-run on a non-production environment. Acceptance criterion for proceeding to production: every per-**Tenant** count post-script equals pre-script, including for the cloonar primary. - Prior art for the verification approach: the diagnostic awk + `ldapsearch -Y EXTERNAL -H ldapi:///` commands used during the planning session, which produced the per-suffix entry distribution that informed this PRD. - No unit-test layer is meaningful here — the migration is a one-shot offline procedure executed by a human operator from a runbook. The "test" is the dry-run + the post-migration verification. ## Out of Scope - The structural `{N}` index padding fix — separate, prerequisite PRD. - Reorganising the LDAP tree to fold **Tenants** under cloonar (option Z from the planning session). Rejected in ADR-0001 in favor of per-**Tenant** envs (Y). - Online migration without downtime. Explicitly accepts a maintenance window. - Changing the LDAP backend away from `back-mdb`. No change planned. - Per-**Tenant** authelia rules beyond the existing on/off flag inherited from PRD-1. - Splitting the cloonar primary database itself across multiple envs (e.g., one for users, one for system). Cloonar stays as one env at `/var/lib/openldap/cloonar/`. ## Further Notes - Depends on the index-padding PRD landing first. The helper introduced there is the integration point for the per-**Tenant** `olcDbDirectory` change. - ADR-0001 commits to this target shape. - The pre/post entry-count verification is the single most important checkpoint. If counts disagree post-migration, rollback to the tarball before any client traffic is allowed. - Estimated downtime budget: 30–60 minutes for a clean run on the current ~130 entries. Dominated by tarball backup time, not slapd ops. - The orphan/typo cleanup is bundled as an option rather than required so the first execution can be strictly value-preserving if the operator prefers. The cleanup is then a trivial follow-up because each orphan would naturally have its own (or no) **Tenant** env.
Author
Owner

Depends on #4 landing first — the mkTenant helper introduced there is the integration point for the per-tenant olcDbDirectory change here.

Depends on #4 landing first — the mkTenant helper introduced there is the integration point for the per-tenant olcDbDirectory change here.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Cloonar/nixos#5
No description provided.