Operational Playbooks
PKI failures rarely look like cryptographic breaks. They look like expired OCSP responder keys at 03:00, missing OCSP nonces in vendor implementations, and threadpool exhaustion under unexpected load. This chapter collects the playbooks that turn these into 10-minute recoveries.
6.1 Why this chapter exists
Section titled “6.1 Why this chapter exists”The preceding chapters laid out what a PKI-anchored AI deployment should look like. This one is concrete: runnable recipes for the operations that keep it honest. The reader who has shipped a QTSP will recognise most of these from production experience; the reader who has not will save weeks by reading them in advance.
We assume EATF / Aletheia as the reference signing engine. The playbooks generalise: substituting a different engine costs about two paragraphs of adaptation per playbook.
6.2 TSA setup
Section titled “6.2 TSA setup”A Time-Stamping Authority is the second-most-load-bearing component in a PKI-anchored AI pipeline (the first being the signing key itself). When the TSA goes offline, signing stops. When the TSA’s own certificate expires unexpectedly, every signature produced between the expiry and discovery is operationally suspect. Both failure modes are common.
flowchart TB
classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
classDef warn fill:#fef9c3,stroke:#a16207,color:#713f12
classDef bad fill:#fee2e2,stroke:#b91c1c,color:#7f1d1d
P[Primary TSA <br> e.g. SK ID Solutions EE-QTSP]:::ok
F[Fallback TSA <br> e.g. FreeTSA / DigiCert]:::ok
M[Monitoring <br> cert expiry, response latency]:::warn
A[Alerting <br> Pagerduty / on-call]:::warn
Sign[eatf sign] --> Try{Try primary}
Try -->|≤ 5s| P
Try -->|timeout| F
P --> M
F --> M
M -->|threshold| A
Selection
Section titled “Selection”For most deployments — including all current EATF partner integrations — a single qualified TSA from a national QTSP suffices for legal purposes, plus one non-qualified fallback for availability. The fallback is not legally equivalent, but a non-qualified timestamp is still better than no timestamp.
| Provider | Qualified? | Notes |
|---|---|---|
| SK ID Solutions (Estonia) | yes | EE-QTSP; reliable; default for Estonian deployments |
| Telia (Sweden / EE) | yes | Cross-border qualified; commercial pricing |
| DigiCert TSA | yes (US/EU) | Cross-jurisdictional; lower legal weight in EU |
| FreeTSA.org | no | Operational fallback; community-run |
| Sectigo TSA | no | Free tier; commercial upgrade for QTSP grade |
Freshness windows
Section titled “Freshness windows”Every TSA token has a genTime field that determines the moment
the timestamp witnesses. Verifiers should reject tokens whose
genTime deviates from the local clock by more than the deployment-
configured drift bound:
- Strict: ±60 seconds. Default for EATF in qualified mode.
- Relaxed: ±5 minutes. For non-clustered single-host signers whose clock is NTP-locked.
- Lenient: ±1 hour. For embedded / disconnected scenarios. Use with explicit deployer-policy approval.
Redundancy
Section titled “Redundancy”Two TSAs in different administrative jurisdictions, with retry logic at the signer side. EATF defaults: 5-second timeout against primary, then 10-second against fallback, then a hard fail producing an alert. The hard fail is deliberately non-recoverable to prevent silently producing untimestamped artefacts.
6.3 OCSP / CRL discipline
Section titled “6.3 OCSP / CRL discipline”OCSP responder failures are the single most common production
incident in PKI-anchored systems. The 2026-04-07 EJBCA OCSP
responder-key-expiry incident — captured in
life/reflect/2026-04-07_ejbca-ops-dossier-from-tmp.md
— is the worked example for this section.
Staleness detection
Section titled “Staleness detection”A verifier’s OCSP cache must reject responses whose nextUpdate
is in the past, even if the response was fresh when fetched. A
related and subtler bug: a verifier accepting a response with no
nextUpdate field at all (some responder implementations omit it).
Reject those too — explicitly, with a logged warning.
stateDiagram-v2
[*] --> Idle
Idle --> Fetching: cache miss
Fetching --> Validating: response received
Fetching --> Failed: timeout / network error
Validating --> Caching: signature OK + nextUpdate > now
Validating --> Suspicious: missing nextUpdate
Validating --> Stale: nextUpdate ≤ now
Caching --> Idle: TTL set
Stale --> [*]: alert
Suspicious --> [*]: alert + reject
Failed --> Retry: backoff
Retry --> Fetching: after N seconds
The 03:00 incident playbook
Section titled “The 03:00 incident playbook”When the on-call alert fires for “OCSP responder unreachable” or “OCSP signing certificate expired”:
- Identify scope. Which CA’s OCSP is failing? Which downstream verifiers depend on it? Are signing operations impacted, or only verification?
- Switch verifiers to soft-fail with audit. Do not soft- fail silently. If your environment’s policy permits temporary soft-fail, switch verifiers to “soft-fail with explicit audit marker on every accepted response” — never to “soft-fail silently”.
- Stop signing. If signing is impacted (because EATF captures OCSP at sign time), stop signing until the responder is restored. A signing pipeline that produces untimestamped or unrevocation-checked artefacts is a future audit liability.
- Contact the CA. Most QTSP-grade CAs have a 24/7 incident contact in their certificate practices statement.
- After restoration: revoke or reissue every artefact signed in the soft-fail window if policy demands; otherwise, document the gap with a signed incident report (which is itself an EATF-signed artefact).
Capturing OCSP at sign time
Section titled “Capturing OCSP at sign time”Repeated from chapter 1 §1.4 because it is the operational discipline that distinguishes a defensible signing pipeline from an indefensible one:
6.4 Audit ledger operations
Section titled “6.4 Audit ledger operations”The hash-chained audit ledger from chapter 5 §5.4 has its own operational discipline.
Hash-chain integrity verification
Section titled “Hash-chain integrity verification”A daily cron checks that every block in the per-tenant ledger
satisfies block[N].header.prev_hash == sha256(block[N-1]). If
the chain breaks, the most recent valid block is identified and
the operator is alerted. Recovery is non-trivial: a broken chain
implies either tampering, or a bug in the ledger writer that
needs investigation, or clock-skew driven block-id collision.
Treat as a security incident until proven otherwise.
# Daily integrity check, run from croneatf audit verify-chain \ --tenant my-org \ --since 2026-04-01 \ --output /var/log/eatf/integrity-$(date +%F).logBlock-size tuning
Section titled “Block-size tuning”Per-tenant block_size controls how many events accumulate before
a block is sealed and signed. Defaults:
| block_size | Use case |
|---|---|
| 1 | Real-time forensic; one event per block. Heavy on signatures. |
| 16 | Default for EATF deployments. Balanced. |
| 64 | Batch-style verticals (water-quality-ee daily bulletin). |
| 256 | Bulk operations only. |
Tuning happens at deployer-onboarding time and changes thereafter require a new tenant scope (not a configuration change to an existing one) to keep the chain semantics clean.
Retention and rotation
Section titled “Retention and rotation”EU AI Act high-risk deployments imply 5-year minimum retention for most decisions; some sectoral overlays (medical, financial) extend this to 10 or 30 years. Practical retention strategy:
- Hot tier (≤ 90 days): PostgreSQL or equivalent, fast query, replicated.
- Warm tier (90 days – 2 years): S3-compatible object storage (e.g. Cloudflare R2 in our deployments), on-demand query.
- Cold tier (2+ years): Cold archival (Glacier-equivalent), chain integrity preserved through periodic re-verification + archival timestamping (PAdES-LTA-style).
Rotation: every block in the ledger is independently retrievable; rotation = moving older blocks from hot to warm, with the chain preserved by archival timestamps every quarter.
6.5 EATF signing flow
Section titled “6.5 EATF signing flow”The day-to-day EATF flow has four CLI commands, in this order. Together they form the operational lifecycle of any deployer integration.
flowchart LR
Init[eatf init <br> provision deployer key + registration] --> Doctor[eatf doctor <br> self-test signing, TSA, OCSP]
Doctor --> Sign[eatf sign <br> produce .aep for content]
Sign --> Verify[eatf verify <br> offline verification of .aep]
Verify --> Loop{Daily ops}
Loop -->|new agent| Reg[eatf agents sync]
Loop -->|verify chain| Audit[eatf audit verify-chain]
Reg --> Sign
Audit --> Sign
eatf init
Section titled “eatf init”Provision a deployer key pair (RSA-PSS-2048 + ML-DSA-65 hybrid
by default), register with the upstream Aletheia substrate, and
download the certificate chain into ~/.eatf/.
eatf init \ --deployer-uri https://my-org.example.com \ --aletheia-endpoint https://h2oatlas.ee/api \ --hybrid-pqc \ --tsa-primary https://timestamp.sk.ee \ --tsa-fallback https://freetsa.org/tsreatf doctor
Section titled “eatf doctor”Self-test the entire pipeline before any production signing. This is the single most underused command in deployer onboarding.
eatf doctor
✓ Deployer key loaded (RSA-PSS-2048 + ML-DSA-65 hybrid)✓ Certificate chain validates against EE-TL anchor✓ TSA primary reachable: 142 ms median round-trip✓ TSA fallback reachable: 218 ms median round-trip✓ OCSP responder reachable; freshness 2 hours✓ Audit ledger writable; current head: block #4592✓ All checks passed.A red dot here is a hard stop: do not sign in production until
eatf doctor is green.
eatf sign
Section titled “eatf sign”The actual signing operation. Takes content (file, stdin, or URL)
and emits an .aep Evidence Package.
eatf sign \ --content ./report.pdf \ --content-type application/pdf \ --model-id water-quality-classifier-v3.2.1 \ --policy-version water-policy-v2.1.0 \ --output ./report.aep \ --metadata '{"chapter":"01","lang":"en","version":"v0.2"}'
✓ Canonical CBOR payload constructed (hash 0xab...cd)✓ RSA-PSS signature produced✓ ML-DSA-65 signature produced✓ TSA token acquired (SK ID Solutions, 2026-05-05T20:14:33Z)✓ OCSP captured (fresh_at 2026-05-05T20:14:33Z)✓ Evidence package written: ./report.aep (4.2 KB)✓ Audit-ledger event #4593 appended to block #1147eatf verify
Section titled “eatf verify”Offline verification of any .aep package. Does not contact the
network unless --live-ocsp is set.
eatf verify ./report.aep
Package ID: 0x8f3e4a2b...Signed by: https://my-org.example.com (deployer)Signed at: 2026-05-05T20:14:33Z (TSA-anchored)Algorithms: RSA-PSS-2048 + ML-DSA-65 (hybrid)
✓ All signatures validate✓ Certificate chain validates under EE-TL trust anchors✓ OCSP captured at sign time confirms cert was good✓ TSA token validates under SK ID Solutions✓ Package id binding correct (anti-replay)✓ VERIFIED.6.6 Incident response
Section titled “6.6 Incident response”Three concrete scenarios — not an exhaustive list, but the three that EATF deployers have actually hit.
Incident A — Signature fails to verify in production
Section titled “Incident A — Signature fails to verify in production”Symptoms: A downstream verifier rejects an .aep package that
we know we produced legitimately.
Diagnosis order:
- Run
eatf verify --explainagainst the package. The verifier should tell you which check failed. - If signature: re-check the signing certificate’s expiry and revocation status against the captured OCSP.
- If chain: re-fetch the EE-TL and re-validate; the supervisory body may have rotated keys.
- If TSA: check the TSA’s own cert chain.
- If none of the above: log it as a verifier bug and route to the maintainers.
Incident B — TSA is offline
Section titled “Incident B — TSA is offline”Symptoms: Signing fails; eatf doctor reports TSA unreachable.
Decision tree:
flowchart TB
A[TSA primary down] --> B{Fallback up?}
B -->|yes| C[Switch to fallback <br> lower legal weight, audit]
B -->|no| D{Critical signing ops?}
D -->|yes| E[Stop signing, escalate]
D -->|no| F[Defer signing, retry every 5 min]
C --> G[Document downgrade in incident log]
G --> H[Re-anchor with primary TSA when restored]
Incident C — CA rotates OCSP signing key silently
Section titled “Incident C — CA rotates OCSP signing key silently”Symptoms: OCSP responses suddenly fail signature verification on previously-working packages.
Recovery:
- Fetch the CA’s updated
AuthorityInfoAccessextension. - Re-fetch the OCSP responder’s new signing certificate.
- Push the updated trust anchor / responder cert into the verifier’s cache.
- Re-run
eatf doctorto confirm. - Replay any verifications that failed during the gap.
This is rare but not unheard of — OCSP responder key rotation by the CA is policy-permitted but should be announced. When the CA fails to announce, the deployer is left to discover. Hence the monitoring discipline in §6.2.
6.7 What this chapter set up for chapter 7
Section titled “6.7 What this chapter set up for chapter 7”Chapter 7 walks the substrate-and-vertical thesis through five sectors — environmental monitoring, building audit, education, medical, financial. Each vertical inherits the operational discipline of this chapter: TSA selection, OCSP capture, audit-ledger discipline, EATF flow, incident playbooks. The verticals differ in policy and disclosure surface, not in the underlying operations.
6.8 Outgoing links
Section titled “6.8 Outgoing links”- → Chapter 7 · the substrate in production across five domains.
- → Chapter 8 · 30-year retention, PQC longevity, and other open operational problems.
- → Aletheia / EATF reference implementation.
- →
life/reflect/2026-04-07_ejbca-ops-dossier-from-tmp.mdfor the worked OCSP-responder-expiry incident.