Skip to content

Operational Playbooks

PKI failures rarely look like cryptographic breaks. They look like expired OCSP responder keys at 03:00, missing OCSP nonces in vendor implementations, and threadpool exhaustion under unexpected load. This chapter collects the playbooks that turn these into 10-minute recoveries.

The preceding chapters laid out what a PKI-anchored AI deployment should look like. This one is concrete: runnable recipes for the operations that keep it honest. The reader who has shipped a QTSP will recognise most of these from production experience; the reader who has not will save weeks by reading them in advance.

We assume EATF / Aletheia as the reference signing engine. The playbooks generalise: substituting a different engine costs about two paragraphs of adaptation per playbook.

A Time-Stamping Authority is the second-most-load-bearing component in a PKI-anchored AI pipeline (the first being the signing key itself). When the TSA goes offline, signing stops. When the TSA’s own certificate expires unexpectedly, every signature produced between the expiry and discovery is operationally suspect. Both failure modes are common.

flowchart TB
    classDef ok fill:#dcfce7,stroke:#15803d,color:#14532d
    classDef warn fill:#fef9c3,stroke:#a16207,color:#713f12
    classDef bad fill:#fee2e2,stroke:#b91c1c,color:#7f1d1d

    P[Primary TSA <br> e.g. SK ID Solutions EE-QTSP]:::ok
    F[Fallback TSA <br> e.g. FreeTSA / DigiCert]:::ok
    M[Monitoring <br> cert expiry, response latency]:::warn
    A[Alerting <br> Pagerduty / on-call]:::warn

    Sign[eatf sign] --> Try{Try primary}
    Try -->|≤ 5s| P
    Try -->|timeout| F
    P --> M
    F --> M
    M -->|threshold| A

For most deployments — including all current EATF partner integrations — a single qualified TSA from a national QTSP suffices for legal purposes, plus one non-qualified fallback for availability. The fallback is not legally equivalent, but a non-qualified timestamp is still better than no timestamp.

ProviderQualified?Notes
SK ID Solutions (Estonia)yesEE-QTSP; reliable; default for Estonian deployments
Telia (Sweden / EE)yesCross-border qualified; commercial pricing
DigiCert TSAyes (US/EU)Cross-jurisdictional; lower legal weight in EU
FreeTSA.orgnoOperational fallback; community-run
Sectigo TSAnoFree tier; commercial upgrade for QTSP grade

Every TSA token has a genTime field that determines the moment the timestamp witnesses. Verifiers should reject tokens whose genTime deviates from the local clock by more than the deployment- configured drift bound:

  • Strict: ±60 seconds. Default for EATF in qualified mode.
  • Relaxed: ±5 minutes. For non-clustered single-host signers whose clock is NTP-locked.
  • Lenient: ±1 hour. For embedded / disconnected scenarios. Use with explicit deployer-policy approval.

Two TSAs in different administrative jurisdictions, with retry logic at the signer side. EATF defaults: 5-second timeout against primary, then 10-second against fallback, then a hard fail producing an alert. The hard fail is deliberately non-recoverable to prevent silently producing untimestamped artefacts.

OCSP responder failures are the single most common production incident in PKI-anchored systems. The 2026-04-07 EJBCA OCSP responder-key-expiry incident — captured in life/reflect/2026-04-07_ejbca-ops-dossier-from-tmp.md — is the worked example for this section.

A verifier’s OCSP cache must reject responses whose nextUpdate is in the past, even if the response was fresh when fetched. A related and subtler bug: a verifier accepting a response with no nextUpdate field at all (some responder implementations omit it). Reject those too — explicitly, with a logged warning.

stateDiagram-v2
    [*] --> Idle
    Idle --> Fetching: cache miss
    Fetching --> Validating: response received
    Fetching --> Failed: timeout / network error
    Validating --> Caching: signature OK + nextUpdate > now
    Validating --> Suspicious: missing nextUpdate
    Validating --> Stale: nextUpdate ≤ now
    Caching --> Idle: TTL set
    Stale --> [*]: alert
    Suspicious --> [*]: alert + reject
    Failed --> Retry: backoff
    Retry --> Fetching: after N seconds

When the on-call alert fires for “OCSP responder unreachable” or “OCSP signing certificate expired”:

  1. Identify scope. Which CA’s OCSP is failing? Which downstream verifiers depend on it? Are signing operations impacted, or only verification?
  2. Switch verifiers to soft-fail with audit. Do not soft- fail silently. If your environment’s policy permits temporary soft-fail, switch verifiers to “soft-fail with explicit audit marker on every accepted response” — never to “soft-fail silently”.
  3. Stop signing. If signing is impacted (because EATF captures OCSP at sign time), stop signing until the responder is restored. A signing pipeline that produces untimestamped or unrevocation-checked artefacts is a future audit liability.
  4. Contact the CA. Most QTSP-grade CAs have a 24/7 incident contact in their certificate practices statement.
  5. After restoration: revoke or reissue every artefact signed in the soft-fail window if policy demands; otherwise, document the gap with a signed incident report (which is itself an EATF-signed artefact).

Repeated from chapter 1 §1.4 because it is the operational discipline that distinguishes a defensible signing pipeline from an indefensible one:

The hash-chained audit ledger from chapter 5 §5.4 has its own operational discipline.

A daily cron checks that every block in the per-tenant ledger satisfies block[N].header.prev_hash == sha256(block[N-1]). If the chain breaks, the most recent valid block is identified and the operator is alerted. Recovery is non-trivial: a broken chain implies either tampering, or a bug in the ledger writer that needs investigation, or clock-skew driven block-id collision. Treat as a security incident until proven otherwise.

Окно терминала
# Daily integrity check, run from cron
eatf audit verify-chain \
--tenant my-org \
--since 2026-04-01 \
--output /var/log/eatf/integrity-$(date +%F).log

Per-tenant block_size controls how many events accumulate before a block is sealed and signed. Defaults:

block_sizeUse case
1Real-time forensic; one event per block. Heavy on signatures.
16Default for EATF deployments. Balanced.
64Batch-style verticals (water-quality-ee daily bulletin).
256Bulk operations only.

Tuning happens at deployer-onboarding time and changes thereafter require a new tenant scope (not a configuration change to an existing one) to keep the chain semantics clean.

EU AI Act high-risk deployments imply 5-year minimum retention for most decisions; some sectoral overlays (medical, financial) extend this to 10 or 30 years. Practical retention strategy:

  1. Hot tier (≤ 90 days): PostgreSQL or equivalent, fast query, replicated.
  2. Warm tier (90 days – 2 years): S3-compatible object storage (e.g. Cloudflare R2 in our deployments), on-demand query.
  3. Cold tier (2+ years): Cold archival (Glacier-equivalent), chain integrity preserved through periodic re-verification + archival timestamping (PAdES-LTA-style).

Rotation: every block in the ledger is independently retrievable; rotation = moving older blocks from hot to warm, with the chain preserved by archival timestamps every quarter.

The day-to-day EATF flow has four CLI commands, in this order. Together they form the operational lifecycle of any deployer integration.

flowchart LR
    Init[eatf init <br> provision deployer key + registration] --> Doctor[eatf doctor <br> self-test signing, TSA, OCSP]
    Doctor --> Sign[eatf sign <br> produce .aep for content]
    Sign --> Verify[eatf verify <br> offline verification of .aep]
    Verify --> Loop{Daily ops}
    Loop -->|new agent| Reg[eatf agents sync]
    Loop -->|verify chain| Audit[eatf audit verify-chain]
    Reg --> Sign
    Audit --> Sign

Provision a deployer key pair (RSA-PSS-2048 + ML-DSA-65 hybrid by default), register with the upstream Aletheia substrate, and download the certificate chain into ~/.eatf/.

Окно терминала
eatf init \
--deployer-uri https://my-org.example.com \
--aletheia-endpoint https://h2oatlas.ee/api \
--hybrid-pqc \
--tsa-primary https://timestamp.sk.ee \
--tsa-fallback https://freetsa.org/tsr

Self-test the entire pipeline before any production signing. This is the single most underused command in deployer onboarding.

Окно терминала
eatf doctor
Deployer key loaded (RSA-PSS-2048 + ML-DSA-65 hybrid)
Certificate chain validates against EE-TL anchor
TSA primary reachable: 142 ms median round-trip
TSA fallback reachable: 218 ms median round-trip
OCSP responder reachable; freshness 2 hours
Audit ledger writable; current head: block #4592
All checks passed.

A red dot here is a hard stop: do not sign in production until eatf doctor is green.

The actual signing operation. Takes content (file, stdin, or URL) and emits an .aep Evidence Package.

Окно терминала
eatf sign \
--content ./report.pdf \
--content-type application/pdf \
--model-id water-quality-classifier-v3.2.1 \
--policy-version water-policy-v2.1.0 \
--output ./report.aep \
--metadata '{"chapter":"01","lang":"en","version":"v0.2"}'
Canonical CBOR payload constructed (hash 0xab...cd)
RSA-PSS signature produced
ML-DSA-65 signature produced
TSA token acquired (SK ID Solutions, 2026-05-05T20:14:33Z)
OCSP captured (fresh_at 2026-05-05T20:14:33Z)
Evidence package written: ./report.aep (4.2 KB)
Audit-ledger event #4593 appended to block #1147

Offline verification of any .aep package. Does not contact the network unless --live-ocsp is set.

Окно терминала
eatf verify ./report.aep
Package ID: 0x8f3e4a2b...
Signed by: https://my-org.example.com (deployer)
Signed at: 2026-05-05T20:14:33Z (TSA-anchored)
Algorithms: RSA-PSS-2048 + ML-DSA-65 (hybrid)
All signatures validate
Certificate chain validates under EE-TL trust anchors
OCSP captured at sign time confirms cert was good
TSA token validates under SK ID Solutions
Package id binding correct (anti-replay)
VERIFIED.

Three concrete scenarios — not an exhaustive list, but the three that EATF deployers have actually hit.

Incident A — Signature fails to verify in production

Section titled “Incident A — Signature fails to verify in production”

Symptoms: A downstream verifier rejects an .aep package that we know we produced legitimately.

Diagnosis order:

  1. Run eatf verify --explain against the package. The verifier should tell you which check failed.
  2. If signature: re-check the signing certificate’s expiry and revocation status against the captured OCSP.
  3. If chain: re-fetch the EE-TL and re-validate; the supervisory body may have rotated keys.
  4. If TSA: check the TSA’s own cert chain.
  5. If none of the above: log it as a verifier bug and route to the maintainers.

Symptoms: Signing fails; eatf doctor reports TSA unreachable.

Decision tree:

flowchart TB
    A[TSA primary down] --> B{Fallback up?}
    B -->|yes| C[Switch to fallback <br> lower legal weight, audit]
    B -->|no| D{Critical signing ops?}
    D -->|yes| E[Stop signing, escalate]
    D -->|no| F[Defer signing, retry every 5 min]
    C --> G[Document downgrade in incident log]
    G --> H[Re-anchor with primary TSA when restored]

Incident C — CA rotates OCSP signing key silently

Section titled “Incident C — CA rotates OCSP signing key silently”

Symptoms: OCSP responses suddenly fail signature verification on previously-working packages.

Recovery:

  1. Fetch the CA’s updated AuthorityInfoAccess extension.
  2. Re-fetch the OCSP responder’s new signing certificate.
  3. Push the updated trust anchor / responder cert into the verifier’s cache.
  4. Re-run eatf doctor to confirm.
  5. Replay any verifications that failed during the gap.

This is rare but not unheard of — OCSP responder key rotation by the CA is policy-permitted but should be announced. When the CA fails to announce, the deployer is left to discover. Hence the monitoring discipline in §6.2.

6.7 What this chapter set up for chapter 7

Section titled “6.7 What this chapter set up for chapter 7”

Chapter 7 walks the substrate-and-vertical thesis through five sectors — environmental monitoring, building audit, education, medical, financial. Each vertical inherits the operational discipline of this chapter: TSA selection, OCSP capture, audit-ledger discipline, EATF flow, incident playbooks. The verticals differ in policy and disclosure surface, not in the underlying operations.