Rollback Procedure¶
When to roll back¶
Roll back when all three of these are true:
- A regression is confirmed (not suspected) in production
- A forward fix cannot be shipped within the MTTR SLO (< 1 h cloud, < 1 day on-prem)
- The Release DRI has approved the rollback decision
Rollback types¶
| Type | What you revert | How |
|---|---|---|
| Code rollback | The application binary — revert to previous OCI image | Pull previous semver tag, redeploy |
| Config rollback | The tenant manifest — revert to previous signed manifest | Swap manifest file, rolling restart |
| Both | Binary + manifest together | Always roll back to a known-good (image, manifest) pair |
Always roll back as a pair
Rolling back the image without rolling back the manifest (or vice versa) can produce an inconsistent state. The image validates the manifest signature at startup — a manifest signed for v1.2.0 may have a different schema than v1.3.0 expects.
Step-by-step (Docker Compose — single box)¶
Step 1 — Identify the previous good tag¶
# List the last 5 tags in the registry
docker images ghcr.io/flexli/dms --format "{{.Tag}}" | head -10
# Or check the release notes for the last known-good version
# Format: v<MAJOR>.<MINOR>.<PATCH>-<git-sha>
Step 2 — Pull the previous image¶
Step 3 — Swap the manifest¶
# Backup current manifest
cp manifests/meesho.manifest.json manifests/meesho.manifest.json.bak
# Restore previous manifest (versioned and immutable)
cp manifests/meesho.v1.2.0.manifest.json manifests/meesho.manifest.json
Step 4 — Redeploy¶
Step 5 — Verify¶
Step 6 — Update the audit log¶
Document in the runbook incident thread:
- Time rollback started
- Time rollback completed
- Previous version:
v1.3.0 - Rollback target:
v1.2.0 - Root cause (if known)
- On-call owner
Rollback criteria¶
A release defines its rollback trigger before it ships. Example:
| Metric | Threshold | Action |
|---|---|---|
| HTTP 5xx error rate | > 5% for 5 min | Pause promotion; alert Release DRI |
| p95 latency (scan endpoint) | > 500 ms for 10 min | Pause promotion; investigate |
| AWB routing failures | > 1% of scans | Immediate rollback |
| Outbox drain failure | queue growing for 15 min | Immediate rollback |
Retained releases¶
The registry retains at least three previous releases. The Release DRI confirms retention before deleting older images.
For current codebase (pre-CI, local dev)¶
Until the CI pipeline is in place (RFC PoC Week 2–3), rollback is manual: