Skip to content

Rollback Procedure

When to roll back

Roll back when all three of these are true:

  1. A regression is confirmed (not suspected) in production
  2. A forward fix cannot be shipped within the MTTR SLO (< 1 h cloud, < 1 day on-prem)
  3. The Release DRI has approved the rollback decision

Rollback types

Type What you revert How
Code rollback The application binary — revert to previous OCI image Pull previous semver tag, redeploy
Config rollback The tenant manifest — revert to previous signed manifest Swap manifest file, rolling restart
Both Binary + manifest together Always roll back to a known-good (image, manifest) pair

Always roll back as a pair

Rolling back the image without rolling back the manifest (or vice versa) can produce an inconsistent state. The image validates the manifest signature at startup — a manifest signed for v1.2.0 may have a different schema than v1.3.0 expects.


Step-by-step (Docker Compose — single box)

Step 1 — Identify the previous good tag

# List the last 5 tags in the registry
docker images ghcr.io/flexli/dms --format "{{.Tag}}" | head -10

# Or check the release notes for the last known-good version
# Format: v<MAJOR>.<MINOR>.<PATCH>-<git-sha>

Step 2 — Pull the previous image

PREV_VERSION=v1.2.0
docker pull ghcr.io/flexli/dms:${PREV_VERSION}

Step 3 — Swap the manifest

# Backup current manifest
cp manifests/meesho.manifest.json manifests/meesho.manifest.json.bak

# Restore previous manifest (versioned and immutable)
cp manifests/meesho.v1.2.0.manifest.json manifests/meesho.manifest.json

Step 4 — Redeploy

DMS_VERSION=${PREV_VERSION} docker compose up -d --no-build

Step 5 — Verify

make docker-smoke
# or:
curl http://localhost:5269/health

Step 6 — Update the audit log

Document in the runbook incident thread:

  • Time rollback started
  • Time rollback completed
  • Previous version: v1.3.0
  • Rollback target: v1.2.0
  • Root cause (if known)
  • On-call owner

Rollback criteria

A release defines its rollback trigger before it ships. Example:

Metric Threshold Action
HTTP 5xx error rate > 5% for 5 min Pause promotion; alert Release DRI
p95 latency (scan endpoint) > 500 ms for 10 min Pause promotion; investigate
AWB routing failures > 1% of scans Immediate rollback
Outbox drain failure queue growing for 15 min Immediate rollback

Retained releases

The registry retains at least three previous releases. The Release DRI confirms retention before deleting older images.


For current codebase (pre-CI, local dev)

Until the CI pipeline is in place (RFC PoC Week 2–3), rollback is manual:

# Roll back to a specific git tag
git checkout v1.2.0

# Restart the local backend
make backend-stop
make backend-bg