Skip to content

Incident Response

Severity levels

Level Definition Response time Example
S0 Complete service outage — no scans processing Immediate DMS process crashed; all scans failing
S1 Partial outage — significant throughput degradation < 15 min AWB lookup failing for subset of AWBs
S2 Degraded — feature impaired, workaround available < 1 h Outbox drain delayed; InScan notifications lagging
S3 Minor — cosmetic or low-impact Next business day Dashboard showing stale stats

First five minutes

1. Confirm the alert is real (not a test or flapping sensor)
2. Check /health on all three services:
   curl http://localhost:5269/health
   curl http://localhost:5000/health
   PGPASSWORD='...' pg_isready -U postgres

3. Check process status:
   make check

4. Check logs for the last 50 errors:
   make logs-backend | grep -i "error\|exception\|fail" | tail -50

5. Declare incident severity and notify the team

Common failure modes

DMS process down

make backend-bg   # restart
make smoke        # verify

If restart fails, check /tmp/dms-backend.log for the crash reason.

Database connection failures

# Is postgres running?
/Library/PostgreSQL/16/bin/pg_isready -U postgres

# Check connection:
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" -c "SELECT 1"

OutboxWorker not draining (notifications lagging)

# Check queue depth
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
  -c "SELECT COUNT(*) FROM meesho.inscan_sync WHERE status != 'Sent';"

# Check partner API reachability
curl -sf http://localhost:5000/health  # local dev
# or hit the real Meesho API endpoint in staging/prod

Partner API returning 401 (token expired)

# Force token refresh (restart the backend — TokenRefreshWorker runs at startup)
make backend-stop && make backend-bg

AWB lookup returning no data (routing decisions failing)

# Check data_sync table
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
  -c "SELECT COUNT(*), MAX(updated_at) FROM meesho.data_sync;"

If MAX(updated_at) is stale (> 10 minutes), DataSyncWorker has stopped polling. Check backend logs.


Post-incident

Within 24 hours of resolution:

  1. Document the timeline in GitHub Discussions (category: Incidents)
  2. Update this runbook with any new failure modes discovered
  3. File a tracking issue for any gaps in alerting or tooling
  4. If a rollback was performed, add the root cause to the Rollback Procedure