Incident Response¶

Severity levels¶

Level	Definition	Response time	Example
S0	Complete service outage — no scans processing	Immediate	DMS process crashed; all scans failing
S1	Partial outage — significant throughput degradation	< 15 min	AWB lookup failing for subset of AWBs
S2	Degraded — feature impaired, workaround available	< 1 h	Outbox drain delayed; InScan notifications lagging
S3	Minor — cosmetic or low-impact	Next business day	Dashboard showing stale stats

First five minutes¶

1. Confirm the alert is real (not a test or flapping sensor)
2. Check /health on all three services:
   curl http://localhost:5269/health
   curl http://localhost:5000/health
   PGPASSWORD='...' pg_isready -U postgres

3. Check process status:
   make check

4. Check logs for the last 50 errors:
   make logs-backend | grep -i "error\|exception\|fail" | tail -50

5. Declare incident severity and notify the team

Common failure modes¶

DMS process down¶

make backend-bg   # restart
make smoke        # verify

If restart fails, check /tmp/dms-backend.log for the crash reason.

Database connection failures¶

# Is postgres running?
/Library/PostgreSQL/16/bin/pg_isready -U postgres

# Check connection:
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" -c "SELECT 1"

OutboxWorker not draining (notifications lagging)¶

# Check queue depth
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
  -c "SELECT COUNT(*) FROM meesho.inscan_sync WHERE status != 'Sent';"

# Check partner API reachability
curl -sf http://localhost:5000/health  # local dev
# or hit the real Meesho API endpoint in staging/prod

Partner API returning 401 (token expired)¶

# Force token refresh (restart the backend — TokenRefreshWorker runs at startup)
make backend-stop && make backend-bg

AWB lookup returning no data (routing decisions failing)¶

# Check data_sync table
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
  -c "SELECT COUNT(*), MAX(updated_at) FROM meesho.data_sync;"

If MAX(updated_at) is stale (> 10 minutes), DataSyncWorker has stopped polling. Check backend logs.

Post-incident¶

Within 24 hours of resolution:

Document the timeline in GitHub Discussions (category: Incidents)
Update this runbook with any new failure modes discovered
File a tracking issue for any gaps in alerting or tooling
If a rollback was performed, add the root cause to the Rollback Procedure