Incident Response¶
Severity levels¶
| Level | Definition | Response time | Example |
|---|---|---|---|
| S0 | Complete service outage — no scans processing | Immediate | DMS process crashed; all scans failing |
| S1 | Partial outage — significant throughput degradation | < 15 min | AWB lookup failing for subset of AWBs |
| S2 | Degraded — feature impaired, workaround available | < 1 h | Outbox drain delayed; InScan notifications lagging |
| S3 | Minor — cosmetic or low-impact | Next business day | Dashboard showing stale stats |
First five minutes¶
1. Confirm the alert is real (not a test or flapping sensor)
2. Check /health on all three services:
curl http://localhost:5269/health
curl http://localhost:5000/health
PGPASSWORD='...' pg_isready -U postgres
3. Check process status:
make check
4. Check logs for the last 50 errors:
make logs-backend | grep -i "error\|exception\|fail" | tail -50
5. Declare incident severity and notify the team
Common failure modes¶
DMS process down¶
If restart fails, check /tmp/dms-backend.log for the crash reason.
Database connection failures¶
# Is postgres running?
/Library/PostgreSQL/16/bin/pg_isready -U postgres
# Check connection:
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" -c "SELECT 1"
OutboxWorker not draining (notifications lagging)¶
# Check queue depth
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
-c "SELECT COUNT(*) FROM meesho.inscan_sync WHERE status != 'Sent';"
# Check partner API reachability
curl -sf http://localhost:5000/health # local dev
# or hit the real Meesho API endpoint in staging/prod
Partner API returning 401 (token expired)¶
# Force token refresh (restart the backend — TokenRefreshWorker runs at startup)
make backend-stop && make backend-bg
AWB lookup returning no data (routing decisions failing)¶
# Check data_sync table
PGPASSWORD='3edc#EDC' psql -U postgres -d "dms-layered" \
-c "SELECT COUNT(*), MAX(updated_at) FROM meesho.data_sync;"
If MAX(updated_at) is stale (> 10 minutes), DataSyncWorker has stopped polling. Check backend logs.
Post-incident¶
Within 24 hours of resolution:
- Document the timeline in GitHub Discussions (category: Incidents)
- Update this runbook with any new failure modes discovered
- File a tracking issue for any gaps in alerting or tooling
- If a rollback was performed, add the root cause to the Rollback Procedure