FLX-ENG-RFC-001 — Branching, Build & Release Engineering¶
| Field | Value |
|---|---|
| RFC ID | FLX-ENG-RFC-001 |
| Status | WIP — target sign-off EoD 22 May 2026 |
| Author | Arun Singh, Senior Distinguished Engineer / Architect (Consulting) |
| Reviewers | Raja Choudhary (Founder), Rahul (Eng Lead), Kanchan, Tushar |
| Scope | mSort application — version control, build, release, multi-tenant rollout, PoC plan, tech stack |
| Supersedes | Ad-hoc dev/prod branching; Meesho-specific default behaviour in baseline |
TL;DR — The decision¶
Maintain exactly one product codebase on a trunk-based branching model. Handle client-specific behaviour through tenant-aware configuration and feature flags, not through client-specific branches or builds. For offline / semi-connected client sites, ship the same artifact with a Flexli-signed deployment manifest that activates the correct feature set per customer.
Branches manage software lifecycle; tenancy manages customer variation; the deployment manifest carries entitlement.
1. Problem¶
Flexli is moving from a single-customer baseline (currently carrying Meesho-specific defaults) to a multi-client product line. Without a deliberate engineering process, the natural drift is toward per-customer code forks, per-customer branches, and per-customer builds. That path looks fast for the first three customers and becomes the single largest source of incidents, merge debt, and release risk by customer ten.
Current constraints¶
- The baseline codebase carries Meesho-specific defaults — not scalable to other clients
- Only
prodanddevbranches with short-lived feature branches — no formal branching model - Gaps in publishing, bundling, and deployment — no CI/CD, no container image, no artifact signing
- Client variation is in business logic itself, not only configuration values
Design principles (each future proposal that violates one needs an explicit waiver)¶
- Branches model lifecycle, tenants model customers. Branching expresses change delivery; tenancy expresses business segmentation. Mixing them creates merge debt.
- One codebase, one artifact, multiple activations. The pipeline produces a single versioned artifact; customer differences are activated at deployment time through governed configuration.
- Control lives with Flexli, not the customer. Feature entitlement is generated, signed, and shipped by Flexli. Customer-editable plain settings are not a control surface; operability and rollback are first-class concerns.
2. Branching Model¶
Trunk-based development with short-lived feature branches and on-demand release branches.
Branches¶
| Branch | Purpose | Lifetime | Merges into |
|---|---|---|---|
main | Single source of truth. Always releasable. Protected; no direct push. | Permanent | — |
feature/<id>-<slug> | One unit of work, scoped to a user story or task ID. | Hours to a few days | main via PR + CI |
release/<version> | Stabilisation window for an upcoming UAT or production cut. Only fixes allowed. | Days to two weeks | main (and tag) |
hotfix/<id> | Urgent production fix on the currently released tag. | Hours | main + release tag |
Rules¶
- Direct pushes to
mainare blocked; every change lands through a PR with green CI and at least one reviewer (two forrelease/*andhotfix/*) - Feature branches rebase on
maindaily and merge within five working days - Tags follow semantic versioning, cut from a
release/*branch - We do not create:
customer/<name>branches, long-liveddevelopbranches, or environment branches
3. Multi-tenant Architecture¶
Client-specific behaviour is expressed through four mechanisms, in this order of preference:
| Mechanism | When to use | Carrier |
|---|---|---|
| Tenant configuration | Values differ per client: endpoints, limits, regex, SLAs, branding | Signed deployment manifest |
| Feature flags | Same capability, enable/disable per client or per environment | Feature registry in manifest |
| Strategy / plugin interfaces | Business logic genuinely differs (e.g., routing, validation, notification provider) | Interface + per-client implementation, selected by config |
| Deployment topology | Isolation required: regulatory, on-prem, dedicated SLO, noisy-neighbour risk | IaC module variant, same artifact |
What we do not do: edit source per client; maintain client forks; rebuild the artifact per client.
Feature registry and manifest¶
All flags live in a single feature registry in code, each entry declaring key, owner, default, and dependencies. The per-client manifest references those keys. A manifest is a small signed JSON:
{
"tenantId": "meesho-blr-wh-01",
"site": "BLR_WAREHOUSE",
"version": "1.2.3",
"features": {
"awb_data_sync": true,
"manifest_close": true,
"manual_sort": false
},
"config": {
"api_base_url": "https://api.meesho.com",
"data_sync_interval_seconds": 300
},
"validityWindow": {
"notBefore": "2026-06-01T00:00:00Z",
"notAfter": "2026-12-31T23:59:59Z"
},
"signature": "<cosign-detached-signature>"
}
The application validates the signature at startup and refuses to start on a tampered or expired manifest.
4. Build, Artifact, and Deployment¶
Pipeline stages¶
| Stage | What happens |
|---|---|
| Build | dotnet build — compile |
| Test | dotnet test — unit + integration |
| Scan | Static analysis + dependency vulnerability scan |
| Package | One immutable OCI image + CycloneDX SBOM + Cosign signature |
| Promote | Same artifact moves QA → Staging → UAT → Production. No rebuild on promotion. |
| Deploy | Orchestrator validates manifest signature and binds it to the artifact |
| Verify | Smoke tests + health checks + canary window |
Online vs offline client deployment¶
| Topology | Manifest delivery |
|---|---|
| Flexli cloud / managed | Pulled from Flexli config service at startup and on refresh; hot-reload or rolling restart |
| Client on-prem, connected | Pulled through site agent; cached locally; rolling restart per site |
| Client on-prem, air-gapped | Signed manifest bundled with the artifact at deployment time; new bundle on the next site visit |
Critical
Even for air-gapped sites, the manifest is generated and signed by Flexli. The client does not edit toggles by hand. A missing, tampered, or expired manifest puts the app into a documented safe state and emits an alert.
Rollback¶
- Code rollback: redeploy the previous tagged artifact (retained at least three releases back)
- Config rollback: redeploy the previous signed manifest (manifests are versioned and immutable)
- Every release has a documented rollback criterion (error rate, latency, business KPI) and a named on-call owner
5. Technology Stack¶
| Layer | Recommended choice |
|---|---|
| Source control & CI | Git (GitHub); GitHub Actions |
| Build & test | Existing .NET / dotnet test; add static-analysis and dependency-scan steps |
| Artifact format & registry | OCI container image + SBOM (CycloneDX); GHCR |
| Signing | Cosign (sigstore) with keys in cloud KMS or HSM |
| Tenant manifest | JSON Schema v1 + detached Cosign signature; bundled with artifact for offline sites |
| Feature flags | OpenFeature SDK with backend adapter selectable per tenant |
| Observability | OpenTelemetry → Grafana stack (Loki, Tempo, Prometheus / Mimir) |
| Secrets & keys | HashiCorp Vault or cloud KMS (AWS KMS / Azure Key Vault) |
| Deployment | Docker Compose (single-box sites) or k3s (multi-service sites) |
| IaC | Terraform (cloud) + Ansible (on-prem) |
6. Release Engineering Process¶
Two cadences¶
- Continuous to Flexli-managed environments: every merge to
mainthat passes CI is deployed automatically to dev → staging → Flexli-internal pilot site. If canary metrics hold for 30 minutes, promoted further. - Scheduled to client sites: fixed train tied to the sprint. One MINOR release per sprint, PATCH hotfixes only when needed.
Sprint-aligned release train (2-week sprint)¶
| Day | What happens | Owner |
|---|---|---|
| 1–8 | Normal development. PRs merge to main continuously. Each merge auto-deploys to dev → staging → Flexli pilot. Canary metrics are watched. | All engineers |
| 9 | Release cut: tag vX.Y.0-rc1 from main. QA owns regression sweep on the rc tag in staging. | Release DRI |
| 10 | Soft freeze on rc tag: only release-blocker fixes (cherry-picked to release/X.Y). New feature work continues on main for next sprint. | Release DRI + QA |
| 11 | Promote to first pilot client (5–10% of fleet). Canary watch for 24 h on per-tenant SLIs. | Release DRI + SRE |
| 12 | If canary green: promote to remaining client sites. Publish release notes and manifest diff per client. Update audit log. | Release DRI |
| 13–14 | Retrospective on the release: DORA metrics for the sprint, any rollback or freeze events, runbook gaps, next-sprint corrections. | Release DRI + Eng Lead |
DORA + one Flexli metric¶
- Deploy frequency — many merges/day to Flexli envs; ≥ 1 client release per sprint
- Lead time — commit → first production tenant; target < 2 weeks
- Change-failure rate — < 15%
- MTTR — < 1 h cloud, < 1 day on-prem
- Tenant-attribution rate — share of incidents diagnosed via per-tenant telemetry without manual log diving
What we deliberately don't do¶
- No release branches kept alive between sprints (
release/X.Yis archived once shipped) - No manual QA sign-off on every PR — automation is the gate; manual QA owns the rc and canary
- No release committee — a rotating Release DRI per sprint owns the call
7. PoC — Distribution Management Server¶
What already exists (§10.1 from RFC)¶
- A working strategy seam.
Program.csalready registersIStrategyandIDropOffStrategyimplementations for Flexli, Myntra, MyntraSingleScan, CsvClient, LiveClientServer, X1G3, WTM. Client variation is already an interface, not a branch. - A clean DI container. ASP.NET Core DI is the natural place to bind a tenant manifest to the right strategy at startup.
- Configuration is already file-driven.
appsettings.jsonandconfiguration.jsonexist; today they are flat and client-editable. The PoC converts them into a Flexli-signed manifest validated at startup.
What is missing (§10.2 from RFC)¶
- No CI, no Dockerfile for production, no signed artifact
- No tenant manifest (strategy selection is hardcoded by configuration values the client can edit on disk)
- No per-tenant observability (App.Metrics counters are present but emit global counters; nothing is tagged with
tenantId) - Framework is .NET 6 (out of LTS since Nov 2024)
PoC week plan¶
| Week | Work | Exit signal |
|---|---|---|
| 1 | Port project to .NET 8 LTS; resolve breaking packages; tests green locally | dotnet build + dotnet test pass |
| 1–2 | Branch protection on main; GitHub Actions workflow (restore → test → scan) | Green CI on a sample PR |
| 2–3 | Dockerfile (multi-stage); push to GHCR; semver tagging from CI | Image runs locally with docker run |
| 3 | SBOM (CycloneDX) + Cosign keyless signing of image | cosign verify-blob succeeds |
| 4 | tenant.manifest.json schema; signed-manifest loader in Program.cs | Tampered manifest is rejected at startup |
| 4–5 | Wire OpenFeature; bind IStrategy / IDropOffStrategy selection to manifest keys | Switch Flexli↔Myntra by manifest change, no rebuild |
| 5–6 | OpenTelemetry SDK; tenantId tag on metrics, logs, traces; local Grafana + Loki + Tempo via compose | Per-tenant dashboard shows the right tenant's traffic |
| 6–7 | docker-compose.yml for offline-style install (app + Postgres + manifest); rollback drill against previous tag | Rollback completes within documented RTO |
| 7–8 | Second pilot tenant: generate a second signed manifest; run both tenants on the same image | Two tenants live on identical artifact, divergent manifests |
| 8–9 | Buffer for issues, runbook, demo polish, internal review | Demo to leadership; G1–G5 gates closed |
Decision gates (G1–G5)¶
| Gate | Condition |
|---|---|
| G1 | CI green on main; direct push blocked |
| G2 | Signed artifact published with SBOM |
| G3 | Signed manifest validated at startup; tampered or expired rejected |
| G4 | Rollback drill passes within documented RTO |
| G5 | Two tenants on one image, with per-tenant telemetry |
Kill criterion: if porting to .NET 8 or extracting tenant variation from configuration.json requires touching more than ~30% of services or controllers, pause and re-scope the manifest model before continuing.
8. Definition of Success¶
Implemented successfully when all hold simultaneously:
- One
mainbranch in active use; no per-customer branches exist or are required - Every release is a tagged, signed, immutable artifact paired with a signed tenant manifest
- Onboarding a new tenant is a manifest change, not a code change
- Per-tenant telemetry queryable by
tenantIdacross logs, metrics, and traces - Rollback drills run quarterly and pass within the documented RTO
- Two tenants live on identical artifacts as evidence the model scales without forks