Skip to content

FLX-ENG-RFC-001 — Branching, Build & Release Engineering

Field Value
RFC ID FLX-ENG-RFC-001
Status WIP — target sign-off EoD 22 May 2026
Author Arun Singh, Senior Distinguished Engineer / Architect (Consulting)
Reviewers Raja Choudhary (Founder), Rahul (Eng Lead), Kanchan, Tushar
Scope mSort application — version control, build, release, multi-tenant rollout, PoC plan, tech stack
Supersedes Ad-hoc dev/prod branching; Meesho-specific default behaviour in baseline

TL;DR — The decision

Maintain exactly one product codebase on a trunk-based branching model. Handle client-specific behaviour through tenant-aware configuration and feature flags, not through client-specific branches or builds. For offline / semi-connected client sites, ship the same artifact with a Flexli-signed deployment manifest that activates the correct feature set per customer.

Branches manage software lifecycle; tenancy manages customer variation; the deployment manifest carries entitlement.


1. Problem

Flexli is moving from a single-customer baseline (currently carrying Meesho-specific defaults) to a multi-client product line. Without a deliberate engineering process, the natural drift is toward per-customer code forks, per-customer branches, and per-customer builds. That path looks fast for the first three customers and becomes the single largest source of incidents, merge debt, and release risk by customer ten.

Current constraints

  • The baseline codebase carries Meesho-specific defaults — not scalable to other clients
  • Only prod and dev branches with short-lived feature branches — no formal branching model
  • Gaps in publishing, bundling, and deployment — no CI/CD, no container image, no artifact signing
  • Client variation is in business logic itself, not only configuration values

Design principles (each future proposal that violates one needs an explicit waiver)

  1. Branches model lifecycle, tenants model customers. Branching expresses change delivery; tenancy expresses business segmentation. Mixing them creates merge debt.
  2. One codebase, one artifact, multiple activations. The pipeline produces a single versioned artifact; customer differences are activated at deployment time through governed configuration.
  3. Control lives with Flexli, not the customer. Feature entitlement is generated, signed, and shipped by Flexli. Customer-editable plain settings are not a control surface; operability and rollback are first-class concerns.

2. Branching Model

Trunk-based development with short-lived feature branches and on-demand release branches.

Branches

Branch Purpose Lifetime Merges into
main Single source of truth. Always releasable. Protected; no direct push. Permanent
feature/<id>-<slug> One unit of work, scoped to a user story or task ID. Hours to a few days main via PR + CI
release/<version> Stabilisation window for an upcoming UAT or production cut. Only fixes allowed. Days to two weeks main (and tag)
hotfix/<id> Urgent production fix on the currently released tag. Hours main + release tag

Rules

  • Direct pushes to main are blocked; every change lands through a PR with green CI and at least one reviewer (two for release/* and hotfix/*)
  • Feature branches rebase on main daily and merge within five working days
  • Tags follow semantic versioning, cut from a release/* branch
  • We do not create: customer/<name> branches, long-lived develop branches, or environment branches

3. Multi-tenant Architecture

Client-specific behaviour is expressed through four mechanisms, in this order of preference:

Mechanism When to use Carrier
Tenant configuration Values differ per client: endpoints, limits, regex, SLAs, branding Signed deployment manifest
Feature flags Same capability, enable/disable per client or per environment Feature registry in manifest
Strategy / plugin interfaces Business logic genuinely differs (e.g., routing, validation, notification provider) Interface + per-client implementation, selected by config
Deployment topology Isolation required: regulatory, on-prem, dedicated SLO, noisy-neighbour risk IaC module variant, same artifact

What we do not do: edit source per client; maintain client forks; rebuild the artifact per client.

Feature registry and manifest

All flags live in a single feature registry in code, each entry declaring key, owner, default, and dependencies. The per-client manifest references those keys. A manifest is a small signed JSON:

{
  "tenantId": "meesho-blr-wh-01",
  "site": "BLR_WAREHOUSE",
  "version": "1.2.3",
  "features": {
    "awb_data_sync": true,
    "manifest_close": true,
    "manual_sort": false
  },
  "config": {
    "api_base_url": "https://api.meesho.com",
    "data_sync_interval_seconds": 300
  },
  "validityWindow": {
    "notBefore": "2026-06-01T00:00:00Z",
    "notAfter": "2026-12-31T23:59:59Z"
  },
  "signature": "<cosign-detached-signature>"
}

The application validates the signature at startup and refuses to start on a tampered or expired manifest.


4. Build, Artifact, and Deployment

Pipeline stages

Build → Test → Scan → Package → Promote → Deploy → Verify
Stage What happens
Build dotnet build — compile
Test dotnet test — unit + integration
Scan Static analysis + dependency vulnerability scan
Package One immutable OCI image + CycloneDX SBOM + Cosign signature
Promote Same artifact moves QA → Staging → UAT → Production. No rebuild on promotion.
Deploy Orchestrator validates manifest signature and binds it to the artifact
Verify Smoke tests + health checks + canary window

Online vs offline client deployment

Topology Manifest delivery
Flexli cloud / managed Pulled from Flexli config service at startup and on refresh; hot-reload or rolling restart
Client on-prem, connected Pulled through site agent; cached locally; rolling restart per site
Client on-prem, air-gapped Signed manifest bundled with the artifact at deployment time; new bundle on the next site visit

Critical

Even for air-gapped sites, the manifest is generated and signed by Flexli. The client does not edit toggles by hand. A missing, tampered, or expired manifest puts the app into a documented safe state and emits an alert.

Rollback

  • Code rollback: redeploy the previous tagged artifact (retained at least three releases back)
  • Config rollback: redeploy the previous signed manifest (manifests are versioned and immutable)
  • Every release has a documented rollback criterion (error rate, latency, business KPI) and a named on-call owner

5. Technology Stack

Layer Recommended choice
Source control & CI Git (GitHub); GitHub Actions
Build & test Existing .NET / dotnet test; add static-analysis and dependency-scan steps
Artifact format & registry OCI container image + SBOM (CycloneDX); GHCR
Signing Cosign (sigstore) with keys in cloud KMS or HSM
Tenant manifest JSON Schema v1 + detached Cosign signature; bundled with artifact for offline sites
Feature flags OpenFeature SDK with backend adapter selectable per tenant
Observability OpenTelemetry → Grafana stack (Loki, Tempo, Prometheus / Mimir)
Secrets & keys HashiCorp Vault or cloud KMS (AWS KMS / Azure Key Vault)
Deployment Docker Compose (single-box sites) or k3s (multi-service sites)
IaC Terraform (cloud) + Ansible (on-prem)

6. Release Engineering Process

Two cadences

  • Continuous to Flexli-managed environments: every merge to main that passes CI is deployed automatically to dev → staging → Flexli-internal pilot site. If canary metrics hold for 30 minutes, promoted further.
  • Scheduled to client sites: fixed train tied to the sprint. One MINOR release per sprint, PATCH hotfixes only when needed.

Sprint-aligned release train (2-week sprint)

Day What happens Owner
1–8 Normal development. PRs merge to main continuously. Each merge auto-deploys to dev → staging → Flexli pilot. Canary metrics are watched. All engineers
9 Release cut: tag vX.Y.0-rc1 from main. QA owns regression sweep on the rc tag in staging. Release DRI
10 Soft freeze on rc tag: only release-blocker fixes (cherry-picked to release/X.Y). New feature work continues on main for next sprint. Release DRI + QA
11 Promote to first pilot client (5–10% of fleet). Canary watch for 24 h on per-tenant SLIs. Release DRI + SRE
12 If canary green: promote to remaining client sites. Publish release notes and manifest diff per client. Update audit log. Release DRI
13–14 Retrospective on the release: DORA metrics for the sprint, any rollback or freeze events, runbook gaps, next-sprint corrections. Release DRI + Eng Lead

DORA + one Flexli metric

  1. Deploy frequency — many merges/day to Flexli envs; ≥ 1 client release per sprint
  2. Lead time — commit → first production tenant; target < 2 weeks
  3. Change-failure rate — < 15%
  4. MTTR — < 1 h cloud, < 1 day on-prem
  5. Tenant-attribution rate — share of incidents diagnosed via per-tenant telemetry without manual log diving

What we deliberately don't do

  • No release branches kept alive between sprints (release/X.Y is archived once shipped)
  • No manual QA sign-off on every PR — automation is the gate; manual QA owns the rc and canary
  • No release committee — a rotating Release DRI per sprint owns the call

7. PoC — Distribution Management Server

What already exists (§10.1 from RFC)

  • A working strategy seam. Program.cs already registers IStrategy and IDropOffStrategy implementations for Flexli, Myntra, MyntraSingleScan, CsvClient, LiveClientServer, X1G3, WTM. Client variation is already an interface, not a branch.
  • A clean DI container. ASP.NET Core DI is the natural place to bind a tenant manifest to the right strategy at startup.
  • Configuration is already file-driven. appsettings.json and configuration.json exist; today they are flat and client-editable. The PoC converts them into a Flexli-signed manifest validated at startup.

What is missing (§10.2 from RFC)

  • No CI, no Dockerfile for production, no signed artifact
  • No tenant manifest (strategy selection is hardcoded by configuration values the client can edit on disk)
  • No per-tenant observability (App.Metrics counters are present but emit global counters; nothing is tagged with tenantId)
  • Framework is .NET 6 (out of LTS since Nov 2024)

PoC week plan

Week Work Exit signal
1 Port project to .NET 8 LTS; resolve breaking packages; tests green locally dotnet build + dotnet test pass
1–2 Branch protection on main; GitHub Actions workflow (restore → test → scan) Green CI on a sample PR
2–3 Dockerfile (multi-stage); push to GHCR; semver tagging from CI Image runs locally with docker run
3 SBOM (CycloneDX) + Cosign keyless signing of image cosign verify-blob succeeds
4 tenant.manifest.json schema; signed-manifest loader in Program.cs Tampered manifest is rejected at startup
4–5 Wire OpenFeature; bind IStrategy / IDropOffStrategy selection to manifest keys Switch Flexli↔Myntra by manifest change, no rebuild
5–6 OpenTelemetry SDK; tenantId tag on metrics, logs, traces; local Grafana + Loki + Tempo via compose Per-tenant dashboard shows the right tenant's traffic
6–7 docker-compose.yml for offline-style install (app + Postgres + manifest); rollback drill against previous tag Rollback completes within documented RTO
7–8 Second pilot tenant: generate a second signed manifest; run both tenants on the same image Two tenants live on identical artifact, divergent manifests
8–9 Buffer for issues, runbook, demo polish, internal review Demo to leadership; G1–G5 gates closed

Decision gates (G1–G5)

Gate Condition
G1 CI green on main; direct push blocked
G2 Signed artifact published with SBOM
G3 Signed manifest validated at startup; tampered or expired rejected
G4 Rollback drill passes within documented RTO
G5 Two tenants on one image, with per-tenant telemetry

Kill criterion: if porting to .NET 8 or extracting tenant variation from configuration.json requires touching more than ~30% of services or controllers, pause and re-scope the manifest model before continuing.


8. Definition of Success

Implemented successfully when all hold simultaneously:

  • One main branch in active use; no per-customer branches exist or are required
  • Every release is a tagged, signed, immutable artifact paired with a signed tenant manifest
  • Onboarding a new tenant is a manifest change, not a code change
  • Per-tenant telemetry queryable by tenantId across logs, metrics, and traces
  • Rollback drills run quarterly and pass within the documented RTO
  • Two tenants live on identical artifacts as evidence the model scales without forks