Challenge
Challenge: Production Deployment Validation¶
How might we make BetterFleet production deployments observable and trustworthy enough that the engineer watching the production pipeline can quickly decide whether a release truly succeeded or needs escalation, by validating ECS rollout health, deployment-critical runtime checks such as migrations, and service-specific operational signals, starting with bf-manage-core and establishing a reusable ECS pattern that can later gate multi-region promotion?
Motivation¶
- Current ECS service pipelines mainly prove Terraform applied successfully.
- ECS startup failures, health check failures, crash loops, or migration failures still require manual inspection in ECS and CloudWatch.
- That manual validation slows releases and depends on operator knowledge of AWS internals.
- A reliable post-deploy validation step is also a prerequisite for safer deployment automation later.
Context¶
- BetterFleet ECS-backed services deploy through GitLab CI, Docker image publishing, and Terraform.
- The primary user is the engineer/operator watching the production pipeline, who needs enough evidence to decide whether to proceed, alert, or delegate.
- The first slice is
bf-manage-core, but the intended pattern is cross-service ECS deployment validation. - Validation will be evidence-based:
- success means expected success logs were discovered;
- failure means known failure logs appeared, or expected success logs did not appear in time.
- For
bf-manage-core, the initial timeout for missing expected logs is about2-3 minutes.
Domain Modelling¶
- Domain flow:
GitLab pipeline -> Terraform apply -> ECS service update -> task rollout -> runtime health/log signals -> validation outcome -> operator decision. - Core entities:
Deployment pipelineTerraform apply resultRegional ECS service rolloutTask/container startup and healthCloudWatch deployment evidenceService-specific operational logsValidation resultEngineer/operator decision
High-Level Use Cases / JTBD (Required)¶
- As the engineer watching the production pipeline, I want the deploy flow to show whether production is actually healthy so I can decide quickly whether to proceed, alert, or delegate investigation.
- As a service owner, I want runtime deployment failures surfaced with enough evidence that I can take over targeted investigation without repeating manual AWS checks.
- As a platform team, I want the
bf-manage-corepilot to establish a reusable validation pattern for other ECS-backed services.
Current State -> Desired State¶
| Current State (Pain/Gain) | Desired Outcome | Success Measure |
|---|---|---|
| Pain: Terraform success can hide ECS/runtime failure | Deployment status reflects runtime outcome, not only IaC completion | Fewer routine deploys require manual ECS or CloudWatch inspection |
| Pain: engineers manually inspect ECS and CloudWatch after deploys | Validation evidence is surfaced automatically in the pipeline path | Time from deploy completion to confidence decision drops materially |
| Pain: migration success is not automatically confirmed | bf-manage-core deploys confirm migration success or failure from logs |
Routine migration checks no longer require manual CloudWatch review |
| Pain: service-specific operational checks are ad hoc | Services can add explicit operational validation signals over time | Teams can extend validation without redesigning the base pattern |
| Gain to protect: current Terraform-based deployment model is consistent across services | Validation strengthens the current model without replacing it | Existing service deployment model remains usable and comparable |
Assumptions & Open Questions¶
- Key assumptions:
- The main near-term problem is post-deploy confidence, not Terraform execution itself.
- ECS-backed services share enough rollout behavior to justify a reusable validation pattern.
- Deterministic log-based checks are sufficient for the first slice.
- Open questions:
- What exact success and failure log signatures should
bf-manage-coreuse? - Which service-specific operational signals are worth adding after rollout and migration checks?
- How should validation results be presented when later used for multi-region promotion decisions?
- Validation plans:
- Review recent deployment investigations and identify the most common manual checks.
- Confirm the first
bf-manage-coresuccess/failure signatures with platform and service owners. - Trial the pattern on
bf-manage-corebefore widening scope to other ECS-backed services.
Constraints & Out of Scope¶
- Constraints:
- Must fit the current GitLab CI + Terraform + ECS model.
- Must work with one-region-at-a-time deployment.
- The first deployed region cannot be prevented by post-deploy validation because validation depends on the deployment happening.
- Service-level validation can at most gate later regional promotion for the same service.
- Cross-service gating belongs to the orchestration layer and is a later slice.
- Out of scope:
- Full automated production deployment.
- Cross-service release blocking in the service pipeline.
- Replacing Terraform as the deployment mechanism.
- Orchestrator-level gating in the first slice.
Evaluation¶
- User value:
- Faster confidence after production releases.
- Less manual AWS investigation.
- Clearer escalation when deploys fail.
- Business value & strategic alignment:
- Lower release overhead and operational interruption.
- Better production confidence for ECS services.
- A concrete stepping stone toward automation of deployment flow and promotion decisions.
Risks & Opportunities¶
- Risks:
- Validation could be too shallow and only restate Terraform success.
- Service-specific differences could weaken a shared pattern if introduced too early.
- Mixing service-pipeline validation with orchestration control too soon could slow delivery.
- Opportunities:
- Standardize deployment evidence across BetterFleet ECS services.
- Reduce reliance on tribal AWS knowledge.
- Build the confidence layer needed for later regional and orchestration gating.
Relationships¶
- Customer commitments:
- None identified; this is currently an internal engineering and platform capability challenge.
- Related projects:
- Production Hotfix Release
- 0030 - Use GitLab registry and Terraform state for ECS services
- 0006 - Trunk Based Development
- Upstream/downstream dependencies:
- Upstream: GitLab pipeline structure, Terraform outputs, ECS service definitions, and CloudWatch log availability.
- Downstream: release operations, region-promotion flow, and future orchestration-level gating.
Solution Ideas & Tradeoffs¶
- Idea A (Preferred): add a common post-deploy validation step for ECS rollout health and explicit log evidence.
- Best for reaching value quickly without replacing the current deployment model.
- Idea B: add service-level smoke and operational checks as the primary release validator.
- More expressive, but slower to standardize and more service-specific from the start.
- Idea C: build a broader deployment controller that owns validation and gating.
- Strategically broader, but too large and risky for the first slice.
- Tradeoffs to compare:
- speed to value versus breadth of coverage
- deterministic evidence versus richer heuristics
- reusable platform pattern versus service-specific depth
- service-pipeline validation versus orchestration-level control
- Preferred direction:
- Start with Idea A, then layer migration and service-specific checks, and only then use the validation pattern for regional and orchestration gating.
Release Sequencing¶
- Natural or value-based order:
- First validate
bf-manage-coreECS rollout health. - Then validate
bf-manage-coremigration success from CloudWatch logs. - Then add
bf-manage-coreservice-specific operational validation. - Then use outcomes to gate subsequent regions for the same service.
- Later, treat cross-service orchestration gating as a separate slice.
- Smaller parts to solve first:
- explicit ECS rollout/startup failure detection
- explicit migration success/failure detection
- simple pass/fail evidence surfaced in the pipeline
- Potential slices/releases:
Slice 1:bf-manage-coreECS rollout health validationSlice 2:bf-manage-coremigration validation from CloudWatch logsSlice 3:bf-manage-coreservice-specific operational validationSlice 4: subsequent-region gating for the same serviceSlice 5: orchestrator-level gating across services