Challenge

Challenge: Production Deployment Validation¶

How might we make BetterFleet production deployments observable and trustworthy enough that the engineer watching the production pipeline can quickly decide whether a release truly succeeded or needs escalation, by validating ECS rollout health, deployment-critical runtime checks such as migrations, and service-specific operational signals, starting with bf-manage-core and establishing a reusable ECS pattern that can later gate multi-region promotion?

Motivation¶

Current ECS service pipelines mainly prove Terraform applied successfully.
ECS startup failures, health check failures, crash loops, or migration failures still require manual inspection in ECS and CloudWatch.
That manual validation slows releases and depends on operator knowledge of AWS internals.
A reliable post-deploy validation step is also a prerequisite for safer deployment automation later.

Context¶

BetterFleet ECS-backed services deploy through GitLab CI, Docker image publishing, and Terraform.
The primary user is the engineer/operator watching the production pipeline, who needs enough evidence to decide whether to proceed, alert, or delegate.
The first slice is bf-manage-core, but the intended pattern is cross-service ECS deployment validation.
Validation will be evidence-based:
success means expected success logs were discovered;
failure means known failure logs appeared, or expected success logs did not appear in time.
For bf-manage-core, the initial timeout for missing expected logs is about 2-3 minutes.

Domain Modelling¶

Domain flow: GitLab pipeline -> Terraform apply -> ECS service update -> task rollout -> runtime health/log signals -> validation outcome -> operator decision.
Core entities:
Deployment pipeline
Terraform apply result
Regional ECS service rollout
Task/container startup and health
CloudWatch deployment evidence
Service-specific operational logs
Validation result
Engineer/operator decision

High-Level Use Cases / JTBD (Required)¶

As the engineer watching the production pipeline, I want the deploy flow to show whether production is actually healthy so I can decide quickly whether to proceed, alert, or delegate investigation.
As a service owner, I want runtime deployment failures surfaced with enough evidence that I can take over targeted investigation without repeating manual AWS checks.
As a platform team, I want the bf-manage-core pilot to establish a reusable validation pattern for other ECS-backed services.

Current State -> Desired State¶

Current State (Pain/Gain)	Desired Outcome	Success Measure
Pain: Terraform success can hide ECS/runtime failure	Deployment status reflects runtime outcome, not only IaC completion	Fewer routine deploys require manual ECS or CloudWatch inspection
Pain: engineers manually inspect ECS and CloudWatch after deploys	Validation evidence is surfaced automatically in the pipeline path	Time from deploy completion to confidence decision drops materially
Pain: migration success is not automatically confirmed	`bf-manage-core` deploys confirm migration success or failure from logs	Routine migration checks no longer require manual CloudWatch review
Pain: service-specific operational checks are ad hoc	Services can add explicit operational validation signals over time	Teams can extend validation without redesigning the base pattern
Gain to protect: current Terraform-based deployment model is consistent across services	Validation strengthens the current model without replacing it	Existing service deployment model remains usable and comparable

Assumptions & Open Questions¶

Key assumptions:
The main near-term problem is post-deploy confidence, not Terraform execution itself.
ECS-backed services share enough rollout behavior to justify a reusable validation pattern.
Deterministic log-based checks are sufficient for the first slice.
Open questions:
What exact success and failure log signatures should bf-manage-core use?
Which service-specific operational signals are worth adding after rollout and migration checks?
How should validation results be presented when later used for multi-region promotion decisions?
Validation plans:
Review recent deployment investigations and identify the most common manual checks.
Confirm the first bf-manage-core success/failure signatures with platform and service owners.
Trial the pattern on bf-manage-core before widening scope to other ECS-backed services.

Constraints & Out of Scope¶

Constraints:
Must fit the current GitLab CI + Terraform + ECS model.
Must work with one-region-at-a-time deployment.
The first deployed region cannot be prevented by post-deploy validation because validation depends on the deployment happening.
Service-level validation can at most gate later regional promotion for the same service.
Cross-service gating belongs to the orchestration layer and is a later slice.
Out of scope:
Full automated production deployment.
Cross-service release blocking in the service pipeline.
Replacing Terraform as the deployment mechanism.
Orchestrator-level gating in the first slice.

Evaluation¶

User value:
Faster confidence after production releases.
Less manual AWS investigation.
Clearer escalation when deploys fail.
Business value & strategic alignment:
Lower release overhead and operational interruption.
Better production confidence for ECS services.
A concrete stepping stone toward automation of deployment flow and promotion decisions.

Risks & Opportunities¶

Risks:
Validation could be too shallow and only restate Terraform success.
Service-specific differences could weaken a shared pattern if introduced too early.
Mixing service-pipeline validation with orchestration control too soon could slow delivery.
Opportunities:
Standardize deployment evidence across BetterFleet ECS services.
Reduce reliance on tribal AWS knowledge.
Build the confidence layer needed for later regional and orchestration gating.

Relationships¶

Customer commitments:
None identified; this is currently an internal engineering and platform capability challenge.
Related projects:
Production Hotfix Release
0030 - Use GitLab registry and Terraform state for ECS services
0006 - Trunk Based Development
Upstream/downstream dependencies:
Upstream: GitLab pipeline structure, Terraform outputs, ECS service definitions, and CloudWatch log availability.
Downstream: release operations, region-promotion flow, and future orchestration-level gating.

Solution Ideas & Tradeoffs¶

Idea A (Preferred): add a common post-deploy validation step for ECS rollout health and explicit log evidence.
Best for reaching value quickly without replacing the current deployment model.
Idea B: add service-level smoke and operational checks as the primary release validator.
More expressive, but slower to standardize and more service-specific from the start.
Idea C: build a broader deployment controller that owns validation and gating.
Strategically broader, but too large and risky for the first slice.
Tradeoffs to compare:
speed to value versus breadth of coverage
deterministic evidence versus richer heuristics
reusable platform pattern versus service-specific depth
service-pipeline validation versus orchestration-level control
Preferred direction:
Start with Idea A, then layer migration and service-specific checks, and only then use the validation pattern for regional and orchestration gating.

Release Sequencing¶

Natural or value-based order:
First validate bf-manage-core ECS rollout health.
Then validate bf-manage-core migration success from CloudWatch logs.
Then add bf-manage-core service-specific operational validation.
Then use outcomes to gate subsequent regions for the same service.
Later, treat cross-service orchestration gating as a separate slice.
Smaller parts to solve first:
explicit ECS rollout/startup failure detection
explicit migration success/failure detection
simple pass/fail evidence surfaced in the pipeline
Potential slices/releases:
Slice 1: bf-manage-core ECS rollout health validation
Slice 2: bf-manage-core migration validation from CloudWatch logs
Slice 3: bf-manage-core service-specific operational validation
Slice 4: subsequent-region gating for the same service
Slice 5: orchestrator-level gating across services