Google Sign-In Required

Use your company Google account to access the BetterFleet private content.

Back to private home

BetterFleet Support Private
Skip to content
BetterFleet Dev Wiki
Challenge
Initializing search
    bf-dev
    • Home
    • Process
    • Products
    • Reference
    • Decisions
    • Work
    • Operations
    bf-dev
    • Home
      • Process Handbook
      • BetterFleet Workflow Map
      • Product Development System
      • Product Engineering Workflow
        • Process Workflows
        • Work Intake and Weekly Planning
        • Product Engineering Workflow in Linear
        • Product Engineering Delivery
        • Agent Guidance
        • Workflow
        • Skills
        • Skill Sources
        • Process Guides
        • GitLab Feature Flags
        • In-App Docs Authoring
        • Release Notes
        • Process Templates
        • Release Plan: <title>
      • Process Publishing
      • Product overview
        • General Reference
          • Core Domain Training
          • System Topology
          • Two-Axis Ontology Model
          • Ontology Primer
          • Worked Example
          • Evidence, Ownership, and Lineage
          • Energy Management
          • Standards and Protocol Map
          • Charging, Roaming, and Commercial Model
          • Charge Planning and Operations
          • Cross-Cutting Domains
          • Domain Coverage Matrix
        • BetterFleet Product Ontology
        • Core Operations Data Ontology
        • BetterFleet R&D Plan
        • Index
        • Architecture
        • Manage Product Capabilities
        • Manage Data and State
        • Manage Service Interaction Flows
        • Manage Reference
        • Manage Internal Application Diagrams
          • Manage Authorization And Permissions
          • bf-manage-core Auth and Authorization Model
          • Manage Authorization and Permissions
          • bf-manage-web Auth and Permission Model
          • Manage Service Catalog
          • bf-depot-sim
          • bf-digital-twin (Manage Role)
          • bf-fleet-health
          • bf-manage-connect
          • bf-manage-core
          • bf-manage-incidents
          • bf-manage-roaming
          • bf-manage-sitepwrmon
          • bf-manage-web
          • bf-schedule-creator (Manage Role)
          • bf-support-microsite
          • bf-telematics
        • Index
        • Architecture
        • Plan Reference
        • Plan Internal Application Diagrams
        • Plan Migration and Flags
        • Plan Simulation Request Lifecycle
          • Plan Service Catalog
          • bf-bnl-schedule-analysis-compute
          • bf-bnl-settings
          • bf-bnl-ui
          • bf-digital-twin (Plan Role)
          • bf-route-modelling
          • bf-schedule-creator (Plan Role)
      • Where to Ask Product Questions
      • Reference
        • Platform Reference
        • Platform Architecture
        • Script Runtime Model
        • Compose Profiles and Modes
        • Repository Map
        • Monolithic Git Transition FAQ
        • Monolithic Git Sizing
        • CI and Release Integration
        • Shared Reference
        • Shared Infrastructure Architecture
        • Secrets and Env Strategy
        • Vendors and Local Dependencies
        • System Reference
        • Cloud Data Dependencies
        • Ports and URLs
        • Service Matrix
          • API Docs
          • OCPI API Docs
          • OCPP API Docs
          • OSCP API Docs
          • VDV API Docs
          • Yard State API Docs
        • System Design
        • System Design: BBA Microgrid Controller Generic Packet Translation
        • System Design: Depot Simulation
        • System Design: IoT Sensor Packet
        • System Design: Microgrid Energy Orchestration
          • System Design: OCPP Profile 3 And ISO 15118 PKI
          • Architecture: BetterFleet OCPP Profile 3 and ISO 15118 PKI
          • Specification: BetterFleet OCPP Profile 3 and ISO 15118 Certificate Lifecycle Management
          • System Design: On-Prem Control
          • Challenge
          • Specification: BetterFleet On-Prem Continuity Control
          • System Design: OSCP
          • OSCP Protocol Documentation
          • Depot Sim Testing Requirements
          • System Design: OSCP Flexibility Provider Domain
      • Decisions
        • Architecture Decision Records
        • 0001 - Record architecture decisions
        • 0002 - Cognito for Authentication and Authorisation
        • 0003 - AWS Amplify for Authentication
        • 0004 - DynamoDB for default database
        • 0005 - Data Persistence
        • 0006 - Trunk-Based Development
        • 0007 - Generalised principle for automation
        • 0008 - Naming Repositories, Services, and URLs
        • 0009 - Use Timezone Aware DateTimes and UTC
        • 0010 - Use semantic release
        • 0011 - Centralized feature flag repository
        • 0012 - Use Named Exports in Storybook
        • 0013 - RESTful TITLE GraphQL
        • 0014 - Service Granularity
        • 0015 - Async/co-routine exception handling pattern
        • 0016 - Logging & log levels
        • 0017 - Instantiated Models
        • 0018 - Repository Pattern for Database Access
        • 0019 - Use of Design Tokens in TypeScript React Application
        • 0020 - API backwards compatibility and versioning
        • 0021 - Alembic Migration strategy
        • 0022 - Consistent react-hook-form usage
        • 0023 - Domain Event-Driven Architecture
        • 0024 - Domain Event Bus Tech Stack
        • 0025 - No enum types in DB table columns
        • 0026 - In-Memory Ormar Stores for Repository testing
        • 0027 - Storing Tab State in Query and Local Storage
        • 0028 - Adopt OpenTelemetry Semantic Conventions for Structured Logging
        • 0029 - Adopt RFC 9457 for HTTP Error Responses
        • 0030 - Use GitLab registry and Terraform state for ECS services
        • 0031 - Adopt DDD, Hexagonal Architecture, and CQRS for Python Domain Services
      • Work
        • Active Work
          • Work: Bba Microgrid Controller
          • Implementation Specification: BBA Microgrid Controller
          • BBA Microgrid Controller Deliverables (Stories)
          • Work: BFDev Monolithic Git
          • Challenge
          • Specification: BFDev Monolithic Git v2
          • BFDev Monolithic Git v2 Stories
          • Work: Complex Circuit Load Balancing
          • Implementation Specification: Complex Circuit Load Balancing
          • Complex Circuit Load Balancing Deliverables (Stories)
            • COR-10 and COR-11 Consolidation Review
          • Work: Dispatch Reliability and Reconciliation
          • Challenge
          • Specification: Dispatch Reliability and Reconciliation
          • Dispatch Reliability and Reconciliation (Unit User Stories)
            • Dispatch populated vehicle cards grey surface snapshot
            • Dispatch Visual Review
          • Work: Enable Scheduled Managed Charger Access
          • Challenge: Enable Scheduled Managed Charger Access
          • Specification Exploration Dossier: Enable Scheduled Managed Charger Access
          • Specification Review: Enable Scheduled Managed Charger Access
          • Specification: Enable Scheduled Managed Charger Access
          • Work: Guided Cut-Off and Release Orchestration
          • Specification: Guided Cut-Off and Release Orchestration
          • Guided Cut-Off and Release Orchestration (Unit User Stories)
          • Work: Production Deployment Validation
          • Challenge
            • Challenge: Production Deployment Validation
            • Motivation
            • Context
              • Domain Modelling
              • High-Level Use Cases / JTBD (Required)
            • Current State -> Desired State
            • Assumptions & Open Questions
            • Constraints & Out of Scope
            • Evaluation
            • Risks & Opportunities
            • Relationships
            • Solution Ideas & Tradeoffs
            • Release Sequencing
          • Work: Scheduled Report Parity
          • Specification: Scheduled Report Parity
          • Work: Telematics
          • Telematics EventBridge Path
          • Telematics Ingress Architecture
          • Specification: Telematics Migration into bf-manage-core with 5-Minute Freshness and Health Visibility
          • Telematics Core Migration MVP (Implementation-Time BDD)
          • Work: Vector Derms
          • Implementation Specification: Vector DERMS
          • Vector DERMS Deliverables (Stories)
          • Work: Visiting Vehicle Charging Visibility
          • Specification: Visiting Vehicle Charging Visibility
          • Visiting Vehicle Charging Visibility (Unit User Stories)
          • Work: Workspace Owned Stripe Roaming
          • Specification: Workspace-Owned Stripe Credentials for Roaming Payments
        • Backlog Work
          • Work: Microgrid
          • Microgrid Backlog Stories
          • Work: Mobile Ops Companion
          • Challenge
          • Specification: Mobile Operations Companion v1
          • Mobile Operations Companion Deliverables (Stories)
          • Work: Oscp
          • OSCP Backlog Stories
        • Archived Work
          • Work: Code Canonical Orchestration
          • Challenge
          • Specification: Product Engineering Workflow
          • Product Engineering Workflow Deliverables (Unit User Stories)
          • Work: Release Notes Automation
          • Release Plan: Release Notes Automation
          • Release Notes Automation Backlog Stories
      • Operations
      • Onboarding Runbook
        • Operations Runbooks
        • Production Hotfix Release
        • Staging Hotfix Release
        • Manage Staging Release Validation
        • Terraform Plan Dry Runs
        • Operations Tooling
        • Code Indexing
        • Operations Evidence
        • Database Restoration Test Report
      • Daily Operations Runbook
      • Testing Guide
      • Troubleshooting
    • Challenge: Production Deployment Validation
    • Motivation
    • Context
      • Domain Modelling
      • High-Level Use Cases / JTBD (Required)
    • Current State -> Desired State
    • Assumptions & Open Questions
    • Constraints & Out of Scope
    • Evaluation
    • Risks & Opportunities
    • Relationships
    • Solution Ideas & Tradeoffs
    • Release Sequencing
    1. Home
    2. Work
    3. Active
    4. Production deployment validation

    Challenge

    Challenge: Production Deployment Validation¶

    How might we make BetterFleet production deployments observable and trustworthy enough that the engineer watching the production pipeline can quickly decide whether a release truly succeeded or needs escalation, by validating ECS rollout health, deployment-critical runtime checks such as migrations, and service-specific operational signals, starting with bf-manage-core and establishing a reusable ECS pattern that can later gate multi-region promotion?

    Motivation¶

    • Current ECS service pipelines mainly prove Terraform applied successfully.
    • ECS startup failures, health check failures, crash loops, or migration failures still require manual inspection in ECS and CloudWatch.
    • That manual validation slows releases and depends on operator knowledge of AWS internals.
    • A reliable post-deploy validation step is also a prerequisite for safer deployment automation later.

    Context¶

    • BetterFleet ECS-backed services deploy through GitLab CI, Docker image publishing, and Terraform.
    • The primary user is the engineer/operator watching the production pipeline, who needs enough evidence to decide whether to proceed, alert, or delegate.
    • The first slice is bf-manage-core, but the intended pattern is cross-service ECS deployment validation.
    • Validation will be evidence-based:
    • success means expected success logs were discovered;
    • failure means known failure logs appeared, or expected success logs did not appear in time.
    • For bf-manage-core, the initial timeout for missing expected logs is about 2-3 minutes.

    Domain Modelling¶

    • Domain flow: GitLab pipeline -> Terraform apply -> ECS service update -> task rollout -> runtime health/log signals -> validation outcome -> operator decision.
    • Core entities:
    • Deployment pipeline
    • Terraform apply result
    • Regional ECS service rollout
    • Task/container startup and health
    • CloudWatch deployment evidence
    • Service-specific operational logs
    • Validation result
    • Engineer/operator decision

    High-Level Use Cases / JTBD (Required)¶

    • As the engineer watching the production pipeline, I want the deploy flow to show whether production is actually healthy so I can decide quickly whether to proceed, alert, or delegate investigation.
    • As a service owner, I want runtime deployment failures surfaced with enough evidence that I can take over targeted investigation without repeating manual AWS checks.
    • As a platform team, I want the bf-manage-core pilot to establish a reusable validation pattern for other ECS-backed services.

    Current State -> Desired State¶

    Current State (Pain/Gain) Desired Outcome Success Measure
    Pain: Terraform success can hide ECS/runtime failure Deployment status reflects runtime outcome, not only IaC completion Fewer routine deploys require manual ECS or CloudWatch inspection
    Pain: engineers manually inspect ECS and CloudWatch after deploys Validation evidence is surfaced automatically in the pipeline path Time from deploy completion to confidence decision drops materially
    Pain: migration success is not automatically confirmed bf-manage-core deploys confirm migration success or failure from logs Routine migration checks no longer require manual CloudWatch review
    Pain: service-specific operational checks are ad hoc Services can add explicit operational validation signals over time Teams can extend validation without redesigning the base pattern
    Gain to protect: current Terraform-based deployment model is consistent across services Validation strengthens the current model without replacing it Existing service deployment model remains usable and comparable

    Assumptions & Open Questions¶

    • Key assumptions:
    • The main near-term problem is post-deploy confidence, not Terraform execution itself.
    • ECS-backed services share enough rollout behavior to justify a reusable validation pattern.
    • Deterministic log-based checks are sufficient for the first slice.
    • Open questions:
    • What exact success and failure log signatures should bf-manage-core use?
    • Which service-specific operational signals are worth adding after rollout and migration checks?
    • How should validation results be presented when later used for multi-region promotion decisions?
    • Validation plans:
    • Review recent deployment investigations and identify the most common manual checks.
    • Confirm the first bf-manage-core success/failure signatures with platform and service owners.
    • Trial the pattern on bf-manage-core before widening scope to other ECS-backed services.

    Constraints & Out of Scope¶

    • Constraints:
    • Must fit the current GitLab CI + Terraform + ECS model.
    • Must work with one-region-at-a-time deployment.
    • The first deployed region cannot be prevented by post-deploy validation because validation depends on the deployment happening.
    • Service-level validation can at most gate later regional promotion for the same service.
    • Cross-service gating belongs to the orchestration layer and is a later slice.
    • Out of scope:
    • Full automated production deployment.
    • Cross-service release blocking in the service pipeline.
    • Replacing Terraform as the deployment mechanism.
    • Orchestrator-level gating in the first slice.

    Evaluation¶

    • User value:
    • Faster confidence after production releases.
    • Less manual AWS investigation.
    • Clearer escalation when deploys fail.
    • Business value & strategic alignment:
    • Lower release overhead and operational interruption.
    • Better production confidence for ECS services.
    • A concrete stepping stone toward automation of deployment flow and promotion decisions.

    Risks & Opportunities¶

    • Risks:
    • Validation could be too shallow and only restate Terraform success.
    • Service-specific differences could weaken a shared pattern if introduced too early.
    • Mixing service-pipeline validation with orchestration control too soon could slow delivery.
    • Opportunities:
    • Standardize deployment evidence across BetterFleet ECS services.
    • Reduce reliance on tribal AWS knowledge.
    • Build the confidence layer needed for later regional and orchestration gating.

    Relationships¶

    • Customer commitments:
    • None identified; this is currently an internal engineering and platform capability challenge.
    • Related projects:
    • Production Hotfix Release
    • 0030 - Use GitLab registry and Terraform state for ECS services
    • 0006 - Trunk Based Development
    • Upstream/downstream dependencies:
    • Upstream: GitLab pipeline structure, Terraform outputs, ECS service definitions, and CloudWatch log availability.
    • Downstream: release operations, region-promotion flow, and future orchestration-level gating.

    Solution Ideas & Tradeoffs¶

    • Idea A (Preferred): add a common post-deploy validation step for ECS rollout health and explicit log evidence.
    • Best for reaching value quickly without replacing the current deployment model.
    • Idea B: add service-level smoke and operational checks as the primary release validator.
    • More expressive, but slower to standardize and more service-specific from the start.
    • Idea C: build a broader deployment controller that owns validation and gating.
    • Strategically broader, but too large and risky for the first slice.
    • Tradeoffs to compare:
    • speed to value versus breadth of coverage
    • deterministic evidence versus richer heuristics
    • reusable platform pattern versus service-specific depth
    • service-pipeline validation versus orchestration-level control
    • Preferred direction:
    • Start with Idea A, then layer migration and service-specific checks, and only then use the validation pattern for regional and orchestration gating.

    Release Sequencing¶

    • Natural or value-based order:
    • First validate bf-manage-core ECS rollout health.
    • Then validate bf-manage-core migration success from CloudWatch logs.
    • Then add bf-manage-core service-specific operational validation.
    • Then use outcomes to gate subsequent regions for the same service.
    • Later, treat cross-service orchestration gating as a separate slice.
    • Smaller parts to solve first:
    • explicit ECS rollout/startup failure detection
    • explicit migration success/failure detection
    • simple pass/fail evidence surfaced in the pipeline
    • Potential slices/releases:
    • Slice 1: bf-manage-core ECS rollout health validation
    • Slice 2: bf-manage-core migration validation from CloudWatch logs
    • Slice 3: bf-manage-core service-specific operational validation
    • Slice 4: subsequent-region gating for the same service
    • Slice 5: orchestrator-level gating across services
    Made with Material for MkDocs
    BFDev Docs Assistant
    New conversation?
    Ask one focused question at a time, this helps the assistant provide accurate answers about what's been implemented in BetterFleet.