Google Sign-In Required

Use your company Google account to access the BetterFleet private content.

Back to private home

BetterFleet Support Private
Skip to content
BetterFleet Dev Wiki
Challenge
Initializing search
    bf-dev
    • Home
    • Process
    • Products
    • Reference
    • Decisions
    • Work
    • Operations
    bf-dev
    • Home
      • Process Handbook
      • BetterFleet Workflow Map
      • Product Development System
      • Product Engineering Workflow
        • Process Workflows
        • Work Intake and Weekly Planning
        • Product Engineering Workflow in Linear
        • Product Engineering Delivery
        • Agent Guidance
        • Workflow
        • Skills
        • Skill Sources
        • Process Guides
        • GitLab Feature Flags
        • In-App Docs Authoring
        • Release Notes
        • Process Templates
        • Release Plan: <title>
      • Process Publishing
      • Product overview
        • General Reference
          • Core Domain Training
          • System Topology
          • Two-Axis Ontology Model
          • Ontology Primer
          • Worked Example
          • Evidence, Ownership, and Lineage
          • Energy Management
          • Standards and Protocol Map
          • Charging, Roaming, and Commercial Model
          • Charge Planning and Operations
          • Cross-Cutting Domains
          • Domain Coverage Matrix
        • BetterFleet Product Ontology
        • Core Operations Data Ontology
        • BetterFleet R&D Plan
        • Index
        • Architecture
        • Manage Product Capabilities
        • Manage Data and State
        • Manage Service Interaction Flows
        • Manage Reference
        • Manage Internal Application Diagrams
          • Manage Authorization And Permissions
          • bf-manage-core Auth and Authorization Model
          • Manage Authorization and Permissions
          • bf-manage-web Auth and Permission Model
          • Manage Service Catalog
          • bf-depot-sim
          • bf-digital-twin (Manage Role)
          • bf-fleet-health
          • bf-manage-connect
          • bf-manage-core
          • bf-manage-incidents
          • bf-manage-roaming
          • bf-manage-sitepwrmon
          • bf-manage-web
          • bf-schedule-creator (Manage Role)
          • bf-support-microsite
          • bf-telematics
        • Index
        • Architecture
        • Plan Reference
        • Plan Internal Application Diagrams
        • Plan Migration and Flags
        • Plan Simulation Request Lifecycle
          • Plan Service Catalog
          • bf-bnl-schedule-analysis-compute
          • bf-bnl-settings
          • bf-bnl-ui
          • bf-digital-twin (Plan Role)
          • bf-route-modelling
          • bf-schedule-creator (Plan Role)
      • Where to Ask Product Questions
      • Reference
        • Platform Reference
        • Platform Architecture
        • Script Runtime Model
        • Compose Profiles and Modes
        • Repository Map
        • Monolithic Git Transition FAQ
        • Monolithic Git Sizing
        • CI and Release Integration
        • Shared Reference
        • Shared Infrastructure Architecture
        • Secrets and Env Strategy
        • Vendors and Local Dependencies
        • System Reference
        • Cloud Data Dependencies
        • Ports and URLs
        • Service Matrix
          • API Docs
          • OCPI API Docs
          • OCPP API Docs
          • OSCP API Docs
          • VDV API Docs
          • Yard State API Docs
        • System Design
        • System Design: BBA Microgrid Controller Generic Packet Translation
        • System Design: Depot Simulation
        • System Design: IoT Sensor Packet
        • System Design: Microgrid Energy Orchestration
          • System Design: OCPP Profile 3 And ISO 15118 PKI
          • Architecture: BetterFleet OCPP Profile 3 and ISO 15118 PKI
          • Specification: BetterFleet OCPP Profile 3 and ISO 15118 Certificate Lifecycle Management
          • System Design: On-Prem Control
          • Challenge
            • Challenge: BetterFleet On-Prem Continuity Control
            • Motivation
            • Context
              • Domain Modelling
              • High-Level Use Cases / JTBD
            • Current State -> Desired State
            • Assumptions & Open Questions
            • Constraints & Out of Scope
            • Evaluation
            • Risks & Opportunities
            • Relationships
            • Solution Ideas & Tradeoffs
            • Release Sequencing
            • Appendix: TTC / PowerON-Specific Context
              • Origin and stakeholder context
              • TTC-specific requirements and preferences
              • TTC-specific working targets from current material
              • TTC-specific validation and rollout expectations
          • Specification: BetterFleet On-Prem Continuity Control
          • System Design: OSCP
          • OSCP Protocol Documentation
          • Depot Sim Testing Requirements
          • System Design: OSCP Flexibility Provider Domain
      • Decisions
        • Architecture Decision Records
        • 0001 - Record architecture decisions
        • 0002 - Cognito for Authentication and Authorisation
        • 0003 - AWS Amplify for Authentication
        • 0004 - DynamoDB for default database
        • 0005 - Data Persistence
        • 0006 - Trunk-Based Development
        • 0007 - Generalised principle for automation
        • 0008 - Naming Repositories, Services, and URLs
        • 0009 - Use Timezone Aware DateTimes and UTC
        • 0010 - Use semantic release
        • 0011 - Centralized feature flag repository
        • 0012 - Use Named Exports in Storybook
        • 0013 - RESTful TITLE GraphQL
        • 0014 - Service Granularity
        • 0015 - Async/co-routine exception handling pattern
        • 0016 - Logging & log levels
        • 0017 - Instantiated Models
        • 0018 - Repository Pattern for Database Access
        • 0019 - Use of Design Tokens in TypeScript React Application
        • 0020 - API backwards compatibility and versioning
        • 0021 - Alembic Migration strategy
        • 0022 - Consistent react-hook-form usage
        • 0023 - Domain Event-Driven Architecture
        • 0024 - Domain Event Bus Tech Stack
        • 0025 - No enum types in DB table columns
        • 0026 - In-Memory Ormar Stores for Repository testing
        • 0027 - Storing Tab State in Query and Local Storage
        • 0028 - Adopt OpenTelemetry Semantic Conventions for Structured Logging
        • 0029 - Adopt RFC 9457 for HTTP Error Responses
        • 0030 - Use GitLab registry and Terraform state for ECS services
        • 0031 - Adopt DDD, Hexagonal Architecture, and CQRS for Python Domain Services
      • Work
        • Active Work
          • Work: Bba Microgrid Controller
          • Implementation Specification: BBA Microgrid Controller
          • BBA Microgrid Controller Deliverables (Stories)
          • Work: BFDev Monolithic Git
          • Challenge
          • Specification: BFDev Monolithic Git v2
          • BFDev Monolithic Git v2 Stories
          • Work: Complex Circuit Load Balancing
          • Implementation Specification: Complex Circuit Load Balancing
          • Complex Circuit Load Balancing Deliverables (Stories)
            • COR-10 and COR-11 Consolidation Review
          • Work: Dispatch Reliability and Reconciliation
          • Challenge
          • Specification: Dispatch Reliability and Reconciliation
          • Dispatch Reliability and Reconciliation (Unit User Stories)
            • Dispatch populated vehicle cards grey surface snapshot
            • Dispatch Visual Review
          • Work: Enable Scheduled Managed Charger Access
          • Challenge: Enable Scheduled Managed Charger Access
          • Specification Exploration Dossier: Enable Scheduled Managed Charger Access
          • Specification Review: Enable Scheduled Managed Charger Access
          • Specification: Enable Scheduled Managed Charger Access
          • Work: Guided Cut-Off and Release Orchestration
          • Specification: Guided Cut-Off and Release Orchestration
          • Guided Cut-Off and Release Orchestration (Unit User Stories)
          • Work: Production Deployment Validation
          • Challenge
          • Work: Scheduled Report Parity
          • Specification: Scheduled Report Parity
          • Work: Telematics
          • Telematics EventBridge Path
          • Telematics Ingress Architecture
          • Specification: Telematics Migration into bf-manage-core with 5-Minute Freshness and Health Visibility
          • Telematics Core Migration MVP (Implementation-Time BDD)
          • Work: Vector Derms
          • Implementation Specification: Vector DERMS
          • Vector DERMS Deliverables (Stories)
          • Work: Visiting Vehicle Charging Visibility
          • Specification: Visiting Vehicle Charging Visibility
          • Visiting Vehicle Charging Visibility (Unit User Stories)
          • Work: Workspace Owned Stripe Roaming
          • Specification: Workspace-Owned Stripe Credentials for Roaming Payments
        • Backlog Work
          • Work: Microgrid
          • Microgrid Backlog Stories
          • Work: Mobile Ops Companion
          • Challenge
          • Specification: Mobile Operations Companion v1
          • Mobile Operations Companion Deliverables (Stories)
          • Work: Oscp
          • OSCP Backlog Stories
        • Archived Work
          • Work: Code Canonical Orchestration
          • Challenge
          • Specification: Product Engineering Workflow
          • Product Engineering Workflow Deliverables (Unit User Stories)
          • Work: Release Notes Automation
          • Release Plan: Release Notes Automation
          • Release Notes Automation Backlog Stories
      • Operations
      • Onboarding Runbook
        • Operations Runbooks
        • Production Hotfix Release
        • Staging Hotfix Release
        • Manage Staging Release Validation
        • Terraform Plan Dry Runs
        • Operations Tooling
        • Code Indexing
        • Operations Evidence
        • Database Restoration Test Report
      • Daily Operations Runbook
      • Testing Guide
      • Troubleshooting
    • Challenge: BetterFleet On-Prem Continuity Control
    • Motivation
    • Context
      • Domain Modelling
      • High-Level Use Cases / JTBD
    • Current State -> Desired State
    • Assumptions & Open Questions
    • Constraints & Out of Scope
    • Evaluation
    • Risks & Opportunities
    • Relationships
    • Solution Ideas & Tradeoffs
    • Release Sequencing
    • Appendix: TTC / PowerON-Specific Context
      • Origin and stakeholder context
      • TTC-specific requirements and preferences
      • TTC-specific working targets from current material
      • TTC-specific validation and rollout expectations
    1. Home
    2. Reference
    3. System design
    4. On prem control
    Resilience & Security Shared Technical

    Challenge

    Challenge: BetterFleet On-Prem Continuity Control¶

    How might we add a reusable on-prem continuity capability to BetterFleet that keeps customer charging sites operating safely through cloud or WAN outages, without creating a second full CMS or weakening security, operability, and auditability?

    Motivation¶

    • Customers with operationally critical charging sites need credible continuity when cloud connectivity is degraded or unavailable.
    • The current cloud-first model exposes operational risk when site communications are disrupted.
    • BetterFleet needs a challenge statement that separates the reusable product problem from customer-specific deployment requirements.
    • The opportunity is to define a standard on-prem pattern that can be profiled per customer and site instead of redesigned for each deployment.

    Context¶

    • The emerging BetterFleet direction is a hybrid model in which BetterFleet Cloud remains primary while an on-site BetterFleet IoT Hub provides local continuity capability when needed.
    • The key charger constraint is structural: chargers cannot be re-pointed automatically during an outage, so continuity has to be achieved through a permanent gateway path rather than runtime reconfiguration.
    • Different customers may prefer different operator access patterns, but the core product challenge is the same: routing, authentication, authority, and degraded UX must remain deterministic when the site is isolated.
    • The intended outcome is not feature parity with the cloud CMS. It is a constrained continuity mode that preserves safe charging, site power protection, and operator control until cloud service is restored.

    Domain Modelling¶

    • Domain: depot and site charging continuity control in a cloud-primary EV charging platform.
    • Primary entities: site, charger, IoT Hub, cloud control plane, local continuity controller, operator session, site meter, authority state, buffered event or transaction record.
    • Core decisions: who currently holds control authority, whether site conditions are safe to continue charging, whether a mode transition is valid, and whether local state can be reconciled back to the cloud.
    • This challenge sits between product architecture, operational resilience, cybersecurity, identity, and field hardware support.

    High-Level Use Cases / JTBD¶

    • As a depot or site operator, I need charging to continue safely during a cloud outage so operations are not jeopardized.
    • As an operations user, I need to see clearly whether the site is in Cloud, Local Fallback, Manual Local, or Recovery Validation mode so I know which controls are valid.
    • As a site operations team, I need a continuity access model that still works during isolation so outage handling does not depend on ad hoc instructions or separate unmanaged tools.
    • As a platform owner, I need the cloud to remain the long-term system of record so reporting, optimization, integrations, and governance do not fragment.
    • As a support and security team, I need every failover, local action, offline login, and reconciliation step to be auditable so the distributed edge footprint remains supportable.

    Current State -> Desired State¶

    Current State (Pain/Gain) Desired Outcome Success Measure
    Pain: chargers currently depend on the cloud path, so continuity is limited when WAN or cloud connectivity is lost Chargers stay on a permanent IoT Hub gateway path, with continuity achieved through a control-mode change rather than charger reconfiguration No charger endpoint change required at failure time; fallback activates within the approved deployment profile
    Pain: authority boundaries between cloud and local control are currently ambiguous Cloud and local modes follow an explicit, debounced state machine with exactly one authority at a time All mode transitions are deterministic, logged, and free of conflicting control decisions
    Pain: outage handling today does not provide a defined local operator experience Operators get an explicit continuity experience with clear mode banners and reduced controls Operator drills show clear mode awareness and successful completion of priority continuity tasks
    Pain: authentication and authorization behaviour during cloud loss is underspecified Already-authenticated users retain bounded continuity access, and new operators can log in offline using a synchronized local verifier or cache Continuity access remains available for approved outage windows without offline privilege escalation
    Pain: locally taken actions and site events may diverge from cloud records after recovery Local events, alarms, transactions, and meter-driven control actions are buffered and reconciled back to the cloud with explicit conflict handling Critical operational records are not lost, and unresolved reconciliation failures raise visible alarms
    Gain to protect: BetterFleet Cloud already provides centralized monitoring, optimization, integrations, and system-of-record capability Cloud remains the primary control plane and strategic product surface, with local scope deliberately constrained to continuity functions Local capability stays within an approved minimum feature set and does not evolve into parallel CMS parity

    Assumptions & Open Questions¶

    • Key assumptions:
      • BetterFleet Cloud remains the system of record and primary control plane in normal operation.
      • Chargers will be configured once to the IoT Hub virtual IP, or equivalent permanent local gateway path, and not reconfigured dynamically during outages.
      • The local control appliance can be deployed with resilient switching, health monitoring, and bounded local power continuity where the deployment requires it.
      • Local fallback is a business continuity capability with deliberately narrow functional scope.
      • Site topology, UPS expectations, auth freshness, and failover timings may vary by deployment and should be captured as site or customer profiles instead of core product rules.
      • Identity and authorization changes can be synchronized to the offline verifier or cache frequently enough to support secure outage-time login.
    • Open questions:
      • What routing and DNS strategy should BetterFleet standardize for continuity access: same URL, site-local alias, or a supported profile set?
      • What exact local feature set is approved for fallback mode, especially around manual overrides and local-only controls?
      • How should buffered records be modeled, retained, ordered, and reconciled when cloud and local state diverge?
      • What password-verification and credential-synchronization model will be used for offline logins without duplicating the full cloud identity provider locally?
      • Which deployment parameters should be product defaults versus customer-configurable site profile settings?
      • What are the final support, replacement, monitoring, and operational responsibility boundaries across BetterFleet, partners, and customer IT teams?
    • Validation plans:
      • Validate challenge scope and default solution direction across BetterFleet product, architecture, security, and operations stakeholders.
      • Validate failover, restore, UI, offline auth, and reconciliation behaviour in lab simulations before customer deployment.
      • Validate operational usability and supportability at a representative pilot site before scale rollout.

    Constraints & Out of Scope¶

    • Constraints:
      • The solution must not rely on charger reconfiguration at failure time.
      • The target operating model is one permanent charger endpoint with cloud proxy and local continuity modes.
      • Exactly one control authority may exist at a time.
      • Failover and handback must be automatic or tightly bounded, debounced, and fully auditable.
      • Offline continuity access must use bounded permissions and fail closed when required identity data is missing or stale beyond policy.
      • The operator experience must make current mode and authority explicit, whether the deployment uses a shared entry point or an approved continuity-specific route.
      • Local networking, hardware resilience, and power continuity requirements must be defined per supported deployment profile.
    • Out of scope:
      • Reproducing the full BetterFleet cloud CMS on site.
      • Building a separate local identity provider equivalent to the cloud identity platform.
      • Allowing offline account creation, role changes, password resets, or privilege changes.
      • Delivering full reporting, analytics, advanced optimization, or rich visualization parity in fallback mode.
      • Solving site-wide electrical outages through the control system itself.

    Evaluation¶

    • User value:
      • Safer continuity of charging operations during cloud or WAN outages.
      • Less operator confusion during abnormal conditions because access path, mode state, and available controls are explicit.
      • Continued ability to authorize charging and manage site power within agreed continuity limits.
    • Business value & strategic alignment:
      • Reduces service-delivery risk across customers without abandoning the strategic benefits of a cloud-first BetterFleet platform.
      • Creates a reusable edge-resilience pattern that can be profiled by customer and site while remaining supportable and auditable.
      • Creates a defensible pilot-to-scale path that can be validated before broad rollout.

    Risks & Opportunities¶

    • Risks:
      • The IoT Hub becomes critical-path infrastructure; hardware, firmware, or LAN failures could affect all attached chargers.
      • A continuity access model can fail badly if routing, DNS, session handling, or authority signalling are not explicit.
      • Offline authentication introduces new security and data-governance obligations.
      • Poorly defined reconciliation logic can create reporting drift or disputed system-of-record history after recovery.
      • Local scope can grow into an unsupportable shadow CMS if governance is weak.
    • Opportunities:
      • Establish a clear edge-resilience pattern for BetterFleet deployments with similar outage concerns.
      • Improve product trust by demonstrating deterministic failover, restore, and auditability.
      • Strengthen site-network, device-management, identity-sync, and observability practices that benefit the broader platform roadmap.

    Relationships¶

    • Customer commitments:
      • Each deployment needs explicit continuity goals covering operator access, site electrical guardrails, outage-time permissions, and recovery expectations.
      • Customer-specific requirements should be captured in appendices or deployment profiles, not embedded in the core solution description.
    • Related projects:
      • BetterFleet Cloud platform evolution, IoT Hub hardware upgrade work, identity synchronization, observability, and site network remediation.
    • Upstream and downstream dependencies:
      • Site LAN readiness, charger Ethernet availability, and UPS or network hardware decisions are prerequisites for pilot success.
      • Identity synchronization, cloud service health signalling, and reconciliation pipelines are prerequisites for safe hand-back to cloud control.

    Solution Ideas & Tradeoffs¶

    • Idea A (Preferred): cloud-primary continuity controller using the IoT Hub as a permanent OCPP gateway and temporary local authority, with a reduced local UI exposed through a deployment-approved operator access model.
      • Best fit for the charger constraint and the need for a reusable product capability.
      • Highest design complexity around routing, auth, authority, and reconciliation.
    • Idea B: cloud-primary continuity controller with a separate local access path and explicit manual switchover UX.
      • Simpler routing and degraded-mode behaviour.
      • Weaker operator ergonomics and less consistent cross-site operating model.
    • Idea C: minimal continuity through static charger fallbacks and operational runbooks, without granting the IoT Hub temporary control authority.
      • Lower engineering and cyber complexity.
      • Unlikely to meet continuity, authorization, and visibility goals credibly.
    • Idea D: full on-prem CMS replica with broad local autonomy.
      • Maximum local independence from cloud outages.
      • Highest cost, support overhead, divergence risk, and strategic mismatch.
    • Tradeoffs to compare: implementation cost, field support burden, cyber exposure, operator simplicity, continuity quality, auditability, and long-term product maintainability.
    • Preferred direction: Idea A, provided detailed design proves that routing, offline auth, authority lock, and reconciliation are deterministic and supportable across supported deployment profiles.

    Release Sequencing¶

    • Natural or value-based order:
      • Alignment and scope freeze.
      • Detailed design across architecture, routing, auth, state authority, logging, sync, hardware, and test strategy.
      • Engineering and build for gateway control, reduced local UI, buffering, reconciliation, and observability.
      • Lab or factory validation for failover, restore, power, auth, and soak scenarios.
      • Pilot deployment and acceptance at a representative customer site.
      • Scale rollout using a repeatable deployment and support package.
    • Smaller parts to solve first:
      • Lock the continuity feature set and control authority rules.
      • Prove the supported operator access and offline authentication strategies.
      • Prove buffer and reconcile behaviour under realistic outage and restore scenarios.
    • Potential slices or releases:
      • Slice 1: architecture and alignment package.
      • Slice 2: detailed design and acceptance criteria.
      • Slice 3: pilot-capable implementation.
      • Slice 4: scale rollout package.

    Appendix: TTC / PowerON-Specific Context¶

    The following items are specific to the current TTC and PowerON opportunity and are intentionally kept out of the main body so the challenge remains reusable for other customers.

    Origin and stakeholder context¶

    • This challenge was prompted by TTC and PowerON work on on-prem charging continuity and by the August 2025 PowerON proposal.
    • The immediate audience includes TTC, PowerON, and BetterFleet stakeholders preparing for workshop alignment and later approval.
    • The work intersects with TTC's wider EMS program and TTC site-network readiness activities.

    TTC-specific requirements and preferences¶

    • TTC comment 5 pushes toward one operator entry point and one browser pane, which raises additional routing, authentication, authority, and degraded UX questions.
    • TTC and PowerON need continuity that protects site electrical limits and supports daily service requirements at TTC depots.
    • Most TTC sites have, or are expected to have, the required hardwired Ethernet topology; Arrow Road remains a known exception to address explicitly.

    TTC-specific working targets from current material¶

    • Automatic failover target: 30 seconds from genuine cloud connectivity loss.
    • Recovery validation window: 5 minutes before return to cloud begins.
    • Operator-visible reversion countdown: 30 seconds.
    • Temporary Hold Local extension: capped at 5 minutes.
    • Offline continuity lease for already-authenticated operators: up to 6 hours.
    • Identity and authorization sync freshness to the offline verifier or cache: within 5 minutes while connectivity is healthy.
    • UPS-backed control continuity for the IoT Hub and critical local networking: 1 hour.

    TTC-specific validation and rollout expectations¶

    • Validate challenge scope and decisions in the workshop with TTC, PowerON, and BetterFleet stakeholders.
    • Validate failover, restore, UI, offline auth, and reconciliation behaviour in lab simulations before TTC pilot deployment.
    • Validate operational usability and supportability at an agreed TTC pilot site under controlled outage drills before wider rollout.
    Made with Material for MkDocs
    BFDev Docs Assistant
    New conversation?
    Ask one focused question at a time, this helps the assistant provide accurate answers about what's been implemented in BetterFleet.