Challenge

Challenge: BetterFleet On-Prem Continuity Control¶

How might we add a reusable on-prem continuity capability to BetterFleet that keeps customer charging sites operating safely through cloud or WAN outages, without creating a second full CMS or weakening security, operability, and auditability?

Motivation¶

Customers with operationally critical charging sites need credible continuity when cloud connectivity is degraded or unavailable.
The current cloud-first model exposes operational risk when site communications are disrupted.
BetterFleet needs a challenge statement that separates the reusable product problem from customer-specific deployment requirements.
The opportunity is to define a standard on-prem pattern that can be profiled per customer and site instead of redesigned for each deployment.

Context¶

The emerging BetterFleet direction is a hybrid model in which BetterFleet Cloud remains primary while an on-site BetterFleet IoT Hub provides local continuity capability when needed.
The key charger constraint is structural: chargers cannot be re-pointed automatically during an outage, so continuity has to be achieved through a permanent gateway path rather than runtime reconfiguration.
Different customers may prefer different operator access patterns, but the core product challenge is the same: routing, authentication, authority, and degraded UX must remain deterministic when the site is isolated.
The intended outcome is not feature parity with the cloud CMS. It is a constrained continuity mode that preserves safe charging, site power protection, and operator control until cloud service is restored.

Domain Modelling¶

Domain: depot and site charging continuity control in a cloud-primary EV charging platform.
Primary entities: site, charger, IoT Hub, cloud control plane, local continuity controller, operator session, site meter, authority state, buffered event or transaction record.
Core decisions: who currently holds control authority, whether site conditions are safe to continue charging, whether a mode transition is valid, and whether local state can be reconciled back to the cloud.
This challenge sits between product architecture, operational resilience, cybersecurity, identity, and field hardware support.

High-Level Use Cases / JTBD¶

As a depot or site operator, I need charging to continue safely during a cloud outage so operations are not jeopardized.
As an operations user, I need to see clearly whether the site is in Cloud, Local Fallback, Manual Local, or Recovery Validation mode so I know which controls are valid.
As a site operations team, I need a continuity access model that still works during isolation so outage handling does not depend on ad hoc instructions or separate unmanaged tools.
As a platform owner, I need the cloud to remain the long-term system of record so reporting, optimization, integrations, and governance do not fragment.
As a support and security team, I need every failover, local action, offline login, and reconciliation step to be auditable so the distributed edge footprint remains supportable.

Current State -> Desired State¶

Current State (Pain/Gain)	Desired Outcome	Success Measure
Pain: chargers currently depend on the cloud path, so continuity is limited when WAN or cloud connectivity is lost	Chargers stay on a permanent IoT Hub gateway path, with continuity achieved through a control-mode change rather than charger reconfiguration	No charger endpoint change required at failure time; fallback activates within the approved deployment profile
Pain: authority boundaries between cloud and local control are currently ambiguous	Cloud and local modes follow an explicit, debounced state machine with exactly one authority at a time	All mode transitions are deterministic, logged, and free of conflicting control decisions
Pain: outage handling today does not provide a defined local operator experience	Operators get an explicit continuity experience with clear mode banners and reduced controls	Operator drills show clear mode awareness and successful completion of priority continuity tasks
Pain: authentication and authorization behaviour during cloud loss is underspecified	Already-authenticated users retain bounded continuity access, and new operators can log in offline using a synchronized local verifier or cache	Continuity access remains available for approved outage windows without offline privilege escalation
Pain: locally taken actions and site events may diverge from cloud records after recovery	Local events, alarms, transactions, and meter-driven control actions are buffered and reconciled back to the cloud with explicit conflict handling	Critical operational records are not lost, and unresolved reconciliation failures raise visible alarms
Gain to protect: BetterFleet Cloud already provides centralized monitoring, optimization, integrations, and system-of-record capability	Cloud remains the primary control plane and strategic product surface, with local scope deliberately constrained to continuity functions	Local capability stays within an approved minimum feature set and does not evolve into parallel CMS parity

Assumptions & Open Questions¶

Key assumptions:
- BetterFleet Cloud remains the system of record and primary control plane in normal operation.
- Chargers will be configured once to the IoT Hub virtual IP, or equivalent permanent local gateway path, and not reconfigured dynamically during outages.
- The local control appliance can be deployed with resilient switching, health monitoring, and bounded local power continuity where the deployment requires it.
- Local fallback is a business continuity capability with deliberately narrow functional scope.
- Site topology, UPS expectations, auth freshness, and failover timings may vary by deployment and should be captured as site or customer profiles instead of core product rules.
- Identity and authorization changes can be synchronized to the offline verifier or cache frequently enough to support secure outage-time login.
Open questions:
- What routing and DNS strategy should BetterFleet standardize for continuity access: same URL, site-local alias, or a supported profile set?
- What exact local feature set is approved for fallback mode, especially around manual overrides and local-only controls?
- How should buffered records be modeled, retained, ordered, and reconciled when cloud and local state diverge?
- What password-verification and credential-synchronization model will be used for offline logins without duplicating the full cloud identity provider locally?
- Which deployment parameters should be product defaults versus customer-configurable site profile settings?
- What are the final support, replacement, monitoring, and operational responsibility boundaries across BetterFleet, partners, and customer IT teams?
Validation plans:
- Validate challenge scope and default solution direction across BetterFleet product, architecture, security, and operations stakeholders.
- Validate failover, restore, UI, offline auth, and reconciliation behaviour in lab simulations before customer deployment.
- Validate operational usability and supportability at a representative pilot site before scale rollout.

Constraints & Out of Scope¶

Constraints:
- The solution must not rely on charger reconfiguration at failure time.
- The target operating model is one permanent charger endpoint with cloud proxy and local continuity modes.
- Exactly one control authority may exist at a time.
- Failover and handback must be automatic or tightly bounded, debounced, and fully auditable.
- Offline continuity access must use bounded permissions and fail closed when required identity data is missing or stale beyond policy.
- The operator experience must make current mode and authority explicit, whether the deployment uses a shared entry point or an approved continuity-specific route.
- Local networking, hardware resilience, and power continuity requirements must be defined per supported deployment profile.
Out of scope:
- Reproducing the full BetterFleet cloud CMS on site.
- Building a separate local identity provider equivalent to the cloud identity platform.
- Allowing offline account creation, role changes, password resets, or privilege changes.
- Delivering full reporting, analytics, advanced optimization, or rich visualization parity in fallback mode.
- Solving site-wide electrical outages through the control system itself.

Evaluation¶

User value:
- Safer continuity of charging operations during cloud or WAN outages.
- Less operator confusion during abnormal conditions because access path, mode state, and available controls are explicit.
- Continued ability to authorize charging and manage site power within agreed continuity limits.
Business value & strategic alignment:
- Reduces service-delivery risk across customers without abandoning the strategic benefits of a cloud-first BetterFleet platform.
- Creates a reusable edge-resilience pattern that can be profiled by customer and site while remaining supportable and auditable.
- Creates a defensible pilot-to-scale path that can be validated before broad rollout.

Risks & Opportunities¶

Risks:
- The IoT Hub becomes critical-path infrastructure; hardware, firmware, or LAN failures could affect all attached chargers.
- A continuity access model can fail badly if routing, DNS, session handling, or authority signalling are not explicit.
- Offline authentication introduces new security and data-governance obligations.
- Poorly defined reconciliation logic can create reporting drift or disputed system-of-record history after recovery.
- Local scope can grow into an unsupportable shadow CMS if governance is weak.
Opportunities:
- Establish a clear edge-resilience pattern for BetterFleet deployments with similar outage concerns.
- Improve product trust by demonstrating deterministic failover, restore, and auditability.
- Strengthen site-network, device-management, identity-sync, and observability practices that benefit the broader platform roadmap.

Relationships¶

Customer commitments:
- Each deployment needs explicit continuity goals covering operator access, site electrical guardrails, outage-time permissions, and recovery expectations.
- Customer-specific requirements should be captured in appendices or deployment profiles, not embedded in the core solution description.
Related projects:
- BetterFleet Cloud platform evolution, IoT Hub hardware upgrade work, identity synchronization, observability, and site network remediation.
Upstream and downstream dependencies:
- Site LAN readiness, charger Ethernet availability, and UPS or network hardware decisions are prerequisites for pilot success.
- Identity synchronization, cloud service health signalling, and reconciliation pipelines are prerequisites for safe hand-back to cloud control.

Solution Ideas & Tradeoffs¶

Idea A (Preferred): cloud-primary continuity controller using the IoT Hub as a permanent OCPP gateway and temporary local authority, with a reduced local UI exposed through a deployment-approved operator access model.
- Best fit for the charger constraint and the need for a reusable product capability.
- Highest design complexity around routing, auth, authority, and reconciliation.
Idea B: cloud-primary continuity controller with a separate local access path and explicit manual switchover UX.
- Simpler routing and degraded-mode behaviour.
- Weaker operator ergonomics and less consistent cross-site operating model.
Idea C: minimal continuity through static charger fallbacks and operational runbooks, without granting the IoT Hub temporary control authority.
- Lower engineering and cyber complexity.
- Unlikely to meet continuity, authorization, and visibility goals credibly.
Idea D: full on-prem CMS replica with broad local autonomy.
- Maximum local independence from cloud outages.
- Highest cost, support overhead, divergence risk, and strategic mismatch.
Tradeoffs to compare: implementation cost, field support burden, cyber exposure, operator simplicity, continuity quality, auditability, and long-term product maintainability.
Preferred direction: Idea A, provided detailed design proves that routing, offline auth, authority lock, and reconciliation are deterministic and supportable across supported deployment profiles.

Release Sequencing¶

Natural or value-based order:
- Alignment and scope freeze.
- Detailed design across architecture, routing, auth, state authority, logging, sync, hardware, and test strategy.
- Engineering and build for gateway control, reduced local UI, buffering, reconciliation, and observability.
- Lab or factory validation for failover, restore, power, auth, and soak scenarios.
- Pilot deployment and acceptance at a representative customer site.
- Scale rollout using a repeatable deployment and support package.
Smaller parts to solve first:
- Lock the continuity feature set and control authority rules.
- Prove the supported operator access and offline authentication strategies.
- Prove buffer and reconcile behaviour under realistic outage and restore scenarios.
Potential slices or releases:
- Slice 1: architecture and alignment package.
- Slice 2: detailed design and acceptance criteria.
- Slice 3: pilot-capable implementation.
- Slice 4: scale rollout package.

Appendix: TTC / PowerON-Specific Context¶

The following items are specific to the current TTC and PowerON opportunity and are intentionally kept out of the main body so the challenge remains reusable for other customers.

Origin and stakeholder context¶

This challenge was prompted by TTC and PowerON work on on-prem charging continuity and by the August 2025 PowerON proposal.
The immediate audience includes TTC, PowerON, and BetterFleet stakeholders preparing for workshop alignment and later approval.
The work intersects with TTC's wider EMS program and TTC site-network readiness activities.

TTC-specific requirements and preferences¶

TTC comment 5 pushes toward one operator entry point and one browser pane, which raises additional routing, authentication, authority, and degraded UX questions.
TTC and PowerON need continuity that protects site electrical limits and supports daily service requirements at TTC depots.
Most TTC sites have, or are expected to have, the required hardwired Ethernet topology; Arrow Road remains a known exception to address explicitly.

TTC-specific working targets from current material¶

Automatic failover target: 30 seconds from genuine cloud connectivity loss.
Recovery validation window: 5 minutes before return to cloud begins.
Operator-visible reversion countdown: 30 seconds.
Temporary Hold Local extension: capped at 5 minutes.
Offline continuity lease for already-authenticated operators: up to 6 hours.
Identity and authorization sync freshness to the offline verifier or cache: within 5 minutes while connectivity is healthy.
UPS-backed control continuity for the IoT Hub and critical local networking: 1 hour.

TTC-specific validation and rollout expectations¶

Validate challenge scope and decisions in the workshop with TTC, PowerON, and BetterFleet stakeholders.
Validate failover, restore, UI, offline auth, and reconciliation behaviour in lab simulations before TTC pilot deployment.
Validate operational usability and supportability at an agreed TTC pilot site under controlled outage drills before wider rollout.