Specification: BetterFleet On-Prem Continuity Control¶
TLDR (Solution Summary)¶
- BetterFleet will add a cloud-primary, edge-authoritative-on-failure continuity model in which chargers connect permanently through the BetterFleet IoT Hub gateway so charging sites can continue operating safely during cloud or WAN outages.
Charger Gateway Path+ move charger control through one permanent IoT Hub virtual IP, or equivalent permanent gateway endpoint, and two operating modes (Cloud Proxy,Local Fallback) + remove any dependence on charger reconfiguration at failure time.Authority State Machine+ define explicit site states, failover rules, restore rules, reversion countdown, and Hold Local behavior using deployment-profile thresholds + make control handoff deterministic, auditable, and free of split authority.Continuity Operator Experience+ provide an approved continuity access model with explicit mode banners and a reduced local continuity UI + keep operators in one managed access pattern while clearly constraining local actions.Offline Identity and Continuity Lease+ preserve active operator sessions during outage and allow new outage-time login through a synchronized local verifier/cache + keep continuity access available without reproducing the full cloud identity system locally.Local Buffer and Reconciliation+ persist local events, alarms, transactions, meter-driven decisions, and operator actions for sync-back after recovery + restore the cloud as system of record without silent data loss.Edge Resilience and Audit Controls+ define health monitoring, standby takeover, local power continuity expectations, transition logging, and security guardrails + make the IoT Hub supportable as critical control-path infrastructure.
1. Summary¶
Problem¶
BetterFleet is currently cloud-first, with chargers connected through the cloud control path. Customers with operationally critical charging sites need charging continuity when cloud or WAN communications are lost, but the charger fleet cannot be reconfigured dynamically during an outage and the local fallback path must not become a second full CMS.
Goal and Success Criteria¶
- Define a developer-ready conceptual specification for on-prem continuity control across BetterFleet customer deployments.
- Preserve BetterFleet Cloud as the primary control plane and system of record during normal operation.
- Define deterministic failover, restore, offline access, and sync-back behavior that can be piloted before scale rollout.
- Success criteria:
- Chargers require no runtime endpoint change when cloud connectivity fails.
- Exactly one authority is active at any time for a site.
- Local fallback activates within the configured failover threshold after genuine cloud connectivity loss.
- Return to cloud begins only after the configured recovery validation window and completes only after reconciliation passes.
- Already-authenticated sessions remain usable for the configured continuity-lease duration, and new operators can log in offline using their normal credentials through a synchronized local verifier/cache.
- Local mode remains limited to the agreed continuity feature set and does not drift into cloud feature parity.
What will be built in this phase¶
Site Continuity Gateway+ define permanent charger-to-IoT-Hub control path and virtual-IP ownership rules + keep OCPP communications stable across normal and fallback modes.Control Authority State Machine+ defineCloud Normal,Cloud Suspect,Local Fallback,Manual Local,Recovery Validation, andReconcile and Handbackbehavior + ensure one deterministic control-owner model.Reduced Continuity UI+ define degraded continuity experience for the approved access model with explicit mode banners, continuity-only actions, and clear authority indication + avoid operator confusion during outage handling.Offline Access Contract+ define offline continuity lease for existing sessions and synchronized local credential verification for new logins + preserve secure operator continuity during cloud-loss conditions.Buffered Record and Reconciliation Contract+ define what is buffered locally, how records are ordered, and how sync-back and conflict handling occur before cloud handback + protect system-of-record integrity.Operational Resilience Contract+ define local health checks, alarms, standby takeover expectations, and profile-defined control continuity requirements + make the IoT Hub supportable as critical edge infrastructure.
Scope (In)¶
- Permanent IoT Hub gateway path for charger communications.
- Site-level authority state machine and mode transitions.
- Reduced local continuity UI and operator-mode semantics.
- Offline continuity access model for authenticated sessions and new outage-time logins.
- Local buffering, reconciliation, and return-to-cloud rules.
- Security, audit, and resilience requirements that materially shape the conceptual design.
Scope (Out)¶
- Full local replication of the BetterFleet cloud CMS.
- Detailed database schema, storage engine, and infrastructure implementation choices.
- Final network addressing plan, VLAN design, or hardware wiring drawings.
- Detailed replacement design for the cloud identity provider or any second full identity provider on site.
- Backlog-ready stories, acceptance criteria, or BDD scenarios.
Current Baseline¶
| Area | Current behavior / source position | Implication for this spec |
|---|---|---|
| Charger connectivity | Chargers connect directly to BetterFleet Cloud today. | Continuity cannot rely on runtime charger reconfiguration. |
| IoT Hub role | The IoT Hub already exists on site and is used for power-meter ingestion and local compute. | The IoT Hub is the natural on-prem continuity boundary. |
| Fallback capability | Local control has been proposed but is not yet deployed or fully specified. | This spec must define scope boundaries, state rules, and operational contracts. |
| Operator access | Different customers may prefer either a shared entry point or a continuity-specific local route. | The spec must define supported access profiles, mode indication, and degraded-routing assumptions. |
| Authentication | Cloud identity relies on live cloud services; outage behavior is currently underspecified. | Offline continuity lease and offline login rules must be explicit. |
| Recovery behavior | Proposal intent exists, but restore validation, handback, and sync-back logic are incomplete. | This spec must define deterministic return-to-cloud semantics before build starts. |
Future evolution guardrails¶
- This phase must not make it harder to add richer local resilience controls later, but phase 1 must stay continuity-scoped.
- This phase must not require a local feature surface that mirrors the full cloud application.
- This phase must keep the cloud as the canonical long-term source of truth for reporting, optimization, and external integrations.
- This phase must keep room for multiple site profiles and future partner-specific routing/auth choices without changing the core continuity model.
Deployment profile parameters¶
- The core continuity model is shared across customers, but deployment profiles may vary by:
- operator access model
- failover threshold
- recovery validation window
- reversion countdown behavior
- Hold Local cap
- offline continuity-lease duration
- identity-sync freshness threshold
- local power continuity requirement
- Customer-specific values for these parameters belong in appendices or deployment profiles, not in the main solution description.
2. Users and Use Cases¶
Primary personas¶
| Persona | Primary job | Why this spec matters |
|---|---|---|
| Depot Operator | Keep vehicles charging and understand site state during outages. | Needs continuity access, clear mode visibility, and approved override controls. |
| Control Room / Operations Supervisor | Coordinate site behavior and handback timing. | Needs explicit authority state, countdown visibility, and alarm signals. |
| Support / Field Service | Diagnose gateway, network, failover, auth, and reconciliation issues. | Needs auditability, health visibility, and deterministic rules. |
| Platform / Product Engineer | Build the continuity model safely across cloud and edge. | Needs canonical terms, requirements, and domain boundaries. |
| Security / IAM Owner | Control outage-time access without creating security drift. | Needs a bounded offline authentication model and auditable role use. |
High-level user stories¶
- As a depot operator, I can continue safe charging operations when cloud connectivity is lost so service pull-out is protected.
- As an operator, I can use the approved BetterFleet continuity access route and immediately see whether the site is in cloud or local continuity mode.
- As an operator already signed in when the outage begins, I can keep using continuity controls without being forced through a second login.
- As an operator arriving after the outage begins, I can log in with my normal username and password through an approved offline path.
- As support, I can explain why the site failed over, why it has or has not returned to cloud, and what happened during the outage.
- As the platform, I can restore cloud authority only after buffered local actions and site state have been reconciled.
Edge cases and failure modes¶
- Cloud connectivity is lost, but local LAN or charger communications are also degraded.
- Cloud connectivity flaps repeatedly near the failover threshold.
- An operator manually places the site into local mode while cloud remains available.
- A cloud token expires during a long outage.
- A user is disabled centrally just before an outage and the local verifier/cache is stale.
- Buffered local records cannot be uploaded or reconciled successfully when connectivity returns.
- The active IoT Hub CPU fails while the site is already in fallback mode.
3. Conceptual Model Terms and Decisions¶
Key Terms¶
| Term | Definition | Notes |
|---|---|---|
| Site Continuity Gateway | The BetterFleet IoT Hub acting as the permanent charger-facing endpoint and mode switch between cloud proxy and local control. | Canonical edge boundary for this spec. |
| Cloud Proxy Mode | Operating mode in which the IoT Hub relays charger traffic upstream and does not originate routine site control decisions except approved safety functions. | Normal operating mode. |
| Local Fallback Mode | Operating mode in which the IoT Hub becomes temporary site control authority for the minimum approved continuity feature set. | Entered automatically after debounced cloud-loss detection. |
| Manual Local Mode | Operator-requested local authority mode used for planned maintenance, testing, or operational control while cloud connectivity still exists. | Must remain explicit and auditable. |
| Authority Lock | The rule and state indicating whether cloud or local control is currently authoritative for a site. | Exactly one authority may hold the lock at a time. |
| Recovery Validation | A time-bounded state after cloud health returns in which the system verifies stable recovery before handback begins. | Duration comes from the active deployment profile. |
| Reconcile and Handback | Recovery stage in which buffered records are uploaded, charger/site state is reconciled, and local authority is released only if checks pass. | Precedes return to cloud normal. |
| Offline Continuity Lease | A bounded outage-time permission state derived from the last valid cloud-authenticated session for an already-authenticated user. | Distinct from live cloud-token validity. |
| Local Credential Verifier / Cache | Read-only outage-time verifier holding synchronized credential-verification material and role mappings for approved users. | Supports new offline logins during outage. |
| Buffered Operational Record | Any locally stored event, alarm, transaction, meter-driven decision, operator action, or transition captured during fallback. | Must survive until sync-back succeeds or an alarm is raised. |
| Reduced Continuity UI | Local operator experience exposing only approved continuity features with explicit mode banners and restricted controls. | Not a full local CMS. |
| Deployment Continuity Profile | Named configuration set defining deployment-specific thresholds, access model, and resilience expectations while keeping the shared state model unchanged. | Holds customer or site-specific values outside the core spec. |
Decision Ledger¶
| ID | Decision | Rationale | Alternatives Rejected | Implications |
|---|---|---|---|---|
| D-001 | Chargers connect permanently to the IoT Hub virtual IP rather than directly to the cloud. | Automatic failover is impossible if charger endpoints must be changed during an outage. | Runtime charger reconfiguration or manual repointing at failure time. | Gateway path becomes critical infrastructure. |
| D-002 | BetterFleet Cloud remains primary; the IoT Hub becomes temporary authority only during fallback or manual local operation. | Preserves centralized records, optimization, and integration strategy. | Full dual-primary control or permanent local-first operation. | Cloud remains system of record outside controlled outage windows. |
| D-003 | Exactly one authority lock exists per site at all times. | Prevents split-brain control and conflicting commands. | Concurrent cloud and local decision-making. | State transitions and audit logs are mandatory. |
| D-004 | Automatic failover uses a deployment-profile threshold from genuine cloud-loss detection to local authority activation. | Balances resilience with debounce against transient communication noise while allowing supported profile variation. | Immediate failover on one missed heartbeat or long manual intervention. | Health assessment must be debounced, deterministic, and profile-driven. |
| D-005 | Return to cloud is automatic after a deployment-profile recovery validation window, with reversion countdown and Hold Local behaviour bounded by policy. | Restores cloud authority predictably while allowing brief operational delay and site-specific operating profiles. | Manual-only restore, indefinite hold-local, or immediate handback on first recovered signal. | Countdown, hold, expiry, and handback events must all be logged. |
| D-006 | Already-authenticated users continue under a deployment-profile offline continuity lease with bounded duration. | Operators must not lose continuity access solely because cloud-token renewal is unavailable during outage. | Require reauthentication when token expires or silently treat expired tokens as fully valid cloud sessions. | Local UI must distinguish continuity-lease operation from normal cloud-authenticated mode. |
| D-007 | New offline logins use the operator's normal username and password through a synchronized local verifier/cache. | Outage operations cannot depend only on cached browser sessions. | Separate fallback accounts or no new offline logins. | Identity sync, disable propagation, and local verification are explicit deliverables. |
| D-008 | The local identity cache is read-only during outage mode. | Prevents security drift and local privilege escalation. | Local account admin, password reset, or role editing during outage. | Offline mode excludes provisioning and authorization changes. |
| D-009 | Local functionality is continuity-scoped and deliberately reduced. | Keeps the edge surface supportable and prevents shadow-CMS growth. | Feature parity with cloud. | Scope governance is part of the product contract. |
| D-010 | Cloud handback is blocked until buffered records are uploaded and reconciliation passes. | System-of-record integrity matters more than fastest possible UI transition. | Best-effort upload after handback or unconditional return to cloud. | Recovery failures keep the site in local mode and raise alarms. |
| D-011 | The IoT Hub and critical local networking must meet deployment-profile local power continuity requirements. | Local control path is only useful if the gateway and site LAN stay alive through the outage scenarios the deployment is meant to cover. | No power continuity requirement or charger-power continuity claims beyond site power availability. | Power-loss behavior and support responsibilities must be explicit and profile-defined. |
| D-012 | Manual Local Mode is supported as an explicit operator action for testing, maintenance, and approved operational cases. | Proposal intent includes manual switching capability and testing flexibility. | Automatic-only local control. | Manual mode needs approval rules, alarms, and cloud visibility when used. |
4. Domain Model and Eventstorming (Conceptual)¶
Bounded context and ubiquitous language¶
- Bounded context:
On-Prem Continuity Control. - Adjacent contexts:
BetterFleet Cloud Control,Local Operator Access,Identity Synchronization,Audit and Monitoring. - Core actors: Depot Operator, Operations Supervisor, IoT Hub Controller, BetterFleet Cloud, Charger Fleet, Site Metering, Identity Source.
- Core commands: evaluate cloud health, declare cloud suspect, lock local authority, enter local fallback, enter manual local, authenticate offline user, record continuity action, begin recovery validation, apply hold local, upload buffered records, reconcile state, restore cloud authority.
- Core domain events: cloud health degraded, cloud unavailable declared, local authority locked, local fallback entered, manual local entered, offline continuity lease activated, offline login succeeded, buffered record stored, recovery validation started, hold local applied, reconciliation passed, reconciliation failed, cloud authority restored.
Aggregates and entities (conceptual)¶
SiteContinuityController- Owns site operating mode, authority lock, failover thresholds, recovery validation, and handback sequencing.
- Owns the decision to enter or exit local control.
GatewayHealthAssessment- Owns cloud reachability signals, upstream OCPP proxy health, local LAN/charger health cross-checks, and debounce state.
- Exists to distinguish transient failures from genuine outages.
OperatorContinuitySession- Owns continuity-lease state for already-authenticated users and offline-login session state for new outage-time users.
- Owns permission scope restricted to reduced continuity UI actions.
LocalCredentialSnapshot- Owns synchronized verifier material, role mappings, freshness metadata, and disable/change timestamps received from the cloud path.
- Is read-only during outage mode.
BufferedRecordQueue- Owns locally persisted operational records awaiting upload and reconciliation.
- Owns acknowledgement state for sync-back completion or failure escalation.
RecoveryBatch- Owns one handback attempt including upload result, reconciliation result, conflict markers, and final authority decision.
Invariants and business rules (normative)¶
- A charger endpoint must not be changed as part of failover or restore behavior.
- Exactly one authority lock may be active for a site at any time.
Cloud Proxy Mode,Local Fallback Mode, andManual Local Modeare mutually exclusive authority states.- Automatic failover must not occur on a single transient failure; it must follow a debounced cloud-health assessment.
- Automatic handback must not begin until the site has remained in
Recovery Validationfor the configured validation window. - Hold Local may extend handback only temporarily and only within the configured policy cap for the active deployment profile.
- The reduced continuity UI must expose only approved continuity functions and must not expose cloud-only administration or configuration workflows.
- The local identity cache must remain read-only in outage mode.
- Offline continuity lease permissions must never exceed the permissions last granted while cloud connectivity was healthy.
- New offline logins must authenticate against synchronized local verifier/cache data and must fail closed if required identity data is missing or stale beyond policy.
- All local control actions, mode transitions, offline auth events, and reconciliation outcomes must produce buffered operational records and audit entries.
- Cloud authority must not be restored until buffered records are uploaded and reconciliation passes.
- Reconciliation failure must keep the site in local authority mode and raise a visible alarm.
- Site-wide power loss is outside the continuity guarantee; the UPS requirement is for control continuity only.
External systems and integrations (conceptual)¶
Charger Fleet- Communicates permanently through the IoT Hub gateway path.
- Continues charger status, control, and transaction exchange through the current authority mode.
BetterFleet Cloud- Provides normal control, system-of-record services, identity source, and reconciliation target.
- Receives synced buffered records after recovery and reassumes authority only after handback passes.
Site Metering- Feeds local power state needed for continuity load management and audit.
- Must remain locally visible during cloud-loss conditions.
Identity Source- Synchronizes credential-verification material, role mappings, and disable/change events while connectivity is healthy.
- Is not duplicated as a full local identity platform.
Monitoring and Support- Consumes alarms, health status, transition logs, and reconciliation outcomes for operational response.
Interaction Flow¶
flowchart LR
Health["Evaluate Cloud Health"] --> Suspect{"Failure persists?"}
Suspect -->|No| Normal["Remain in Cloud Proxy Mode"]
Suspect -->|Yes| Lock["Lock Local Authority"]
Lock --> Fallback["Enter Local Fallback Mode"]
Fallback --> Buffer["Buffer Records and Actions"]
Buffer --> Recover{"Cloud healthy for validation window?"}
Recover -->|No| Fallback
Recover -->|Yes| Countdown["Start Reversion Countdown"]
Countdown --> Hold{"Hold Local applied?"}
Hold -->|Yes, within cap| Countdown
Hold -->|No or expired| Reconcile["Upload and Reconcile"]
Reconcile --> Restore{"Reconciliation passed?"}
Restore -->|Yes| Cloud["Restore Cloud Authority"]
Restore -->|No| Alarm["Raise Alarm and Remain Local"]
Event Timeline¶
timeline
title On-Prem Continuity Event Timeline
CloudHealthDegraded: Upstream failures exceed suspect threshold
CloudUnavailableDeclared: Failover threshold reached
LocalFallbackEntered: Local authority lock applied
OfflineContinuityLeaseActivated: Existing session continues in outage mode
BufferedRecordStored: Local actions and events are persisted
RecoveryValidationStarted: Cloud health remains stable again
HoldLocalApplied: Optional short extension requested
ReconciliationCompleted: Buffered records uploaded and state checked
CloudAuthorityRestored: Site returns to cloud control
Event Dictionary¶
CloudHealthDegraded: Cloud reachability crosses suspect threshold | starts debounce process | payload:site_id,observed_at,health_failures,last_success_at| may trigger continued retries or local failover.CloudUnavailableDeclared: Genuine outage declared | authorizes local control transition | payload:site_id,declared_at,failure_window,health_evidence| triggers local authority lock.LocalFallbackEntered: Site enters automatic fallback | continuity mode becomes active | payload:site_id,entered_at,reason,previous_mode| triggers UI banner, local controls, buffering.ManualLocalEntered: Operator explicitly enters manual local mode | planned local authority starts | payload:site_id,entered_at,operator_id,reason| triggers cloud alert if cloud remains reachable.OfflineContinuityLeaseActivated: Existing session continues past live cloud-auth boundary | preserves outage-time access | payload:site_id,user_id,lease_started_at,lease_expires_at,role_scope| keeps continuity permissions active.OfflineLoginSucceeded: New outage-time login accepted through local verifier/cache | grants continuity access for a new session | payload:site_id,user_id,authenticated_at,cache_version,role_scope| creates local continuity session.BufferedRecordStored: Local action or event persisted | protects system-of-record reconstruction | payload:record_id,site_id,record_type,occurred_at,source_mode| later drives sync-back.RecoveryValidationStarted: Stable cloud recovery detected | begins timed restore gate | payload:site_id,started_at,restore_threshold| may lead to countdown or fallback if health degrades again.HoldLocalApplied: Operator delays handback | short operational extension begins | payload:site_id,operator_id,applied_at,hold_until| delays restore within allowed cap.ReconciliationCompleted: Upload and state comparison finished | decides handback success/failure | payload:site_id,completed_at,result,conflict_count| triggers restore or alarm.CloudAuthorityRestored: Cloud regains control | ends local authority period | payload:site_id,restored_at,recovery_batch_id| returns site to cloud normal.
5. Requirements and Constraints¶
Functional requirements¶
FR-001: The solution must provide a permanent charger-facing gateway path through the IoT Hub virtual IP so that failover and restore do not depend on charger endpoint changes.FR-002: The site continuity controller must implement explicit operating states forCloud Normal,Cloud Suspect,Local Fallback,Manual Local,Recovery Validation, andReconcile and Handback.FR-003: The site continuity controller must automatically transition fromCloud SuspecttoLocal Fallbackwhen the debounced cloud-loss threshold is exceeded.FR-004: The solution must enforce one authority lock per site and prevent simultaneous cloud and local control authority.FR-005: The reduced continuity UI must remain reachable through the agreed continuity access model and must display persistent mode banners indicating the active authority state.FR-006: The reduced continuity UI must expose only the approved fallback functions: charger availability/status, current power allocation visibility, site meter visibility, needs-based load management, continuity session authorization, basic alarms, and approved operator overrides.FR-007: The solution must support explicit entry intoManual Local Modeby an approved operator and must make that state visible both locally and, where cloud connectivity remains available, in the cloud experience.FR-008: The solution must preserve already-authenticated user access during outage through an offline continuity lease derived from the last valid cloud-authenticated identity and role state.FR-009: The solution must allow new outage-time operator login using the operator's normal username and password through a securely synchronized local verifier/cache.FR-010: The local verifier/cache must provide synchronized role mappings and account-disable/password-change state needed to govern outage-time access.FR-011: The solution must prevent offline account creation, role changes, password resets, and privilege escalation during outage mode.FR-012: The solution must buffer locally generated operational records, including failover and restore events, charging-session records, charger-status changes, power allocations and overrides, meter readings used for control, alarms, and operator actions.FR-013: The solution must upload buffered operational records to the cloud after recovery and before cloud authority is restored.FR-014: The solution must reconcile local and cloud state before handback and must remain in local authority mode with a visible alarm if reconciliation fails.FR-015: The solution must provide operator-visible reversion countdown behavior when recovery validation succeeds and must support a temporary Hold Local action capped by the active deployment profile.FR-016: The solution must record an auditable trail for every mode transition, authority change, outage-time login event, continuity lease activation, Hold Local action, and reconciliation result.FR-017: The solution must monitor IoT Hub controller health, standby readiness, switch/network health, and control-application health sufficiently to support active/standby continuity behavior.FR-018: The solution must support deployment continuity profiles that define access model, failover threshold, recovery validation window, reversion countdown behaviour, Hold Local cap, continuity-lease duration, identity-sync freshness threshold, and local power continuity requirements without changing the core authority state model.
Non-functional requirements¶
NFR-001: Automatic failover from genuine cloud connectivity loss to active local fallback must complete within the configured failover threshold under healthy local-LAN conditions.NFR-002: Automatic return-to-cloud must require the configured recovery validation window before the reversion countdown begins.NFR-003: Hold Local must not extend local authority beyond the configured cap for a single recovery attempt.NFR-004: Already-authenticated outage-time continuity access must remain available only for the configured continuity-lease duration from continuity-lease activation.NFR-005: User disables, password changes, and role changes must propagate to the local verifier/cache within the configured identity freshness threshold while cloud connectivity is healthy.NFR-006: The IoT Hub and critical local networking must support the configured local power continuity requirement.NFR-007: All buffered operational records and audit events must be timestamped and ordered in a way that supports later reconciliation and traceability.NFR-008: Security controls must preserve transport protection, role-based access control, audit logging, and segmented local exposure appropriate to the expanded edge attack surface.NFR-009: The solution must fail safe toward restricted control when authority, health, or identity state is uncertain.NFR-010: Recovery and reconciliation outcomes must be operationally observable so support teams can identify stuck fallback, repeated flapping, or unresolved sync failures.
Constraints and assumptions¶
- The BetterFleet cloud experience remains the primary control plane and long-term system of record.
- The fallback scope is intentionally limited to continuity functions and excludes cloud-only feature parity.
- Hardwired Ethernet connectivity to chargers and local LAN participation are prerequisites for sites using this model.
- Site-wide power continuity for charging is not guaranteed by this solution.
Build item coverage mapping¶
Site Continuity Gateway->FR-001,FR-004,FR-017,FR-018,NFR-001,NFR-006Control Authority State Machine->FR-002,FR-003,FR-004,FR-014,FR-015,NFR-001,NFR-002,NFR-003,NFR-009Reduced Continuity UI->FR-005,FR-006,FR-007,FR-015,FR-018Offline Access Contract->FR-008,FR-009,FR-010,FR-011,FR-018,NFR-004,NFR-005,NFR-008,NFR-009Buffered Record and Reconciliation Contract->FR-012,FR-013,FR-014,FR-016,NFR-007,NFR-010Operational Resilience Contract->FR-016,FR-017,FR-018,NFR-006,NFR-008,NFR-010
Verification notes (high-level)¶
FR-001toFR-004,NFR-001: validate through lab failover simulations with chargers on the permanent gateway path.FR-005toFR-007: validate through operator workflow tests in cloud, fallback, and manual local states.FR-008toFR-011,NFR-004,NFR-005: validate through offline auth drills covering token expiry, new outage-time login, disabled-user propagation, and stale-cache handling.FR-012toFR-014,NFR-007,NFR-010: validate through outage-and-recovery simulations with buffered-record replay and induced reconciliation failures.FR-017,NFR-006: validate through hardware, standby takeover, and power-interruption tests in lab and pilot environments.
6. Interaction and Flow¶
User journey and process steps¶
- In normal operation, chargers communicate through the IoT Hub while cloud retains authority and the standard BetterFleet UI is presented.
- If cloud health degrades, the IoT Hub enters
Cloud Suspectand continues retries during the debounce window. - If failure persists, local authority is locked, the site enters
Local Fallback, the UI shows local mode, and local buffering begins. - Already-authenticated users continue through continuity lease; new operators may log in through the local verifier/cache.
- Operators perform only approved continuity tasks while alarms, overrides, and meter-driven decisions are buffered locally.
- When cloud health is stable again,
Recovery Validationruns for the configured validation window. - A visible reversion countdown starts; an operator may apply
Hold Localtemporarily within the active policy cap. - When countdown/hold expires, the system uploads buffered records, reconciles state, and restores cloud authority only if reconciliation succeeds.
- If reconciliation fails, the site remains local and raises a visible alarm for support action.
Flowchart: Authority State Machine¶
stateDiagram-v2
[*] --> CloudNormal
CloudNormal --> CloudSuspect: health failures exceed suspect threshold
CloudSuspect --> CloudNormal: health restored within debounce window
CloudSuspect --> LocalFallback: failover threshold reached
CloudNormal --> ManualLocal: approved operator switch
ManualLocal --> CloudNormal: operator release and cloud healthy
LocalFallback --> RecoveryValidation: cloud healthy for restore threshold
ManualLocal --> RecoveryValidation: operator release requested and cloud healthy
RecoveryValidation --> LocalFallback: health degrades again
RecoveryValidation --> ReconcileAndHandback: countdown/hold complete
ReconcileAndHandback --> CloudNormal: reconciliation passes
ReconcileAndHandback --> LocalFallback: reconciliation fails
Sequence Diagram: Automatic Failover¶
sequenceDiagram
participant Cloud as BetterFleet Cloud
participant Hub as IoT Hub
participant Charger as Chargers
participant UI as Operator UI
Cloud->>Hub: Normal control and proxy traffic
Hub->>Charger: Relay commands and receive status
Hub->>Hub: Evaluate cloud health continuously
Cloud--xHub: Upstream failures persist
Hub->>Hub: Enter Cloud Suspect
alt failure clears within debounce
Hub->>Hub: Return to Cloud Normal
else failover threshold reached
Hub->>Hub: Lock local authority
Hub->>UI: Show Local Fallback banner
Hub->>Charger: Continue local continuity control
Hub->>Hub: Buffer actions, records, alarms, and transactions
end
Sequence Diagram: Return to Cloud¶
sequenceDiagram
participant Cloud as BetterFleet Cloud
participant Hub as IoT Hub
participant UI as Operator UI
Cloud->>Hub: Connectivity restored
Hub->>Hub: Enter Recovery Validation
Hub->>Hub: Verify stable health for validation window
Hub->>UI: Show reversion countdown
opt Hold Local requested
UI->>Hub: Hold Local
Hub->>UI: Show temporary extension within policy cap
end
Hub->>Cloud: Upload buffered operational records
Hub->>Cloud: Reconcile charger and site state
alt reconciliation passes
Hub->>Hub: Release local authority
Hub->>UI: Show Cloud Restored
else reconciliation fails
Hub->>UI: Remain Local Fallback and raise alarm
end
7. Non-Technical Implementation Approach¶
Approach overview¶
- Use the challenge and this specification as the alignment baseline for workshop approval and scope freeze.
- Follow with detailed design focused on routing, auth, state authority, buffering, reconciliation, hardware resilience, and test strategy.
- Build a pilot-capable implementation before any scale rollout commitment.
- Validate in lab/factory conditions before site deployment, then validate again under pilot-site outage drills.
Design considerations¶
- Product priority is continuity and safety, not cloud feature replication.
- Operational clarity is more important than hiding degraded mode; local state must be obvious.
- Security posture must remain explicit even when user experience is seamless.
- Handback quality matters more than fastest possible return to cloud.
- The IoT Hub must be treated as field-critical control infrastructure with corresponding support expectations.
Dependencies and prerequisites¶
- Charger Ethernet readiness and local LAN topology at each target site.
- Dual-controller IoT Hub and resilient local-switching design.
- Agreed continuity access model for local outage access.
- Synchronized local credential-verifier approach and IAM governance agreement.
- Pilot acceptance criteria, deployment profile values, support model, and contractual deliverables.
8. Open Questions¶
- What routing and DNS patterns should BetterFleet support for continuity access across shared-entry-point and continuity-specific local-route profiles?
- What exact operator override actions are approved in fallback mode, and which must remain cloud-only?
- What is the final conflict-resolution policy when local and cloud state differ at handback time?
- What freshness threshold should cause outage-time login to fail closed because the local verifier/cache is too stale?
- Which deployment profile values should BetterFleet standardize versus configure per customer or site?
- What are the formal pilot acceptance criteria and contractual design-deliverable set?
- What are the final standby-takeover and replacement-time expectations for IoT Hub field support?
9. Appendices¶
TTC / PowerON deployment-specific profile¶
- The current TTC and PowerON opportunity is one concrete deployment of this general solution.
- TTC comment 5 prefers one operator entry point and one browser pane.
- Most TTC sites have, or are expected to have, the required hardwired Ethernet topology; Arrow Road remains a known readiness exception outside this conceptual spec.
- TTC working profile values from current material:
- automatic failover target: 30 seconds from genuine cloud connectivity loss
- recovery validation window: 5 minutes
- operator-visible reversion countdown: 30 seconds
- Hold Local cap: 5 minutes
- offline continuity lease duration: 6 hours
- identity-sync freshness threshold: 5 minutes while connectivity is healthy
- local power continuity requirement: 1 hour of UPS-backed control continuity for the IoT Hub and critical local networking
- TTC-specific validation and rollout expectations:
- workshop validation with TTC, PowerON, and BetterFleet stakeholders
- lab simulation before TTC pilot deployment
- controlled outage drills at an agreed TTC pilot site before wider rollout
Inputs reviewed¶
- On-prem continuity challenge
PowerOn - On-premise control technical proposal.docx(August 2025 source proposal)- Workshop gap-closure decisions captured in the user-provided draft
Conceptual interfaces and boundaries¶
Authority State Contract- Exposes current mode, authority owner, failover reason, countdown status, and Hold Local state to UI and support surfaces.
Offline Access Contract- Exposes continuity-lease validation and offline-login outcome without allowing local identity administration.
Buffered Record Contract- Exposes record type, source mode, timestamps, upload state, reconciliation status, and audit references for sync-back workflows.
Recovery Outcome Contract- Exposes whether handback passed, what conflicts were found, and whether support intervention is required.