OSCP Protocol Documentation¶

This document complements spec.md by describing the current implemented OSCP connection-lifecycle slice rather than the full conceptual OSCP design.

Use spec.md as the source of truth for the broader architecture, domain boundaries, and future-shape OSCP model. Use this document for the current runtime FSM, handshake/timeout/heartbeat behaviour, implementation compromises in the delivered lifecycle flow, and the confirmed target behaviour that follows from lifecycle degradation into fallback mode or intentional disconnect.

Purpose¶

This document exists to:

define the BetterFleet interpretation of the OSCP connection lifecycle
describe the persisted runtime state model
document the handshake, timeout, and heartbeat logical flows
document how offline lifecycle detection hands off into fallback mode
document the intentional-disconnect handover back to non-OSCP control
capture the current implementation compromises

Scope¶

This document currently covers:

connection establishment
connection timeout handling
heartbeat handling and liveness
lifecycle-triggered fallback-mode behaviour
intentional disconnect handover
BetterFleet and CP simulator role boundaries used for development

This document does not currently (but may in future) define:

registration behavior
capability / forecast payload semantics
long-term audit or event-sourcing strategy

Supplementary Resources¶

OSCP specification: docs/reference/system-design/oscp/spec.md
Vector DERMS specification: docs/work/active/vector-derms/spec.md

Roles¶

For BetterFleet's current OSCP implementation:

BetterFleet acts as the flexibility provider, FP
the peer system acts as the capacity provider, CP

Route ownership follows that split:

BetterFleet exposes FP routes such as /oscp/fp/2.0/...
the CP simulator exposes CP routes such as /oscp/cp/2.0/...

The OSCP specification calls out the inclusion of a capacity optimizer, CO, which currently falls out of scope.

OSCP Conceptual Model¶

OSCP has three broad concerns:

Registration
Active connection lifecycle
Capabilities

State Model¶

The connection lifecycle is described through the following finite state machine.

stateDiagram-v2
    direction LR
    UNREGISTERED --> NO_CONNECTION: Register (out of scope)

    NO_CONNECTION --> CONNECTING: Start connection (rx204)
    NO_CONNECTION --> ACCEPTING_CONNECTION: Receive connection request (tx204)

    CONNECTING --> CONNECTED: Receive acknowledgement (tx204)
    CONNECTING --> NO_CONNECTION: Timeout or protocol failure

    ACCEPTING_CONNECTION --> CONNECTED: Send acknowledgement (rx204)
    ACCEPTING_CONNECTION --> NO_CONNECTION: Timeout or protocol failure

    CONNECTED --> CONNECTED: receive Heartbeat within threshold
    CONNECTED --> NO_CONNECTION: Heartbeat threshold missed or protocol failure

Registration can only occur while in the UNREGISTERED state
Active connection lifecycle relates to all other states
Capabilities can only be enacted while in the CONNECTED state

Communication Expectations¶

Accepted OSCP requests in this slice are acknowledged with HTTP 204.

Invalid or impossible protocol transitions are rejected with the appropriate HTTP error, for example 403 when the current runtime state does not allow the requested transition.

Registration¶

Registration exists conceptually but is out of scope for the current implementation.

The intended end state is that registration owns creation of the persisted OSCP connection-state record.

Under that model:

an OSCP connection must be registered before handshake or heartbeat lifecycle commands are valid
later lifecycle commands transition an existing registered record rather than creating one on first use
duplicate first-touch record creation races are treated as "already registered" conflicts rather than infrastructure failures

Connection¶

Connection consists of:

handshake initiation
handshake acknowledgement
heartbeat exchange
timeout / disconnect handling

Crucially, communication sessions are initiated and handled using HTTP/S, meaning that the concept of an ongoing 'connection' is ephemeral. These are not websockets, but the protocol does imitate it.

Starting a connection¶

This is initiated by BetterFleet by sending a handshake to the capacity provider.

Expected sequence¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store

    FP->>State: Persist state = CONNECTING
    FP->>CP: Send Handshake
    CP-->>FP: HTTP 204
    CP->>FP: HandshakeAcknowledge
    FP-->>CP: HTTP 204
    FP->>State: Persist state = CONNECTED
    FP->>State: Set connection_started_at if first connection

The outbound communication initialisation must also be robust to timeout.

The timeout handling is necessary to cover two distinct failure modes:

the peer does not return the immediate HTTP 204 to the outbound Handshake
the peer returns 204, but never sends the later HandshakeAcknowledge

Timeout handling¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store
    participant Scheduler as Event Bridge

    FP->>State: Persist state = CONNECTING
    FP->>Scheduler: Create one-shot CONNECTING timeout
    FP->>CP: Send handshake

    alt handshake_acknowledge received before timeout
        CP-->>FP: HTTP 204
        CP->>FP: HandshakeAcknowledge
        FP-->>CP: HTTP 204
        FP->>State: Persist state = CONNECTED
        FP->>Scheduler: Cancel CONNECTING timeout
    else timeout event fires first
        Scheduler->>FP: Trigger OSCP_CONNECTION_TIMEOUT
        FP->>State: Re-check expected state + state_updated_at
        FP->>State: Disconnect/reset state
        FP->>Scheduler: Delete timeout event
    end

Receiving a connection request¶

This is initiated by the capacity provider by sending a handshake to BetterFleet.

The acknowledgement is deliberately delayed onto an in-process async follow-up task so BetterFleet can (nearly) guarantee that the outbound HTTP 204 is returned before the outbound HandshakeAcknowledge is sent.

Expected sequence¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store
    participant BG as Async follow-up task

    CP->>FP: Handshake
    FP->>State: Persist state = ACCEPTING_CONNECTION
    FP->>BG: Schedule delayed HandshakeAcknowledge
    FP-->>CP: HTTP 204
    BG->>CP: Send HandshakeAcknowledge
    CP-->>FP: HTTP 204
    FP->>State: Persist state = CONNECTED
    FP->>State: Set connection_started_at if first connection

The inbound communication acceptance must also be robust to timeout.

The timeout handling is necessary to cover the case where the peer never returns the expected HTTP 204 to BetterFleet's outbound HandshakeAcknowledge.

More information at end.

Timeout handling¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store
    participant Scheduler as Event Bridge
    participant BG as Async follow-up task

    CP->>FP: Send handshake
    FP->>State: Persist state = ACCEPTING_CONNECTION
    FP->>Scheduler: Create one-shot ACCEPTING_CONNECTION timeout
    FP->>BG: Schedule delayed HandshakeAcknowledge
    FP-->>CP: HTTP 204

    alt delayed HandshakeAcknowledge succeeds before timeout
        BG->>CP: Send HandshakeAcknowledge
        CP-->>FP: HTTP 204
        FP->>State: Persist state = CONNECTED
        FP->>Scheduler: Cancel ACCEPTING_CONNECTION timeout
    else timeout event fires first
        Scheduler->>FP: Trigger ACCEPTING_CONNECTION timeout
        FP->>State: Re-check expected state + state_updated_at
        FP->>State: Disconnect/reset state
        FP->>Scheduler: Delete timeout event
    end

Maintaining a connection¶

This is done through the use of heartbeats.

Inbound heartbeat is accepted only when the connection is effectively CONNECTED.
A valid heartbeat does not trigger a state transition.
A valid heartbeat updates heartbeat_expires_at.
Once heartbeat_expires_at has passed, the connection is effectively offline.
Heartbeat liveness uses a separate stale grace window so acceptance is not tied exactly to the raw heartbeat interval.

Outbound heartbeat¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store
    participant Scheduler as Event Bridge

    FP->>State: Persist state = CONNECTED
    FP->>Scheduler: Create recurring heartbeat schedule

    loop While connected
        Scheduler->>FP: Trigger heartbeat
        FP->>CP: Send heartbeat
    end

Inbound heartbeat¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store

    loop While connected
        CP-->>FP: Heartbeat with offline_mode_at
        FP->>State: Persist heartbeat_expires_at
    end

Heartbeat expiry¶

sequenceDiagram
    participant CP as Capacity Provider
    participant FP as BetterFleet
    participant State as DB State Store
    participant Scheduler as Event Bridge

    alt heartbeat_expires_at has passed
        FP->>State: Interpret connection as offline
        FP->>State: Disconnect/reset state
        FP->>Policy: Resolve fallback or gap-policy constraint for now
        FP->>Ops: Show yellow fallback mode + create notification/incident
        FP->>Scheduler: Delete heartbeat schedule
    end

Fallback-mode handover after offline detection¶

Fallback mode is not a separate OSCP connection state. It is an operator-visible operating mode layered on top of an offline or degraded connection.
When heartbeat expiry or equivalent offline detection occurs, BetterFleet transitions the OSCP connection to its offline lifecycle state and then resolves the managed-scope constraint to apply next.
If valid fallback forecast coverage exists for now, BetterFleet activates fallback-derived constraint state for the mapped managed scope.
If no matching fallback coverage exists for now, BetterFleet applies the configured gap policy. The current supported selectable option is the existing circuit safe default or non-OSCP path.
In both cases, BetterFleet surfaces a yellow fallback mode rather than a red fail-safe alarm and creates notification and incident context for operators.
When the connection is restored and valid primary forecast coverage resumes, BetterFleet exits fallback mode and restores the non-fallback OSCP-controlled path unless a newer accepted forecast supersedes it.

Intentional disconnect handover¶

sequenceDiagram
    participant User as Operator
    participant FP as BetterFleet
    participant State as DB State Store
    participant MG as Managed Scope / Compatibility Path
    participant Ops as Operator Surface

    User->>FP: Disconnect OSCP connection
    FP->>Ops: Warn that active OSCP constraints will be cleared
    User->>FP: Confirm disconnect
    FP->>State: Mark connection disconnected / reset lifecycle state
    FP->>MG: Withdraw active OSCP forecast and fallback envelopes
    FP->>Ops: Exit fallback mode if active
    FP->>Ops: Show local control resumed / non-OSCP path active

Capabilities¶

Capability exchange and forecast actions are later slices. They depend on the connection model being stable first.

Runtime State Model¶

The current persisted runtime FSM uses four states:

NO_CONNECTION: no active protocol session exists.
CONNECTING: BetterFleet initiated connection establishment by sending Handshake; the connection is not yet established and BetterFleet is waiting for inbound HandshakeAcknowledge.
ACCEPTING_CONNECTION: the remote party initiated connection establishment; BetterFleet accepted the inbound handshake path and the connection is not yet established while the acknowledgement exchange completes.
CONNECTED: the handshake exchange is complete and the session remains live until heartbeat expectations are violated or a protocol failure occurs.

Fallback mode is not an additional persisted FSM state in this document. It is an operational mode that can be active while the connection lifecycle is effectively offline or intentionally disconnected.

Implementation Constraints and Compromises¶

AWS Event Bridge Constraint¶

AWS Scheduler recurring schedules are minute-granularity. BetterFleet therefore:

keeps the protocol heartbeat preference in seconds during handshake communication
rounds scheduler cadence up to whole minutes for recurring outbound heartbeat schedules
rounds connection-timeout scheduling up to whole minutes for one-shot timeout events

This is an implementation compromise, not a protocol ideal.

Timeouts¶

BetterFleet uses connection timeouts as a lifecycle failsafe rather than as an OSCP-defined protocol state.

This exists to ensure that transient failures do not leave the local lifecycle stuck in CONNECTING or ACCEPTING_CONNECTION forever.

Once BetterFleet has already sent an outbound request and is waiting for HTTP 204, lack of response is handled through the transport timeout / request failure path rather than by waiting for the lifecycle timeout.

This means that there is a race condition that exists, whereby a HandshakeAcknowledge and the paired timeout are processed simultaneously, leading to an unclear end-state depending on processing order.

If the timeout is processed first, then the connection is killed and the acknowledgement fails.
If the acknowledgement is processed first, then the connection is made and the timeout fails.

This is intended behaviour as both events could be processed by different instances/servers, meaning they rely exclusively on the deterministic transitions of the finite state model (e.g., acknowledgement does nothing in the not connected state, etc.), meaning this is a safe mechanism.

It is also important to note that the CP who sent the acknowledgment is attempting to make a connection with BetterFleet, whereas the timeout behaviour is purely a failsafe, not part of the OSCP spec, hence why it is acceptable for the connection to be made even after the timeout window has formally passed.

Self-Healing¶

In practice, the current lifecycle is intended to be self-healing. This means that if states become misaligned (perhaps through an uncaught race condition), this misalignment will only persist for a short period of time as each peer will quickly realise that there is no live connection, and so the connection process can recommence from a known good state.