OSCP Protocol Documentation¶
This document complements spec.md by describing the current implemented OSCP connection-lifecycle slice rather than the full conceptual OSCP design.
Use spec.md as the source of truth for the broader architecture, domain boundaries, and future-shape OSCP model. Use this document for the current runtime FSM, handshake/timeout/heartbeat behaviour, implementation compromises in the delivered lifecycle flow, and the confirmed target behaviour that follows from lifecycle degradation into fallback mode or intentional disconnect.
Purpose¶
This document exists to:
- define the BetterFleet interpretation of the OSCP connection lifecycle
- describe the persisted runtime state model
- document the handshake, timeout, and heartbeat logical flows
- document how offline lifecycle detection hands off into fallback mode
- document the intentional-disconnect handover back to non-OSCP control
- capture the current implementation compromises
Scope¶
This document currently covers:
- connection establishment
- connection timeout handling
- heartbeat handling and liveness
- lifecycle-triggered fallback-mode behaviour
- intentional disconnect handover
- BetterFleet and CP simulator role boundaries used for development
This document does not currently (but may in future) define:
- registration behavior
- capability / forecast payload semantics
- long-term audit or event-sourcing strategy
Supplementary Resources¶
- OSCP specification:
docs/reference/system-design/oscp/spec.md - Vector DERMS specification:
docs/work/active/vector-derms/spec.md
Roles¶
For BetterFleet's current OSCP implementation:
- BetterFleet acts as the flexibility provider,
FP - the peer system acts as the capacity provider,
CP
Route ownership follows that split:
- BetterFleet exposes
FProutes such as/oscp/fp/2.0/... - the CP simulator exposes
CProutes such as/oscp/cp/2.0/...
The OSCP specification calls out the inclusion of a capacity optimizer, CO, which currently falls out of scope.
OSCP Conceptual Model¶
OSCP has three broad concerns:
- Registration
- Active connection lifecycle
- Capabilities
State Model¶
The connection lifecycle is described through the following finite state machine.
stateDiagram-v2
direction LR
UNREGISTERED --> NO_CONNECTION: Register (out of scope)
NO_CONNECTION --> CONNECTING: Start connection (rx204)
NO_CONNECTION --> ACCEPTING_CONNECTION: Receive connection request (tx204)
CONNECTING --> CONNECTED: Receive acknowledgement (tx204)
CONNECTING --> NO_CONNECTION: Timeout or protocol failure
ACCEPTING_CONNECTION --> CONNECTED: Send acknowledgement (rx204)
ACCEPTING_CONNECTION --> NO_CONNECTION: Timeout or protocol failure
CONNECTED --> CONNECTED: receive Heartbeat within threshold
CONNECTED --> NO_CONNECTION: Heartbeat threshold missed or protocol failure
- Registration can only occur while in the
UNREGISTEREDstate - Active connection lifecycle relates to all other states
- Capabilities can only be enacted while in the
CONNECTEDstate
Communication Expectations¶
Accepted OSCP requests in this slice are acknowledged with HTTP 204.
Invalid or impossible protocol transitions are rejected with the appropriate HTTP error, for example 403 when the current runtime state does not allow the requested transition.
Registration¶
Registration exists conceptually but is out of scope for the current implementation.
The intended end state is that registration owns creation of the persisted OSCP connection-state record.
Under that model:
- an OSCP connection must be registered before handshake or heartbeat lifecycle commands are valid
- later lifecycle commands transition an existing registered record rather than creating one on first use
- duplicate first-touch record creation races are treated as "already registered" conflicts rather than infrastructure failures
Connection¶
Connection consists of:
- handshake initiation
- handshake acknowledgement
- heartbeat exchange
- timeout / disconnect handling
Crucially, communication sessions are initiated and handled using HTTP/S, meaning that the concept of an ongoing 'connection' is ephemeral. These are not websockets, but the protocol does imitate it.
Starting a connection¶
This is initiated by BetterFleet by sending a handshake to the capacity provider.
Expected sequence¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
FP->>State: Persist state = CONNECTING
FP->>CP: Send Handshake
CP-->>FP: HTTP 204
CP->>FP: HandshakeAcknowledge
FP-->>CP: HTTP 204
FP->>State: Persist state = CONNECTED
FP->>State: Set connection_started_at if first connection
The outbound communication initialisation must also be robust to timeout.
The timeout handling is necessary to cover two distinct failure modes:
- the peer does not return the immediate HTTP
204to the outboundHandshake - the peer returns
204, but never sends the laterHandshakeAcknowledge
Timeout handling¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
participant Scheduler as Event Bridge
FP->>State: Persist state = CONNECTING
FP->>Scheduler: Create one-shot CONNECTING timeout
FP->>CP: Send handshake
alt handshake_acknowledge received before timeout
CP-->>FP: HTTP 204
CP->>FP: HandshakeAcknowledge
FP-->>CP: HTTP 204
FP->>State: Persist state = CONNECTED
FP->>Scheduler: Cancel CONNECTING timeout
else timeout event fires first
Scheduler->>FP: Trigger OSCP_CONNECTION_TIMEOUT
FP->>State: Re-check expected state + state_updated_at
FP->>State: Disconnect/reset state
FP->>Scheduler: Delete timeout event
end
Receiving a connection request¶
This is initiated by the capacity provider by sending a handshake to BetterFleet.
The acknowledgement is deliberately delayed onto an in-process async follow-up task so BetterFleet can (nearly) guarantee that the outbound HTTP 204 is returned before the outbound HandshakeAcknowledge is sent.
Expected sequence¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
participant BG as Async follow-up task
CP->>FP: Handshake
FP->>State: Persist state = ACCEPTING_CONNECTION
FP->>BG: Schedule delayed HandshakeAcknowledge
FP-->>CP: HTTP 204
BG->>CP: Send HandshakeAcknowledge
CP-->>FP: HTTP 204
FP->>State: Persist state = CONNECTED
FP->>State: Set connection_started_at if first connection
The inbound communication acceptance must also be robust to timeout.
The timeout handling is necessary to cover the case where the peer never returns the expected HTTP 204 to BetterFleet's outbound HandshakeAcknowledge.
More information at end.
Timeout handling¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
participant Scheduler as Event Bridge
participant BG as Async follow-up task
CP->>FP: Send handshake
FP->>State: Persist state = ACCEPTING_CONNECTION
FP->>Scheduler: Create one-shot ACCEPTING_CONNECTION timeout
FP->>BG: Schedule delayed HandshakeAcknowledge
FP-->>CP: HTTP 204
alt delayed HandshakeAcknowledge succeeds before timeout
BG->>CP: Send HandshakeAcknowledge
CP-->>FP: HTTP 204
FP->>State: Persist state = CONNECTED
FP->>Scheduler: Cancel ACCEPTING_CONNECTION timeout
else timeout event fires first
Scheduler->>FP: Trigger ACCEPTING_CONNECTION timeout
FP->>State: Re-check expected state + state_updated_at
FP->>State: Disconnect/reset state
FP->>Scheduler: Delete timeout event
end
Maintaining a connection¶
This is done through the use of heartbeats.
- Inbound heartbeat is accepted only when the connection is effectively
CONNECTED. - A valid heartbeat does not trigger a state transition.
- A valid heartbeat updates
heartbeat_expires_at. - Once
heartbeat_expires_athas passed, the connection is effectively offline. - Heartbeat liveness uses a separate stale grace window so acceptance is not tied exactly to the raw heartbeat interval.
Outbound heartbeat¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
participant Scheduler as Event Bridge
FP->>State: Persist state = CONNECTED
FP->>Scheduler: Create recurring heartbeat schedule
loop While connected
Scheduler->>FP: Trigger heartbeat
FP->>CP: Send heartbeat
end
Inbound heartbeat¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
loop While connected
CP-->>FP: Heartbeat with offline_mode_at
FP->>State: Persist heartbeat_expires_at
end
Heartbeat expiry¶
sequenceDiagram
participant CP as Capacity Provider
participant FP as BetterFleet
participant State as DB State Store
participant Scheduler as Event Bridge
alt heartbeat_expires_at has passed
FP->>State: Interpret connection as offline
FP->>State: Disconnect/reset state
FP->>Policy: Resolve fallback or gap-policy constraint for now
FP->>Ops: Show yellow fallback mode + create notification/incident
FP->>Scheduler: Delete heartbeat schedule
end
Fallback-mode handover after offline detection¶
- Fallback mode is not a separate OSCP connection state. It is an operator-visible operating mode layered on top of an offline or degraded connection.
- When heartbeat expiry or equivalent offline detection occurs, BetterFleet transitions the OSCP connection to its offline lifecycle state and then resolves the managed-scope constraint to apply next.
- If valid fallback forecast coverage exists for
now, BetterFleet activates fallback-derived constraint state for the mapped managed scope. - If no matching fallback coverage exists for
now, BetterFleet applies the configured gap policy. The current supported selectable option is the existing circuit safe default or non-OSCP path. - In both cases, BetterFleet surfaces a yellow fallback mode rather than a red fail-safe alarm and creates notification and incident context for operators.
- When the connection is restored and valid primary forecast coverage resumes, BetterFleet exits fallback mode and restores the non-fallback OSCP-controlled path unless a newer accepted forecast supersedes it.
Intentional disconnect handover¶
sequenceDiagram
participant User as Operator
participant FP as BetterFleet
participant State as DB State Store
participant MG as Managed Scope / Compatibility Path
participant Ops as Operator Surface
User->>FP: Disconnect OSCP connection
FP->>Ops: Warn that active OSCP constraints will be cleared
User->>FP: Confirm disconnect
FP->>State: Mark connection disconnected / reset lifecycle state
FP->>MG: Withdraw active OSCP forecast and fallback envelopes
FP->>Ops: Exit fallback mode if active
FP->>Ops: Show local control resumed / non-OSCP path active
Capabilities¶
Capability exchange and forecast actions are later slices. They depend on the connection model being stable first.
Runtime State Model¶
The current persisted runtime FSM uses four states:
NO_CONNECTION: no active protocol session exists.CONNECTING: BetterFleet initiated connection establishment by sendingHandshake; the connection is not yet established and BetterFleet is waiting for inboundHandshakeAcknowledge.ACCEPTING_CONNECTION: the remote party initiated connection establishment; BetterFleet accepted the inbound handshake path and the connection is not yet established while the acknowledgement exchange completes.CONNECTED: the handshake exchange is complete and the session remains live until heartbeat expectations are violated or a protocol failure occurs.
Fallback mode is not an additional persisted FSM state in this document. It is an operational mode that can be active while the connection lifecycle is effectively offline or intentionally disconnected.
Implementation Constraints and Compromises¶
AWS Event Bridge Constraint¶
AWS Scheduler recurring schedules are minute-granularity. BetterFleet therefore:
- keeps the protocol heartbeat preference in seconds during handshake communication
- rounds scheduler cadence up to whole minutes for recurring outbound heartbeat schedules
- rounds connection-timeout scheduling up to whole minutes for one-shot timeout events
This is an implementation compromise, not a protocol ideal.
Timeouts¶
BetterFleet uses connection timeouts as a lifecycle failsafe rather than as an OSCP-defined protocol state.
This exists to ensure that transient failures do not leave the local lifecycle stuck in CONNECTING or ACCEPTING_CONNECTION forever.
Once BetterFleet has already sent an outbound request and is waiting for HTTP 204, lack of response is handled through the transport timeout / request failure path rather than by waiting for the lifecycle timeout.
This means that there is a race condition that exists, whereby a HandshakeAcknowledge and the paired timeout are processed simultaneously, leading to an unclear end-state depending on processing order.
- If the timeout is processed first, then the connection is killed and the acknowledgement fails.
- If the acknowledgement is processed first, then the connection is made and the timeout fails.
This is intended behaviour as both events could be processed by different instances/servers, meaning they rely exclusively on the deterministic transitions of the finite state model (e.g., acknowledgement does nothing in the not connected state, etc.), meaning this is a safe mechanism.
It is also important to note that the CP who sent the acknowledgment is attempting to make a connection with BetterFleet, whereas the timeout behaviour is purely a failsafe, not part of the OSCP spec, hence why it is acceptable for the connection to be made even after the timeout window has formally passed.
Self-Healing¶
In practice, the current lifecycle is intended to be self-healing. This means that if states become misaligned (perhaps through an uncaught race condition), this misalignment will only persist for a short period of time as each peer will quickly realise that there is no live connection, and so the connection process can recommence from a known good state.