0015 - Async/co-routine exception handling pattern¶

Date¶

2024-01-10

Status¶

Proposed

Context¶

In our system architecture, we utilise long-running coroutines that are initialised at application startup. These coroutines are pivotal for continuous background processing and various asynchronous tasks. Effective error handling is crucial to ensure the resilience and reliability of these coroutines, especially since they run persistently and handle a range of operations.

Decision¶

We have decided to implement a two-tiered error handling strategy for our long-running coroutines: 1. Inner Loop Handling (Recoverable Errors): * Within the inner loop of each coroutine, we will focus on handling errors that are deemed recoverable. * This includes implementing retry logic or other recovery mechanisms for transient or expected errors, such as temporary network issues or service interruptions. * The coroutine should attempt to resolve these issues internally and continue its normal operation without escalating to a full restart or termination. 2. Outer Loop Handling (Critical Errors): * Outside the inner loop, we will implement an exception handler that catches any errors not addressed within the inner loop. * Upon encountering such an error, the coroutine will be terminated. This action is reserved for critical issues that indicate a fundamental problem which cannot be recovered from within the scope of the coroutine’s logic. * A critical log message will be generated upon such termination to alert system administrators and developers to the issue. 3. Use of Bare Except Clause: * We will employ a bare except clause to ensure all exceptions are caught, including those not explicitly anticipated. * While this approach can capture unexpected errors and we would not use it for short-lived code, in the context of long-running coroutines where continuous operation is desired, it provides a safety net. We recognise that this comes with the trade-off of potentially catching exceptions that might not need special handling, but in this scenario, the goal of uninterrupted service outweighs this concern.

Consequences¶

Increased Resilience: The system will be more robust against recoverable errors, reducing the likelihood of coroutines failing due to transient issues.
Improved Error Visibility: Critical errors that cause coroutine termination will be clearly logged, making it easier to diagnose and address underlying issues.
Potential Overcatching: Using a bare except clause may lead to some exceptions being caught that could otherwise be handled differently, which could mask certain issues. However, this is considered an acceptable trade-off in this specific context of long-running coroutines.
Implementation Complexity: Developers will need to carefully distinguish between recoverable and critical errors and implement appropriate handling logic in both the inner and outer loops.