Handling Failures
Orkes Conductor automatically handles transient workflow and task failures without the need to write custom code. Various failure-handling configurations can be set ahead of time, which will take effect during execution.
For tasks, you can configure the following resilience parameters in its task definition:
- Retries
- Timeouts
- Rate limits
For workflows, you can configure the following resilience parameters in its workflow definition:
- Timeouts
- Compensation flows (known as failure workflow in Conductor)
- Rate limits
To deal with workflow failures post-execution, refer to Debugging Workflows.
Message delivery guarantees
Conductor guarantees at least once message delivery, meaning all messages are persistent and will be delivered to task workers one or more times. In the event of failure, the message will be delivered more than once. This semantic ensures that:
- If a workflow has started, it will run to completion as long as all its tasks are completed.
- If a task worker fails due to restarts, crashes, or other issues, the message will be redelivered to another worker node that is alive and responding.
Task retries
Automatic retries are a key strategy for handling transient task failures. If a task fails to complete, the Conductor server will make the task available for polling again after a given duration.
Retry configuration
You can configure retry behavior for tasks in its task definition. The parameters for defining a task’s retry behavior are:
- Retry count
- Retry logic
- Retry delay seconds
- Backoff scale factor
Parameter | Description | Required/ Optional |
---|---|---|
retryCount | The number of retry attempts if the task fails. Default value is 3. | Optional. |
retryLogic | The policy that determines the retry mechanism for the tasks. Supported values:
| Optional. |
retryDelaySeconds | The time (in seconds) to wait before each retry attempt. This provides time for the task service to recover from any transient failure before it is retried. Default value is 60. Note: The actual duration depends on the retry policy set in retryLogic. | Optional. |
backoffScaleFactor | The value multiplied with retryDelaySeconds to determine the interval for retry. Default value is 1. | Optional. |
Example
// task definition
{
"name": "someTaskDefName",
...
"retryCount": 3,
"retryLogic": "FIXED|EXPONENTIAL_BACKOFF|LINEAR_BACKOFF",
"retryDelaySeconds": 1,
"backoffScaleFactor": 1
}
Example retry behavior
Based on the retry configuration in the above figure, the following sequence of events will occur in the event of a retry:
- Worker (W1) polls the Conductor server for task T1 and receives the task.
- After processing the task, the worker determines that the task execution is a failure and reports to the server with a
FAILED
status after 10 seconds. - The server will persist this failed execution of T1.
- A new task T1 execution is created and scheduled for polling. Based on the retry configuration, the task will be available for polling after 5 seconds
Task timeouts
A task timeout can occur if:
- There are no workers available for a given task type. This could be due to longer-than-expected system downtime or a system misconfiguration.
- The worker receives the message but dies before completely processing the task, so the task never reaches completion.
- The worker has completed the task but could not communicate with the Conductor server due to network failures, the server being down, or other issues.
Timeout configuration
You can configure timeout behavior for tasks in its task definition to handle the various abovementioned cases. The parameters for a task’s timeout behavior are:
- Poll timeout seconds
- Response timeout seconds
- Timeout seconds
- Timeout policy
Parameter | Description | Required/ Optional |
---|---|---|
pollTimeoutSeconds | The maximum duration in seconds that a worker has to poll a task before it gets marked as TIMED_OUT . When configured with a value > 0, Conductor will wait for the task to be picked up by a worker. Useful for detecting a backlogged task queue with insufficient workers. Default value is 3600. | Optional. |
responseTimeoutSeconds | The maximum duration in seconds that a worker has to respond to the server with a status update before it gets marked as TIMED_OUT . When configured with a value > 0, Conductor will wait for the worker to return a status update, starting from when the task was picked up. If a task requires more time to complete, the worker can respond with the IN_PROGRESS status. Default value is 600. | Optional. |
timeoutSeconds | The maximum duration in seconds for the task to reach a terminal state before it gets marked as TIMED_OUT . When configured with a value > 0, Conductor will wait for the task to complete, starting from when the task was picked up. Useful for governing the overall SLA for completion. Default value is 3600. | Required. |
timeoutPolicy | The policy for handling timeout. Supported values:
| Optional. |
To configure tasks that never timeout, set timeOutSeconds
and pollTimeoutSeconds
to 0.
Example
// task definition
{
"name": "someTaskDefName",
...
"retryCount": 3,
"retryLogic": "FIXED|EXPONENTIAL_BACKOFF|LINEAR_BACKOFF",
"retryDelaySeconds": 1,
"backoffScaleFactor": 1
}