Table of Contents

Share on:

Share on LinkedInShare on FacebookShare on Twitter
Playground Orkes

Ready to build reliable applications 10x faster?

PRODUCT

Task-Level Resilience in Orkes Conductor: Timeouts and Retries in Action

Karl Goeltner
Software Engineer
May 12, 2025
5 min read

In distributed systems, individual task failure is not a matter of if, but when. APIs go down, services stall, and workers disappear. What matters is how your system responds. With Orkes Conductor, you don’t just handle these failures—you design for them.

Conductor provides fine-grained control over how each task behaves under failure. With customizable timeouts and retries, you can recover from transient issues without human intervention, ensure critical steps don’t hang indefinitely, and build workflows that fail gracefully instead of catastrophically.

In this blog, we’ll explore three core capabilities that enable resilient task execution in Orkes Conductor:

  • Task Retries
  • Task Timeouts
  • System Task Timeouts

Task retries: Recovering from flaky failures

One of the most common failure scenarios is a transient error—momentary service unavailability, network hiccups, or throttling by an external API. Conductor lets you retry failed tasks automatically, using configurable backoff strategies to avoid overwhelming downstream services.

Retry parameters

ParameterDescription
retryCountThe maximum number of retry attempts. Default is 3.
retryLogicRetry strategy for the tasks. Supports:
  • FIXED–Retries after a fixed interval defined by retryDelaySeconds.
  • LINEAR_BACKOFF–Retries occur with a delay that increases linearly based on retryDelaySeconds x backoffScaleFactor x attempt_number.
  • EXPONENTIAL_BACKOFF–Retries occur with a delay that increases exponentially based on retryDelaySeconds x (backoffScaleFactor ^ attempt_number.
retryDelaySecondsThe delay between retries. This combines with the backoff logic to calculate the actual wait time.
Note: The actual duration depends on the retry policy set in retryLogic.
backoffScaleFactorMultiplier applied to retryDelaySeconds to adjust how fast delays increase. Default is 1.

Use case: Flaky email provider

Imagine your email provider fails intermittently. The first request sends a 500 error, but the second or third might succeed. This is a perfect scenario for retries.

from conductor.client.configuration.configuration import Configuration
from conductor.client.http.models import TaskDef
from conductor.client.orkes_clients import OrkesClients


def main():
    api_config = Configuration()
    clients = OrkesClients(configuration=api_config)
    metadata_client = clients.get_metadata_client()

    task_def = TaskDef()
    task_def.name = 'send_email_task'
    task_def.description = 'Send an email with retry on intermittent failures'
    task_def.retry_count = 3
    task_def.retry_logic = 'EXPONENTIAL_BACKOFF'
    task_def.retry_delay_seconds = 2
    task_def.backoff_scale_factor = 2

    metadata_client.register_task_def(task_def=task_def)

    print(f'Registered the task -- view at {api_config.ui_host}/taskDef/{task_def.name}')


if __name__ == '__main__':
    main()

Check out the full sample code for the send email task.

Here, if the email task fails, it will automatically retry up to 3 times with increasing delays—2s, 4s, and 8s—allowing time for the service to recover between attempts.

Email validation workflow
Email validation workflow using a Send Email task with exponential-backoff retries

Task timeouts: Preventing workflow stalls

Retries help you recover, but timeouts prevent you from getting stuck in the first place. Whether a worker goes offline or an external service hangs, task-level timeouts ensure your workflow doesn’t wait forever.

Timeout parameters

ParameterDescription
pollTimeoutSecondsThe time to wait for a worker to poll this task before marking it as TIMED_OUT.
responseTimeoutSecondsThe time to wait for a worker to send a status update (like IN_PROGRESS) after polling.
timeoutSecondsTotal time allowed for the task to reach a terminal state.
timeoutPolicyAction to take when a timeout occurs:
  • RETRY–Retries the task using retry settings.
  • TIME_OUT_WF–Marks the whole workflow as TIMED_OUT.
  • ALERT_ONLY–Logs an alert but lets the task continue.

Use case: Slow inventory API

Say you're calling a third-party inventory API that sometimes takes too long to respond. You don't want to wait forever, but you also don’t want to fail immediately. Here's how you'd configure a balanced timeout with retries:

from conductor.client.configuration.configuration import Configuration
from conductor.client.http.models import TaskDef
from conductor.client.orkes_clients import OrkesClients


def main():
    api_config = Configuration()
    clients = OrkesClients(configuration=api_config)
    metadata_client = clients.get_metadata_client()

    task_def = TaskDef()
    task_def.name = 'inventory_check_task'
    task_def.description = 'Check inventory status with timeout and retry settings'
    
    # Retry settings
    task_def.retry_count = 2
    task_def.retry_logic = 'FIXED'
    task_def.retry_delay_seconds = 5

    # Timeout settings
    task_def.timeout_seconds = 30
    task_def.poll_timeout_seconds = 10
    task_def.response_timeout_seconds = 15
    task_def.timeout_policy = 'RETRY'

    metadata_client.register_task_def(task_def=task_def)

    print(f'Registered the task -- view at {api_config.ui_host}/taskDef/{task_def.name}')


if __name__ == '__main__':
    main()

Check out the full sample code for the check inventory task.

This setup gives your worker 30 seconds to complete the task. If it doesn’t respond or fails, Conductor will retry it twice, waiting 5 seconds between each attempt.

Inventory workflow
Inventory workflow with a Check Inventory task with a 30-second timeout

Together, these retry and timeout configurations help you build workflows that are not just reactive, but resilient by design.

Next, we’ll look at how the same principles apply at the workflow level, giving you end-to-end control over your system’s behavior.

System task resilience

A key part of building resilient workflows is defining how long system tasks should wait on external services or heavy computations. In Orkes Conductor, each system task has default or configurable timeout settings to control this behavior.

HTTP task resilience

In Orkes Conductor, HTTP timeouts are defined by two parameters:

ParameterDescriptionDefault Values
connectionTimeoutThe maximum time (in milliseconds) to establish a TCP connection to the remote server.30 sec
readTimeoutThe maximum time (in milliseconds) to wait for a response after the connection is established and the request is sent.60 sec

In Orkes Conductor, these defaults are enforced to ensure platform stability.

HTTP timeouts
Orkes Conductor making an HTTP request to an external server, illustrating timeout applications.

Internal timeouts resilience

Some system tasks have implicit timeout behaviors based on internal implementation. If you're designing workflows around system tasks, it's critical to understand and respect these limits.

System TaskConnection TimeoutRead Timeout
HTTP30 sec60 sec
LLM60 sec60 sec
Opsgenie30 sec60 sec
Inlinen/a4 sec (max execution time)
Business Rule10 sec120 sec

Wrap up

Task failures are unavoidable, but with proper retry and timeout configurations, they don’t have to break your workflows. Conductor’s task-level resilience features help you avoid cascading failures, handle transient issues gracefully, and prevent workflows from hanging indefinitely.

In the next article, we’ll scale this approach up and explore workflow-level failure handling strategies like timeout policies and compensation flows that give you end-to-end resilience.

Next up:

Orkes Conductor is an enterprise-grade Unified Application Platform for process automation, API and microservices orchestration, agentic workflows, and more. Check out the full set of features, or try it yourself using our free Developer Playground.

Related Blogs

Fail Fast, Recover Smart: Timeouts, Retries, and Recovery in Orkes Conductor

May 12, 2025

Fail Fast, Recover Smart: Timeouts, Retries, and Recovery in Orkes Conductor