Table of Contents

Share on:

Share on LinkedInShare on FacebookShare on Twitter
Playground Orkes

Ready to build reliable applications 10x faster?

PRODUCT

Fail Fast, Recover Smart: Timeouts, Retries, and Recovery in Orkes Conductor

Karl Goeltner
Software Engineer
May 12, 2025
5 min read

In distributed systems, failure isn’t a possibility—it’s a certainty. APIs hang, workers crash, and networks drop. What matters isn’t avoiding failure, but recovering from it quickly and cleanly.

Orkes Conductor gives you the tools to do exactly that. Timeouts and retries aren’t just configurable—they’re core to how Conductor ensures reliability at scale. In this post, you’ll learn how to use them effectively at both the task and workflow levels, how they interact with failure workflows, and how to design resilient systems that heal themselves without manual intervention.

Let’s dive in.

Why failure handling is critical in workflow applications

Imagine you're orchestrating a payment flow. After the user places an order, your workflow triggers tasks to charge the card, update inventory, and send a confirmation email.

All good—until the inventory API goes silent.

Now you're in a mess: the card is charged, but inventory isn’t updated, and the user doesn’t get a confirmation. This is exactly the kind of situation timeouts and retries are designed to avoid.

Unresponsive HTTP task in an inventory workflow
Unresponsive HTTP task in an inventory workflow

In Orkes Conductor, timeouts and retries work together to provide resilience and control:

  • Timeouts prevent your workflow from hanging indefinitely.
  • Retries recover from transient errors without manual intervention.

Together, they help you gracefully handle:

  • Unresponsive services: If a downstream API is taking too long, timeout the task and retry or move on.
  • Crashed or unresponsive workers: If a worker crashes mid-task, timeouts ensure the task doesn’t hang forever.
  • Transient errors: If a task fails quickly (like a network blip or 500 error), retries can automatically reattempt the operation.
  • Misconfigured or under-provisioned systems: If no worker is available to pick up a task, timeouts ensure the system doesn’t just wait forever.

This is more than resilience—it’s control. Instead of writing custom error-handling logic for every edge case, you get:

  • At-least-once delivery
  • Automatic retry logic
  • Configurable timeouts
  • Self-healing workflows

Task timeouts vs task retries

As you design resilient workflows, it’s important to understand the difference between timeouts and retries for individual tasks—they solve different problems:

AspectRetryTimeout
What it isA second (or third, etc.) attempt to run the same task after it fails.A time limit for how long a task is allowed to run (or how long you wait for it to start/progress).
TriggerHappens after a task fails (e.g., returns FAILED).Happens if the task takes too long to respond to or complete.
ExampleAPI call fails with 500 Internal Server Error → Retry it after a delay.API call hangs with no response after 30 seconds → Timeout triggers a recovery action.
Configured viaretryCount, retryLogic, retryDelaySeconds, backoffScaleFactortimeoutSeconds, pollTimeoutSeconds, responseTimeoutSeconds, timeoutPolicy

In short:

  • Retries deal with tasks that fail fast.
  • Timeouts deal with tasks that hang slow.

Both are essential for robust, fault-tolerant tasks.

Task timeouts vs task retries
Failed worker resulting in task-timeout and retries

Workflow timeouts and failure workflows

Just like tasks, entire workflows can time out or fail, and Conductor gives you powerful tools to handle those cases gracefully.

AspectWorkflow TimeoutFailure Workflow
What it isA maximum time limit for the whole workflow to complete.A backup workflow that runs automatically when the original workflow fails.
TriggerHappens when the workflow exceeds its configured timeout.Happens when the workflow ends in FAILED or TERMINATED status.
ExampleA data pipeline takes too long (e.g., > 2 hours) → Workflow is marked TIMED_OUT.A workflow fails due to repeated task failures → A failure workflow sends alerts and logs diagnostics.
Configured viatimeoutSeconds, timeoutPolicyfailureWorkflow

In short:

  • Workflow timeouts protect against indefinitely running flows.
  • Failure workflows let you define recovery logic and postmortems for failed flows.

Together, they make workflows self-aware and recoverable.

Workflow timeouts and failure workflows
Long-running worker exceeds workflow timeout and triggers compensation workflow

Combining retries, timeouts, and failure Workflows

Retries give tasks more chances. Timeouts limit how long each attempt can run. Workflow timeouts cap the total duration of the workflow. If all else fails, failure workflows take over.

Here’s how it all fits together:

  • Task failsTask Retry: it’s retried (up to retryCount).
  • Each retry has a timeoutTask Timeout: Limits how long you wait per attempt.
  • Too many failures or long delayWorkflow Timeout: Workflow fails or times out.
  • Workflow failsFailure Workflow: A failure workflow runs to clean up or alert.

This layered approach gives you fault-tolerance at every level:

Default vs custom configuration: When to override

Orkes Conductor ships with sane defaults.

Task retry settings

{
  "retryCount": 3,
  "retryDelaySeconds": 60,
  "retryLogic": "FIXED",
  "backoffScaleFactor": 1
}

Task timeout settings

{
  "responseTimeoutSeconds": 600,
  "timeoutSeconds": 3600,
  "pollTimeoutSeconds": 3600,
  "timeoutPolicy": "TIME_OUT_WF"
}

Workflow settings

{
  "timeoutSeconds": 0,
  "timeoutPolicy": "ALERT_ONLY",
  "failureWorkflow": ""
}

These are a solid starting point, but not always ideal for your SLAs.

Override the defaults when:

  • You have strict latency goals, such as completing checkout in under 5 minutes.
  • You interact with unreliable third-party APIs, like a flaky partner API.
  • You need predictable system behavior under load or failure.

On the flip side, don’t override just for the sake of it. Defaults help reduce configuration sprawl and are good enough for many internal service workflows.

Wrap up

In distributed systems, failure is inevitable, but downtime doesn’t have to be. With Orkes Conductor, resilience is built in, not bolted on.

Timeouts help prevent indefinite hangs, and retries offer automatic recovery from transient issues. Workflow-level timeouts and failure workflows ensure your entire system can fail gracefully and bounce back, without manual intervention.

By combining these tools thoughtfully, you get more than reliability—you get confidence—confidence that your workflows will keep running, even when parts of your system don’t.

So, whether you're orchestrating high-stakes payment flows or background data pipelines, remember: design for failure, recover automatically, and keep moving forward.

Go more in-depth with task and workflow resilience with these follow-up articles:

Orkes Conductor is an enterprise-grade Unified Application Platform for process automation, API and microservices orchestration, agentic workflows, and more. Check out the full set of features, or try it yourself using our free Developer Playground.

Related Blogs

Task-Level Resilience in Orkes Conductor: Timeouts and Retries in Action

May 12, 2025

Task-Level Resilience in Orkes Conductor: Timeouts and Retries in Action