Back To Blogs

Table of Contents

Share on:

Ready to build reliable applications 10x faster?

Back To Blogs

PRODUCT

Fail Fast, Recover Smart: Timeouts, Retries, and Recovery in Orkes Conductor

Karl Goeltner

Software Engineer

May 12, 2025

5 min read

In distributed systems, failure isn’t a possibility—it’s a certainty. APIs hang, workers crash, and networks drop. What matters isn’t avoiding failure, but recovering from it quickly and cleanly.

Orkes Conductor gives you the tools to do exactly that. Timeouts and retries aren’t just configurable—they’re core to how Conductor ensures reliability at scale. In this post, you’ll learn how to use them effectively at both the task and workflow levels, how they interact with failure workflows, and how to design resilient systems that heal themselves without manual intervention.

Let’s dive in.

Why failure handling is critical in workflow applications

Imagine you're orchestrating a payment flow. After the user places an order, your workflow triggers tasks to charge the card, update inventory, and send a confirmation email.

All good—until the inventory API goes silent.

Now you're in a mess: the card is charged, but inventory isn’t updated, and the user doesn’t get a confirmation. This is exactly the kind of situation timeouts and retries are designed to avoid.

Unresponsive HTTP task in an inventory workflow

In Orkes Conductor, timeouts and retries work together to provide resilience and control:

Timeouts prevent your workflow from hanging indefinitely.
Retries recover from transient errors without manual intervention.

Together, they help you gracefully handle:

Unresponsive services: If a downstream API is taking too long, timeout the task and retry or move on.
Crashed or unresponsive workers: If a worker crashes mid-task, timeouts ensure the task doesn’t hang forever.
Transient errors: If a task fails quickly (like a network blip or 500 error), retries can automatically reattempt the operation.
Misconfigured or under-provisioned systems: If no worker is available to pick up a task, timeouts ensure the system doesn’t just wait forever.

This is more than resilience—it’s control. Instead of writing custom error-handling logic for every edge case, you get:

At-least-once delivery
Automatic retry logic
Configurable timeouts
Self-healing workflows

Task timeouts vs task retries

As you design resilient workflows, it’s important to understand the difference between timeouts and retries for individual tasks—they solve different problems:

Aspect	Retry	Timeout
What it is	A second (or third, etc.) attempt to run the same task after it fails.	A time limit for how long a task is allowed to run (or how long you wait for it to start/progress).
Trigger	Happens after a task fails (e.g., returns FAILED).	Happens if the task takes too long to respond to or complete.
Example	API call fails with 500 Internal Server Error → Retry it after a delay.	API call hangs with no response after 30 seconds → Timeout triggers a recovery action.
Configured via	`retryCount`, `retryLogic`, `retryDelaySeconds`, `backoffScaleFactor`	`timeoutSeconds`, `pollTimeoutSeconds`, `responseTimeoutSeconds`, `timeoutPolicy`

In short:

Retries deal with tasks that fail fast.
Timeouts deal with tasks that hang slow.

Both are essential for robust, fault-tolerant tasks.

Failed worker resulting in task-timeout and retries

Workflow timeouts and failure workflows

Just like tasks, entire workflows can time out or fail, and Conductor gives you powerful tools to handle those cases gracefully.

Aspect	Workflow Timeout	Failure Workflow
What it is	A maximum time limit for the whole workflow to complete.	A backup workflow that runs automatically when the original workflow fails.
Trigger	Happens when the workflow exceeds its configured timeout.	Happens when the workflow ends in FAILED or TERMINATED status.
Example	A data pipeline takes too long (e.g., > 2 hours) → Workflow is marked TIMED_OUT.	A workflow fails due to repeated task failures → A failure workflow sends alerts and logs diagnostics.
Configured via	`timeoutSeconds`, `timeoutPolicy`	`failureWorkflow`

In short:

Workflow timeouts protect against indefinitely running flows.
Failure workflows let you define recovery logic and postmortems for failed flows.

Together, they make workflows self-aware and recoverable.

Long-running worker exceeds workflow timeout and triggers compensation workflow

Combining retries, timeouts, and failure Workflows

Retries give tasks more chances. Timeouts limit how long each attempt can run. Workflow timeouts cap the total duration of the workflow. If all else fails, failure workflows take over.

Here’s how it all fits together:

Task fails → Task Retry: it’s retried (up to retryCount).
Each retry has a timeout → Task Timeout: Limits how long you wait per attempt.
Too many failures or long delay → Workflow Timeout: Workflow fails or times out.
Workflow fails → Failure Workflow: A failure workflow runs to clean up or alert.

This layered approach gives you fault-tolerance at every level:

Default vs custom configuration: When to override

Orkes Conductor ships with sane defaults.

Task retry settings

{
  "retryCount": 3,
  "retryDelaySeconds": 60,
  "retryLogic": "FIXED",
  "backoffScaleFactor": 1
}

Task timeout settings

{
  "responseTimeoutSeconds": 600,
  "timeoutSeconds": 3600,
  "pollTimeoutSeconds": 3600,
  "timeoutPolicy": "TIME_OUT_WF"
}

Workflow settings

{
  "timeoutSeconds": 0,
  "timeoutPolicy": "ALERT_ONLY",
  "failureWorkflow": ""
}

These are a solid starting point, but not always ideal for your SLAs.

Override the defaults when:

You have strict latency goals, such as completing checkout in under 5 minutes.
You interact with unreliable third-party APIs, like a flaky partner API.
You need predictable system behavior under load or failure.

On the flip side, don’t override just for the sake of it. Defaults help reduce configuration sprawl and are good enough for many internal service workflows.

Wrap up

In distributed systems, failure is inevitable, but downtime doesn’t have to be. With Orkes Conductor, resilience is built in, not bolted on.

Timeouts help prevent indefinite hangs, and retries offer automatic recovery from transient issues. Workflow-level timeouts and failure workflows ensure your entire system can fail gracefully and bounce back, without manual intervention.

By combining these tools thoughtfully, you get more than reliability—you get confidence—confidence that your workflows will keep running, even when parts of your system don’t.

So, whether you're orchestrating high-stakes payment flows or background data pipelines, remember: design for failure, recover automatically, and keep moving forward.

Go more in-depth with task and workflow resilience with these follow-up articles:

—

Orkes Conductor is an enterprise-grade orchestration platform for process automation, API and microservices orchestration, agentic workflows, and more. Check out the full set of features, or try it yourself using our free Developer Edition.

Related Blogs