Durable Execution Explained — How Conductor Delivers Resilient Systems Out Of The Box

Viren Baraiya
May 09, 2024
Reading Time: 9 mins

This is part 2 of a two-part series on durable execution, what it is, why it is important, and how to pull it off. Part 2 shows how Conductor, a workflow orchestration engine, seamlessly integrates durable execution into applications. Check out part 1 for more about what durable execution is.

In the ever-evolving landscape of application architecture, durable execution and platform engineering have been gaining traction in recent years, driven by the need for resilient, scalable, and efficient systems.

Durable execution refers to a system’s ability to persist execution even in face of interruption or failure. This characteristic is especially important in distributed and/or long-running systems, where the chances of disruptions or failure increase drastically. While there are several best practices for building durable applications and systems, one of the most effective ways is to leverage stateful platforms like Orkes Conductor.

Conductor is a workflow orchestration platform that abstracts away the complexities of underlying infrastructure, enabling developers to focus on building applications. True to its name, Conductor directs and orchestrates the performance of distributed services into a dynamic application flow. Each individual player – or task – does not need to care what the other players are doing, because Conductor keeps track of what is supposed to happen at every juncture.

Its in-built state management allows for reliable recovery in case of failure or interruption. Just like a musical conductor, it empowers applications to adapt to ever-changing conditions without going offline — whether it is automatically retrying tasks, scaling up to meet traffic spikes, or integrating new services.

Infographic of 6 key features in Conductor: a resilient engine, elastic capacity, failure handling, introspection, versioning, and metrics.
6 key Conductor features for durable execution.

How does Conductor enable you to build resilient, efficient, and scalable systems? Let’s take a look at what happens in the backend when you build your applications with Conductor as the main orchestration engine.

Conductor guarantees durable execution under the hood

Conductor’s secret sauce for fortifying systems with durable execution is decoupled infrastructure and redundancy. Let’s set the scene for an example workflow.

Say you have an online shop that makes and ships custom violins worldwide. The order process can take months to fulfill, from pre-ordering the violin to customizing and shipping it.

Enter the four key actors in our order workflow.

  • Order App—the interface where customers can pre-order violins and make payment.
  • Conductor Server—the central orchestration engine that directs and tracks the workflow. Conductor’s workflow execution service (WES) runs on this server and manages the task queues.
  • Task Workers—code units that run and complete queued tasks.
  • Conductor Stores—storage units that contain all workflow information, including metadata, task queues, and history.
Diagram stack of the Order App, Conductor Server, Task Workers, and Conductor Stores.
The tech stack for how Conductor powers applications.

Typical success scenario

In the Order App, when the user clicks the Order button during the checkout procedure, a Conductor workflow for order_processing is triggered. The Order App passes the workflow input parameters, such as the order details, shipping address, and user email to the Conductor Server. In return, the Server passes back the workflow instance ID, which can be used to track the workflow progress and manage its execution.

Diagram of workflow getting triggered by a signal.
Workflow begins upon a signal.

Based on predefined signals and parameters, the order_processing workflow will run through a series of tasks, such as a HTTP call to a payment processor, or a piece of custom functionality for invoice calculation.

In Conductor, workflows are executed on a worker-task queue architecture, where each task type – HTTP call, webhook, and so on – has its own task queue. When the workflow execution for order_processing begins, the workflow execution service (WES) begins to add the workflow’s tasks to the relevant queues. A HTTP task that calls a third-party payment processor, capture_payment, is added to the HTTP task queue. Meanwhile, calculate_invoice, a custom function, is added to a custom task queue, while notify_invoice, another third-party email service, is added to the HTTP task queue.

Diagram of tasks getting sorted into queues based on task type.
Based on the predefined workflow, Conductor Server adds tasks to the appropriate task queues.

While Conductor’s WES is directing and scheduling tasks to the right queue, the available Task Workers are busy polling for tasks to do. Although there are three tasks queued, the first task, capture_payment, has to be completed first, before the next task can begin. So, when Worker A polls for a task, the Conductor Server sends capture_payment to Worker A for completion. Once Worker A has completed the task, it updates the Server about the task completion status.

The Server registers and keeps track of each task’s status. So when it receives the update from Worker A that capture_payment has been completed, it will send the next scheduled task to the next available worker.

Diagram of workers polling for tasks, completing tasks, and sending the task status back to the server.
Worker-task queue architecture, where workers poll the server for tasks to do, and the server assigns tasks based on the defined workflow schedule.

This set-up is how Conductor keeps track of the workflow state as one task gets completed one after another based on the predefined workflow schedule. And voilà, with Conductor’s state management, developers need not spend time building complicated infrastructure for state management. Remember the workflow instance ID that was sent when the workflow was initiated? The Order App can simply use the ID to query the Server about the workflow status at any time.

State persistence and durability

Crucially, Conductor goes beyond just state visibility. It’s built to withstand and recover from failures no matter how long the workflow runs. Cue the Conductor Stores. At every point, data gets stored on distributed, high-availability clusters so that the workflow can always pick up and resume from where it last stopped – whether from a restart in a failed run or from an idle state in a long-running flow.

For example, after capture_payment, the WES reads the next task, wait_customization, and pauses the workflow to wait for the luthier to finish crafting the instrument. The process may take several months, but with the workflow execution history, pending task queues, and predefined flow of tasks, the system can easily recover from this state of idling. Once Conductor receives a signal – perhaps the luthier clicked a confirmation button on the Order App – that the violin has been made, it will send the next scheduled task in the queue to the next available worker.

Diagram of Conductor Server receiving a signal to continue and assigning the next task to a worker.
Conductor resumes the workflow upon receiving a signal to proceed to the next task.

Handling failures in all shapes and sizes

Whether it’s transient failures, like services going offline briefly, or execution failures, like buggy code, or even deliberate termination, like a customer canceling an order, Conductor is equipped to handle it all.

We’ve seen a glimpse of how Conductor’s decoupled infrastructure and redundancy enable applications to run smoothly with guaranteed state visibility and persistence. But failure scenarios are where these characteristics for durability really shine through.

App server goes offline

Let’s continue with the custom violin order processing workflow. With the violin ready, the workflow proceeds to calculate_invoice, a custom functionality on the Order App. Perhaps at this moment, a blackout causes the Order App’s server to go down temporarily, which takes all the Task Workers for the calculate_invoice task offline as well. When the Conductor Server dispatches this task to be completed, there are no Workers available to complete it.

Based on the task’s retry and timeout policies, Conductor will automatically reschedule the task until the Order App’s server comes back online or until timeout occurs.

Diagram of Conductor Server attempting to reach task workers on a different server.
Conductor Server will automatically handle transient failures based on predefined parameters for retries, timeout, and so on.

Service hits rate limit

Once the calculate_invoice task has been completed, the next task – a HTTP call for notify_invoice – is invoked. At this point, we hit another roadblock: the HTTP service for this task has reached its rate limit. As before, Conductor automatically retries the task with exponential backoff, so that the task is guaranteed to be completed successfully.

Conductor goes offline

Conductor can be deployed in high-availability clusters to guarantee maximum uptime. Even so, in the off-chance where its workflow execution service (WES) goes down, Conductor’s decoupled infrastructure ensures that task runs are not affected. Since the task queues reside on high-availability clusters, separate from the WES, workers can continue running the tasks until completion and update the Conductor Server once it comes back online.

Introspecting workflows for debugging

Once in a while, workflows may still fail despite these automated safeguards and policies for guaranteed execution. However, Conductor makes it easy to remedy these situations. With Conductor Stores that preserve all execution history, developers can inspect what happened under the hood to troubleshoot and rectify errors before restarting the failed workflows.

Diagram of workflow failure and Conductor’s introspection feature where developers can inspect the execution from various screens.
Conductor enables you to look under the hood and find out where exactly your workflow failed and why it happened.

For example, say the number of custom violin orders have increased over time, and a number of order_processing workflow executions are taking too long or have timed out. With the ability to introspect, we can quickly pinpoint the problem. Perhaps the HTTP URL is outdated, or there are insufficient workers servicing a task. Armed with these logs, application developers can quickly troubleshoot and resolve these issues so that the workflows can restart without any roadblocks.

Importantly, because Conductor keeps state management and infrastructure separate from the Order App’s business logic, the developers can easily scale or upgrade the underlying infrastructure without any downtime.

Analyzing metrics to optimize performance

Over time, sufficient data will be collected to analyze the workflow performance in aggregate. Conductor comes equipped with a metrics dashboard that showcases key insights about latency, completion rate, failure rate, number of concurrent workflows, and so on.

Diagram of a metrics dashboard.
Conductor provides a metrics dashboard for aggregate workflow performance.

These metrics can further inform decisions to optimize Conductor workflows for better performance, such as refactoring code or scaling up the infrastructure.

Refactoring workflows with no downtime

With in-built support for workflow versioning, application developers can refactor the workflow code anytime without impacting existing workflows. Once the workflow definition has been updated, new executions will run based on the latest definitions while existing workflows can be restarted to run the latest definitions. All of this, made possible with Conductor’s decoupled architecture.

Diagram of different versions for the same workflow.
Conductor has in-built versioning capabilities to allow for workflow changes without impacting existing runs.

Key features for durable execution

In summary, Conductor bolsters your application durability with these key features:

  • Resilient Engine—build atop decoupled infrastructure, created for durability, speed, and redundancy
  • Elastic Capacity—run millions of concurrent workflows with minimal latency
  • Failure Handling—get native support for rate limits, retry policies, timeouts, and more
  • Introspection—inspect your workflow performance for troubleshooting
  • Versioning—safely and cleanly version your workflows with no disruption to production runs
  • Metrics—get aggregated insights into workflow performance

Durability for any use case

As a general-purpose orchestration engine, Conductor is versatile enough for any possible use case — compliance checks in banking and finance, media encoding in entertainment, or shipping automation in logistics. Check out our case studies to discover how organizations across industries use Conductor or the following tutorials:

Orkes Cloud is a fully managed and hosted Conductor service that can scale seamlessly according to your needs. When you use Conductor via Orkes Cloud, your engineers don’t need to worry about set-up, tuning, patching, and managing high-performance Conductor clusters. Try it out with our 14-day free trial for Orkes Cloud.

Related Posts

Ready to build reliable applications 10x faster?