Skip to main content

2 posts tagged with "workflow"

View All Tags

· 8 min read

Editor’s Note: This post was originally published in Sep 2022 in Medium.

Workflow Start Request

Netflix Conductor is a well known platform for service orchestration allowing developers to build stateful applications in the cloud without having to worry about resiliency, fault tolerance and scale. Conductor allows building stateful applications — automation as well as human actor driven flows by composing services that are mostly stateless.

At Orkes, we regularly scale Conductor deployments to handle millions of workflows per day and we often get questions from the community on how to scale the Conductor and if the system can be scaled to handle billions of workflows.


We ran a benchmark to test the limits of Conductor’s scalability using a simple single node setup on a laptop. The results were not so surprising, given our experience with Conductor, — with the right configuration, Conductor can be scaled to handle the demands of really large workloads.







We measured the peak stable throughput Conductor can achieve on the following parameters:

  • Number of workflow start request per second
  • Number of task updates per second — translates to no. of state transitions happening per second
  • Number of workflows completing per second with moderate complexity
  • Amount of data processed per second


We measured p50, p95 and p99 latencies for the critical APIs in Conductor:

  • Workflow Start
  • Task Update

Benchmarking Tools

We used wrk2, a fantastic tool to generate stable load on the server. Wrk2 improves on wrk and adds the ability to generate sustained load at a specific rate (-R parameter).

We created a load testing workflow which is complex enough with a total of 13 steps.

For the experiment a set of load testing workers were created that were used to poll for the tasks in the workflow, and produce a dummy output.


Conductor has a pluggable metrics system and we used Prometheus to capture and Grafana to visualize various metrics.

The Setup

The experiment was designed to send a sustained workload of 200+ workflow executions / sec. The test workflow also embedded a sub-workflow with a single step inside, so during the experiment, we were starting approximately 400+ workflow/sec.

For the Conductor version, we used Orkes’ build of Conductor that is tuned and customized to handle large workflows with minimal latency possible and optimize the operational costs.

We have a version of this available under open source at

Conductor servers are a horizontally scalable system, for this test we ran a single node server — of course your production setup should have at-least 3 nodes in multiple availability zones / racks for higher availability.


The Conductor server was running on a Macbook Pro with M1 Max CPU and 64GB of RAM. The same machine also ran a single node redis server. Postgres database was running on another Macbook Pro with M1 Max CPU and 32GB of RAM. The same machine also ran task workers. A Core i9 Macbook pro with 32 GB RAM ran Prometheus, Grafana, a load generator docker container and task workers.

The systems communicated over WiFi 6 network — no wired connectivity, which potentially could have improved latencies a bit. The machines were roughly equivalent to a c7g.2xlarge instance type of AWS (memory consumption in the tests was below 16GB offered by this instance type).

Peak Workflow Start Requests

Testing the limits of how many workflows can be started in a burst. For this experiment we wanted to remove any network latencies so we ran the wrk on the same host as the Conductor server running thereby removing any network latencies out of equation. This gives us a theoretical max request/sec assuming the network is limitless (which it isn’t’).

Workflow Start Requests / sec. Under the normal load the no. of start requests averages at about 1.8K/sec

Latencies under high load with ~2K request/sec

Workflow Completion Metrics

For this experiment, we asked a question:

If all the tasks in a workflow were instantaneous, how many workflows can we start and complete in a second with a sustained load such that there is minimal backlog of the tasks.

To test this, we used wrk2 to send a sustained load of 210 workflow execution requests, where each workflow contains a sub workflow and worker tasks.

Workflow Execution Graph

Workflows getting completed per second

Task Level Metrics

Conductor publishes the depth of the pending task queue, this is useful when deciding when to scale up/down your worker clusters. Here is the snapshot of the worker queue depth. High numbers for a given task indicates the worker resource starvation and need to scale the worker cluster to fulfill the demand.

Pending queue size of tasks at a given point in time. Sustained high numbers indicate worker starvation and a need to scale out workers.

We achieved a consistent throughput of 1450+ task executions / sec. This included polling for the task, executing the business logic (in our case producing randomized test output) and updating the task status on the server by workers. Each successful task completion initiates the state transition of the workflow.

Number of worker tasks getting updated per second.

Critical to the throughput is the update of the task execution back to the Conductor server. This is the most critical API operation we found that is directly responsible for the throughput of the server. We optimized the server to limit the tail latencies ensuring p99 stays well under the check.

Task update from worker latencies in milliseconds.

Task Poll

Conductor uses a long-poll for the task polling, and the request waits until there is a task available for the worker or the timeout, with the timeout set to 100 ms in this experiment. Polling implements batching for more efficient use of the network and connection. In the test, the batch size was set to 10 for the tasks by workers.

Latencies for task poll

Data Processed by Workflow

We generated fake data for the test to simulate the real world scenarios in terms of data transfer. The amount of data being processed by workflows impacts the requirements for provisioned IOPS (on cloud environments) and Network throughput requirements.

The experiment averaged a sustained rate of ~80 MB/sec data processed.

Amount of data (task inputs and outputs) being processed at a given point in time.

Scaling to handle billions of workflows / month

The experiment used a total of 3 commodity hardware (roughly equivalent to c7g.2xlarge instance types on AWS). Throughout the experiment JVM heap size consistently remained below the 2GB mark.

The experiment created a workload of 210 moderately complex workflows per second, which if run constantly for a month will generate about 540M workflows.

Conductor Servers

Conductor servers themselves are stateless and can be scaled out to handle larger workload demands. Each of the server nodes can handle the workload based on the available CPU and Network and can be scaled out to handle larger workloads.


Redis serves as the backbone for state management and queues used by servers to communicate. Scaling higher workload requires either scaling up redis or using redis cluster to better distribute workload.


Postgres is used for indexing of workflow data. Beyond the disk storage requirements, scaling postgres requires two factors 1) adequate CPU and 2) IOPS required (especially on cloud environment) to handle the writes under heavy workloads. Writes to postgres are asynchronous done using durable queues (check out orkes-queues) but longer delays means completed workflows remain in Redis for longer periods of time, requiring larger Redis memory.


We often get asked, can Conductor be scaled to handle billions of workflows per month? The answer is a resounding YES it can. If you would like to give it a try, check out our developer playground at

Orkes, founded by the founding engineers of Netflix Conductor, is a fully managed service offering Conductor as a hosted service in the cloud and on-prem. Checkout our community edition for a fully open source version of Orkes stack.

If you are an enterprise and have a use case that you would like to run a PoC, please reach out to us.

Don’t forget to give us a ⭐️

· 5 min read
Doug Sillars

What is a workflow? Wikipedia says "A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information."

None of that is incorrect. But it is certainly a mouthful. If I am asked in an elevator what a workflow orchestration is, I like to use analogies to make the point as approachable as possible:

"It's like a recipe for code. A recipe has a series of steps, that must be run in a specific order. Iyt can also have different options for food allergies, or different ingredients you might have on hand? A workflow is a recipe for your code, and can be built to handle many of the changes that might reasonably occur when the code runs."

Workflows as recipes

You may have seen one of the name videos out there of parents teaching kids code by writing out the steps to create a Peanut butter and Jelly sandwich.

We're not going to be quite as silly as that Dad, but we'll run through some instructions on how to create a PB&J from Instructables..

If you look at the URL of that recipe - it appears it took 4 tries to get it right :D

The steps are (skipping some substeps for clarity):

  1. Gather Your Ingredients for the Sandwich
  2. Pull Out Two Slices of Bread
  3. Open Peanut Butter and Jelly
  4. Spread the Peanut Butter Onto One Slice of Bread
  5. Spread the Jelly Onto the Other Slice of Bread
  6. Combine the Two Slices
  7. Clean Up Your Workspace
  8. Enjoy Your Sandwich

(if you read closely, I skipped 3 steps - wearing gloves, removing the crust and cutting the sandwich in half... because, well - that's all just ridiculous)

The eight steps above are a workflow. They must be performed in that order to create a sandwich - you cannot spread the PB before you lay out the bread.

Building a Conductor Workflow

We can turn these 8 steps into a Conductor workflow.

PB&J example workflow

If you'd like to see the definition of this workflow, you can check out the code on the Orkes Playground. It's free to sign up.

This workflow shows clear steps with arrows pointing to the next step - so it is visually possible to see how the workflow will progress.

Improving the workflow

Each task is run by a microservice, so making changes and adding tasks is easy to do. In this example - you see that the recipe calls for PB to be spread first, and then the jelly. This works for a single human - these steps each take 2 hands. But there is no reason that the jelly cannot go first, and THEN the PB. Or - if you had more hands - the PB and the Jelly could be spread simultaneously.

Independent tasks

Let's remove the PB dependency from the jelly spreading - as the order of PB vs. jelly does not matter.

We can do this in Conductor with a FORK. A fork splits your workflow into 2 asynchronous tasks, and then a JOIN reconnects the the workflow into a single path.

We can now apply a fork to apply PB, and a second fork to apply the jelly:

PB&J example workflow

This is version 2 of the workflow, and you can see it in the playground

Now, these operations are independent, and if there is space in the jelly task queue - that can be completed ahead of the PB.

Recipe variations

Often, recipes have variations to preparation, and they can be read like an IF statement in programming (If (fresh tomatoes) {do x}, else if (tinned tomatoes) {do y}.

Cooking a burrito

Let's look at a common example - from a burrito in my freezer. The preparation directions vary depending on the cooking method.

burrito instructions

We can emulate this in a workflow using a switch task. The Switch task takes in the workflow ovenType input ${workflow.input.ovenType} and based on this value will make a decision. The default case is for the microwave, and then second decisionCase is set to "oven". From that input, the different tasks can be run.

burrito workflow


Workflows are a series of tasks that must be followed in a certain order. In this post, we used cooking recipes as an analogy to a workflow - they too are a series of tasks that must be followed in a certain order.

We created sample workflows for making a peanut and jelly sandwich (version 1 and version2) and another workflow to cook a frozen burrito with microwave or oven instructions.

<<<<<<< Local Changes We are able to reuse a number of tasks:

  • The zap task is used twice in the microwave branch.
  • The bake task in the oven branch.
  • The flip task is used in both branches.

When reusing tasks, the taskReferenceName (shown at the top of the box) must be unique.======= If you're curious about how to build a workflow orchestration - it might be a fun exercise to try your favorite recipe as a workflow. You can build on the workflows from this post in our free playground. Feel free to share what you came up with in our Discord. We love seeing creative uses of workflows!>>>>>>> External Changes