Guide to Scaling Workers

Workers are responsible for executing business logic in the workflow applications. Scaling and tuning workers are dependent on the following metrics:

Conductor servers publish metrics that allow you to monitor the health of your workers. Here is a guide to using these metrics: PromQL

tip

Each of the metrics below contains taskType as a tag that can be used to monitor the metrics for a specific task.

Pending requests (Gauge)

max(task_queue_depth{taskType="<task_definition_name>"})

How to use the metric:

The goal should be to keep the queue depth constant (it may not be zero if the tasks are long running).
Configure alerts and autoscaling policies for your workers based on the increase/decrease in the queue depth in a given time period.

task_completed_seconds_count is published as a counter with taskType as a tag.

rate(task_completed_seconds_count{taskType="<task_definition_name>"}[$__rate_interval])

How to use the metric:

The metric shows the throughput. The goal is to keep the throughput at a threshold depending on the application's needs.
Configure alerts and autoscaling policies for your workers based on the increase/decrease in the throughput.

This metric measures the time the task was sitting in the queue before getting picked up by a worker.

max(task_queue_wait_seconds{quantile="<quantile>",taskType="<task_definition_name>"})

How to use the metric:

If the value is very large (more than a few seconds), check the following:

The number of workers and whether they are busy processing the tasks. If so, increase the number of workers.
Check the worker's polling interval and reduce it if needed.

note

Reducing the polling interval could increase the number of API requests to the server, triggering system limits.