About the author
Rakesh Upadhayaya
Staff Engineer @ Freshworks
Rakesh is a staff engineer at Freshworks where he works as a technical guide/developer for reimagining workflows using Orkes Conductor.
In today's rapidly evolving digital landscape, the ability to execute complex workflows at scale is vital. Imagine running a billion workflows per day, each consisting of 10-20 nodes—an intimidating task, right? However, with the right architecture, tools, and strategies, this challenge is not only achievable but also sustainable. In this blog, we’ll explore the challenges, key considerations, and best practices for running a billion moderate-sized workflows daily.
Understanding the Workflow Challenge
Workflows are the backbone of automated systems, enabling the orchestration of complex tasks across various services. A workflow with 10-20 nodes might involve anything from data validation, processing, and storage to triggering async API, sending emails and generating reports. While running such workflows on a small scale is manageable, scaling up to billions of executions per day introduces significant challenges.
Leveraging Conductor’s Out-of-the-Box Features
Relied on
Conductor’s vision for durable computing, utilising its durability features at every step and for each worker. This included retry mechanisms, rate limiting, callback options, and even making certain tasks optional to ensure workflows continue without being impacted by sudden failures.
We also extensively used Conductor’s UI for debugging, analysing input/output, monitoring execution statistics, and tracing exceptions in case of failures.
Key Challenges
Ensuring that your infrastructure can handle the massive volume of workflows without compromising performance.
Maintaining high availability and fault tolerance to ensure that workflows are executed correctly and on time.
Optimizing resource utilization to keep costs under control while maximizing throughput.
Performance Optimisation Techniques
Conductor with Redis Queueing and Columnar DB: Redis was utilized for task queuing, and Columnar DB was employed for Conductor persistence to achieve better IOPS. Enhanced Columnar DB clusters provide superior auto-scaling and IOPS with minimal latency. Additionally, Columnar DB offers advanced observability dashboards for managing I/O and monitoring throttling.
Task Queue Depth: Tasks were identified and scaled horizontally by provisioning more workers for the same tasks. This involved assigning different queues to the workers and optimizing the number of threads for each custom/system worker.
Batch Polling: The latest Conductor SDK, which supports batch polling of tasks from respective queues, was used to improve efficiency.
Optimised Poll Intervals: Optimal poll intervals were determined based on queue depth and worker execution times, fine-tuning the system for better performance.
Thread Management: The appropriate number of threads was allocated to different workers based on their queue depths and participation in workflows.
Resource Auto-Scaling: Auto-scaling policies were implemented based on real-time metrics like CPU usage, memory consumption, and request rate. This ensures the system scales up during peak times and scales down during low demand, optimizing resource usage and costs.
Benchmarking
After experimenting with multiple configurations, we identified two key approaches:
Cell Architecture: Each module should be isolated with its own custom worker services and Conductor instances to avoid noisy neighbour issues and unexpected loads from other modules.
Per Unit Load: Identify the right number of workflows executed per pod using CPU(min-max) - 2-3 & Memory (min-max) - 3-4G with heap size of 2G It can be further tuned based on the nature and type of workflows under execution.
By applying this math, we can also establish the minimum and maximum pod limits for auto-scaling, ensuring that the system remains both efficient and cost-effective.If you're looking to streamline your workflow orchestration, eliminate the complexities, and focus solely on your business use cases while reaping all the benefits that Conductor offers, then reach out to
orkes.io. They can help you leverage Conductor as your orchestration engine, allowing you to concentrate on what matters most—your business. To gain a deeper understanding of how managed Conductor scales at Orkes, please take a moment to review
this information. It will provide insights into how Orkes manages scaling, ensuring optimal performance and reliability for your workflows.
Conclusion
By implementing these strategies and optimizations, we can successfully scale our system to execute millions of workflows daily. As we continue to refine our approach, we look forward to pushing the boundaries of what’s possible in workflow automation.
Please feel free to reach out to me if you need any more information on this topic. I'm here to help!
Happy Coding!