Orkes logo image
Product
Platform
Orkes Platform thumbnail
Orkes Platform
Agentspan thumbnail
Agentspan
Orkes Conductor Vs Conductor OSS thumbnail
Orkes vs. Conductor OSS
Orkes Cloud
Try enterprise Orkes Cloud for free
Enjoy a free 14-day trial with all enterprise features
Start for free
Capabilities
Microservices Workflow Orchestration icon
Microservices Workflow Orchestration
Enable faster development cycles, easier maintenance, and improved user experiences.
Realtime API Orchestration icon
Realtime API Orchestration
Enable faster development cycles, easier maintenance, and improved user experiences.
Event Driven Architecture icon
Event Driven Architecture
Create durable workflows that promote modularity, flexibility, and responsiveness.
Human Workflow Orchestration icon
Human Workflow Orchestration
Seamlessly insert humans in the loop of complex workflows.
Process orchestration icon
Process Orchestration
Visualize end-to-end business processes, connect people, processes and systems, and monitor performance to resolve issues in real-time
Agentic workflows icon
Agentic Workflows
Transform your workflows into agentic experiences while maintaining full compliance and control
Use Cases
By Industry
Financial Services icon
Financial Services
Secure and comprehensive workflow orchestration for financial services
Media and Entertainment icon
Media and Entertainment
Enterprise grade workflow orchestration for your media pipelines
Telecommunications icon
Telecommunications
Future proof your workflow management with workflow orchestration
Healthcare icon
Healthcare
Revolutionize and expedite patient care with workflow orchestration for healthcare
Shipping and logistics icon
Shipping and Logistics
Reinforce your inventory management with durable execution and long running workflows
Docs
Developers
Learn
Blog
Explore our blog for insights into the latest trends in workflow orchestration, real-world use cases, and updates on how our solutions are transforming industries.
Read blogs
Check out our latest blog:
Building Durable Loops with Conductor, Part 1: Why Agentic Loops, and Why Now?
Customers
Discover how leading companies are using Orkes to accelerate development, streamline operations, and achieve remarkable results.
Read case studies
Our latest case study:
LinkedIn Case Study Thumbnail
Orkes Academy New!
Master workflow orchestration with hands-on labs, structured learning paths, and certification. Build production-ready workflows from fundamentals to Agentic AI.
Explore courses
Featured course:
Orkes Academy Thumbnail
Events icon
Events
Videos icons
Videos
In the news icon
In the News
Whitepapers icon
Whitepapers
About us icon
About Us
Pricing
Get a demo
Signup
Slack FaviconDiscourse Logo icon
Get a demo
Signup
Slack FaviconDiscourse Logo icon
Orkes logo image

Company

Platform
Careers
HIRING!
Partners
About Us
Legal Hub
Security

Product

Cloud
Platform
Support

Community

Docs
Blogs
Events

Use Cases

Microservices Workflow Orchestration
Realtime API Orchestration
Event Driven Architecture
Agentic Workflows
Human Workflow Orchestration
Process Orchestration

Compare

Orkes vs Camunda
Orkes vs BPMN
Orkes vs LangChain
Orkes vs Temporal
Twitter or X Socials linkLinkedIn Socials linkYouTube Socials linkSlack Socials linkGitHub Socials linkFacebook iconInstagram iconTik Tok icon
© 2026 Orkes. All Rights Reserved.
Back to Blogs

Table of Contents

Share on:Share on LinkedInShare on FacebookShare on Twitter
Worker Code Illustration

Get Started for Free with Dev Edition

Signup
Back to Blogs
AGENTIC ENGINEERING

Building Durable Loops with Conductor, Part 1: Why Agentic Loops, and Why Now?

Nick Lotz
Nick Lotz
Content Engineer
Last updated: July 1, 2026
June 30, 2026
12 min read

Related Blogs

Why Orchestration is Key for Unlocking Agentic AI

Jul 2, 2025

Why Orchestration is Key for Unlocking Agentic AI

What is Model Context Protocol (MCP)?

May 26, 2025

What is Model Context Protocol (MCP)?

BPM vs Orchestration Tools: Automating Business Processes

May 22, 2025

BPM vs Orchestration Tools: Automating Business Processes

Ready to Build Something Amazing?

Join thousands of developers building the future with Orkes.

Start for free

An AI agent in some respects can be thought of as a loop. It asks a model what to do next, takes an action, and repeats until the task is done. This pattern certainly predates large language models, but LLMs made it the default behavior in agentic software. For example, the ReAct formulation of interleaved reasoning and acting (released in 2022!) is a loop, and so is every self-correcting variant that followed it.

While loops are easy to write, they are also very easy to underestimate. A while statement might behave perfectly in isolation. But it stops behaving well the moment the loop has to run longer than a short, individual session. Think integration with slow tool calls, an approval that takes a day, or redeploys with multiple process restarts. When the process holding the loop dies, the loop's state dies with it.

This series is about building a more robust approach. A durable loop keeps its iteration state in a runtime instead of in your process memory. We'll start by discussing how we got here: why agentic systems are loops, why the naive loop fails as tasks get longer, and what specifically makes a loop durable rather than merely restartable.

We'll then ground these concepts in a small working example on Conductor, the orchestration engine powered by Orkes. Later, we'll build the patterns production agents need on top of a durable loop.

A loop is an agent's control structure

A modern agent can be thought of as a control loop with a model inside it. Each pass, the model reads the current state and proposes an action, something executes that action, the result is appended to the state. Then the loop decides whether to go again.

This structure has been implemented in several different ways. ReAct runs a single loop that alternates a reasoning step with a tool call. Reflexion wraps an outer loop around it: after an attempt, the agent writes a critique of its own trajectory and runs the inner loop again with that reflection in context. Plan-and-execute agents split the work in two, one pass to draft a plan and a loop to carry it out and revise it.

But ultimately a survey of these systems finds a common architecture underneath the variations: memory, a step that decides on an action, and execution, iterated until a goal is met.

The agent loop: read memory, decide, execute, repeat ReAct, Reflexion, and plan-and-execute all run the same loop: read the memory, decide an action, execute it, and repeat until the goal is met.

Because the loop is the control structure, its reliability is the agent's reliability. Anything the loop forgets, the agent forgets. Any pass that silently runs twice, the agent does twice. The question of how durable your agent is reduces to the question of how durable its loop is.

Why naive loops break at length

The case against the in-process loop starts as an arithmetic problem. Suppose every step of an agent succeeds, end to end, 99% of the time. A 20-step task then succeeds about 82% of the time, because 0.99 to the 20th power is 0.82.

A 100-step run, the size of a serious multi-agent task once sub-agent calls are counted, drops to 37%. Past a certain length, most runs hit at least one failed step.

Length also exposes failure modes that have nothing to do with the model's judgment! An in-process loop loses its state when the process dies mid-iteration. It repeats side effects when a restart replays a step whose effect already landed, so an agent that crashes after charging a card and before recording the charge will charge again on retry.

It also keeps no durable record of what the agent saw and decided, so a run that does something strange at three in the morning is unauditable after the fact. And it offers no way to meaningfully observe or intervene: once the loop is running, a human or a supervising agent cannot tell it to stop or reconsider.

A while loop has one response to any of this: start over from the top. That doubles the token usage, and is unsafe wherever an earlier step already had an effect.

Underneath sits a distinction the rest of this series depends on: durability is not continuity. Checkpointing your state to a database gives you durability, meaning your data survives. It does not give you continuity, meaning knowledge of which step was in flight when the process died and whether its effect already landed.

Continuity is what lets a system guarantee that a step runs effectively once, however many times the process dies between attempting it and recording it. A durable loop needs both. It has to persist the agent's state and the position of the loop itself.

What makes a loop durable

Durable execution is an old idea that means the following: a computation's progress is persisted as it runs, so after a failure it resumes from the last committed point instead of restarting.

The idea predates agents by decades. Its lineage runs from the Sagas model of long-lived transactions, through event-sourced systems that reconstruct state by replaying a log, into the workflow engines that now carry agents. What they share is a commitment to treating each step's completion as a durable fact rather than a line of code that happened to run.

A loop is durable when four things hold:

  • The loop counter lives in the runtime. Which iteration you are on is server state, not a variable in your process. A restart reads it back instead of resetting to zero.
  • Each iteration is checkpointed. A pass's result is committed before the next pass starts, so the boundary between iterations is also a recovery point.
  • Iteration boundaries are idempotent. A step retried after a crash has the same effect as running it once, which is what makes it safe to re-run the step that was in flight when the process died.
  • Termination is explicit. The loop has both a condition and a ceiling, so it ends on purpose instead of running until it exhausts a budget or a context window.

Four properties of a durable loop A loop is durable when the runtime owns the counter, commits each iteration, keeps steps idempotent, and terminates on purpose.

For example, Conductor models this with a loop task called DO_WHILE. The tasks inside it run once per iteration, and a loopCondition decides whether to repeat. The loop's state is owned by the Conductor server, not by the workers that do the work, which is the property that makes everything later in this post possible.

From a while-loop to a saga

A durable loop is one point on a spectrum. At one end is the in-process while loop: maximally flexible, and gone the instant the process dies. At the other is the late-bound saga, where the execution graph does not exist when the run starts and the model synthesizes it edge by edge, committing each edge to a ledger before it runs.

A durable loop sits between them. Its approach is fixed, a body and a condition, but its state and position belong to the runtime.

This is a feature, not a limitation. A loop whose structure is known ahead of time is easy to reason about, easy to bound, and easy to recover, which is exactly what you want for the parts of an agent that repeat a well-defined step until a condition holds. Where the plan itself has to be invented as the agent goes, you reach for a saga. Where the agent repeats a known step, a durable loop is the smaller and sturdier tool.

Building one with Conductor

Let's make it concrete. Start by launching a local Conductor instance via the official Docker container:

bash
docker run --init -p 8080:8080 -p 5000:5000 conductoross/conductor:latest

A workflow is ultimately an underlying JSON. This one runs a step called agent_think until the step reports done, or until a ceiling of six iterations:

json
{
  "name": "durable_agent_loop",
  "version": 1,
  "tasks": [
    {
      "name": "agent_loop",
      "taskReferenceName": "agent_loop",
      "type": "DO_WHILE",
      "loopCondition": "if ($.agent_loop['iteration'] < 6 && $.think['done'] != true) { true; } else { false; }",
      "loopOver": [
        {
          "name": "agent_think",
          "taskReferenceName": "think",
          "type": "SIMPLE",
          "inputParameters": {
            "goal": "${workflow.input.goal}",
            "draft": "${think.output.draft}",
            "iteration": "${agent_loop.output.iteration}"
          }
        }
      ]
    }
  ]
}

Two references show the durability aspect. $.agent_loop['iteration'] is the loop counter, maintained by the server. ${think.output.draft} feeds the previous pass's output into the next pass, which means the running state lives in the workflow execution rather than in any worker.

The DO_WHILE loop in the Conductor UI The DO_WHILE task is the loop. Conductor owns the iteration count and the per-pass state; the worker only does one pass at a time.

Putting an agent in the loop

The step itself is an ordinary worker: a process that polls Conductor for agent_think work, calls a model, and reports the result. Here it runs one self-refinement pass with gpt-5.5 using the critique-and-improve move that Reflexion formalizes.

python
import json, requests
from openai import OpenAI

client = OpenAI()                       # reads OPENAI_API_KEY
SERVER = "http://localhost:8080/api"
SYSTEM = (
    "You refine a definition over successive passes. Make one improvement per "
    "pass and reply as JSON {draft, critique, done}. Always set done=false "
    "before pass 4; from pass 4 on, set done=true only when the draft is "
    "accurate, concrete, names the failure it prevents, and is 45 words or fewer."
)

def refine(goal, draft, n):
    resp = client.chat.completions.create(
        model="gpt-5.5",
        response_format={"type": "json_object"},
        messages=[{"role": "system", "content": SYSTEM},
                  {"role": "user", "content": json.dumps(
                      {"goal": goal, "current_draft": draft, "pass": n})}])
    return json.loads(resp.choices[0].message.content)

while True:                             # a worker is just a process that polls
    r = requests.get(f"{SERVER}/tasks/poll/agent_think")
    if r.status_code == 204 or not r.text.strip():
        continue                        # nothing queued right now
    task = r.json()
    if not task or not task.get("taskId"):
        continue
    d = task["inputData"]
    out = refine(d["goal"], d.get("draft") or "", d.get("iteration") or 0)
    requests.post(f"{SERVER}/tasks", json={
        "taskId": task["taskId"],
        "workflowInstanceId": task["workflowInstanceId"],
        "status": "COMPLETED", "outputData": out})

Register the workflow and start a run:

bash
curl -X POST localhost:8080/api/metadata/taskdefs -H 'Content-Type: application/json' \
  -d '[{"name":"agent_think","responseTimeoutSeconds":20,"retryCount":3}]'
curl -X POST localhost:8080/api/metadata/workflow -H 'Content-Type: application/json' \
  -d @durable_agent_loop.json
curl -X POST localhost:8080/api/workflow/durable_agent_loop -H 'Content-Type: application/json' \
  -d '{"goal":"Define durable execution for a senior engineer."}'

Notice there is no agent loop in the worker. It handles one pass and returns; Conductor decides whether to schedule another. On this run the agent took four passes, about sixteen seconds, before it judged the draft good enough and set done.

Four gpt-5.5 passes on the timeline Four gpt-5.5 passes, each one iteration of the same DO_WHILE loop, recorded by the server as they ran.

It converged on:

Durable execution is a runtime model that persists execution history, checkpoints, and recorded external I/O results so, after crashes, restarts, or redeploys, it can deterministically resume from the last committed step instead of re-running completed work, preventing duplicate side effects.

Surviving a crash

Here is the part the in-process loop cannot do. We start the loop, let two passes commit, and kill -9 the worker while the third pass is mid-flight, in the middle of its model call. The process is gone. Nothing is running.

The loop does not go with it. The interrupted pass sits as IN_PROGRESS on the server until its response timeout elapses; the server then marks it TIMED_OUT and re-queues it.

We start a fresh worker. It polls, picks up the re-queued pass, finishes it, and the loop runs to completion. The first two passes are never repeated, because the server already committed them.

The interrupted pass times out and re-runs on a new worker Pass think__3 timed out when its worker was killed, then ran again on a fresh worker and completed. The loop finished without losing the earlier passes.

This is the continuity we mentioned earlier, not just durability. The server knew which iteration was in flight, knew it had not committed, and handed it to another worker. The two committed passes were not lost and not repeated, so the loop resumed at the third pass instead of restarting from the top.

Designing loops that converge

A durable loop removes the penalty for running long, which makes it easy to write loops that never stop. Two guards matter. First, give every loop a ceiling, so a model that never declares itself finished still terminates. And make each step idempotent, so the re-queue after a crash does not double an effect. Refining a draft was safe to retry because it has no external side effect; a step that charges a card or sends an email needs an idempotency key to be equally safe.

There is also a subtlety in how state crosses a crash: ${think.output.draft} resolves to the most recent task instance, so a pass retried after a timeout reads that timed-out instance's empty output rather than the last good draft. When a loop must carry accumulated state through a mid-pass failure, hold it in a workflow variable or reference the last committed output explicitly.

Where we are going next

Part 1 establishes the idea of moving the loop off your process and into a runtime that owns its state, and the agent stops being something that dies when its process does. The next part will build on the same DO_WHILE foundation, with the patterns production agents actually need: pausing a loop for human approval and resuming it on a signal, fanning a loop out across parallel sub-agents, and changing a loop's definition without breaking the runs already in flight.

Further reading

  • Orkes Developer Edition
  • The DO_WHILE operator reference
  • Why your agent is not an LLM in a loop
  • Idempotency and retry safety in distributed workflows
  • Durable execution, explained
  • Join the Conductor community