This is Part 2 of the Prompt Engineering series, which will put prompt engineering tactics into practice, including model choice and LLM parameters like temperature. Check out Part 1 on the importance of creating effective prompts.
Prompt engineering is the practice of enhancing LLM responses by writing optimized prompts and tuning parameters like temperature or topP. There are many strategies and tactics for creating effective prompts, but how do these translate into real-life situations? Which tactics are the most effective? How should I combine prompt writing and parameter tuning? Let’s explore some real examples of prompt engineering put into practice.
OpenAI documentation provides six main strategies for prompt engineering:
These strategies point to the overarching principle of prompt engineering: providing clear, specific, well-structured instructions. There are many ways to split hairs about how exactly to do that, with numerous guides available online (including our previous Guide to Prompt Engineering, where we introduced the basics of creating clear and effective prompts). What we want to do is to dig deeper into how model choice, prompt writing tactics, and parameter tuning interact and interplay to produce the ideal responses.
In the following sections, we will explore each facet in turn and finally put it all together, so you have a concrete idea of what to focus on when implementing prompt engineering in your AI projects.
Without any prompt engineering, your choice of LLM serves as the ringfence determining the base quality of the response output. To illustrate, here is the response by Cohere’s command
model compared to Mistral’s mistral-small
model. Both models were asked to summarize an article about NASA’s Europa Clipper mission:
Summarize ${text}
// Assume that ${text} is a dynamic variable that is injected with the relevant content at runtime
While both models provide a relatively accurate summary of the article, Mistral offers a well-structured response out of the box without additional prompting.
Depending on your use cases and system requirements, Mistral’s responses could be overly-engineered or the more optimal response. Nevertheless, using prompt engineering tactics, you can get command
to respond just like mistral-small
.
To provide clear, specific, well-structured instructions, these are some of the tactics employed in the prompts shown below:
Break up the instructions for clarity.
Instead of combining the article and the instructions in a single sentence, the prompt is now divided into distinct sections: one for the content and another for the instructions on how to evaluate the content.
Demarcate your prompt with special characters or tags where appropriate.
The article content is now enclosed in quotation marks to explicitly mark the text that should be summarized.
Include details about the desired output. Quantify where possible.
In one variation, the prompt now asks for the response to be “clear and concise” and even specifies the desired format of using bullet points. In another, it explicitly states the desired length of the output.
Check out how the responses have improved in each of these prompts.
Prompt: Here is the text from an article: "${text}". Provide a summary of this text in a
clear and concise manner.
Response: NASA has launched the Europa Clipper mission, a $5.2 billion robotic spacecraft,
to explore Jupiter's moon, Europa, which is believed to have a global ocean of liquid water
beneath its icy crust and could potentially support extraterrestrial life. The mission aims to
determine if Europa has the necessary ingredients for life, including water, organic compounds,
and an energy source. The spacecraft, the largest and most expensive planetary science
mission in NASA's history, will conduct 49 close encounters with Europa, studying its ocean,
atmosphere, and surface features. Scientists are particularly interested in the possibility of
hydrothermal vents on Europa's ocean floor, which could support simple life forms. The
mission also aims to find potential landing sites for future lander missions. Europa Clipper's
findings will shape future exploration of Europa and the search for life beyond Earth.
In some cases, it is also useful to tune the temperature of the responses. This LLM parameter affects how creative or deterministic the output will be — the higher the temperature, the more random and creative the output will be.
While temperature is associated with creative tasks like generating a poem or a story, it can also apply to factual writing, like summaries. Take a look at this response below, where the temperature is set to 0.1. The text output reuses similar words or phrasing, and this repetitiveness makes the summary rather stale and unengaging.
In contrast, a temperature of 1 leads to more dynamic and lively text, but tends to exaggerate or include less important information. This would mean that the summarizer is prone to creating misleading snapshots of the articles.
A more moderate temperature of 0.4 could be the sweet spot for an article summarizer, providing relevant information with livelier language.
We’ve seen individually how model choice, prompt writing tactics, and LLM parameters can affect your desired output. But how do these factors stack together to get optimized results at scale? While prompt writing tactics get you the biggest mileage in obtaining a particular result, LLM parameter tuning and model choice could be the distinguishing factor between an average response and a stellar one.
Using the following prompt with different models and temperature values:
Here is the text from an article: "${text}". Summarize the text clearly in 280 characters using this format "Article summary: <your_summary_here>".
We get the following responses:
At a small sample size, it is easy to pick which answer is more desirable. But in a production setting, it is best practice to systematically test changes using a larger sample size to ascertain between randomness and true improvement.
One way to systematically track changes would be to evaluate the LLM responses against a benchmark — using relevant criteria or gold-standard answers. In this case, we want summaries that are concise yet comprehensive — answering the 5W1H (what, who, where, when, why, and how).
These evaluations can be manually done by humans or AI-assisted. Whichever route you choose, ensure that you can evaluate a large sample set with sufficient variety to cover all the edge cases in your use scenarios.
In practice, an application that leverages generative AI is likely far more complex than using a simple one-shot prompt. Multiple prompts can be used across different tasks in a process, or AI techniques like retrieval-augmented generated (RAG), computer vision, and other data pre-processing tasks may come into play.
An AI orchestration engine like Orkes Conductor can streamline such developmental efforts with its built-in AI tasks and feature-rich AI prompt builder, where you can test various prompt engineering strategies. With one-step integrations for a dozen LLM providers (including OpenAI, HuggingFace, and more), Orkes’ AI prompt builder unlocks the convenience of plugging, testing, and playing with prompts across models.
Besides testing, these prompts can be templatized, saved, and safely used in a production setting for AI-driven workflows and applications, such as document classification, RAG-based search retrieval, approval journeys, and more.
By implementing various prompt engineering tactics, we have explored effective ways to combine model choice, prompt writing, and parameter tuning and systematically test these changes.
Beyond what we have explored here, there are many other ways to improve model outputs, such as model fine-tuning or retrieval-augmented generation (RAG). A strategic combination of these different techniques and methods will unleash the full potential of generative AI in automating creative, complex, or human-involved tasks.
Using Orkes Conductor, you can rapidly build, optimize, and test AI-enabled applications in a distributed environment. Conductor is an open-source orchestration platform used widely in many mission-critical applications for orchestrating microservices, AI services, event handling, and more. Try out a fully-managed and hosted Conductor service with the 14-day free trial for Orkes Cloud.