A hot topic among conversation designers and others is how to capture the scope and behavior of LLMs. While there is a need to create a design language and to constantly test and document prompts in tools like Voiceflow, there’s another layer of documentation that needs to occur before we can start defining patterns and creating standards and best practices.
Until we become more familiar with the intricacies of the models, capturing the scope and behavior of LLMs is done through trial and error, rigorous testing and documentation, and understanding the variables that make up a successful prompt and prompt output. It all goes back to Maaike Groenewege’s point at Unparsed 2023 that the best output comes down to content optimization.
Prompt documentation and tracking
I’ve seen a few tools like Vidura that track prompts, but what I’m not seeing is a way to track the output behavior. When you change one variable in a prompt, the outcome of the entire experience can change, so it’s really important to document every single detail. This way, it’s easier to track what you’ve tried before, pinpoint exactly how to fix a prompt that might not respond the way you want it to, and mix and match the best parts of each prompt.
Prompt variables
Here are the types of variables to test and document each time you change or add a prompt (these will vary depending on the tool you use):
- Goal/Task
- Prompt
- System prompt (if applicable)
- Data source (e.g. AI model or knowledge base)
- Model type (e.g. GPT-3, ChatGPT, GPT-4, Claude 2, Llama 2)
- Knowledge base (e.g. text, documents, URLs)
- Temperature
- Tokens
- Prompt response type (if applicable) (e.g. prompt, memory, memory and prompt)
- Expected behavior or output
- Examples of expected behavior or output
- Behavior and output considerations (e.g. hallucinations, sensitive data)
- What worked well about the prompt output
- What could be improved
What we ideally need is a way to track these variables within a tool like Voiceflow or with a tool like Airtable that allows you to filter, color-code, search, compare, and track changes. This will help with scalability and cross-functional collaboration. But until we have specific tools for prompt management and documentation, FigJam‘s been working for me so far. I started with a spreadsheet, but FigJam was a lot faster and easier.
This is very detailed and intensive work, but it has to be done until we understand how to create consistent behavior and output.
Prompt variable spreadsheet template
I created an initial spreadsheet template to demonstrate what I’m envisioning. This is not scalable and only accounts for individual prompts, but it’s a starting point to build on. If you want to use it, please feel free to make a copy and modify it based on your specific needs!
Working document of prompt variables and definitions
For those of you that may not be familiar with some of these terms mentioned above, I’ve created an initial working document that defines all the prompt variables mentioned. These definitions are based on my research and understanding of them, so if anything is inaccurate or if I’m missing anything, please let me know and I will adjust it.
Prompt Variables and Definitions
I hope this is helpful! Please reach out on LinkedIn if you have any questions.