A man holding a phone over a laptop, out of it comes a chatbot.

Building Generative AI Products

Written by: Jakob Friberg and Felix Weiland, Data Analysts at Bontouch

Since a year back Bontouch has experimented a lot internally, built several proof-of-concepts, and begun developing new AI-backed functionalities in the products we make together with our partners. These new large language models (LLM) open up a new world of possibilities, but amidst the AI buzz that has captivated many, some challenges often remain overshadowed. In this article, we will explore how these kinds of products are built up, and why building real products on top of LLMs that work out in the wild isn’t always that easy.

What is Generative AI?

Generative AI came into the public’s mind in 2022 with the introduction of capable image generation products, such as DALL-E, Midjourney, and Stable diffusion, and with the release of ChatGPT at the end of November 2022. Read more about Generative AI in our other blogpost here.

From a tech standpoint, Generative AI refers to a collection of machine learning algorithms that can produce content from the information you feed them. For instance, it can condense text into a summary, write code, or create a lifelike image in a desired aesthetic for your slideshow. Since then, we’ve seen a lot of new products popping up, trying to solve old and new problems with this new technology.

The Blueprint for Generative AI Products

Many of the Generative AI products we are seeing popping up now are built on top of large language models (LLMs). These are the very capable engines propelling the AI application. But an engine alone, no matter how advanced, won’t get you to your destination. LLMs are just one part of a larger system that delivers a functional AI-driven product. Here’s the idea:

A blueprint laid out on a desktop, surrounded by rulers and a pen. — *Image created by Bontouch using DALL·E 3, a generative AI tool.*

Large Language Model:
The LLM generates creative text, answers questions, translates languages, and solves other language-related tasks out-of-the-box based on the input prompt.

Guardrails:
Guardrails guide AI output, preventing missteps and ensuring no harmful content is generated.

Domain Data:
While the LLM understands language broadly, domain data helps it navigate, ensuring it’s accurate and relevant for specific industries or topics. Training an LLM with only domain-specific knowledge isn’t just costly; it’s the training on diverse data that powers its versatility. Instead of starting from scratch, injecting prompts and fine-tuning with domain language are more effective strategies.

Memory:
All current models are stateless, meaning that they do not ‘remember’ past messages in a conversation out of the box. This challenge has different solutions based on what you want to achieve, but the memory is something that we build on top of the models.

Evaluation:
There are a ton of benchmarks for AI models out there, but they are often generic and built for research rather than the type of LLM applications we are building here at Bontouch. In general, our approach is to create standardized test sets specific to the product we want to build in order to measure accuracy.

User interface:
Everything we build should provide a purpose, and most of the time, that means a user interacting with some interface. How this is created is as important as ever, but the way that LLMs can handle natural language inputs opens up new ways of building these interfaces.

How to implement these components is up to you. Langchain and Microsoft’s Guidance are great examples of frameworks that abstract key concepts and categorize many everyday use cases, such as Q&A and Chatbots and are good starting points to get up to speed. Both serve the purpose of constructing the components mentioned above into a fully functional product. We are still in the early stage of this new landscape, and development is moving fast, with new solutions and application areas being discovered daily.

At Bontouch, we’ve made use of libraries like these and our own custom-built solutions. However, the ground principles have revolved around this blueprint.

The challenges with Generative AI

Large language models are inherently unpredictable. This section will cover common obstacles and some insights into how we have solved them. Disclaimer: The following section delves into technical details. If you’re up for a deep dive, read on! If not, feel free to jump to our concluding remarks at the end of this post.

Large Language Model

Deciding which LLM to use will depend on the required capability, data privacy, and accepted latency. To date, the smaller models that can be run locally or larger open-sourced models that can be run on your own GPU cluster do not provide capabilities on the same level as that of ChatGPT. Therefore, you are usually bound to use LLM APIs from providers such as OpenAI, Anthropic, or Microsoft.

The larger models available as APIs are a lot more capable but it comes with a caveat. The response time is longer, meaning your user might need to sit around waiting for an answer. In web and app development, we’re typically talking about response times in milliseconds to keep your product non-sluggish. A call to a LLM API may sometimes take as long as 30 seconds (!).

The two key factors that determine the speed are how much text the model needs to generate, as well as the models that you use. The higher complexity models, such as GPT-4, have slower response times but better accuracy in the outputs. Based on the specific need of the application you should carefully consider what model(s) you want to use. If you have an application that processes multiple calls or even chaining calls together, you can of course use more than one model. Furthermore, some of the API services also provide fine-tuning functionality that can improve the speed and accuracy of the model. You could achieve similar performance as a larger model with a fine-tuned smaller model for your specific use cases.

Regarding data privacy, we will dig deeper into that further down in this post.

Guardrails

Typically, the API providers have guardrails implemented that ensure the model doesn’t produce harmful content and limit the length of responses. However, at Bontouch we also include our own custom guardrails to make the model produce output in the required format. The term “prompt engineering” has become a popular expression for this type of work, and it’s a tough nut to crack. Today’s large language models pay attention to different words in the input prompt based on which context the words are in. This is the “self-attention” mechanism. Prompt engineers need to understand this to get the model to take into account all instructions in their prompts.

As a beginner, it’s common to prompt by simply mashing instructions and information into one piece of text with no common thread, leading to the model ignoring some instructions or missing the context.

Beginner prompting:
I want you to do A based on this text below: TEXT: <TEXT> You should only answer in Swedish. Don't mention anything about topic B. Answer in a professional tone and in a bullet point format.

To solve this, you need to think less like you’re programming and more like you’re instructing a human. The models are trained on natural language, meaning the output will be more predictable if your input is… natural language. This is such an exciting concept because contrary to what one would expect, proficient, prompt engineers, are not thinking like developers. They’re thinking like a writer or an efficient communicator.

Basic prompt engineering patterns include providing an instruction word such as “translate” or “summarize” and the format and tonality you want in the output. However, as more complex tasks need to be solved, it becomes necessary to use more robust prompt formats. One-shot or Few-shot prompting are well-established techniques providing the model with in-context learning. Chain-of-thought prompting is a recent technique that has been proven very successful in guiding the model through a series of reasoning steps. In the example below we show two prompt examples with these techniques that you can give to a LLM. Note how both techniques involve describing to the model what output you expect.

One-shot prompting:
I would like to go to London next week, traveling from Stockholm. Here's the destination and origin: {from: "Stockholm", to: "London"} I would like to book a trip from Tokyo to Berlin on the 4th of December. Here's the destination and origin: <Model response>

Chain-of-thought prompting:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can have 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A: <Model response>

If we want the model to perform some other action than answering, providing a natural text answer back, we must rely on the model to output its response in the correct format. Let’s say we want to translate a user query to an API call then it is necessary for the model to provide the response in exactly the right way. This could be achieved either by prompt engineering, fine-tuning, or a combination of both.

In the end, guard-railing your AI product will make up a substantial portion of the work that needs to be done, and it’s a continuous trial-and-error process. Never expect your guardrails to be static. Updated language models and change in user prompting behavior necessitates updated guardrails because of the models’ stochastic nature.

Memory and Domain data

To enable an LLM product to manage memory and external domain data not inherently known to generalist models, one common technique consists of supplying relevant knowledge as a context in user prompts. For instance, a chatbot answering questions about your organization’s product suite needs to be provided with that information in the input sent to the model. It works well with a few paragraphs of text, but what if it’s 50 pages worth of all your product descriptions? Similarly, you can provide a complete chat history as context until a certain point, after which it will be too large to fit inside the models’ maximum input length (also referred to as context window).

A transparent cube divided in two. — *Image created by Bontouch using DALL·E 3, a generative AI tool.*

Today’s go-to solution involves chunking data into smaller pieces and only includes the most relevant pieces of information together with the user’s question. This is often done through similarity matching, using sentence embedding, and by obtaining a high-dimensional vector that semantically represents text. This allows for scoring of semantic similarity between the user’s input and the chunks of domain data. In practice, a vector database is used, such as Weaviate, Pinecone, or Qdrant. They offer robust integrations into common programming languages, easing the configuration of embedding models, data usage, and correct input provision for similarity matching.

The only thing you need to worry about is selecting the appropriate embedding model, determining the data to use, and ensuring accurate input for effective similarity matching. Below is an example with a Pinecone vector database.

import pinecone # initialize pinecone pinecone.init( api_key=os.getenv("PINECONE_API_KEY"), # find at app.pinecone.io environment=os.getenv("PINECONE_ENV"), # next to api key in console ) index_name = "langchain-demo" # First, check if our index already exists. If it doesn't, we create it if index_name not in pinecone.list_indexes(): # we create a new index pinecone.create_index( name=index_name, metric='cosine', dimension=1536 ) # The OpenAI embedding model `text-embedding-ada-002 uses 1536 dimensions` docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name) # Extract the most relevant documents in our vector database based on # the text in the query. query = "What did the president say about Ketanji Brown Jackson" docs = docsearch.similarity_search(query

Source: https://python.langchain.com/docs/integrations/vectorstores/pinecone

Chaining

So now you’ve optimized your prompt engineering, you’ve provided the model with a memory of past interactions, you’ve made sure it has access to specific domain knowledge… and it still doesn’t perform? – Enter LLM chaining.

A conductor that guides an orchestra, above them is a map of the notes. — *Image created by Bontouch using DALL·E 3, a generative AI tool.*

Just like an orchestra has different instruments playing together, we join LLM calls to achieve something more significant than the sum of the parts. By chaining together LLM calls in parallel and series, we can have one model respond to the output of another model or have two, three, or four models respond to the user input and have a fifth model summarize all their outputs for a more diverse response.

“Just like an orchestra have different instruments playing together, we join LLM calls together to achieve something greater than the sum of the parts”

This is an exciting opportunity where the power of LLMs could excel. Because we can break down complex operations into multiple calls with different configurations, we can process the input from a user in ways that required immense development just a year back. However, it comes with a price. More LLM calls means both a higher cost for the service and a higher latency. Working with asynchronous calls in the background can partly solve the issue with latency, but at the end of the day, it boils down to a game of balancing a complex prompting schema with fewer prompts with more worked-out prompt engineering.

Evaluation

Evaluating the performance of LLM functionality, both during and after development, is a crucial but challenging aspect of the process. There are several reasons AI evaluation becomes complex:

Lack of Ground Truth:
Unlike traditional software, where you have a definitive set of inputs and expected outputs, LLM often deals with tasks that don’t have a single correct answer. This makes it challenging to determine what constitutes a “correct” response.

Subjectivity:
LLM systems are often used to generate or interpret human language. Evaluating the quality of language generation or understanding can be highly subjective and context-dependent.

Data and Bias:
LLMs learn from data, and if the training data is biased, the model can produce biased results. Identifying and mitigating bias in these systems is a challenging ethical and technical issue.

Real-World Variability:
Evaluating LLM systems’ performance in a real-world, dynamic environment can be complex, as they need to adapt to changing conditions and user behaviors.

Unintended Behaviors:
LLM systems can exhibit unexpected or unintended behaviors, which may not become apparent until they are in use. These behaviors can be difficult to predict and evaluate.

So far, we’ve seen success in combining the knowledge we already have in testing digital products by doing classic user testing, together with the capabilities of large language models, by having the models evaluate themselves. An LLM can be prompt-engineered to score the factual consistency of a response, the relevancy of the context provided with the user prompt, or measure how relevant the output is to the provided prompt. For instance, one way to create a valuable metric is to extract the number of sentences in the context provided relevant to the prompt entered by the user. Below is an example:

Relevant context extraction:
User: Tell me about the Eiffel Tower's architectural design. Context: The Eiffel Tower, also known as the Iron Lady, is a wrought-iron lattice tower located on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower. Construction of the Eiffel Tower began in 1887 and was completed in 1889 as the entrance arch for the 1889 World's Fair, held in Paris to celebrate the 100th anniversary of the French Revolution. The tower stands 330 meters (1,083 feet) tall, making it one of the most recognizable landmarks in the world.
Apart from its architectural significance, the Eiffel Tower has been featured in numerous movies and books. It has also been the site of various events and ceremonies over the years, making it an iconic symbol of Paris. During its construction, it used over 18,000 individual iron pieces and 2.5 million rivets. Workers had to manually shape each rivet before it could be used in the tower's construction. This was a time-consuming and painstaking process.

In this example, two sentences in the context are not related to the user’s question, meaning it will be slightly penalized, receiving a lower score in the evaluation.

Using these types of metrics, it’s possible to score the whole system of language models, guardrails, domain data, etc., specific to the product you’re building. This also makes the system design process less subjective and more efficient as you can put a number on how well the system performs and test a lot more use cases compared to manually testing the system.

Building solid frameworks to evaluate the performance of models that we set into production is essential in both learning how they operate at a more granular level and ensuring that they are performing well enough. But this is an area where there is a lot more to uncover.

User Interface

LLMs, with their ability to interpret and generate natural language, present both challenges and opportunities that need to be addressed for effective user interaction. The first thing that people tend to think about when it comes to LLMs and user interfaces is a chatbot – and LLMs are well-suited for this application. But there are many other interesting applications where the operations made by the LLMs are done more under the hood. Their understanding of context and meaning in natural language has opened up a new world of input methods for users. An input from a user can, for instance, be parsed to a JSON format and processed programmatically despite having spelling mistakes or not necessary keywords inside a request. A user interface that previously might have presented the user with dropdown lists, checkboxes, and sorting components can now be “outsourced” to an LLM parsing an input field with a request explained by a user.

Simple text parsing with an LLM:
“I’d like to go by train to Stockholm on Friday next week”. Travel method: Train Destination: Stockholm Date: 2023-10-26

We see many possible use cases for these and similar approaches where an LLM is working together with other parts of the application. There are ways of integrating information from current views, fine-tuning models to parse information quickly in the correct format, and remembering previous interactions for personalization.

However, there are challenges as well. Latency in API response times can affect real-time user interactions and disrupt the fluidity expected in modern interfaces. As mentioned before, the response time of available LLM APIs is dependent on how much text the model needs to generate and the models that you use. Currently, it’s a complex problem to work around and the solution falls back to building AI features around this limitation and incorporating feedback mechanisms, like loading animations or interim responses from a parallel API call, to keep the user engaged while they await the primary response.

Another challenge is actually its vast capacity. Due to the striking capabilities you first encounter when trying out tools like ChatGPT, it’s easy to initially overestimate what the models actually are capable of doing. Since it seems like it’s able to handle such a wide variety of tasks at that level, adding more information that it should know of or instructing it how to behave should be possible, right? Unfortunately, it is not that simple. There are definitely limitations to what you are able to do with an LLM, and the more we explore and work with it, the better we understand it.

Conclusion

In this article, we have presented a conceptual framework around how to build AI products and the challenges presented in doing that. With this framework, we are standing on a base to build further on the new type of innovation that can be created with this technology and we are continuing to explore that within our organization and with our partners.

This technology is still new, with discoveries and applications popping up everywhere. At Bontouch, we’re constantly exploring the latest advances in these matters to develop further our expertise in building digital products. Suddenly, it has become possible to strap a thin interface around a state-of-the-art language model and label it a product without much effort. However, the principles of digital product development haven’t changed. This approach can only result in an impressive demo at best. Integrating an advanced LLM into a real digital product still requires expertise and much work from a dedicated team.

Bontouch has worked with AI and Machine Learning for more than seven years. We’re running labs exploring technology and use cases with our partners, incorporating machine learning into products, building chatbots, voice assistants, and more. As we move forward, we will continue to embrace the power of AI, unlocking its potential to revolutionize our digital products and improve the lives of our users. Together, we embark on a journey toward a future where AI is helpful, accessible, and responsible for all.

If you want to know more or explore what Generative AI means for you and your brand, don’t hesitate to reach out to us at [email protected] so we can explore this together.