Build Safe and Reliable AI Systems

Learn how to make your AI system safer and more reliable.

In this tutorial, you'll improve a chatbot's reliability by learning how to:

Improve your prompt engineering.
Reduce AI hallucinations.
Create prompt engineering test suites.

To make your learning experience more engaging, we’ll use the demo application we used in Okta Developer Day: an online learning platform called “Identiflix”. The web app has a chatbot system for users to ask questions about identity and security topics. However, this chatbot sometimes answers unrelated topics and provides information that’s not accurate or even real. Let’s fix that throughout this tutorial.

Why Use AI Reliability Techniques?

AI systems' output is not always accurate. It may depend on many factors: the initial training data, the lack of context, or other reasons. Inaccuracy may have bad consequences, including loss of customers, damage to the brand, and legal problems.

To improve the accuracy of your AI system's output, you should consider using AI reliability techniques such as content filters, fact-checking mechanisms, and prompt testing suites.

These are common techniques used by AI practitioners and have a significant impact on AI-based applications:

Improved Accuracy. By implementing content filters and fact-checking mechanisms, you can reduce hallucinations and unexpected responses, leading to more accurate and trustworthy AI outputs.
Enhanced User Experience. Reliable AI systems provide consistent and appropriate responses, resulting in better user satisfaction and engagement with your chatbots or AI applications.
Risk Mitigation. Implementing guardrails helps prevent potential reputational damage or legal issues that could arise from inappropriate or false AI-generated content.
Quality Assurance. A prompt testing suite allows you to systematically verify your AI's performance, ensuring it meets your standards and expectations across various scenarios.
Scalability: As your AI applications grow, these techniques provide a foundation for maintaining reliability, making it easier to expand and improve your systems over time.
Competitive Edge: Mastering these techniques sets you apart in the AI development field, as reliability is becoming increasingly crucial in AI applications.

By adopting these reliability techniques, you'll be better equipped to ship powerful, safe, and trustworthy AI systems.

Project Setup

Start by cloning the project into your local machine:

git clone https://github.com/oktadev/devday-24-labs-demo-app.git

Make the project directory your current working directory:

cd devday-24-labs-demo-app

Next, install the Next.js project dependencies:

npm install

To start the development server, and reload the applications whenever you make changes to the codebase, run the following comand. It will show the url (localhost + port) your application is running on, so you can open it in your browser.

npm run dev

The application's chatbot was designed to work with two different LLM models. You can choose which one you'd like to use based on the following:

Xenova/Qwen1.5-0.5B-Chat

By default, the application will download and install a local LLM model, this way, you won't be needing to configure any API keys, however there are some considerations for using this model:

The model was trained with 620M parameters, which makes it a relatively small LLM, thus the results you'll obtain from this model will not be the most optimal.
The model will still require high processing power and could take some time to produce any response depending on the hardware is running.

Open AI

You can opt to use OpenAI instead of running a local model. By using OpenAI, you'll access state of the art models that will respond effectively to your prompts without sacrificing on performance. If you decide to use OpenAI, follow the next steps:

Get an OpenAI API key
Edit the .env.local or .env file in your application's project and add a new environment variable: OPENAI_API_KEY. Assign your OpenAI API key as its value.
The application will automatically switch to use OpenAI when the environment variable is assigned.

Depending on whether you are using the Xenova/Qwen1.5-0.5B-Chat or the OpenAI model, you will be modifying different parts of the code. If you are using the OpenAI model you will be editing parts of the code that uses the useOpenAI conditional statements.

Reducing Hallucinations Through Prompting

Hallucinations are responses generated by an AI system that contain false or misleading information presented as facts. There can be many reasons for hallucinations: from the initial training data to the very nature of the model. However, one of the main reasons is a lack of context.

You will learn how to define a context to a prompt in order to mitigate the risk of hallucinations.

Go to the /data folder of the project and open the ai.ts file, which contains the code that interacts with the AI model. The code in this file defines the systemMessage constant, which is our system prompt:

const systemMessage = {
  role: "system",
  content: `You are a chatbot that answers questions about identity, authentication, and authorization.`,
};

A system prompt is a fixed prompt providing instructions to the model. The user can't change this prompt, so you can use it to define a set of constraints within which the model can respond.

The prompt you see in the code snippet above defines the topics the model can respond to, but it lacks robustness. Let's improve it step by step.

Step 1: Adding Context and Citation Requirements

The first step is to define a context and some rules. Let's start by modifying the content property of our system prompt as shown below:

const systemMessage = {
  role: "system",
  content: `You are a chatbot that answers questions about identity, authentication, and authorization. Rules:
  1. Use ONLY the information from the numbered context below.
  2. Cite sources using [1], [2], etc. after EACH piece of information.
  Context:
  ${numberedContext}
  Remember to cite your sources and only use the information provided above.`,
};

You added two rules and a context.

The first rule forces the model to take the context defined below as the only source of information. This ensures that you are in control of the source of information that the model can provide.

The second rule asks the model to provide the references to the sources it used to formulate the answer.

Then, after the Context: string, there is a variable bound to a list of resources about the topic on which we specialize our model, i.e., identity, authentication, and authorization. The value of the numberedContext variable is a string with the actual content that the model must consider.

The last sentence simply reinforces the two rules defined earlier.

Step 2: Implementing Response Constraints

In this second step, you will refine the system prompt to include a couple of additional constraints:

const systemMessage = {
  role: "system",
  content: `You are a chatbot that answers questions about identity, authentication, and authorization. Rules:
  1. Use ONLY the information from the numbered context below.
  2. Cite sources using [1], [2], etc. after EACH piece of information.
  3. Your response MUST be 50 words or less.
  4. Your response MUST contain at least one citation in square brackets, like [1].
  Context:
  ${numberedContext}
  Remember to cite your sources and only use the information provided above.`,
};

You see that two new rules were added.

The third rule limits the length of responses, while the fourth rule ensures that at least one citation is used and defines its format.

Step 3: Fine-tuning Response Behavior

Let's complete the improvement of the system prompt to reduce hallucinations by adding two more rules that tell the model how to behave when it does not find an answer. Change the prompt as follows:

const systemMessage = {
  role: "system",
  content: `You are a chatbot that answers questions about identity, authentication, and authorization. Rules:
  1. Use ONLY the information from the numbered context below.
  2. Your response MUST be 50 words or less.
  3. Cite sources using [1], [2], etc. after EACH piece of information.
  4. If you can't answer using the context, try to provide a partial answer or related information.
  5. Do not add any information not present in the context.
  6. Your response MUST contain at least one citation in square brackets, like [1].
  Context:
  ${numberedContext}
  Remember to cite your sources and only use the information provided above.`,
};

This final step adds instructions for handling incomplete information and explicitly prohibits adding information not in the context, further reducing the likelihood of hallucinations.

By implementing these incremental improvements, we significantly enhance the reliability and trustworthiness of the AI's responses while maintaining its focus on the provided context.

Citation Processing

The current solution has the desired behavior, but unfortunately it simply provides citations numbers in square brackets, which is not that meaningful. You need a way to associate the links of the cited documents to these numbers.

Let's create a processCitations function that implements this citation processing:

function processCitations(
  assistantResponse: string,
  searchResults: any[],
): string {
  let processedResponse = assistantResponse.replace(
    /\[(\d+)\]/g,
    (match, p1) => {
      const index = parseInt(p1) - 1;
      if (index >= 0 && index < searchResults.length) {
        return `[${p1}: ${searchResults[index].metadata.source}]`;
      }
      return match;
    },
  );
  return processedResponse;
}

This function uses a regular expression to find citation numbers in square brackets (e.g., [1], [2]). It then replaces each citation with an expanded version that includes the source URL. In detail:

The regex /\[(\d+)\]/g matches citation numbers in square brackets.
For each match, it calculates the index in the searchResults array (subtracting 1 as array indices start at 0).
If a corresponding result exists, it replaces the citation with [number: source_url].
If no matching result is found, the original citation is left unchanged.

The processed response with expanded citations is then returned and used as the final output of the AI assistant.

Creating a Prompt Engineering Test Suite

This section is more about explaining the code in the existing test suite files. Not sure if it fits the general goal of a lab (i.e., building something)

To ensure consistent AI response quality, we'll create a robust Prompt Engineering Test Suite using unit tests. This suite will systematically evaluate the AI's responses against predefined criteria.

Step 1: Define Test Cases

In the /tests folder of the project, create a testPrompts.json file with a variety of test cases. You will find the file already in place with the test cases defined ar an array of objects as follows:

[
  {
    "name": "Basic Authentication Definition",
    "input": "What is authentication?",
    "expectedOutput": "authentication verifying identity",
    "criteria": "contains"
  },
  {
    "name": "Authorization Definition",
    "input": "Define authorization in the context of security.",
    "expectedOutput": "authorization determines access resources actions",
    "criteria": "contains"
  }
  // ... more test cases ...
]

The name property indicates a description of the test case.

The input property contains the input for the AI assistant.

The expectedOutput property is a sequence of keywords related to the expected answer for the input.

The criteria property describes the criteria that make the output valid. The keyword contains indicates that the keywords in the expectedOutput property must be included in the answer.

Step 2: Implement the Test Runner

The promptTestingSuite.js file in the /tests folder implements the test runner for the test cases defined in the previous step. Let's analyze two functions implemented in this file to get a high-level idea of how the test runner works: evaluateResponse and runTests.

The code of the evaluateResponse function is as follows:

function evaluateResponse(actualOutput, expectedOutput, criteria) {
  const normalizeText = (text) => {
    return text
      .toLowerCase()
      .replace(
        /\b(the|a|an|and|or|but|in|on|at|to|for|of|with|by|from|up)\b/g,
        "",
      )
      .replace(/[^\w\s\[\]]/g, "")
      .trim();
  };

  const normalizedActual = normalizeText(actualOutput);
  const normalizedExpected = normalizeText(expectedOutput);

  switch (criteria) {
    case "exact":
      return normalizedActual === normalizedExpected;
    case "contains":
      const keywords = normalizedExpected
        .split(/\s+/)
        .filter((word) => word.length > 3);
      const matchCount = keywords.filter((keyword) =>
        normalizedActual.includes(keyword),
      ).length;
      return matchCount / keywords.length >= 0.7;
    case "length":
      return actualOutput.split(/\s+/).length <= parseInt(expectedOutput);
    case "citation":
      const openBrackets = (actualOutput.match(/\[/g) || []).length;
      const closeBrackets = (actualOutput.match(/\]/g) || []).length;
      return openBrackets > 0 && openBrackets === closeBrackets;
    default:
      return false;
  }
}

It first normalizes actual and expected outputs for comparison by converting to lowercase and removing certain words and punctuation.

Then it evaluates the test output by evaluating the following criteria:

'exact': Checks if normalized outputs are identical.
'contains': Ensures a significant portion of keywords in expected output are present in actual output (at least 70%).
'length': Verifies that the word count of actual output does not exceed the expected length.
'citation': Checks for balanced square brackets in actual output to ensure proper citations.

The runTests function is shown below:

async function runTests() {
  try {
    const prompts = JSON.parse(fs.readFileSync(PROMPTS_FILE, "utf8"));
    const results = [];

    for (const [index, prompt] of prompts.entries()) {
      console.log(
        `Running test ${index + 1}/${prompts.length}: ${prompt.name}`,
      );

      const startTime = Date.now();
      const response = await getAIResponse([
        { role: "user", content: prompt.input },
      ]);
      const endTime = Date.now();

      const result = {
        name: prompt.name,
        input: prompt.input,
        expectedOutput: prompt.expectedOutput,
        actualOutput: response,
        passed: evaluateResponse(
          response,
          prompt.expectedOutput,
          prompt.criteria,
        ),
        executionTime: endTime - startTime,
      };

      results.push(result);
      console.log(`Test ${index + 1} completed. Passed: ${result.passed}`);
    }

    fs.writeFileSync(RESULTS_FILE, JSON.stringify(results, null, 2));

    const passedTests = results.filter((r) => r.passed).length;
    console.log(`\nTest Summary:`);
    console.log(`Total Tests: ${results.length}`);
    console.log(`Passed: ${passedTests}`);
    console.log(`Failed: ${results.length - passedTests}`);
    console.log(`\nDetailed results saved to ${RESULTS_FILE}`);
  } catch (error) {
    console.error("Error running tests:", error);
  }
}

To run the test suite, it:

Reads prompts from the specified PROMPTS_FILE.
Initializes an empty results array to store test outcomes.
Iterates through prompts, logging progress and measuring response time for each test.
Evaluates the actual response against the expected output using evaluateResponse.
Stores results, including pass/fail status and execution time.
Writes summarized results to RESULTS_FILE and logs a summary of total, passed, and failed tests.

Step 3: Run the Test Suite

Now, all we have to do is run the tests with the following command:

node tests/promptTestingSuite.js

You will get a summary of the result on the screen and the details in the testResults.json file.

Step 4: Analyze Results

Review the generated testResults.json file to identify areas for improvement in your system prompt. Look for patterns in failed tests and adjust your prompts or AI model accordingly.

Step 5: Iterate and Improve

Based on the test results:

Refine your prompts to address any consistently failing tests.
Add new test cases to cover edge cases or newly identified scenarios.
Adjust the evaluation criteria if necessary to better capture the desired response quality.

Keep in mind, unlike traditional testing, you're not aiming for a 100% passing score because depending on the model, that is unlikely to achieve every time.

By systematically testing and iterating on your prompts, you can ensure consistent and high-quality AI responses across various scenarios. This approach helps maintain reliability and prevents regression as you continue to develop and refine your AI system.

Recap

Throughout this tutorial, you learned a few techniques to improve the reliability of your AI-based application and prevent hallucinations. In particular, you learned:

How to add context and constraints to your system prompt in order to make the underlying AI model more focused on the specific topics you provided.
How to process the output generated by the AI model to make it more helpful for the user. Specifically, you expanded the citations by adding the URL of the source document.
How to create automatic tests to ensure that the output generated by the AI model is what you expected.

By applying the strategies you learned, you can ensure that AI models not only perform accurately but also align with ethical standards and user expectations through fact-based checking.

Thanks!