0
08.05.2025
1. In the era of AI, applications are the primary interface through which users engage with AI technologies, like large language models (LLMs). They play a pivotal role in delivering LLM-driven solutions to end-users, making it essential to ensure they function accurately, reliably, and safely. However, testing applications presents unique challenges. Unlike general-purpose AI models, testing LLM-based applications requires addressing the complexities of use cases, industry requirements, and the diverse variations in language and user interactions.
2. While internal testing is indispensable for establishing the baseline safety and performance of LLM applications, third-party (3P) testing provides an additional layer of scrutiny. This is particularly important given the evolving and nascent state of science in LLM application testing. 3P testing allows organisations to enhance safety by identifying potential blind spots and uncovering risks that might otherwise go unnoticed. It also builds greater trust among users and other stakeholders by ensuring high standards of reliability.
3. Recognising this, the Infocomm Media Development Authority of Singapore (IMDA) partnered with Singapore Airlines (SIA) and Resaro to conduct a 3P testing of SIA’s LLM-based search assistant application. The objective of the exercise (and this blogpost) is to develop and share a structured methodology as a reference point for future testers of similar RAG-based search assistant applications (a common use case across industries today), and contribute to global efforts to improve and standardise testing of LLM applications.
4. As outlined in Bengio et al. (2025)1 , risk identification and assessment form the initial stages of AI risk management practices. We followed this approach to determine key areas of concern for testing. We first identified potential risks and unintended outcomes that could arise from the application’s use. We then analysed the severity and likelihood of these risks, taking into account existing safety measures implemented in the application. The outcome led us to focus on understanding the risk of inaccurate or incomplete responses, which could mislead users and diminish trust in the application. Significantly, SIA’s Enterprise Risk Management Framework had also identified this risk as a key concern, providing an additional layer of validation to our assessment and emphasising the critical nature of addressing this issue in AI systems. To address this, testing assessed the accuracy and completeness of the application’s responses. A largely synthetic testing dataset was designed to simulate real-world usage scenarios. The testing followed a two-stage approach: First, we measured the end-to-end performance of the application – looking at overall correctness of its output against ground-truth data. Second, we looked into component-level performance of the pipeline using more granular metrics for the application’s retrieval and synthesis mechanisms to obtain actionable insights for improvement. Both stages were conducted on non-perturbed and perturbed questions, with the latter providing insight into the application’s robustness.
5. This testing approach provided structured and human-interpretable insights, facilitating informed deployment decisions. In making such decisions, organisations should set appropriate thresholds based on their risk tolerance and assess issues identified through testing in relation to real-world acceptability. Beyond pre-deployment testing, organisations should also implement continuous monitoring to detect emerging issues and ensure sustained application safety and reliability. Finally, the exercise concluded with an examination of possible future enhancements to the testing process, such as the incorporation of metrics for linguistic quality to better capture user experience.
6. Retrieval-Augmented Generation (RAG) has emerged as a widely adopted framework for enhancing generative AI applications by grounding outputs in relevant external data. RAG-based applications typically combine information retrieval with LLM-based text generation, enabling the generation of responses that are both accurate and contextually relevant. This approach is highly versatile and is used in various industries for different types of applications.
7. One implementation of RAG is in the development of LLM-based search applications. These represent a common industry archetype that enhance users’ ability to access, interpret and synthesise information efficiently, whether from databases or the web. Implementing the RAG framework, these applications typically integrate information retrieval with LLM-based text generation to deliver synthesised and contextually relevant responses to user queries.
8. The SIA search assistant, a consumer-facing application integrated into the organisation’s public website, exemplifies this archetype. Its primary purpose is to help users derive answers to their queries relating to travel on SIA. As set out in Figure 1, the application comprised four key components:
Figure 1 – Simplified architecture of SIA’s application
9. To ensure a structured and targeted approach to testing, a risk identification and assessment process was undertaken2. This process aimed to identify key areas of concern where the application’s performance could have significant implications and informed the selection and development of appropriate testing methodologies.
10. In the initial risk identification phase, a systematic analysis of potential risks and unintended outcomes that could arise from the application’s use was conducted. As outlined in Bengio et al. (2025), this stage drew on AI-specific risk taxonomies3. The analysis identified five primary categories of risks:
11. Following risk identification, each risk was assessed based on its severity and likelihood of occurrence. This phase considered the effectiveness of mitigation measures implemented by SIA during the application’s development as these would have an impact on the risk’s severity and likelihood of occurrence. SIA had employed a comprehensive, multilayered approach to mitigate Gen AI risks, implementing enterprise-wide, application-level and use-case specific controls. For example, during internal testing, SIA had proactively identified potential edge cases where certain types of user inputs could challenge the application. To address this, SIA developed an innovative layered prompt structure to enhance the application’s robustness. This approach encapsulates the user’s query within two prompts: (a) a pre-prompt placed before the user query to reinforce appropriate response behaviour; and (b) a post-prompt that reiterated key safety instructions.
12. Taking the safeguards into account, the residual risk of hallucinations and inaccuracies remained a key concern, especially given their potential real-world consequences. For example, if the search application retrieves the wrong documents or misinforms users about travel requirements, it could result in financial losses or reputational damage for the organisation. This assessment was further supported by internal assessments that SIA had conducted earlier using its Enterprise Risk Management Framework, which had also identified this risk as a key concern. As a result, testing focused on assessing the accuracy and completeness of the application’s output to identify instances of factual inaccuracies, omissions or fabricated details.
13. As part of application development, SIA had conducted multiple rounds of internal testing with questions provided by various stakeholders across its different business units. Aligned with the risk assessment stated above, testing focused on output correctness, particularly the accuracy and completeness of the application’s response, and the relevance of the links provided in the response. As part of SIA’s commitment to continuous improvement, internal testing identified opportunities for enhancement. These insights were valuable in refining the application’s performance, and SIA promptly implemented the necessary improvements prior to 3P testing.
14. SIA supplemented its internal testing with 3P testing by Resaro. This enabled independent validation by experts in LLM application testing and the detection of potential issues that might have been missed during the internal testing process. This exercise had two main objectives:
15. With these objectives in mind, Resaro developed a framework to assess the application in a human-interpretable and actionable manner. The framework, illustrated in Figure 2, focused on the primary metric of correctness of the answer relative to the ground truth. This aligned with the outcome from the risk identification and assessment exercise and provided a high-level assessment of the application’s output. Secondary metrics were introduced to provide a more granular understanding of the application’s output by categorising issues and identifying where they occurred in the pipeline. These metrics allowed for deeper analysis when the primary metric of generation correctness signalled potential concerns.
Figure 2 - Overview of the testing framework
16. To ensure comprehensive testing, SIA and Resaro collaborated to create a diverse dataset that included both real-world examples and carefully generated synthetic data, providing a broad range of scenarios for thorough evaluation. To start with, SIA provided access to the questions used in its internal testing, as well as ground-truth retrieved documents and answers for these. Resaro expanded this dataset in two ways:
17. The final dataset comprised 2,800 questions, offering a comprehensive mix of real-world and synthetic data. Approximately two-thirds of the dataset was fully synthetic, while the remaining one-third was partially synthetic, based on perturbations of SIA’s internal testing dataset.
18. Application testing comprised two stages, which were carried out in relation to both the non-perturbed (to identify baseline performance) and the perturbed questions (to assess robustness):
First Stage: Baseline Performance – Testing of the Final Output
19. The first stage focused on measuring the correctness of the application’s final output , providing insights into its alignment with ground-truth data. This stage comprised of the following four steps:
Second Stage: Deep Dive into Baseline Performance – Testing of Individual Components
20. The F1 score from the first stage provided a high-level measure of the application’s performance but lacked the granularity needed to provide actionable insights into specific components within the application pipeline. For example, precision challenges, such as the inclusion of incorrect or extraneous information in the output, could stem from issues like hallucination (i.e., content not grounded in retrieved documents) or retrieval of irrelevant documents. However, the F1 score alone would not shed light on this, highlighting the need for other metrics.
21. The second stage of testing therefore aimed to test the individual components of the application using three additional metrics:
First and Second Stages: Robustness Testing – Perturbed Questions
22. This portion of the testing process examined the application’s ability to handle perturbed questions that reflect real-world variations in user input, such as misspellings and the use of synonyms. Robustness testing offered a valuable perspective on the application’s capacity to manage such diverse inputs, effectively complementing the baseline performance test.
23. These metrics from testing are useful in providing actionable insights:
24. To move from metrics to a go/no-go decision on deployments requires thresholds. These need to be determined by the organisation based on the nature of the application and the tolerance for error. In making a deployment decision, organisations may consider:
25. While pre-deployment testing helps to validate that the application meets the organisation’s own internal safety standards, continuous testing of the application’s output and behavior post-deployment is equally important as it can help detect emerging issues in real time, enabling timely intervention. One should run these tests when there are material changes made to the application, to check if the changes have impacted safety and performance. However, even if no changes are made, such testing is still important. For example, it allows for the detection of data drift, where the real-world data that the application encounters during deployment differs from the training data it was built on. Such testing for data drift can be performed by capturing a share of the production data that is as different as possible from the pre-deployment data and drafting ground-truth responses to test against the actual responses of the application.
26. The current testing methodology provided a structured framework for testing the output correctness of a RAG-based website search application. By combining various metrics such as precision, recall and generation faithfulness, the approach offered valuable insights into the application’s ability to provide accurate and complete responses, at both the output and component levels.
27. The test design laid a foundation for future iterations that could: (a) account for other factors in user satisfaction, such as clarity, relevance, and fluency; (b) take into account the relative importance of the claims in the ground-truth data; and (c) account for LLMs’ tendency to generate varying outputs for the same query. These aspects, elaborated upon below, represent opportunities for a more comprehensive evaluation of the application’s performance.
28. Capturing linguistic quality and user experience: Responses that are correct but poorly structured or difficult to understand can undermine an application’s utility. Additional metrics such as linguistic quality, relevance or user experience are thus vital to obtaining a more comprehensive understanding of the application’s overall performance. That said, there will still be a need for ground truth-based metrics. User feedback is useful to understand user experience but given that users are unlikely to have ground-truth information, it is unlikely to accurately capture the correctness of a response.
29. Prioritising claims: Incorporating a measure of relative importance between claims would facilitate a prioritisation of the most critical claims, allowing the testing process to better align with human judgment and provide more meaningful insights into the application’s performance.
30. Executing multiple runs: Since LLMs are probabilistic and can generate varying outputs for the same query, conducting multiple runs per input can help improve the statistical robustness of the testing process. Averaging performance across multiple runs would provide a more reliable measure of the application’s output correctness, reducing the impact of response variability. Additionally, multiple runs allow for the identification of inputs that the application consistently struggles to get right, highlighting particularly problematic queries that may warrant greater focus.
31. This exercise demonstrated that LLM application testing, complemented with 3P testing, is an important step in ensuring overall AI safety and reliability. The structured approach shared serves as a reference point for future testers and contributes to global efforts to standardise LLM application testing. Nonetheless, opportunities for methodological improvement remain, given that the science of AI testing is nascent and evolving quickly. Also, testing should also not be a one-time exercise; it must be conducted periodically and complemented by continuous monitoring to address evolving challenges and maintain application safety and reliability over time.
1 Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805
2 As set out in Bengio et al. (2025), risk identification and assessment form the initial stages of AI risk management practices. Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., … Zeng, Y. (2025). International AI Safety Report. arXiv. https://doi.org/10.48550/arXiv.2501.17805
3 We referred to the risk taxonomy in Slattery et al. (2025). Slattery, P., Saeri, A. K., Grundy, E. A. C., Graham, J., Noetel, M., Uuk, R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). The AI risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence. arXiv. https://doi.org/10.48550/arXiv.2408.12622
4 Claims are distinct pieces of information or factual statements. For example, a generated response from the application might be “Travellers require a visa for entry and their passport must be valid for at least six months”. This response might be broken down into two claims: (1) A visa is required for entry and (2) The passport must be valid for at least six months.
5 Precision measures the accuracy of a system’s outputs by assessing how many of its results are correct. Imagine searching for “red shoes” online. High precision means that most of the results are actually red shoes, not sandals.
6 Recall measures the completeness of a system’s outputs by assessing how many of the total possible correct results it successfully identifies. Using the same example, high recall means the search engine found most of the red shoes available online, even if it also showed other red items.
7 The F1 score combines precision and recall into a single score. It measures the overall accuracy of a system, balancing showing only relevant results (i.e, precision) with finding all the relevant results (i.e. recall). A high F1 score for red shoes would mean that most of the results are red shoes and not too many other things.
8 For example, in email spam filters, while high recall (i.e. catching all spam emails) is beneficial, it is generally less harmful to let some spam emails through than to misclassify critical messages as spam and risk them being overlooked.