As generative AI applications move into real-world deployment, the question is no longer just how models perform in isolation, but how entire systems behave in practice. This paper looks at the role of application-level testing in evaluating generative AI systems — testing them through the interfaces and workflows that users actually interact with.
Drawing on four case studies conducted by Resaro under the Infocomm Media Development Authority (IMDA)’s Global AI Assurance Pilot, the evaluations cover applications in recruitment, financial crime compliance, healthcare authoring, and enterprise search. Each case applies structured testing approaches based on the IMDA Starter Kit for Safety Testing of LLM-based applications, assessing risks such as hallucinations, bias, data disclosure, and vulnerability to adversarial prompts.
The results highlight why testing at the application layer matters. Many real-world issues emerge not from the foundation model itself, but from how models are integrated with prompts, retrieval systems, tools, and user interfaces. Application-level testing provides a practical way to surface these risks and evaluate whether generative AI systems perform reliably and safely under operational conditions.