0

18.07.2025

Article Hero Image
Resaro x TookiTaki (Global AI Assurance Pilot)

French philosopher Voltaire’s centuries-old caution still resonates: “Doubt is not a pleasant condition, but certainty is absurd.”

This very tension of balancing innovation with caution is something that Tookitaki knows well. While innovating AI systems for financial crime compliance, they’ve made sure accuracy, explainability, and reliability remain front and center.

As part of the Global AI Assurance Pilot launched by IMDA and the AI Verify Foundation, Resaro conducted independent testing of Tookitaki’s FinMate, a GenAI assistant designed to streamline anti-money laundering (AML) investigations.

FinMate automates case summaries and responds to investigator queries, demanding both high accuracy and robust safeguards against misuse.

Our testing combined automated precision, recall, faithfulness, and misuse assessments. We ran batch jobs, API queries, database extracts, and log-file analysis to evaluate FinMate’s performance. Testing drew on diverse datasets across languages and complexities, including missing data and corrupted inputs, to ensure resilience. This rigorous, independent scrutiny helped catch hallucinations, verify factual accuracy, and confirm governance controls were airtight.

“Working with Resaro on the Global AI Assurance Pilot was highly productive. Their rigorous and collaborative approach helped identify valuable insights, significantly improving our testing practices and enhancing product resilience. Collaborations like this reinforce Tookitaki’s commitment to trustworthy AI and advance our mission to build the trust layer powering the future of financial services. We appreciate Resaro’s role in advancing independent AI assurance practices globally.” — Abhishek Chatterjee, Founder & CEO of Tookitaki

Key Lessons

1/ You can’t evaluate quality without first defining the context of use—who the user is, what tasks they are trying to complete, and what “good enough” quality looks like. That sounds basic, but it is often overlooked.

2/ Risks tend to emerge from system-level interactions. It is not merely the underlying model that matters—but also how it is prompted, what guardrails exist, and how outputs are consumed. We found, for example, that the same LLM model used across different systems could pass or fail hallucination tests depending on prompt structure and domain-specific context.

3/ The scope of GenAI application testing should be proportional to the level of risks identified. We found greater value in focusing on right-sizing the appropriate breadth and depth of technical tests to the use case, before using the corresponding metrics and indicators to produce key results and insights that reflect the expected quality levels of the GenAI application.

Read the full case study: https://assurance.aiverifyfoundation.sg/wp-content/uploads/2025/05/Tookitaki-X-Resaro.pdf


More about Resaro, and our role in the Global AI Assurance Pilot

Resaro is an independent, third-party AI assurance provider that builds AI testing tools to evaluate the performance, safety, and security of dual-use AI systems. With co-headquarters in Singapore and Europe, we combine our global engineering capabilities with deep domain expertise in mission-critical use cases to offer scientific and reliable approaches to testing.

In the Global AI Assurance Pilot, we collaborated with industry partners to conduct comprehensive application-level testing of their GenAI applications. Our role was to first assess the potential areas of quality risks of the GenAI application through architectural deep-dives and stakeholder interviews. We then identified the quality indicators and metrics for these areas and implemented technical tests including performance, hallucination detection, robustness. The results and insights from these plans were then documented and shared with our partners.

We joined the Pilot because we saw an urgent need to move beyond model-level benchmarks and focus on how GenAI performs where it matters most: in the hands of end users. Many of today’s risks—like hallucinated content, and bias —do not show up in foundation model testing alone. They emerge only when the full system is in use.

Our goal was to help demonstrate how application-level testing for GenAI applications can surface these issues in a structured, repeatable way. We also wanted to contribute to shaping a shared language of quality—so that regulators, deployers, and testers could better align on what “good enough” looks like for GenAI applications. Through this, we hope to draw focus to the importance of third-party testing especially for GenAI applications.

The initiative also showcased that repeated testing can bring greater observability, especially for GenAI applications where emergent risks and threats are evolving rapidly. With a quality framework to benchmark the GenAI application in its contest of use, organisations can equip themselves with the right tools and assets in communicating assurance across its multiple stakeholder groups, in a repeatable and scalable way.

Our Co-CEO April Chin recently joined CNA’s Genevieve Woo and Simon Chesterman on CNA Singapore Tonight to discuss: https://www.youtube.com/watch?si=iUs3VtON-m-4RgFS&v=W1YJe3Ab5Og&feature=youtu.be