Global AI Assurance Project: How Unique’s Use Cases Are Tested By Partner QuantPi

Blog Author
by Dr. Sina Wulfmeyer
Jun 3, 2025
featured image

As generative AI becomes more embedded in financial workflows, the challenge isn't just building smart tools – it's making sure they're safe, fair, and reliable. For financial institutions, where the stakes are high and the rules are strict, that assurance is non-negotiable. That’s why Unique, a pioneer in agentic AI for finance, joined forces with QuantPi, a deep-tech startup from Germany’s CISPA research institute, to put their AI systems to the test. Together, they explored how to meaningfully evaluate the performance and integrity of Large Language Models (LLMs) in real-world, regulated environments.

 

1610151367381

 

As Shameek Kundu aptly said at Asia Tech x Singapore : 

"Our goal is to make Al boring and predictable because that's what it takes to deploy in the real world."

 

 

The Global AI Assurance Project led by the Singapore-based AI Verify Foundation by IMDA focused on technical testing of GenAI applications – not just the foundation models, but their real-world implementations. 
 
It’s a field-tested blueprint showcasing how 16 independent assurance teams evaluated 17 GenAI applications across industries like healthcare, banking, and aviation for accuracy, robustness, and reliability. 

 

Why AI Assurance Matters Now

 

Generative AI holds incredible promise for financial services, enabling everything from investment analysis to client communications. But with that power comes serious risks: hallucinations, biased recommendations, regulatory breaches, and data misuse.

Unique's Investment Research Assistant, an AI-powered tool used by top-tier financial institutions like LGT Private Banking, Pictet Group, Julius Baer, and SIX, represents exactly the kind of critical application that demands rigorous testing. To ensure it performs accurately, fairly, and safely, Unique and QuantPi joined forces in a pilot under the Global AI Assurance initiative.

 

Key Players

 

Unique AG

Unique develops enterprise-ready, agentic AI solutions tailored to the financial industry. Their Investment Research Assistant helps banking professionals query stock universes, extract insights, and generate client-ready investment stories. Unique is one of the first European firms to certify its AI Management System to ISO 42001, setting a new benchmark in AI governance.

QuantPi

A spin-off from CISPA Helmholtz Center for Information Security, QuantPi brings state-of-the-art black-box AI testing via its PiCrystal platform. QuantPi enables standardized evaluation of AI models for performance, fairness, and robustness – regardless of data type or model provider.

 

Unique’s contribution: Collaborative Testing for GenAI: A Spotlight on Key Use Cases 

 

Dr. Martin Fadler and Sina Wulfmeyer are proud to have played a pivotal role in advancing the technical testing of GenAI applications. In collaboration with our testing partner, QuantPi, we focused on two critical use cases: 

1️⃣ Investment Insights Agent 
This use case supports bank relationship managers by enabling them to query stock universes, analyze fact sheets, and generate tailored investment recommendations. 

2️⃣ Document Extraction 
A vital sub-step within the Inve1stment Insights Agent pipeline, this process ensures accurate and efficient data extraction to support downstream decision-making. 

 

Key GenAI Risks Identified During the Pilot

 

1 Regulatory and Policy Non-Adherence 
Non-compliance with consumer protection laws or internal policies can have serious consequences, including reputational damage, regulatory penalties, and inconsistent or inappropriate advice. For instance, recommending unsuitable products, such as cryptocurrencies, could harm both clients and the firm. 

2 Fairness and Inclusivity Concerns 
Biases in underlying models or poor performance for specific client groups can result in unfair or inappropriate investment recommendations. This not only undermines trust but also challenges the inclusivity and equity of the advisory process, potentially favoring certain companies or countries. 

3 Trust and Transparency Issues 
A lack of clarity in explaining recommendations, coupled with over-reliance on client-specific data, can erode trust. This may compromise the judgment of advisors and heighten the risk of data misuse, further straining client relationships. 

4 Ethical and Financial Risks 
The misuse of advisory tools – whether for personal financial gain or biased recommendations—poses significant ethical and financial risks. Such actions can lead to regulatory scrutiny, ethical violations, and unfavorable financial outcomes for both clients and the firm. 

5 Client Satisfaction and Stability 
Biased recommendations and poor financial outcomes not only diminish client satisfaction but also jeopardize the firm’s financial stability and long-term credibility. Ensuring fairness and accuracy is essential to maintaining trust and fostering sustainable growth. 

 

How Were The Risks Tested?

Use Case 1 – Investment Research Assistant
  • Accuracy risks: Cosine Similarity between predicted responses and ground truth

  • Hallucination risks: Faithfulness metric to check if responses were grounded in provided context

  • Robustness/Reliability risks: Tested across query difficulty levels, domain bias, and typo tolerance

Use Case 2 – Document Search

  • Inaccurate and irrelevant results risks: Measured using three metrics:

    • Word Overlap Rate

    • Mean Reciprocal Rank (MRR)

    • Lenient Retrieval Accuracy

 

The Results: How we successfully manage risks for Investment Insights Agent.

 

The pilot project, a first-of-its-kind initiative, sets the stage for establishing future norms and standards in the technical testing of Generative AI (GenAI) applications. While our project demonstrated significant progress, it also highlighted certain limitations. Due to constraints in data availability, testing was conducted using a limited set of documents. This small sample size inherently restricts the generalizability of the results and reduces their statistical significance. 

Moreover, the rapidly evolving nature of GenAI presents unique challenges in risk identification and management. With new models emerging frequently, pinpointing the most critical risks is complex. These risks span several areas, including the dynamic and ever-changing underlying data, vulnerabilities within the large language models (LLMs) and their providers, and risks associated with prompting techniques and social engineering. Addressing these challenges requires a forward-looking, adaptive approach to ensure robust and reliable outcomes in this emerging field. 

 

Key Lessons in Testing GenAI Applications 

 

1️⃣ Context is King 
Testing Generative AI (GenAI) is far from a one-size-fits-all approach. The specific context of your application dictates which risks are relevant and which can be deprioritized. Investing time upfront to design tailored and effective tests is critical to achieving meaningful results. 

2️⃣ The Challenge of the Golden Dataset 
Identifying a dataset that seamlessly works across diverse use cases, prompts, and evolving Retrieval-Augmented Generation (RAG) data is no small feat. This challenge becomes even more pronounced in highly regulated industries like Financial Services, where compliance and precision are paramount. 

3️⃣ LLMs Can’t Replace Human Judgment 
While Large Language Models (LLMs) excel at processing and evaluating information, they are not infallible. Human judgment remains an essential component to ensure nuanced and contextually appropriate outcomes. 

4️⃣ Test the Entire Pipeline 
Focusing solely on the final output is insufficient. For agent-based systems, it’s equally important to test interim steps, such as document extraction, to uncover and address risks throughout the entire pipeline. Comprehensive testing ensures a more robust and reliable system. 

 

Helpful Links: 

1. Unique’s Case Study: Case Studies – Global AI Assurance Pilot 

2. Executive Summary – Global AI Assurance Pilot 

3. PDF of Unique Case study: Case Studies – Global AI Assurance Pilot

4. AI Verify Post: GenAI testing: A first-of-its-kind pilot for reliability | AI Verify Foundation hat zu dem Thema Beiträge veröffentlicht | LinkedIn

📺 CNA: https://lnkd.in/gs3SGHvF 
📰 Straits Times: https://lnkd.in/gdxhVmtR 
📄 Main Report: https://lnkd.in/gvGRiswn