Reflections from Resaro’s AI Research Intern, Yeo You Ming

You Ming, a penultimate-year Software Engineering student at the Singapore Institute of Technology (SIT), joined Resaro for a six-month internship.

Over the course of his stint, he worked across AI evaluation, assurance, and applied research, getting hands-on experience with how these systems are tested and assessed in practice. In this reflective piece, he shares what it was like juggling school with a full-time internship, the key projects he worked on, the challenges he ran into, and the lessons that shaped his time at Resaro.

When I started in January, I had two quiet hopes: that the computer vision skills I picked up during my time at the Digital and Intelligence Service (DIS) would finally come in handy, and that I could somehow survive juggling a part-time internship with full-time Y2 studies at SIT.

Both turned out to be true, though not always in the ways I expected.

Project 1: Designing evaluation frameworks for generative AI

My first project was research-heavy in a way I hadn't experienced before. The brief sounded simple: “design evaluation frameworks for text-to-image and image-editing models across axes (such as command execution, spatial reasoning, demographic control, and local contextual grounding)”.

What I didn't appreciate at the start was how subjective rating model output actually is, and how much of the work was about making subjectivity tractable. Two people can look at the same generated image and disagree on whether it followed the prompt. Both can be right, given different assumptions. That meant the framework had to do more than just say "is this image good", it had to define what good meant, in a way two evaluators reading the rubric would arrive at similar scores.

This involved lengthy team calls, where we worked through ideas like human preference alignment (i.e., how likely a human would judge a given output as good) and temporal stability (i.e., whether a generated video stays visually coherent across frames without flickering or jitter). We discussed multilingual prompt design and real-image stress tests to surface the failure modes that English-only benchmarks miss in a Singapore context.

I came out of this project with a more rigorous sense of what "evaluation" actually demands and a healthier respect for the fact that "objective metric" is often a polite phrase for "we agreed on a rubric and stuck to it."

Project 2: Building an Offline-Ready Evaluation Pipeline

My second project involved client delivery, where the project involved a full end-to-end pipeline for filtering flawed images out from large batches of AI-generated images in an air-gapped environment.

Part 1: Choosing the Right Metrics

Before any production code, the team needed to answer a more fundamental question: which metrics should we ship? This phase was the most research-heavy chunk of the internship, and arguably the most consequential as picking the wrong metric is harder to walk back than picking the wrong implementation of the right one.

I evaluated three candidates across image and video modalities: Soft-TIFA (an evaluation framework for prompt adherence under Meta’s GenEval2 for images), Q-Align (an established open-sourced perceptual quality metric for images), and VideoReward (a three-dimension reward model for videos). Each was tested on its own controlled dataset against human judgments.

After rigorous testing against human judgments, we landed on a two-metric setup. Soft-TIFA decided on the prompt adherence and Q-Align perceived visual quality of the outputs, while we ruled out VideoReward due to its poor generalisation to client-specific content.

For Soft-TIFA, the production pipeline needed to score images fully automatically, so manually authoring questions per prompt wasn't feasible. After brainstorming through many solutions, I landed on using an LLM to generate questions directly from the image prompt. The next question I wanted answered was: does it actually correlate with what humans think? To find out, I methodically built an image evaluation dataset of client-specific prompts, generated on two image generation models with deliberately contrasting quality (a SOTA model and an older one) so we'd have a clear good/bad signal. Every image was rated by a human reviewer who didn't know which model produced it. Soft-TIFA's correlation with human preference was meaningful, but not strong. More interesting were the failure modes I found along the way. The Soft-TIFA evaluator from the GenEval 2 codebase has two methodology issues that silently degrade scores: it only properly handles yes/no and "how many" question formats, and its default per-image aggregator is not suitable for other question formats. The biggest single finding, though, was that either-or question phrasing ("is X seated or stationary?") causes catastrophic score failures. The model interprets the "or" as a forced choice and outputs the alternative word instead of "Yes". This prompted me to design a Question Generation Checklist, with specific rules developed through careful testing, that any LLM authoring questions for the pipeline must follow. About two-thirds of either-or questions in our demo dataset went from near-zero scores to passing once we rewrote them to the checklist.

For Q-Align, the question was simpler: does it work off-the-shelf, or do we need to fine-tune it? I ran a smaller validation study (20 hand-rated images across multiple T2I and IE services) and confirmed that Q-Align reliably agrees with humans on the good end of the scale, but systematically overestimates quality on mediocre images. That makes it a clean complement to Soft-TIFA: one covers prompt adherence, the other covers visual quality, but unsuitable for our use-case as-is. I thus ran a LoRA fine-tuning experiment on a self-curated training dataset which consists of images generated by new, high-performing models such as Flux2.Klein and Z-Image, to see if we could improve Q-Align for our use-case. I found that my LoRA-fine-tuned Q-align performed worse when tested on my evaluation dataset, which contained a mix of very low quality images (from older models) and good quality images. The fine-tuning had specialised it for the narrow quality differences in newer SOTA model outputs, which made it worse at the more obvious good-vs-bad separation on the demo set. The implication for production was clear: Q-Align needs to be modular, with per-service LoRAs for cases where even decent-looking images have to be filtered out.

For VideoReward, the question was whether it generalised to client-specific-context video. The answer was a sharp partial. I tested 15 hand-labelled videos across its three dimensions (Visual Quality, Motion Quality, Text Alignment). Text Alignment correlated strongly with human judgment. Visual Quality and Motion Quality essentially didn't, their correlations were close to zero. This makes sense in hindsight: VideoReward was trained on internet videos across general categories, and our prompts referenced client-specific use cases which are out-of-distribution. This confirmed our decision to scope the project down to images only, with the two-metric setup: Soft-TIFA for prompt adherence, Q-Align for perceived quality. I co-developed the pipeline with another fellow intern, and took charge of developing the prompt adherence component. This phase was uncomfortable in a useful way, most metrics "work" in their original papers, but nearly all of them have some sharp edge once you push them outside the conditions they were validated under. Knowing where those edges are turned out to be more valuable than picking the highest-scoring one.

Part 2: Shipping the Pipeline

With the metric decided and the experiment findings documented, our proof of concept successfully completed with positive results and received a go-ahead from the client. Thus, we moved on to shipping the end-to-end pipeline. The prompt adherence component has two parts 1. An LLM that authors yes/no VQA questions from the prompt based on the checklist I designed, and 2. A vision-language model that answers those questions against the image. Each per-question probability becomes a per-prompt score. In our case, all of this had to run with no internet access, on a single H100 shared with our client’s bespoke image generation model.

The LLM spike. Before any production code, I spent weeks characterising the question-authoring side. I built a 75-prompt benchmark spanning simple compositions to Singapore-context scenes, generated a Claude baseline of VQA questions for each, then benchmarked four open-weight candidates (Qwen 3.5 9B, Granite 4.1 8B, Qwen 3.5 35B-A3B-AWQ, Gemma 4 31B-AWQ) against it under a Spearman correlation criterion, which Gemma 4 31B-AWQ came out clearly ahead. This phase felt much closer to academic research than engineering: most of my time was reading papers, sanity-checking probability math, and debating choices in a notebook before any of it became code.

The production phase. I shipped three components: a QuestionGenerator that loads Gemma via vLLM and produces questions against two prompt templates (text-to-image and image editing), with the Question Generation Checklist baked in as nine validity rules; a SoftTIFAScorer that loads Qwen3-VL via raw transformers and computes per-question probabilities from the first-token softmax, with the two methodology patches from part 1 preserved; and a SoftTIFAPipeline that wires them together. I'm glad I managed to ship the offline-ready Soft-TIFA pipeline before the internship ended.

Beyond the technical work

Beyond the tools and projects, what made this internship stand out was the environment itself. Resaro operates with a lot of autonomy - you're trusted to own your work - but that's balanced by the rhythm of sprint planning and regular check-ins that keep things grounded and moving. Whenever I hit unfamiliar territory, the team was generous with their time; it was never a matter of being left to figure things out alone.

Colleagues would sit down or get in a call with me, point me to the right resources, and walk me through the first steps rather than just hand me an answer. That was especially true during my first project on evaluation frameworks, which involved a surprisingly comprehensive process with a lot of moving parts. I was guided through work patiently, from thinking about the right parameters to identifying quality indicators to running the actual evaluation, each step building on the last. It's the kind of mentorship that's hard to replicate in a classroom, and it will shape how I approach future projects in school.

This was also my first real exposure to proper software development methodology: Jira sprints, time estimates, ticket management. Tools I had only heard about in school suddenly became part of my daily workflow, and I now understand why they exist. Balancing all of this with school was no easy feat, but it pushed me to grow in ways classroom learning alone couldn't.

A huge thank you to the colleagues at Resaro who made this experience what it was; Jet, Wen Yi, Jun Yu, Miguel and Peter. I’m moving onwards to my next semester at university. Really grateful for everything Resaro taught me, and excited to bring this back into my studies 😊

We’re Hiring

We’re always looking for curious and motivated interns to join our team. If you’re interested in gaining hands-on experience with us, drop us a message through our contact form here: https://resaro.ai/contact

Reflections from Resaro’s AI Research Intern, Yeo You Ming

More Insights

Five Flows That Make AI Assurance Work

A Field Guide to the AI Assurance Ecosystem: Who's Who and Why It Matters

The Full Inventory: Assurance Objects across AI Systems’ Governance, Tech Stack, and Impact