The Problem: Judging the Best LLM Response
The WSDM Cup 2025 challenged participants with a deceptively simple yet deeply technical task:
Given a user prompt and two AI-generated responses, predict which one is better.
This wasn't just a binary classification problem—it was about understanding how LLMs behave, how users judge quality, and how we can automate that judgment. In an era where generative models power everything from chatbots to copilots, this task is directly applicable to LLM alignment, safety, and real-time feedback systems.
🛠️ My Dual Approach: Embedding-Based + Fine-Tuned Models
To tackle the challenge, I explored two distinct pipelines:
1. 🔍 Embedding + LightGBM Classifier
Baseline Setup:
Used Universal Sentence Encoder (USE) to embed
prompt
,response_a
, andresponse_b
.Extracted contextual chunks from each text (start, middle, end) to preserve tone and logical flow.
Generated similarity and statistical features between prompt–response pairs.
Model: Trained a LightGBM classifier using the engineered features.
Why It Worked:
Fast and scalable
Easy to interpret
Performed well on short and mid-length prompts
Limitation:
Couldn’t fully capture semantic nuances or deeper reasoning, especially in long-form responses.
2. 🧠 LLM Fine-Tuning with DeepSeek R1 + QLoRA (4-bit)
To go deeper, I fine-tuned a DeepSeek-7B (R1) model using QLoRA in 4-bit precision.
Challenge: Handling long context (prompt + 2 responses) exceeded the model’s token limits.
Solution:
I split each prompt and response into 3 sections:
Start
,Middle
,End
Concatenated those intelligently to retain structure and tone while staying within the context limit.
Training Strategy:
Input:
Prompt + Response A + Response B
Target: Winner label (A or B)
Contrastive fine-tuning to encourage preference prediction
Outcome:
Stronger reasoning in long-context samples
Better performance than classical methods on nuanced cases
⚔️ Real Challenges Faced
Context Overflow: Managing lengthy prompt–response pairs within model limits required creative sampling.
Multilingual Bias: Language imbalance made it tricky to generalize.
Noisy Annotations: Human preferences are sometimes inconsistent, which affected ground truth reliability.
📊 Performance Snapshot
Method | Accuracy | Strengths | Weaknesses |
USE + LightGBM | ~63% | Fast, interpretable | Shallow understanding |
DeepSeek + QLoRA | ~68% | Contextual depth, stronger preference prediction | Slower, resource-heavy |
Both approaches were language-agnostic and adaptable, with the second model showing promising generalization in real-world QA tasks.
🌐 Final Thoughts: From Competition to Real-Time AI Systems
This challenge is more than just academic—it’s the foundation for AI agents that think before they respond.
As LLMs generate more content across industries, we must build systems that can evaluate, select, and justify model responses automatically. Here’s how this work transitions to production:
💡 Real-World Applications:
Chatbot Moderation: Rank or reject responses based on quality, tone, or safety.
Copilot Assistants (Legal, Finance, Healthcare): Prevent hallucinations by flagging weaker outputs.
LLM Orchestration: Pick the best completion from multiple models behind the scenes.
Reinforcement Learning from Human Feedback (RLHF): My fine-tuned model could act as an automated proxy for human preference.
🤖 What’s Next: My Interview Chatbot
I’m now integrating this evaluation model into my personal portfolio chatbot—a digital twin that mimics me in interviews.
🔗 It will:
Take questions from recruiters
Generate multiple LLM responses
Score them using the fine-tuned model
Respond with the most human-aligned answer that reflects my actual tone, background, and thinking style
This will simulate a live interview conversation—bridging AI engineering, personalization, and conversational intelligence. It's more than just a demo—it's a step toward autonomous digital personas that can represent humans in professional settings.
📬 Let’s Connect
I’m actively exploring AI safety, model alignment, and LLM evaluation agents. If you’re building something in this space—or want to try the interview bot—feel free to connect: