Ganesh Mamidipalli

The Problem: Judging the Best LLM Response

The WSDM Cup 2025 challenged participants with a deceptively simple yet deeply technical task:

Given a user prompt and two AI-generated responses, predict which one is better.

This wasn't just a binary classification problem—it was about understanding how LLMs behave, how users judge quality, and how we can automate that judgment. In an era where generative models power everything from chatbots to copilots, this task is directly applicable to LLM alignment, safety, and real-time feedback systems.

🛠️ My Dual Approach: Embedding-Based + Fine-Tuned Models

To tackle the challenge, I explored two distinct pipelines:

1. 🔍 Embedding + LightGBM Classifier

Baseline Setup:

Used Universal Sentence Encoder (USE) to embed prompt, response_a, and response_b.
Extracted contextual chunks from each text (start, middle, end) to preserve tone and logical flow.
Generated similarity and statistical features between prompt–response pairs.

Model: Trained a LightGBM classifier using the engineered features.

Why It Worked:

Fast and scalable
Easy to interpret
Performed well on short and mid-length prompts

Limitation:

Couldn’t fully capture semantic nuances or deeper reasoning, especially in long-form responses.

2. 🧠 LLM Fine-Tuning with DeepSeek R1 + QLoRA (4-bit)

To go deeper, I fine-tuned a DeepSeek-7B (R1) model using QLoRA in 4-bit precision.

Challenge: Handling long context (prompt + 2 responses) exceeded the model’s token limits.

Solution:

I split each prompt and response into 3 sections: Start, Middle, End
Concatenated those intelligently to retain structure and tone while staying within the context limit.

Training Strategy:

Input: Prompt + Response A + Response B
Target: Winner label (A or B)
Contrastive fine-tuning to encourage preference prediction

Outcome:

Stronger reasoning in long-context samples
Better performance than classical methods on nuanced cases

⚔️ Real Challenges Faced

Context Overflow: Managing lengthy prompt–response pairs within model limits required creative sampling.
Multilingual Bias: Language imbalance made it tricky to generalize.
Noisy Annotations: Human preferences are sometimes inconsistent, which affected ground truth reliability.

📊 Performance Snapshot

Method	Accuracy	Strengths	Weaknesses
USE + LightGBM	~63%	Fast, interpretable	Shallow understanding
DeepSeek + QLoRA	~68%	Contextual depth, stronger preference prediction	Slower, resource-heavy

Both approaches were language-agnostic and adaptable, with the second model showing promising generalization in real-world QA tasks.

🌐 Final Thoughts: From Competition to Real-Time AI Systems

This challenge is more than just academic—it’s the foundation for AI agents that think before they respond.

As LLMs generate more content across industries, we must build systems that can evaluate, select, and justify model responses automatically. Here’s how this work transitions to production:

💡 Real-World Applications:

Chatbot Moderation: Rank or reject responses based on quality, tone, or safety.
Copilot Assistants (Legal, Finance, Healthcare): Prevent hallucinations by flagging weaker outputs.
LLM Orchestration: Pick the best completion from multiple models behind the scenes.
Reinforcement Learning from Human Feedback (RLHF): My fine-tuned model could act as an automated proxy for human preference.

🤖 What’s Next: My Interview Chatbot

I’m now integrating this evaluation model into my personal portfolio chatbot—a digital twin that mimics me in interviews.

🔗 It will:

Take questions from recruiters
Generate multiple LLM responses
Score them using the fine-tuned model
Respond with the most human-aligned answer that reflects my actual tone, background, and thinking style

This will simulate a live interview conversation—bridging AI engineering, personalization, and conversational intelligence. It's more than just a demo—it's a step toward autonomous digital personas that can represent humans in professional settings.

📬 Let’s Connect

I’m actively exploring AI safety, model alignment, and LLM evaluation agents. If you’re building something in this space—or want to try the interview bot—feel free to connect:

🔗 LinkedIn
💻 GitHub
🌐 Portfolio