WSDM Cup 2025: Evaluating LLMs with Embeddings & Fine-Tuning

Solved WSDM Cup 2025 : Predict the better LLM response (A or B) for a prompt. Dataset was labeled by domain experts across multiple languages. Built two models—USE + LightGBM and fine-tuned DeepSeek-7B with QLoRA, handling long context via smart slicing.

aillmfine tuning
WSDM Cup 2025: Evaluating LLMs with Embeddings & Fine-Tuning

The Problem: Judging the Best LLM Response

The WSDM Cup 2025 challenged participants with a deceptively simple yet deeply technical task:

Given a user prompt and two AI-generated responses, predict which one is better.

This wasn't just a binary classification problem—it was about understanding how LLMs behave, how users judge quality, and how we can automate that judgment. In an era where generative models power everything from chatbots to copilots, this task is directly applicable to LLM alignment, safety, and real-time feedback systems.


🛠️ My Dual Approach: Embedding-Based + Fine-Tuned Models

To tackle the challenge, I explored two distinct pipelines:


1. 🔍 Embedding + LightGBM Classifier

Baseline Setup:

  • Used Universal Sentence Encoder (USE) to embed prompt, response_a, and response_b.

  • Extracted contextual chunks from each text (start, middle, end) to preserve tone and logical flow.

  • Generated similarity and statistical features between prompt–response pairs.

Model: Trained a LightGBM classifier using the engineered features.

Why It Worked:

  • Fast and scalable

  • Easy to interpret

  • Performed well on short and mid-length prompts

Limitation:

  • Couldn’t fully capture semantic nuances or deeper reasoning, especially in long-form responses.


2. 🧠 LLM Fine-Tuning with DeepSeek R1 + QLoRA (4-bit)

To go deeper, I fine-tuned a DeepSeek-7B (R1) model using QLoRA in 4-bit precision.

Challenge: Handling long context (prompt + 2 responses) exceeded the model’s token limits.

Solution:

  • I split each prompt and response into 3 sections: Start, Middle, End

  • Concatenated those intelligently to retain structure and tone while staying within the context limit.

Training Strategy:

  • Input: Prompt + Response A + Response B

  • Target: Winner label (A or B)

  • Contrastive fine-tuning to encourage preference prediction

Outcome:

  • Stronger reasoning in long-context samples

  • Better performance than classical methods on nuanced cases


⚔️ Real Challenges Faced

  1. Context Overflow: Managing lengthy prompt–response pairs within model limits required creative sampling.

  2. Multilingual Bias: Language imbalance made it tricky to generalize.

  3. Noisy Annotations: Human preferences are sometimes inconsistent, which affected ground truth reliability.


📊 Performance Snapshot

Method

Accuracy

Strengths

Weaknesses

USE + LightGBM

~63%

Fast, interpretable

Shallow understanding

DeepSeek + QLoRA

~68%

Contextual depth, stronger preference prediction

Slower, resource-heavy

Both approaches were language-agnostic and adaptable, with the second model showing promising generalization in real-world QA tasks.


🌐 Final Thoughts: From Competition to Real-Time AI Systems

This challenge is more than just academic—it’s the foundation for AI agents that think before they respond.

As LLMs generate more content across industries, we must build systems that can evaluate, select, and justify model responses automatically. Here’s how this work transitions to production:

💡 Real-World Applications:

  • Chatbot Moderation: Rank or reject responses based on quality, tone, or safety.

  • Copilot Assistants (Legal, Finance, Healthcare): Prevent hallucinations by flagging weaker outputs.

  • LLM Orchestration: Pick the best completion from multiple models behind the scenes.

  • Reinforcement Learning from Human Feedback (RLHF): My fine-tuned model could act as an automated proxy for human preference.

🤖 What’s Next: My Interview Chatbot

I’m now integrating this evaluation model into my personal portfolio chatbot—a digital twin that mimics me in interviews.

🔗 It will:

  • Take questions from recruiters

  • Generate multiple LLM responses

  • Score them using the fine-tuned model

  • Respond with the most human-aligned answer that reflects my actual tone, background, and thinking style

This will simulate a live interview conversation—bridging AI engineering, personalization, and conversational intelligence. It's more than just a demo—it's a step toward autonomous digital personas that can represent humans in professional settings.


📬 Let’s Connect

I’m actively exploring AI safety, model alignment, and LLM evaluation agents. If you’re building something in this space—or want to try the interview bot—feel free to connect:

Ask me anything
G