Engineering Deterministic Outputs from Open-Source LLMs

The Problem: Non-Determinism Kills Trust

When you deploy an LLM in aviation compliance, medical diagnostics, or financial auditing, you cannot afford variability. The same prompt with the same context must produce the same output — every single time. A compliance report that changes wording between runs is not just annoying; it's a regulatory failure.

Most frontier APIs abstract this away behind closed walls. But when you're running open-source models like Llama 3, Mistral, DeepSeek, or Gemma on your own infrastructure, you need to engineer determinism yourself. Here are the seven techniques I use in production today.

1. Greedy Decoding: Temperature Zero is Not Enough

The first instinct is to set temperature=0. This collapses the probability distribution to always pick the highest-probability token. But this alone doesn't guarantee determinism — floating-point arithmetic on different GPU architectures can produce subtly different softmax outputs.

The correct configuration is a triple lock:

temperature=0.0 + top_k=1 + do_sample=False
This forces greedy decoding: no sampling, no randomness, just argmax over the logit vector at every step.

In HuggingFace Transformers, this looks like:

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.0,
    top_k=1,
    do_sample=False,
    repetition_penalty=1.0
)

2. Seed Locking: Reproducible Random States

Even with greedy decoding, the underlying CUDA kernels and PyTorch operations can introduce non-determinism through parallel thread scheduling. Seed locking pins the random number generators across all relevant layers:

import torch
import random
import numpy as np

def set_deterministic(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # PyTorch 2.0+
    torch.use_deterministic_algorithms(True)

The critical line is torch.use_deterministic_algorithms(True). This forces PyTorch to use deterministic implementations of all operations — trading some performance for exact reproducibility. If any operation doesn't have a deterministic implementation, it throws an error instead of silently producing different results.

3. Constrained Decoding: Grammar-Guided Generation

The most powerful technique for deterministic outputs is constrained decoding — forcing the model to generate only tokens that conform to a predefined grammar or schema. Instead of hoping the model produces valid JSON, you mathematically guarantee it.

Tools like Outlines, LMQL, and Guidance implement this by masking invalid tokens at each generation step. The model can only choose from tokens that keep the output conforming to your schema:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.3")

schema = {
    "type": "object",
    "properties": {
        "risk_level": {"type": "string", "enum": ["LOW", "MEDIUM", "HIGH", "CRITICAL"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "findings": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["risk_level", "confidence", "findings"]
}

generator = outlines.generate.json(model, schema)
result = generator(prompt)  # Always valid JSON, always matches schema

This is transformative for compliance pipelines. Every output is structurally identical, parseable, and auditable. The model's creativity is channeled into content within the structure, not the structure itself.

4. Structured Output Schemas & Function Calling

Building on constrained decoding, structured output schemas add a validation layer. I use Pydantic models to define the exact shape of every LLM response:

from pydantic import BaseModel, Field
from typing import List, Literal

class ComplianceReport(BaseModel):
    document_id: str
    regulation: str
    status: Literal["COMPLIANT", "NON_COMPLIANT", "NEEDS_REVIEW"]
    findings: List[str] = Field(min_length=1)
    confidence_score: float = Field(ge=0.0, le=1.0)
    recommended_actions: List[str]

The Pydantic model acts as both the generation constraint and the runtime validator. If the model somehow produces output that doesn't match (e.g., confidence > 1.0), it's caught at the validation boundary rather than propagating into downstream systems.

5. KV-Cache Pinning for Identical Prompts

When the same prompt hits your system repeatedly — common in batch processing regulatory documents — KV-cache pinning ensures byte-identical outputs by caching the key-value attention states:

The idea is simple: compute the KV-cache once for a given prompt, serialize it, and reuse it for identical future requests. This eliminates any numerical drift from recomputation and dramatically speeds up inference for repeated queries.

Combined with greedy decoding and seed locking, KV-cache pinning creates a triple guarantee: same prompt → same cache → same tokens → same output. In our aviation compliance pipeline, this reduced output variance to exactly zero across 10,000 identical test runs.

6. Quantization-Aware Determinism

Quantization (GPTQ, AWQ, GGUF) introduces another source of non-determinism. Different quantization methods round weights differently, and some quantization kernels use non-deterministic CUDA operations for speed.

The key insight: quantize once, deploy the exact artifact everywhere. Never re-quantize. Store the quantized model as an immutable artifact with a content hash. Every deployment loads the exact same bytes.

For AWQ specifically, use version="gemv" instead of version="gemm" — the GEMV kernel is deterministic while GEMM is not due to parallel reduction order.

7. Output Post-Processing: Hash & Cache

The final defense layer is deterministic caching at the API boundary. Hash the full request (prompt + parameters + model version) and cache the response:

import hashlib
import json

def deterministic_generate(prompt, params, cache):
    cache_key = hashlib.sha256(
        json.dumps({"prompt": prompt, **params}, sort_keys=True).encode()
    ).hexdigest()
    
    if cache_key in cache:
        return cache[cache_key]
    
    response = model.generate(prompt, **params)
    cache[cache_key] = response
    return response

This is the pragmatic last mile. Even if all other techniques aren't perfectly configured, the cache ensures that once a response is generated, it's locked for that exact input combination.

Real-World Application: Aviation Compliance

In our pipeline at NIAR, we process thousands of aerospace regulatory documents through LLM-powered compliance analysis. Each document gets a structured compliance report with risk classification, specific findings, and recommended actions.

The stack combines all seven techniques:

Greedy decoding with seed locking for base reproducibility
Outlines constrained decoding for schema-guaranteed JSON
Pydantic validators for runtime safety nets
KV-cache pinning for batch processing consistency
Immutable quantized artifacts deployed via Docker
SHA-256 response caching at the API boundary

The result: zero variance across identical inputs, every report auditable, every output traceable to an exact model version + prompt + parameter combination. This is what production-grade AI looks like in regulated industries.

Key Takeaways

Determinism in LLMs isn't a single switch — it's a stack of complementary techniques. Each layer catches what the previous one misses:

Greedy decoding eliminates sampling randomness
Seed locking eliminates CUDA non-determinism
Constrained decoding eliminates structural variance
Schema validation eliminates edge-case failures
KV-cache pinning eliminates recomputation drift
Quantization discipline eliminates weight rounding variance
Response caching provides the final guarantee

If you're deploying LLMs in any domain where consistency matters — and it almost always does — these techniques are not optional. They're the difference between a demo and a production system.