LLM Engineering in Production: Beyond the Prompt

January 30, 2025

Production LLM systems differ from prototypes in the same way a load-bearing wall differs from a cardboard cutout. Both look similar from a distance. One of them holds up a building. This document covers the engineering patterns required to move from prototype to production: evaluation frameworks, retrieval-augmented generation, prompt management, model routing, cost control, safety guardrails, and observability.

Evaluation Frameworks

Without a structured evaluation pipeline, prompt changes amount to guesswork. LLM outputs are non-deterministic and high-dimensional. Equality checks against expected values do not work. Production systems require layered evaluation with automated gating.

+--------------------------------------------------------------------+
|                      EVALUATION PIPELINE                           |
+--------------------------------------------------------------------+
|                                                                    |
|  Stage 1: Deterministic Checks                                     |
|  +--------------------------------------------------------------+ |
|  |  - JSON schema validation (jsonschema, Pydantic)              | |
|  |  - Output length within [min_tokens, max_tokens]              | |
|  |  - Required field presence                                    | |
|  |  - Regex pattern matching for structured outputs              | |
|  |  - Language detection (expected locale match)                 | |
|  +--------------------------------------------------------------+ |
|                          |                                          |
|                          v                                          |
|  Stage 2: Semantic Evaluation                                      |
|  +--------------------------------------------------------------+ |
|  |  - Claim extraction + entailment verification                 | |
|  |  - Embedding cosine similarity to golden answers              | |
|  |  - BERTScore F1 against reference responses                   | |
|  |  - Named entity overlap ratio                                 | |
|  +--------------------------------------------------------------+ |
|                          |                                          |
|                          v                                          |
|  Stage 3: LLM-as-Judge                                             |
|  +--------------------------------------------------------------+ |
|  |  - Pairwise comparison (A vs B verdicts)                      | |
|  |  - Rubric-based scoring on [relevance, accuracy, tone]        | |
|  |  - Safety and policy compliance checks                        | |
|  |  - Faithfulness scoring against retrieved context             | |
|  +--------------------------------------------------------------+ |
|                          |                                          |
|                          v                                          |
|  Stage 4: Human Review (sampled)                                   |
|  +--------------------------------------------------------------+ |
|  |  - Weekly audit of flagged outputs                            | |
|  |  - Inter-annotator agreement tracking (target: kappa > 0.7)  | |
|  |  - New edge cases added to eval dataset                       | |
|  +--------------------------------------------------------------+ |
|                                                                    |
+--------------------------------------------------------------------+

Stage 1: Deterministic Checks

These run on every response with sub-millisecond overhead. They catch structural failures: malformed JSON, outputs exceeding length constraints, missing required fields, wrong output language.

import json
from jsonschema import validate, ValidationError
 
RESPONSE_SCHEMA = {
    "type": "object",
    "required": ["answer", "confidence", "sources"],
    "properties": {
        "answer": {"type": "string", "minLength": 1},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "sources": {
            "type": "array",
            "items": {"type": "string"},
            "minItems": 1,
        },
    },
}
 
def validate_response(raw_output: str) -> dict:
    """Parse and validate LLM output against the expected schema."""
    try:
        parsed = json.loads(raw_output)
    except json.JSONDecodeError as e:
        raise ValueError(f"Output is not valid JSON: {e}")
 
    validate(instance=parsed, schema=RESPONSE_SCHEMA)
 
    token_count = len(raw_output.split())
    if token_count > 2000:
        raise ValueError(
            f"Response length {token_count} tokens exceeds 2000 limit"
        )
 
    return parsed

A system handling 50K daily requests should expect 2-5% of responses to fail schema validation on the first attempt. Retry with a stricter prompt or fall back to a template-based response.

Stage 2: Semantic Evaluation

Semantic evaluation measures whether the response content is correct, not just whether its format is correct.

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
 
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions, 80MB
 
 
def embedding_similarity(response: str, reference: str) -> float:
    """Cosine similarity between response and reference embeddings."""
    embeddings = model.encode([response, reference])
    return float(cos_sim([embeddings[0]], [embeddings[1]])[0][0])
 
 
def claim_precision(
    response_claims: list[str],
    reference_claims: list[str],
    threshold: float = 0.85,
) -> float:
    """Fraction of response claims supported by reference claims."""
    if not response_claims:
        return 1.0
    resp_emb = model.encode(response_claims)
    ref_emb = model.encode(reference_claims)
    sims = cos_sim(resp_emb, ref_emb)
    supported = sum(1 for row in sims if row.max() >= threshold)
    return supported / len(response_claims)
 
 
def claim_recall(
    response_claims: list[str],
    reference_claims: list[str],
    threshold: float = 0.85,
) -> float:
    """Fraction of reference claims covered by the response."""
    if not reference_claims:
        return 1.0
    resp_emb = model.encode(response_claims)
    ref_emb = model.encode(reference_claims)
    sims = cos_sim(ref_emb, resp_emb)
    covered = sum(1 for row in sims if row.max() >= threshold)
    return covered / len(reference_claims)

BERTScore provides an alternative that operates at the token level rather than the claim level. For factual QA tasks, claim-level metrics are more interpretable. For summarization, BERTScore or ROUGE-L are more appropriate.

Stage 3: LLM-as-Judge

Pairwise comparison outperforms absolute scoring. When switching from "rate 1-5" to "which is better, A or B?", inter-annotator agreement with human raters typically improves from around 60% to 82-87%.

JUDGE_PROMPT = """You are evaluating two responses to a question.
 
Question: {question}
Context (ground truth): {context}
 
Response A:
{response_a}
 
Response B:
{response_b}
 
Evaluate on these criteria:
1. Factual accuracy relative to the provided context
2. Completeness (does it address the full question?)
3. Conciseness (no unnecessary information)
 
Provide your reasoning, then state your verdict.
 
Output JSON:
{{"reasoning": "...", "verdict": "A" | "B" | "TIE"}}
"""
 
 
def run_judge(
    question: str,
    context: str,
    response_a: str,
    response_b: str,
    client,
) -> dict:
    """Run pairwise LLM-as-judge evaluation."""
    prompt = JUDGE_PROMPT.format(
        question=question,
        context=context,
        response_a=response_a,
        response_b=response_b,
    )
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    return json.loads(result.choices[0].message.content)

Position bias is a known issue. The judge model may prefer whichever response appears first. Mitigate this by running each comparison twice with swapped positions and discarding cases where the verdict flips.

Eval Dataset Management

The eval dataset should grow continuously. Start with 50-100 examples. Every production bug becomes a new test case. Target 500+ examples by month three.

  +--------------------+
  |  Modify prompt,    |
  |  model, or config  |
  +---------+----------+
            |
            v
  +---------+----------+
  |  Run eval suite    |
  |  (all 4 stages)    |
  +---------+----------+
            |
            v
  +---------+----------+
  |  Compare to        |
  |  baseline metrics  |
  +---------+----------+
            |
      +-----+------+
      |            |
      v            v
  +---+----+  +---+----+
  | Pass:  |  | Fail:  |
  | deploy |  | revert |
  +--------+  +--------+

Track eval runs in LangSmith or Weights & Biases. LangSmith provides LLM-specific trace visualization, dataset versioning, and annotation queues. W&B is more general-purpose but integrates well with experiment tracking across model training and inference.

Retrieval-Augmented Generation

RAG grounds LLM responses in external data. A naive implementation retrieves irrelevant chunks, wastes context window tokens, and produces hallucinated responses that cite real-looking but fabricated sources. Production RAG requires careful engineering at every stage.

Architecture

                       +-------------------+
                       |    User Query     |
                       +--------+----------+
                                |
                       +--------v----------+
                       |  Query Processing |
                       |  - Classification |
                       |  - Expansion      |
                       |  - HyDE (optional)|
                       +--------+----------+
                                |
                  +-------------+-------------+
                  |                           |
         +--------v----------+     +---------v---------+
         |  Vector Search    |     |  Lexical Search   |
         |  (ANN, HNSW)     |     |  (BM25)           |
         |  top-k=50        |     |  top-k=50         |
         +--------+----------+     +---------+---------+
                  |                           |
                  +-------------+-------------+
                                |
                       +--------v----------+
                       |  Reciprocal Rank  |
                       |  Fusion (k=60)    |
                       |  output: top 20   |
                       +--------+----------+
                                |
                       +--------v----------+
                       |  Cross-Encoder    |
                       |  Reranker         |
                       |  output: top 5    |
                       +--------+----------+
                                |
                       +--------v----------+
                       |  Context Assembly |
                       |  - Deduplication  |
                       |  - Metadata       |
                       |  - Token budget   |
                       +--------+----------+
                                |
                       +--------v----------+
                       |  LLM Generation   |
                       |  + Citation       |
                       |  Extraction       |
                       +--------+----------+
                                |
                       +--------v----------+
                       |  Post-Processing  |
                       |  - Citation check |
                       |  - Safety filter  |
                       |  - Format output  |
                       +-------------------+

Embedding Model Selection

The embedding model determines the ceiling of retrieval quality. Switching models changes recall@10 by 10-20 percentage points depending on the domain.

ModelDimensionsSizeCost (per 1M tokens)Best For
text-embedding-3-small1536API$0.02General English, cost-sensitive
text-embedding-3-large3072API$0.13General English, high accuracy
voyage-code-31024API$0.06Code retrieval
Cohere embed-v31024API$0.10Multilingual, compressed binary
all-MiniLM-L6-v238480MBFree (self-host)Low-resource, CPU inference
multilingual-e5-large-instruct10242.2GBFree (self-host)Multilingual, MTEB top-tier
nomic-embed-text-v1.5768548MBFree (self-host)Matryoshka, variable dims

Instruction-prefixed embeddings improve retrieval by 3-8% on benchmarks. Prefix documents with "search_document:" and queries with "search_query:" (exact prefix varies by model). This asymmetry accounts for the structural difference between a short query and a long passage.

Matryoshka embeddings (supported by text-embedding-3-large and nomic-embed-text-v1.5) allow truncating the embedding vector to fewer dimensions at query time. Reducing from 3072 to 1024 dimensions typically costs less than 2% recall while cutting storage and search costs by 66%.

Chunking Strategies

Chunking determines what the retrieval system can find. A naive RAG pipeline retrieving 20 chunks at 500 tokens each consumes 10K context tokens per query. At $3 per million input tokens, that is $0.03 per query, or $1,500/day at 50K queries. Reducing to 8 well-chosen chunks of 300 tokens saves 7,600 tokens per query and $2.28K/day.

StrategyChunk SizeOverlapProsCons
Fixed-size512 tokens50 tokensSimple, predictableSplits mid-sentence
Recursive splitting500-1000 chars100-200 charsRespects paragraph boundariesStill arbitrary boundaries
Semantic chunkingVariableNonePreserves topic coherenceRequires embedding each sentence
Document-structureSection-levelNoneRespects author's organizationHighly variable chunk sizes
Code-awareFunction/classNonePreserves logical unitsLanguage-specific parsers needed
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
 
# Fixed-size: baseline approach
def fixed_chunks(text: str, size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into fixed-size token chunks with overlap."""
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunk_tokens = tokens[i : i + size]
        chunks.append(tokenizer.decode(chunk_tokens))
    return chunks
 
 
# Recursive splitting: respects paragraph and sentence boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", ", ", " "],
    length_function=lambda t: len(tokenizer.encode(t)),
)
 
 
# Semantic chunking: split on topic shifts
def semantic_chunk(
    text: str,
    embedding_model,
    threshold: float = 0.72,
    min_chunk_size: int = 100,
) -> list[str]:
    """Split text at points where consecutive sentence
    embeddings drop below a similarity threshold."""
    sentences = segment_into_sentences(text)
    if len(sentences) <= 1:
        return [text]
 
    embeddings = embedding_model.encode(sentences)
    chunks = []
    current = [sentences[0]]
 
    for i in range(1, len(sentences)):
        sim = float(
            cos_sim([embeddings[i - 1]], [embeddings[i]])[0][0]
        )
        if sim < threshold and len(" ".join(current)) >= min_chunk_size:
            chunks.append(" ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
 
    if current:
        chunks.append(" ".join(current))
    return chunks

Guidelines for chunk size by content type:

  • FAQ, Q&A pairs: 150-300 tokens. Precision matters more than context.
  • Technical documentation: 400-800 tokens. Enough context for an explanation to be self-contained.
  • Legal and regulatory text: 800-1500 tokens. Clauses reference each other; splitting them destroys meaning.
  • Source code: chunk by function, method, or class. A 5-line utility and a 200-line class each form one atomic chunk.

Metadata enrichment is critical. Attach document title, section heading, page number, last-modified timestamp, and source URL to every chunk. This enables filtered retrieval ("find chunks from documents updated in the last 90 days") and proper citation in responses.

Pure vector search fails on exact-match queries. A user searching for error code ERR_CONN_REFUSED_443 gets chunks about network errors in general. BM25 matches the literal string. Pure keyword search fails on conceptual queries. A query about "authentication" misses documents using "login flow," "JWT validation," and "session management." Vector search captures the semantic relationship.

Combining both with Reciprocal Rank Fusion:

from collections import defaultdict
from dataclasses import dataclass
 
 
@dataclass
class SearchResult:
    doc_id: str
    content: str
    metadata: dict
    score: float = 0.0
 
 
def hybrid_search(
    query: str,
    vector_store,
    bm25_index,
    k: int = 20,
    rrf_k: int = 60,
    vector_weight: float = 1.0,
    keyword_weight: float = 1.0,
) -> list[SearchResult]:
    """Hybrid search with weighted Reciprocal Rank Fusion.
 
    Args:
        query: Search query string.
        vector_store: Vector index with similarity_search method.
        bm25_index: BM25 index with search method.
        k: Number of final results to return.
        rrf_k: RRF smoothing constant (default 60 from original paper).
        vector_weight: Weight for vector search contribution.
        keyword_weight: Weight for keyword search contribution.
    """
    vector_results = vector_store.similarity_search(query, k=k * 3)
    keyword_results = bm25_index.search(query, k=k * 3)
 
    scores = defaultdict(float)
    doc_map = {}
 
    for rank, doc in enumerate(vector_results):
        scores[doc.doc_id] += vector_weight / (rrf_k + rank + 1)
        doc_map[doc.doc_id] = doc
 
    for rank, doc in enumerate(keyword_results):
        scores[doc.doc_id] += keyword_weight / (rrf_k + rank + 1)
        doc_map[doc.doc_id] = doc
 
    ranked_ids = sorted(scores, key=scores.get, reverse=True)[:k]
    return [
        SearchResult(
            doc_id=did,
            content=doc_map[did].content,
            metadata=doc_map[did].metadata,
            score=scores[did],
        )
        for did in ranked_ids
    ]

The RRF smoothing constant k=60 comes from the original Cormack et al. paper. Values between 40 and 80 produce similar results in practice. The weight parameters allow tuning the balance between semantic and lexical signals. For technical documentation with many acronyms and identifiers, increase keyword_weight to 1.5.

Reranking

Bi-encoder embeddings encode query and document independently. Cross-encoders process the query-document pair together, enabling token-level interaction. This produces more accurate relevance scores but is too slow to run over the full corpus.

The reranking step takes the top 20-50 candidates from hybrid search and reorders them. On internal benchmarks, adding a cross-encoder reranker improves answer accuracy from 71-73% to 83-86%.

RerankerLatency (20 docs)NDCG@10 (MS MARCO)Deployment
cross-encoder/ms-marco-MiniLM-L-6-v245ms0.39Self-hosted, CPU
Cohere Rerank v380-120ms0.42API
bge-reranker-v2-m360ms0.41Self-hosted, GPU
cross-encoder/ms-marco-TinyBERT-L-2-v212ms0.36Self-hosted, CPU
from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
 
def rerank(
    query: str,
    candidates: list[SearchResult],
    top_k: int = 5,
) -> list[SearchResult]:
    """Rerank candidates using a cross-encoder model."""
    pairs = [(query, c.content) for c in candidates]
    scores = reranker.predict(pairs)
    scored = sorted(
        zip(candidates, scores), key=lambda x: x[1], reverse=True
    )
    return [
        SearchResult(
            doc_id=c.doc_id,
            content=c.content,
            metadata=c.metadata,
            score=float(s),
        )
        for c, s in scored[:top_k]
    ]

Budget 50-150ms for reranking in the latency budget. For sub-50ms requirements, use the TinyBERT variant or limit candidates to 10.

Context Assembly and Token Budgeting

After reranking, assemble the context window for the LLM. This involves deduplication, ordering, and fitting within the token budget.

import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4o")
 
 
def assemble_context(
    chunks: list[SearchResult],
    max_context_tokens: int = 6000,
    system_prompt_tokens: int = 500,
    max_output_tokens: int = 1000,
) -> str:
    """Assemble retrieved chunks into a context string
    that fits within the token budget."""
    available = max_context_tokens - system_prompt_tokens
    seen_content = set()
    selected = []
    total_tokens = 0
 
    for chunk in chunks:
        content_hash = hash(chunk.content.strip())
        if content_hash in seen_content:
            continue
        seen_content.add(content_hash)
 
        chunk_text = (
            f"[Source: {chunk.metadata.get('title', 'Unknown')}]\n"
            f"{chunk.content.strip()}\n"
        )
        chunk_tokens = len(enc.encode(chunk_text))
 
        if total_tokens + chunk_tokens > available:
            break
 
        selected.append(chunk_text)
        total_tokens += chunk_tokens
 
    return "\n---\n".join(selected)

Token budget breakdown for a typical RAG query (GPT-4o, 128K context window):

+-----------------------------------------+
| Component          | Tokens  | Cost     |
|--------------------+---------+----------|
| System prompt      |    500  | $0.0013  |
| Retrieved context  |  6,000  | $0.0150  |
| User query         |    100  | $0.0003  |
| Output             |  1,000  | $0.0100  |
| Total per query    |  7,600  | $0.0266  |
| Daily (50K queries)|  380M   | $1,330   |
+-----------------------------------------+
(Based on GPT-4o pricing: $2.50/M input, $10/M output)

Prompt Engineering Patterns

Chain of Thought with Structured Output

Requesting step-by-step reasoning before the final answer reduces classification errors by 15-30% on multi-step tasks.

You are classifying customer support tickets.

For each ticket:
1. Identify the primary issue category.
2. Determine severity:
   - P0: Service outage, data loss, security incident
   - P1: Major feature broken, widespread impact
   - P2: Minor feature issue, workaround available
   - P3: Cosmetic issue, feature request, general question
3. Check escalation rules:
   - P0/P1: always escalate
   - P2: escalate if customer is enterprise tier or mentions legal action
   - P3: never escalate
4. Produce the classification.

Output JSON:
{
  "reasoning": "<step-by-step analysis>",
  "category": "<string>",
  "severity": "P0" | "P1" | "P2" | "P3",
  "escalate": true | false
}

Including the reasoning field is not just for debugging. The model produces better values for severity and escalate when it writes its reasoning first, because autoregressive generation conditions later tokens on earlier ones.

Prompt Chaining

When a single prompt handles too many tasks, accuracy degrades. Decompose into a pipeline where each step has a focused prompt, a specific model, and independent evaluation.

+--------------------------------------------------------------------+
|                       PROMPT CHAIN                                 |
|              (Document Q&A Pipeline)                               |
+--------------------------------------------------------------------+
|                                                                    |
|  Step 1: Query Analysis              Model: gpt-4o-mini           |
|  +--------------------------------------------------------------+ |
|  | Input:  Raw user question                                    | |
|  | Task:   Classify intent, extract entities, rewrite query     | |
|  | Output: {intent, entities, rewritten_query}                  | |
|  | Cost:   ~150 input + 100 output tokens = $0.0001             | |
|  +------------------------------+-------------------------------+ |
|                                 |                                  |
|                                 v                                  |
|  Step 2: Retrieval               Model: none (retrieval only)     |
|  +--------------------------------------------------------------+ |
|  | Input:  Rewritten query + entity filters                     | |
|  | Task:   Hybrid search + reranking                            | |
|  | Output: Top 5 relevant chunks with metadata                  | |
|  | Cost:   Compute only, no LLM tokens                          | |
|  +------------------------------+-------------------------------+ |
|                                 |                                  |
|                                 v                                  |
|  Step 3: Answer Generation       Model: gpt-4o                   |
|  +--------------------------------------------------------------+ |
|  | Input:  Question + retrieved chunks + citation instructions  | |
|  | Task:   Generate answer with inline citations                | |
|  | Output: Answer text with [1], [2] citation markers           | |
|  | Cost:   ~6500 input + 500 output tokens = $0.021             | |
|  +------------------------------+-------------------------------+ |
|                                 |                                  |
|                                 v                                  |
|  Step 4: Verification            Model: gpt-4o-mini              |
|  +--------------------------------------------------------------+ |
|  | Input:  Answer + source chunks                               | |
|  | Task:   Verify each citation is supported by its source      | |
|  | Output: {verified: bool, unsupported_claims: [...]}          | |
|  | Cost:   ~3000 input + 200 output tokens = $0.0013            | |
|  +--------------------------------------------------------------+ |
|                                                                    |
|  Total per query: ~$0.023                                          |
|  Total latency:   ~1.2s (steps 1+3+4 sequential, step 2 ~200ms)  |
+--------------------------------------------------------------------+

Each step can be evaluated independently. If Step 4 flags unsupported claims above a threshold, the response can be regenerated or routed to human review.

Few-Shot Example Selection

Static few-shot examples are effective but suboptimal for diverse query distributions. Dynamic few-shot selection retrieves the most relevant examples from an example bank based on the input query.

from sentence_transformers import SentenceTransformer
import numpy as np
 
example_model = SentenceTransformer("all-MiniLM-L6-v2")
 
 
class DynamicFewShotSelector:
    """Select few-shot examples most similar to the input query."""
 
    def __init__(self, examples: list[dict]):
        self.examples = examples
        self.embeddings = example_model.encode(
            [ex["input"] for ex in examples]
        )
 
    def select(self, query: str, k: int = 3) -> list[dict]:
        query_emb = example_model.encode([query])
        sims = cos_sim(query_emb, self.embeddings)[0]
        top_indices = np.argsort(sims)[-k:][::-1]
        return [self.examples[i] for i in top_indices]
 
 
# Usage
selector = DynamicFewShotSelector(example_bank)
examples = selector.select(user_query, k=3)
prompt = build_prompt(system_instructions, examples, user_query)

Include at least one example where the correct output is a refusal or "insufficient information" response. Without this, the model attempts to answer every query regardless of whether the context supports an answer.

Model Routing

Not every request requires the most capable model. A routing layer classifies request complexity and dispatches to the appropriate model tier.

+--------------------------------------------------------------------+
|                    MODEL ROUTING                                    |
+--------------------------------------------------------------------+
|                                                                    |
|                  +-----------------+                                |
|                  | Incoming Query  |                                |
|                  +--------+--------+                                |
|                           |                                         |
|                  +--------v--------+                                |
|                  |   Complexity    |                                |
|                  |   Classifier    |                                |
|                  +--------+--------+                                |
|                           |                                         |
|            +--------------+--------------+                          |
|            |              |              |                          |
|       LOW  |         MED  |        HIGH  |                          |
|            v              v              v                          |
|   +--------+---+ +-------+----+ +-------+----+                    |
|   | Tier 1     | | Tier 2     | | Tier 3     |                    |
|   | Haiku /    | | Sonnet /   | | Opus /     |                    |
|   | GPT-4o-    | | GPT-4o     | | GPT-4 /    |                    |
|   | mini       | |            | | o1         |                    |
|   +------------+ +------------+ +------------+                    |
|   Input: $0.25/M  Input: $3/M   Input: $15/M                     |
|   Output: $1.25/M Output: $15/M  Output: $75/M                   |
|   Latency: 200ms  Latency: 600ms Latency: 2-30s                  |
|                                                                    |
|   Tasks:          Tasks:          Tasks:                           |
|   - Classification - Summarization - Multi-step reasoning          |
|   - Extraction    - Q&A with RAG  - Ambiguous queries             |
|   - Formatting    - Code gen      - Long-form analysis            |
|   - Simple Q&A    - Translation   - Complex code review           |
|                                                                    |
|              +------- Fallback path -------+                       |
|              | If Tier 1 fails quality      |                      |
|              | checks, retry with Tier 2.   |                      |
|              | If Tier 2 fails, retry       |                      |
|              | with Tier 3.                 |                      |
|              +------------------------------+                      |
+--------------------------------------------------------------------+
from enum import Enum
 
 
class ModelTier(Enum):
    TIER_1 = "claude-haiku-4-5-20251001"    # $0.80/M input, $4/M output
    TIER_2 = "claude-sonnet-4-6-20250514"   # $3/M input, $15/M output
    TIER_3 = "claude-opus-4-6-20250901"     # $15/M input, $75/M output
 
 
def classify_complexity(query: str) -> float:
    """Estimate query complexity on a 0-1 scale.
 
    Features:
    - Token count (longer queries correlate with complexity)
    - Question count (multiple questions = higher complexity)
    - Presence of reasoning keywords (compare, analyze, why)
    - Domain-specific signals
    """
    tokens = len(query.split())
    question_marks = query.count("?")
    reasoning_keywords = sum(
        1 for w in ["compare", "analyze", "why", "evaluate", "tradeoff"]
        if w in query.lower()
    )
 
    score = 0.0
    score += min(tokens / 200, 0.3)          # length component
    score += min(question_marks * 0.15, 0.3)  # multi-question component
    score += min(reasoning_keywords * 0.1, 0.4)  # reasoning component
 
    return min(score, 1.0)
 
 
def route_request(query: str) -> ModelTier:
    """Route to appropriate model tier based on complexity."""
    complexity = classify_complexity(query)
    if complexity < 0.3:
        return ModelTier.TIER_1
    elif complexity < 0.65:
        return ModelTier.TIER_2
    else:
        return ModelTier.TIER_3
 
 
def execute_with_fallback(query: str, client, max_tier: int = 3) -> str:
    """Execute query with automatic fallback to higher tiers."""
    tier = route_request(query)
    tiers = list(ModelTier)
    start_idx = tiers.index(tier)
 
    for t in tiers[start_idx:max_tier]:
        response = client.generate(model=t.value, prompt=query)
        if passes_quality_checks(response):
            return response
 
    return generate_fallback_response(query)

Cost impact of routing at scale:

+--------------------------------------------------+
| Scenario         | Requests/day | Monthly Cost   |
|------------------+--------------+----------------|
| All Tier 3       |  200,000     | $45,000        |
| Routed (65/25/10)|  200,000     | $10,800        |
| Savings          |              | $34,200 (76%)  |
+--------------------------------------------------+
Assumes average 1K input + 500 output tokens per request.

Cost Optimization

Prompt Caching

Anthropic's prompt caching avoids reprocessing identical system prompt prefixes. A 2,000-token system prompt sent with every request costs $5/M tokens. With caching, the first request pays full price and subsequent requests pay $0.30/M for the cached portion, a 94% reduction on that component.

Output Length Control

Set max_tokens to the minimum necessary for the task. A classification task returning a JSON label needs at most 50 tokens, not the default 4,096. At $10/M output tokens, reducing average output from 500 to 100 tokens saves $0.004 per request, or $200/day at 50K requests.

Batch Processing

For non-real-time workloads (report generation, bulk classification, nightly data processing), use batch APIs. Most providers offer 50% discounts on batch endpoints. OpenAI's Batch API processes requests within 24 hours at half price.

import json
 
 
def create_batch_file(requests: list[dict], output_path: str):
    """Create a JSONL batch file for OpenAI Batch API."""
    with open(output_path, "w") as f:
        for i, req in enumerate(requests):
            batch_req = {
                "custom_id": f"req-{i}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",
                    "messages": req["messages"],
                    "max_tokens": req.get("max_tokens", 500),
                },
            }
            f.write(json.dumps(batch_req) + "\n")

Self-Hosted Inference

For sustained high-volume workloads, self-hosted open models reduce per-token cost. vLLM is the standard inference server.

ConfigurationModelGPUThroughputCost/1M tokens
vLLM, Llama 3.1 8B8B params1x A100 80GB~2,500 tok/s~$0.08
vLLM, Llama 3.1 70B70B params4x A100 80GB~800 tok/s~$0.40
vLLM, Mistral 7B7B params1x A10G 24GB~1,800 tok/s~$0.05
API, GPT-4o-miniN/AN/AN/A$0.60 (avg)

The breakeven point for self-hosting versus API depends on utilization. At 80%+ GPU utilization, self-hosting a 7B model breaks even at approximately 200K-300K requests/day versus GPT-4o-mini pricing.

vLLM configuration for production:

# Launch vLLM server
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
#     --tensor-parallel-size 1 \
#     --max-model-len 8192 \
#     --gpu-memory-utilization 0.90 \
#     --enable-chunked-prefill \
#     --max-num-seqs 256
 
# Key parameters:
# --gpu-memory-utilization 0.90   Reserve 10% for overhead
# --max-num-seqs 256              Max concurrent sequences
# --enable-chunked-prefill        Better TTFT for long prompts
# --tensor-parallel-size N        Shard across N GPUs for large models

Monitor GPU utilization. Below 60% indicates overprovisioning. Above 90% risks out-of-memory errors under burst load. Target 70-85%.

Guardrails and Safety

Input Sanitization

import re
 
INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"you\s+are\s+now\s+a",
    r"system\s*prompt",
    r"reveal\s+your\s+(instructions|prompt|rules)",
    r"pretend\s+(you\s+are|to\s+be)",
    r"override\s+(previous|system)",
]
 
 
def detect_injection(user_input: str) -> bool:
    """Check for common prompt injection patterns."""
    normalized = user_input.lower().strip()
    for pattern in INJECTION_PATTERNS:
        if re.search(pattern, normalized):
            return True
    return False
 
 
def sanitize_input(user_input: str, max_length: int = 4000) -> str:
    """Basic input sanitization."""
    # Truncate excessively long inputs
    if len(user_input) > max_length:
        user_input = user_input[:max_length]
 
    # Remove null bytes and control characters
    user_input = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f]", "", user_input)
 
    return user_input.strip()

Pattern-based detection is the weakest layer. It catches obvious attacks but misses obfuscated or novel injection attempts.

Output Filtering

Output filtering is the last line of defense. Check for PII leakage, system prompt content, off-topic responses, and policy violations.

import re
from dataclasses import dataclass
 
 
@dataclass
class FilterResult:
    passed: bool
    violations: list[str]
 
 
def filter_output(
    output: str,
    system_prompt: str,
    allowed_topics: list[str] | None = None,
) -> FilterResult:
    """Post-generation output filter."""
    violations = []
 
    # Check for system prompt leakage
    prompt_fragments = [
        system_prompt[i : i + 50]
        for i in range(0, len(system_prompt) - 50, 25)
    ]
    for fragment in prompt_fragments:
        if fragment.lower() in output.lower():
            violations.append("system_prompt_leakage")
            break
 
    # Check for PII patterns
    pii_patterns = {
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
        "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    }
    for pii_type, pattern in pii_patterns.items():
        if re.search(pattern, output):
            violations.append(f"pii_{pii_type}")
 
    return FilterResult(
        passed=len(violations) == 0,
        violations=violations,
    )

Instruction Hierarchy

Place critical safety instructions at the end of the system prompt, after any dynamic content. Models assign higher weight to later instructions, making them harder to override via injection in user input.

[System prompt structure]
1. Role and task description
2. Dynamic context (RAG chunks, user history)
3. Safety constraints and refusal instructions  <-- hardest to override

Anthropic's API supports explicit system/user/assistant role separation. Use the system role for all instructions, never place instructions in user-role messages where they can be confused with user input.

Guardrail Frameworks

NeMo Guardrails (NVIDIA) provides a configuration-based approach to defining conversational boundaries: allowed topics, banned topics, moderation flows, and fact-checking rails. It runs as middleware between the application and the LLM API.

Guardrails AI provides a validator-based approach with pre-built validators for PII detection, toxicity, competitor mentions, and format compliance. Each validator runs independently and can halt or modify the response.

Observability

Tracing

Every LLM request should produce a trace containing: request ID, timestamp, model, prompt version hash, input tokens, output tokens, latency, user ID, session ID, and the full prompt/response (or a reference to it in a log store).

import time
import hashlib
import structlog
from opentelemetry import trace
 
logger = structlog.get_logger()
tracer = trace.get_tracer("llm-service")
 
 
def traced_llm_call(
    client,
    model: str,
    messages: list[dict],
    prompt_version: str,
    user_id: str,
    **kwargs,
) -> dict:
    """LLM call with structured logging and OpenTelemetry tracing."""
    prompt_hash = hashlib.sha256(
        str(messages).encode()
    ).hexdigest()[:12]
 
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_version", prompt_version)
        span.set_attribute("llm.prompt_hash", prompt_hash)
        span.set_attribute("user.id", user_id)
 
        start = time.monotonic()
        try:
            response = client.chat.completions.create(
                model=model, messages=messages, **kwargs
            )
            latency_ms = (time.monotonic() - start) * 1000
            usage = response.usage
 
            span.set_attribute("llm.input_tokens", usage.prompt_tokens)
            span.set_attribute("llm.output_tokens", usage.completion_tokens)
            span.set_attribute("llm.latency_ms", latency_ms)
            span.set_attribute("llm.status", "success")
 
            logger.info(
                "llm_call_complete",
                model=model,
                prompt_version=prompt_version,
                input_tokens=usage.prompt_tokens,
                output_tokens=usage.completion_tokens,
                latency_ms=round(latency_ms, 1),
                user_id=user_id,
            )
 
            return {
                "content": response.choices[0].message.content,
                "usage": {
                    "input_tokens": usage.prompt_tokens,
                    "output_tokens": usage.completion_tokens,
                },
                "latency_ms": latency_ms,
            }
 
        except Exception as e:
            latency_ms = (time.monotonic() - start) * 1000
            span.set_attribute("llm.status", "error")
            span.set_attribute("llm.error", str(e))
            logger.error(
                "llm_call_failed",
                model=model,
                error=str(e),
                latency_ms=round(latency_ms, 1),
            )
            raise

Dashboards

Key metrics to track:

MetricGranularityAlert Threshold
Latency p50, p95, p99Per endpointp99 > 5s
Error ratePer model, per endpoint> 2% over 5 min
Token usage (input + output)Per endpoint, daily> 120% of baseline
Daily spendPer model tier> daily budget
Eval pass ratePer prompt version< baseline - 3%
Retrieval recall@5Per index update< baseline - 5%
Output filter trigger ratePer filter type> 5%

LangSmith provides trace waterfall views, dataset management, and annotation queues specifically for LLM applications. OpenTelemetry provides vendor-neutral instrumentation that integrates with Datadog, Grafana, Honeycomb, and other observability platforms.

Prompt Versioning

Treat prompts as code. Store them in version control. Tag each production prompt with a version identifier. Log the version with every request. This enables correlating quality regressions with specific prompt changes.

PROMPT_REGISTRY = {
    "ticket_classifier": {
        "version": "v2.4.1",
        "template": "...",
        "model": "gpt-4o-mini",
        "max_tokens": 200,
        "temperature": 0.0,
        "eval_baseline": {
            "accuracy": 0.91,
            "latency_p95_ms": 450,
        },
    },
    "qa_generator": {
        "version": "v3.1.0",
        "template": "...",
        "model": "gpt-4o",
        "max_tokens": 1000,
        "temperature": 0.2,
        "eval_baseline": {
            "faithfulness": 0.87,
            "relevance": 0.92,
            "latency_p95_ms": 1800,
        },
    },
}

Production Checklist

Evaluation:

  • Eval dataset with 100+ examples covering normal and edge cases
  • Automated eval suite gating every prompt and model change
  • Baseline metrics for accuracy, faithfulness, relevance, safety
  • Every production bug added as a regression test case

Reliability:

  • Timeout on all LLM API calls (recommended: 30s default, 10s for classification)
  • Graceful degradation when the LLM API is unavailable
  • Rate limiting on inbound requests and outbound API calls
  • Input validation, length limits, and injection detection
  • Output filtering for PII, policy violations, and off-topic content

Cost:

  • Daily spend alerts per model tier
  • Token usage tracking per endpoint
  • Model routing or a documented plan for implementing it
  • Output length constraints (max_tokens) set per task

Operations:

  • Structured logging with trace IDs on every request
  • Dashboards for latency, error rate, token usage, and cost
  • Prompt version tracking in logs
  • Runbook for common failure modes (API timeout, model degradation, cost spike)
  • Rollback procedure for prompt and model changes

Fine-Tuning Decision Framework

Fine-tuning is warranted when prompt engineering reaches diminishing returns, the task is narrow and well-defined, and sufficient labeled data (1,000+ examples) is available.

FactorPrompt EngineeringFine-Tuning
Data requirement3-10 examples1,000+ examples
Iteration speedMinutesHours to days
Per-token costHigher (longer prompts)Lower (shorter prompts)
Task specificityGeneralNarrow, well-defined
MaintenanceUpdate prompt textRetrain on new data
RiskLow (revert prompt)Catastrophic forgetting

Use LoRA (Low-Rank Adaptation) rather than full fine-tuning. LoRA trains 0.1-1% of model parameters, reducing compute cost by 10-100x while achieving 80-95% of full fine-tuning quality. Evaluate the fine-tuned model against the base model on a held-out test set covering both the target task and general capabilities to detect catastrophic forgetting.

# LoRA fine-tuning with Hugging Face PEFT
from peft import LoraConfig, get_peft_model, TaskType
 
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank: 8-64, higher = more capacity
    lora_alpha=32,                 # Scaling factor, typically 2x rank
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
 
model = get_peft_model(base_model, lora_config)
# Trainable params: ~0.3% of total
# Training time: ~2 hours on 1x A100 for 7B model, 5K examples

Latency Optimization

End-to-end latency for a RAG query breaks down as follows:

+----------------------------------------------+
| Component              | Typical Latency     |
|------------------------+---------------------|
| Query embedding        | 10-30ms             |
| Vector search (HNSW)   | 5-15ms              |
| BM25 search            | 5-10ms              |
| Reranking (20 docs)    | 50-150ms            |
| Context assembly       | 1-5ms               |
| LLM generation (TTFT)  | 200-800ms           |
| LLM generation (total) | 500-3000ms          |
| Output filtering       | 5-20ms              |
| Total                  | 800-4000ms          |
+----------------------------------------------+

Optimization strategies:

  • Run vector search and BM25 search in parallel. Saves 5-15ms.
  • Stream LLM output to reduce perceived latency. Time-to-first-token matters more than total generation time for user-facing applications.
  • Use speculative decoding (vLLM supports this) with a small draft model to speed up generation by 2-3x for self-hosted models.
  • Cache frequent queries. A 10% cache hit rate on 50K daily queries saves 5K LLM calls.
  • Pre-compute embeddings for common query patterns.
import asyncio
 
 
async def parallel_search(query: str, vector_store, bm25_index, k: int = 50):
    """Run vector and keyword search concurrently."""
    vector_task = asyncio.create_task(
        vector_store.async_similarity_search(query, k=k)
    )
    keyword_task = asyncio.create_task(
        bm25_index.async_search(query, k=k)
    )
    vector_results, keyword_results = await asyncio.gather(
        vector_task, keyword_task
    )
    return vector_results, keyword_results

All latency numbers above assume cloud-hosted infrastructure in the same region as the LLM API provider. Cross-region API calls add 50-200ms of network latency.