Production LLM systems differ from prototypes in the same way a load-bearing wall differs from a cardboard cutout. Both look similar from a distance. One of them holds up a building. This document covers the engineering patterns required to move from prototype to production: evaluation frameworks, retrieval-augmented generation, prompt management, model routing, cost control, safety guardrails, and observability.
Evaluation Frameworks
Without a structured evaluation pipeline, prompt changes amount to guesswork. LLM outputs are non-deterministic and high-dimensional. Equality checks against expected values do not work. Production systems require layered evaluation with automated gating.
+--------------------------------------------------------------------+
| EVALUATION PIPELINE |
+--------------------------------------------------------------------+
| |
| Stage 1: Deterministic Checks |
| +--------------------------------------------------------------+ |
| | - JSON schema validation (jsonschema, Pydantic) | |
| | - Output length within [min_tokens, max_tokens] | |
| | - Required field presence | |
| | - Regex pattern matching for structured outputs | |
| | - Language detection (expected locale match) | |
| +--------------------------------------------------------------+ |
| | |
| v |
| Stage 2: Semantic Evaluation |
| +--------------------------------------------------------------+ |
| | - Claim extraction + entailment verification | |
| | - Embedding cosine similarity to golden answers | |
| | - BERTScore F1 against reference responses | |
| | - Named entity overlap ratio | |
| +--------------------------------------------------------------+ |
| | |
| v |
| Stage 3: LLM-as-Judge |
| +--------------------------------------------------------------+ |
| | - Pairwise comparison (A vs B verdicts) | |
| | - Rubric-based scoring on [relevance, accuracy, tone] | |
| | - Safety and policy compliance checks | |
| | - Faithfulness scoring against retrieved context | |
| +--------------------------------------------------------------+ |
| | |
| v |
| Stage 4: Human Review (sampled) |
| +--------------------------------------------------------------+ |
| | - Weekly audit of flagged outputs | |
| | - Inter-annotator agreement tracking (target: kappa > 0.7) | |
| | - New edge cases added to eval dataset | |
| +--------------------------------------------------------------+ |
| |
+--------------------------------------------------------------------+
Stage 1: Deterministic Checks
These run on every response with sub-millisecond overhead. They catch structural failures: malformed JSON, outputs exceeding length constraints, missing required fields, wrong output language.
import json
from jsonschema import validate, ValidationError
RESPONSE_SCHEMA = {
"type": "object",
"required": ["answer", "confidence", "sources"],
"properties": {
"answer": {"type": "string", "minLength": 1},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"sources": {
"type": "array",
"items": {"type": "string"},
"minItems": 1,
},
},
}
def validate_response(raw_output: str) -> dict:
"""Parse and validate LLM output against the expected schema."""
try:
parsed = json.loads(raw_output)
except json.JSONDecodeError as e:
raise ValueError(f"Output is not valid JSON: {e}")
validate(instance=parsed, schema=RESPONSE_SCHEMA)
token_count = len(raw_output.split())
if token_count > 2000:
raise ValueError(
f"Response length {token_count} tokens exceeds 2000 limit"
)
return parsedA system handling 50K daily requests should expect 2-5% of responses to fail schema validation on the first attempt. Retry with a stricter prompt or fall back to a template-based response.
Stage 2: Semantic Evaluation
Semantic evaluation measures whether the response content is correct, not just whether its format is correct.
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dimensions, 80MB
def embedding_similarity(response: str, reference: str) -> float:
"""Cosine similarity between response and reference embeddings."""
embeddings = model.encode([response, reference])
return float(cos_sim([embeddings[0]], [embeddings[1]])[0][0])
def claim_precision(
response_claims: list[str],
reference_claims: list[str],
threshold: float = 0.85,
) -> float:
"""Fraction of response claims supported by reference claims."""
if not response_claims:
return 1.0
resp_emb = model.encode(response_claims)
ref_emb = model.encode(reference_claims)
sims = cos_sim(resp_emb, ref_emb)
supported = sum(1 for row in sims if row.max() >= threshold)
return supported / len(response_claims)
def claim_recall(
response_claims: list[str],
reference_claims: list[str],
threshold: float = 0.85,
) -> float:
"""Fraction of reference claims covered by the response."""
if not reference_claims:
return 1.0
resp_emb = model.encode(response_claims)
ref_emb = model.encode(reference_claims)
sims = cos_sim(ref_emb, resp_emb)
covered = sum(1 for row in sims if row.max() >= threshold)
return covered / len(reference_claims)BERTScore provides an alternative that operates at the token level rather than the claim level. For factual QA tasks, claim-level metrics are more interpretable. For summarization, BERTScore or ROUGE-L are more appropriate.
Stage 3: LLM-as-Judge
Pairwise comparison outperforms absolute scoring. When switching from "rate 1-5" to "which is better, A or B?", inter-annotator agreement with human raters typically improves from around 60% to 82-87%.
JUDGE_PROMPT = """You are evaluating two responses to a question.
Question: {question}
Context (ground truth): {context}
Response A:
{response_a}
Response B:
{response_b}
Evaluate on these criteria:
1. Factual accuracy relative to the provided context
2. Completeness (does it address the full question?)
3. Conciseness (no unnecessary information)
Provide your reasoning, then state your verdict.
Output JSON:
{{"reasoning": "...", "verdict": "A" | "B" | "TIE"}}
"""
def run_judge(
question: str,
context: str,
response_a: str,
response_b: str,
client,
) -> dict:
"""Run pairwise LLM-as-judge evaluation."""
prompt = JUDGE_PROMPT.format(
question=question,
context=context,
response_a=response_a,
response_b=response_b,
)
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
response_format={"type": "json_object"},
)
return json.loads(result.choices[0].message.content)Position bias is a known issue. The judge model may prefer whichever response appears first. Mitigate this by running each comparison twice with swapped positions and discarding cases where the verdict flips.
Eval Dataset Management
The eval dataset should grow continuously. Start with 50-100 examples. Every production bug becomes a new test case. Target 500+ examples by month three.
+--------------------+
| Modify prompt, |
| model, or config |
+---------+----------+
|
v
+---------+----------+
| Run eval suite |
| (all 4 stages) |
+---------+----------+
|
v
+---------+----------+
| Compare to |
| baseline metrics |
+---------+----------+
|
+-----+------+
| |
v v
+---+----+ +---+----+
| Pass: | | Fail: |
| deploy | | revert |
+--------+ +--------+
Track eval runs in LangSmith or Weights & Biases. LangSmith provides LLM-specific trace visualization, dataset versioning, and annotation queues. W&B is more general-purpose but integrates well with experiment tracking across model training and inference.
Retrieval-Augmented Generation
RAG grounds LLM responses in external data. A naive implementation retrieves irrelevant chunks, wastes context window tokens, and produces hallucinated responses that cite real-looking but fabricated sources. Production RAG requires careful engineering at every stage.
Architecture
+-------------------+
| User Query |
+--------+----------+
|
+--------v----------+
| Query Processing |
| - Classification |
| - Expansion |
| - HyDE (optional)|
+--------+----------+
|
+-------------+-------------+
| |
+--------v----------+ +---------v---------+
| Vector Search | | Lexical Search |
| (ANN, HNSW) | | (BM25) |
| top-k=50 | | top-k=50 |
+--------+----------+ +---------+---------+
| |
+-------------+-------------+
|
+--------v----------+
| Reciprocal Rank |
| Fusion (k=60) |
| output: top 20 |
+--------+----------+
|
+--------v----------+
| Cross-Encoder |
| Reranker |
| output: top 5 |
+--------+----------+
|
+--------v----------+
| Context Assembly |
| - Deduplication |
| - Metadata |
| - Token budget |
+--------+----------+
|
+--------v----------+
| LLM Generation |
| + Citation |
| Extraction |
+--------+----------+
|
+--------v----------+
| Post-Processing |
| - Citation check |
| - Safety filter |
| - Format output |
+-------------------+
Embedding Model Selection
The embedding model determines the ceiling of retrieval quality. Switching models changes recall@10 by 10-20 percentage points depending on the domain.
| Model | Dimensions | Size | Cost (per 1M tokens) | Best For |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | API | $0.02 | General English, cost-sensitive |
| text-embedding-3-large | 3072 | API | $0.13 | General English, high accuracy |
| voyage-code-3 | 1024 | API | $0.06 | Code retrieval |
| Cohere embed-v3 | 1024 | API | $0.10 | Multilingual, compressed binary |
| all-MiniLM-L6-v2 | 384 | 80MB | Free (self-host) | Low-resource, CPU inference |
| multilingual-e5-large-instruct | 1024 | 2.2GB | Free (self-host) | Multilingual, MTEB top-tier |
| nomic-embed-text-v1.5 | 768 | 548MB | Free (self-host) | Matryoshka, variable dims |
Instruction-prefixed embeddings improve retrieval by 3-8% on benchmarks. Prefix documents with "search_document:" and queries with "search_query:" (exact prefix varies by model). This asymmetry accounts for the structural difference between a short query and a long passage.
Matryoshka embeddings (supported by text-embedding-3-large and nomic-embed-text-v1.5) allow truncating the embedding vector to fewer dimensions at query time. Reducing from 3072 to 1024 dimensions typically costs less than 2% recall while cutting storage and search costs by 66%.
Chunking Strategies
Chunking determines what the retrieval system can find. A naive RAG pipeline retrieving 20 chunks at 500 tokens each consumes 10K context tokens per query. At $3 per million input tokens, that is $0.03 per query, or $1,500/day at 50K queries. Reducing to 8 well-chosen chunks of 300 tokens saves 7,600 tokens per query and $2.28K/day.
| Strategy | Chunk Size | Overlap | Pros | Cons |
|---|---|---|---|---|
| Fixed-size | 512 tokens | 50 tokens | Simple, predictable | Splits mid-sentence |
| Recursive splitting | 500-1000 chars | 100-200 chars | Respects paragraph boundaries | Still arbitrary boundaries |
| Semantic chunking | Variable | None | Preserves topic coherence | Requires embedding each sentence |
| Document-structure | Section-level | None | Respects author's organization | Highly variable chunk sizes |
| Code-aware | Function/class | None | Preserves logical units | Language-specific parsers needed |
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Fixed-size: baseline approach
def fixed_chunks(text: str, size: int = 512, overlap: int = 50) -> list[str]:
"""Split text into fixed-size token chunks with overlap."""
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), size - overlap):
chunk_tokens = tokens[i : i + size]
chunks.append(tokenizer.decode(chunk_tokens))
return chunks
# Recursive splitting: respects paragraph and sentence boundaries
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", ", ", " "],
length_function=lambda t: len(tokenizer.encode(t)),
)
# Semantic chunking: split on topic shifts
def semantic_chunk(
text: str,
embedding_model,
threshold: float = 0.72,
min_chunk_size: int = 100,
) -> list[str]:
"""Split text at points where consecutive sentence
embeddings drop below a similarity threshold."""
sentences = segment_into_sentences(text)
if len(sentences) <= 1:
return [text]
embeddings = embedding_model.encode(sentences)
chunks = []
current = [sentences[0]]
for i in range(1, len(sentences)):
sim = float(
cos_sim([embeddings[i - 1]], [embeddings[i]])[0][0]
)
if sim < threshold and len(" ".join(current)) >= min_chunk_size:
chunks.append(" ".join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
if current:
chunks.append(" ".join(current))
return chunksGuidelines for chunk size by content type:
- FAQ, Q&A pairs: 150-300 tokens. Precision matters more than context.
- Technical documentation: 400-800 tokens. Enough context for an explanation to be self-contained.
- Legal and regulatory text: 800-1500 tokens. Clauses reference each other; splitting them destroys meaning.
- Source code: chunk by function, method, or class. A 5-line utility and a 200-line class each form one atomic chunk.
Metadata enrichment is critical. Attach document title, section heading, page number, last-modified timestamp, and source URL to every chunk. This enables filtered retrieval ("find chunks from documents updated in the last 90 days") and proper citation in responses.
Hybrid Search
Pure vector search fails on exact-match queries. A user searching for error code ERR_CONN_REFUSED_443 gets chunks about network errors in general. BM25 matches the literal string. Pure keyword search fails on conceptual queries. A query about "authentication" misses documents using "login flow," "JWT validation," and "session management." Vector search captures the semantic relationship.
Combining both with Reciprocal Rank Fusion:
from collections import defaultdict
from dataclasses import dataclass
@dataclass
class SearchResult:
doc_id: str
content: str
metadata: dict
score: float = 0.0
def hybrid_search(
query: str,
vector_store,
bm25_index,
k: int = 20,
rrf_k: int = 60,
vector_weight: float = 1.0,
keyword_weight: float = 1.0,
) -> list[SearchResult]:
"""Hybrid search with weighted Reciprocal Rank Fusion.
Args:
query: Search query string.
vector_store: Vector index with similarity_search method.
bm25_index: BM25 index with search method.
k: Number of final results to return.
rrf_k: RRF smoothing constant (default 60 from original paper).
vector_weight: Weight for vector search contribution.
keyword_weight: Weight for keyword search contribution.
"""
vector_results = vector_store.similarity_search(query, k=k * 3)
keyword_results = bm25_index.search(query, k=k * 3)
scores = defaultdict(float)
doc_map = {}
for rank, doc in enumerate(vector_results):
scores[doc.doc_id] += vector_weight / (rrf_k + rank + 1)
doc_map[doc.doc_id] = doc
for rank, doc in enumerate(keyword_results):
scores[doc.doc_id] += keyword_weight / (rrf_k + rank + 1)
doc_map[doc.doc_id] = doc
ranked_ids = sorted(scores, key=scores.get, reverse=True)[:k]
return [
SearchResult(
doc_id=did,
content=doc_map[did].content,
metadata=doc_map[did].metadata,
score=scores[did],
)
for did in ranked_ids
]The RRF smoothing constant k=60 comes from the original Cormack et al. paper. Values between 40 and 80 produce similar results in practice. The weight parameters allow tuning the balance between semantic and lexical signals. For technical documentation with many acronyms and identifiers, increase keyword_weight to 1.5.
Reranking
Bi-encoder embeddings encode query and document independently. Cross-encoders process the query-document pair together, enabling token-level interaction. This produces more accurate relevance scores but is too slow to run over the full corpus.
The reranking step takes the top 20-50 candidates from hybrid search and reorders them. On internal benchmarks, adding a cross-encoder reranker improves answer accuracy from 71-73% to 83-86%.
| Reranker | Latency (20 docs) | NDCG@10 (MS MARCO) | Deployment |
|---|---|---|---|
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 45ms | 0.39 | Self-hosted, CPU |
| Cohere Rerank v3 | 80-120ms | 0.42 | API |
| bge-reranker-v2-m3 | 60ms | 0.41 | Self-hosted, GPU |
| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 12ms | 0.36 | Self-hosted, CPU |
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(
query: str,
candidates: list[SearchResult],
top_k: int = 5,
) -> list[SearchResult]:
"""Rerank candidates using a cross-encoder model."""
pairs = [(query, c.content) for c in candidates]
scores = reranker.predict(pairs)
scored = sorted(
zip(candidates, scores), key=lambda x: x[1], reverse=True
)
return [
SearchResult(
doc_id=c.doc_id,
content=c.content,
metadata=c.metadata,
score=float(s),
)
for c, s in scored[:top_k]
]Budget 50-150ms for reranking in the latency budget. For sub-50ms requirements, use the TinyBERT variant or limit candidates to 10.
Context Assembly and Token Budgeting
After reranking, assemble the context window for the LLM. This involves deduplication, ordering, and fitting within the token budget.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def assemble_context(
chunks: list[SearchResult],
max_context_tokens: int = 6000,
system_prompt_tokens: int = 500,
max_output_tokens: int = 1000,
) -> str:
"""Assemble retrieved chunks into a context string
that fits within the token budget."""
available = max_context_tokens - system_prompt_tokens
seen_content = set()
selected = []
total_tokens = 0
for chunk in chunks:
content_hash = hash(chunk.content.strip())
if content_hash in seen_content:
continue
seen_content.add(content_hash)
chunk_text = (
f"[Source: {chunk.metadata.get('title', 'Unknown')}]\n"
f"{chunk.content.strip()}\n"
)
chunk_tokens = len(enc.encode(chunk_text))
if total_tokens + chunk_tokens > available:
break
selected.append(chunk_text)
total_tokens += chunk_tokens
return "\n---\n".join(selected)Token budget breakdown for a typical RAG query (GPT-4o, 128K context window):
+-----------------------------------------+
| Component | Tokens | Cost |
|--------------------+---------+----------|
| System prompt | 500 | $0.0013 |
| Retrieved context | 6,000 | $0.0150 |
| User query | 100 | $0.0003 |
| Output | 1,000 | $0.0100 |
| Total per query | 7,600 | $0.0266 |
| Daily (50K queries)| 380M | $1,330 |
+-----------------------------------------+
(Based on GPT-4o pricing: $2.50/M input, $10/M output)
Prompt Engineering Patterns
Chain of Thought with Structured Output
Requesting step-by-step reasoning before the final answer reduces classification errors by 15-30% on multi-step tasks.
You are classifying customer support tickets.
For each ticket:
1. Identify the primary issue category.
2. Determine severity:
- P0: Service outage, data loss, security incident
- P1: Major feature broken, widespread impact
- P2: Minor feature issue, workaround available
- P3: Cosmetic issue, feature request, general question
3. Check escalation rules:
- P0/P1: always escalate
- P2: escalate if customer is enterprise tier or mentions legal action
- P3: never escalate
4. Produce the classification.
Output JSON:
{
"reasoning": "<step-by-step analysis>",
"category": "<string>",
"severity": "P0" | "P1" | "P2" | "P3",
"escalate": true | false
}
Including the reasoning field is not just for debugging. The model produces better values for severity and escalate when it writes its reasoning first, because autoregressive generation conditions later tokens on earlier ones.
Prompt Chaining
When a single prompt handles too many tasks, accuracy degrades. Decompose into a pipeline where each step has a focused prompt, a specific model, and independent evaluation.
+--------------------------------------------------------------------+
| PROMPT CHAIN |
| (Document Q&A Pipeline) |
+--------------------------------------------------------------------+
| |
| Step 1: Query Analysis Model: gpt-4o-mini |
| +--------------------------------------------------------------+ |
| | Input: Raw user question | |
| | Task: Classify intent, extract entities, rewrite query | |
| | Output: {intent, entities, rewritten_query} | |
| | Cost: ~150 input + 100 output tokens = $0.0001 | |
| +------------------------------+-------------------------------+ |
| | |
| v |
| Step 2: Retrieval Model: none (retrieval only) |
| +--------------------------------------------------------------+ |
| | Input: Rewritten query + entity filters | |
| | Task: Hybrid search + reranking | |
| | Output: Top 5 relevant chunks with metadata | |
| | Cost: Compute only, no LLM tokens | |
| +------------------------------+-------------------------------+ |
| | |
| v |
| Step 3: Answer Generation Model: gpt-4o |
| +--------------------------------------------------------------+ |
| | Input: Question + retrieved chunks + citation instructions | |
| | Task: Generate answer with inline citations | |
| | Output: Answer text with [1], [2] citation markers | |
| | Cost: ~6500 input + 500 output tokens = $0.021 | |
| +------------------------------+-------------------------------+ |
| | |
| v |
| Step 4: Verification Model: gpt-4o-mini |
| +--------------------------------------------------------------+ |
| | Input: Answer + source chunks | |
| | Task: Verify each citation is supported by its source | |
| | Output: {verified: bool, unsupported_claims: [...]} | |
| | Cost: ~3000 input + 200 output tokens = $0.0013 | |
| +--------------------------------------------------------------+ |
| |
| Total per query: ~$0.023 |
| Total latency: ~1.2s (steps 1+3+4 sequential, step 2 ~200ms) |
+--------------------------------------------------------------------+
Each step can be evaluated independently. If Step 4 flags unsupported claims above a threshold, the response can be regenerated or routed to human review.
Few-Shot Example Selection
Static few-shot examples are effective but suboptimal for diverse query distributions. Dynamic few-shot selection retrieves the most relevant examples from an example bank based on the input query.
from sentence_transformers import SentenceTransformer
import numpy as np
example_model = SentenceTransformer("all-MiniLM-L6-v2")
class DynamicFewShotSelector:
"""Select few-shot examples most similar to the input query."""
def __init__(self, examples: list[dict]):
self.examples = examples
self.embeddings = example_model.encode(
[ex["input"] for ex in examples]
)
def select(self, query: str, k: int = 3) -> list[dict]:
query_emb = example_model.encode([query])
sims = cos_sim(query_emb, self.embeddings)[0]
top_indices = np.argsort(sims)[-k:][::-1]
return [self.examples[i] for i in top_indices]
# Usage
selector = DynamicFewShotSelector(example_bank)
examples = selector.select(user_query, k=3)
prompt = build_prompt(system_instructions, examples, user_query)Include at least one example where the correct output is a refusal or "insufficient information" response. Without this, the model attempts to answer every query regardless of whether the context supports an answer.
Model Routing
Not every request requires the most capable model. A routing layer classifies request complexity and dispatches to the appropriate model tier.
+--------------------------------------------------------------------+
| MODEL ROUTING |
+--------------------------------------------------------------------+
| |
| +-----------------+ |
| | Incoming Query | |
| +--------+--------+ |
| | |
| +--------v--------+ |
| | Complexity | |
| | Classifier | |
| +--------+--------+ |
| | |
| +--------------+--------------+ |
| | | | |
| LOW | MED | HIGH | |
| v v v |
| +--------+---+ +-------+----+ +-------+----+ |
| | Tier 1 | | Tier 2 | | Tier 3 | |
| | Haiku / | | Sonnet / | | Opus / | |
| | GPT-4o- | | GPT-4o | | GPT-4 / | |
| | mini | | | | o1 | |
| +------------+ +------------+ +------------+ |
| Input: $0.25/M Input: $3/M Input: $15/M |
| Output: $1.25/M Output: $15/M Output: $75/M |
| Latency: 200ms Latency: 600ms Latency: 2-30s |
| |
| Tasks: Tasks: Tasks: |
| - Classification - Summarization - Multi-step reasoning |
| - Extraction - Q&A with RAG - Ambiguous queries |
| - Formatting - Code gen - Long-form analysis |
| - Simple Q&A - Translation - Complex code review |
| |
| +------- Fallback path -------+ |
| | If Tier 1 fails quality | |
| | checks, retry with Tier 2. | |
| | If Tier 2 fails, retry | |
| | with Tier 3. | |
| +------------------------------+ |
+--------------------------------------------------------------------+
from enum import Enum
class ModelTier(Enum):
TIER_1 = "claude-haiku-4-5-20251001" # $0.80/M input, $4/M output
TIER_2 = "claude-sonnet-4-6-20250514" # $3/M input, $15/M output
TIER_3 = "claude-opus-4-6-20250901" # $15/M input, $75/M output
def classify_complexity(query: str) -> float:
"""Estimate query complexity on a 0-1 scale.
Features:
- Token count (longer queries correlate with complexity)
- Question count (multiple questions = higher complexity)
- Presence of reasoning keywords (compare, analyze, why)
- Domain-specific signals
"""
tokens = len(query.split())
question_marks = query.count("?")
reasoning_keywords = sum(
1 for w in ["compare", "analyze", "why", "evaluate", "tradeoff"]
if w in query.lower()
)
score = 0.0
score += min(tokens / 200, 0.3) # length component
score += min(question_marks * 0.15, 0.3) # multi-question component
score += min(reasoning_keywords * 0.1, 0.4) # reasoning component
return min(score, 1.0)
def route_request(query: str) -> ModelTier:
"""Route to appropriate model tier based on complexity."""
complexity = classify_complexity(query)
if complexity < 0.3:
return ModelTier.TIER_1
elif complexity < 0.65:
return ModelTier.TIER_2
else:
return ModelTier.TIER_3
def execute_with_fallback(query: str, client, max_tier: int = 3) -> str:
"""Execute query with automatic fallback to higher tiers."""
tier = route_request(query)
tiers = list(ModelTier)
start_idx = tiers.index(tier)
for t in tiers[start_idx:max_tier]:
response = client.generate(model=t.value, prompt=query)
if passes_quality_checks(response):
return response
return generate_fallback_response(query)Cost impact of routing at scale:
+--------------------------------------------------+
| Scenario | Requests/day | Monthly Cost |
|------------------+--------------+----------------|
| All Tier 3 | 200,000 | $45,000 |
| Routed (65/25/10)| 200,000 | $10,800 |
| Savings | | $34,200 (76%) |
+--------------------------------------------------+
Assumes average 1K input + 500 output tokens per request.
Cost Optimization
Prompt Caching
Anthropic's prompt caching avoids reprocessing identical system prompt prefixes. A 2,000-token system prompt sent with every request costs $5/M tokens. With caching, the first request pays full price and subsequent requests pay $0.30/M for the cached portion, a 94% reduction on that component.
Output Length Control
Set max_tokens to the minimum necessary for the task. A classification task returning a JSON label needs at most 50 tokens, not the default 4,096. At $10/M output tokens, reducing average output from 500 to 100 tokens saves $0.004 per request, or $200/day at 50K requests.
Batch Processing
For non-real-time workloads (report generation, bulk classification, nightly data processing), use batch APIs. Most providers offer 50% discounts on batch endpoints. OpenAI's Batch API processes requests within 24 hours at half price.
import json
def create_batch_file(requests: list[dict], output_path: str):
"""Create a JSONL batch file for OpenAI Batch API."""
with open(output_path, "w") as f:
for i, req in enumerate(requests):
batch_req = {
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": req["messages"],
"max_tokens": req.get("max_tokens", 500),
},
}
f.write(json.dumps(batch_req) + "\n")Self-Hosted Inference
For sustained high-volume workloads, self-hosted open models reduce per-token cost. vLLM is the standard inference server.
| Configuration | Model | GPU | Throughput | Cost/1M tokens |
|---|---|---|---|---|
| vLLM, Llama 3.1 8B | 8B params | 1x A100 80GB | ~2,500 tok/s | ~$0.08 |
| vLLM, Llama 3.1 70B | 70B params | 4x A100 80GB | ~800 tok/s | ~$0.40 |
| vLLM, Mistral 7B | 7B params | 1x A10G 24GB | ~1,800 tok/s | ~$0.05 |
| API, GPT-4o-mini | N/A | N/A | N/A | $0.60 (avg) |
The breakeven point for self-hosting versus API depends on utilization. At 80%+ GPU utilization, self-hosting a 7B model breaks even at approximately 200K-300K requests/day versus GPT-4o-mini pricing.
vLLM configuration for production:
# Launch vLLM server
# vllm serve meta-llama/Llama-3.1-8B-Instruct \
# --tensor-parallel-size 1 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.90 \
# --enable-chunked-prefill \
# --max-num-seqs 256
# Key parameters:
# --gpu-memory-utilization 0.90 Reserve 10% for overhead
# --max-num-seqs 256 Max concurrent sequences
# --enable-chunked-prefill Better TTFT for long prompts
# --tensor-parallel-size N Shard across N GPUs for large modelsMonitor GPU utilization. Below 60% indicates overprovisioning. Above 90% risks out-of-memory errors under burst load. Target 70-85%.
Guardrails and Safety
Input Sanitization
import re
INJECTION_PATTERNS = [
r"ignore\s+(all\s+)?previous\s+instructions",
r"you\s+are\s+now\s+a",
r"system\s*prompt",
r"reveal\s+your\s+(instructions|prompt|rules)",
r"pretend\s+(you\s+are|to\s+be)",
r"override\s+(previous|system)",
]
def detect_injection(user_input: str) -> bool:
"""Check for common prompt injection patterns."""
normalized = user_input.lower().strip()
for pattern in INJECTION_PATTERNS:
if re.search(pattern, normalized):
return True
return False
def sanitize_input(user_input: str, max_length: int = 4000) -> str:
"""Basic input sanitization."""
# Truncate excessively long inputs
if len(user_input) > max_length:
user_input = user_input[:max_length]
# Remove null bytes and control characters
user_input = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f]", "", user_input)
return user_input.strip()Pattern-based detection is the weakest layer. It catches obvious attacks but misses obfuscated or novel injection attempts.
Output Filtering
Output filtering is the last line of defense. Check for PII leakage, system prompt content, off-topic responses, and policy violations.
import re
from dataclasses import dataclass
@dataclass
class FilterResult:
passed: bool
violations: list[str]
def filter_output(
output: str,
system_prompt: str,
allowed_topics: list[str] | None = None,
) -> FilterResult:
"""Post-generation output filter."""
violations = []
# Check for system prompt leakage
prompt_fragments = [
system_prompt[i : i + 50]
for i in range(0, len(system_prompt) - 50, 25)
]
for fragment in prompt_fragments:
if fragment.lower() in output.lower():
violations.append("system_prompt_leakage")
break
# Check for PII patterns
pii_patterns = {
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
}
for pii_type, pattern in pii_patterns.items():
if re.search(pattern, output):
violations.append(f"pii_{pii_type}")
return FilterResult(
passed=len(violations) == 0,
violations=violations,
)Instruction Hierarchy
Place critical safety instructions at the end of the system prompt, after any dynamic content. Models assign higher weight to later instructions, making them harder to override via injection in user input.
[System prompt structure]
1. Role and task description
2. Dynamic context (RAG chunks, user history)
3. Safety constraints and refusal instructions <-- hardest to override
Anthropic's API supports explicit system/user/assistant role separation. Use the system role for all instructions, never place instructions in user-role messages where they can be confused with user input.
Guardrail Frameworks
NeMo Guardrails (NVIDIA) provides a configuration-based approach to defining conversational boundaries: allowed topics, banned topics, moderation flows, and fact-checking rails. It runs as middleware between the application and the LLM API.
Guardrails AI provides a validator-based approach with pre-built validators for PII detection, toxicity, competitor mentions, and format compliance. Each validator runs independently and can halt or modify the response.
Observability
Tracing
Every LLM request should produce a trace containing: request ID, timestamp, model, prompt version hash, input tokens, output tokens, latency, user ID, session ID, and the full prompt/response (or a reference to it in a log store).
import time
import hashlib
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
tracer = trace.get_tracer("llm-service")
def traced_llm_call(
client,
model: str,
messages: list[dict],
prompt_version: str,
user_id: str,
**kwargs,
) -> dict:
"""LLM call with structured logging and OpenTelemetry tracing."""
prompt_hash = hashlib.sha256(
str(messages).encode()
).hexdigest()[:12]
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_version", prompt_version)
span.set_attribute("llm.prompt_hash", prompt_hash)
span.set_attribute("user.id", user_id)
start = time.monotonic()
try:
response = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
latency_ms = (time.monotonic() - start) * 1000
usage = response.usage
span.set_attribute("llm.input_tokens", usage.prompt_tokens)
span.set_attribute("llm.output_tokens", usage.completion_tokens)
span.set_attribute("llm.latency_ms", latency_ms)
span.set_attribute("llm.status", "success")
logger.info(
"llm_call_complete",
model=model,
prompt_version=prompt_version,
input_tokens=usage.prompt_tokens,
output_tokens=usage.completion_tokens,
latency_ms=round(latency_ms, 1),
user_id=user_id,
)
return {
"content": response.choices[0].message.content,
"usage": {
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
},
"latency_ms": latency_ms,
}
except Exception as e:
latency_ms = (time.monotonic() - start) * 1000
span.set_attribute("llm.status", "error")
span.set_attribute("llm.error", str(e))
logger.error(
"llm_call_failed",
model=model,
error=str(e),
latency_ms=round(latency_ms, 1),
)
raiseDashboards
Key metrics to track:
| Metric | Granularity | Alert Threshold |
|---|---|---|
| Latency p50, p95, p99 | Per endpoint | p99 > 5s |
| Error rate | Per model, per endpoint | > 2% over 5 min |
| Token usage (input + output) | Per endpoint, daily | > 120% of baseline |
| Daily spend | Per model tier | > daily budget |
| Eval pass rate | Per prompt version | < baseline - 3% |
| Retrieval recall@5 | Per index update | < baseline - 5% |
| Output filter trigger rate | Per filter type | > 5% |
LangSmith provides trace waterfall views, dataset management, and annotation queues specifically for LLM applications. OpenTelemetry provides vendor-neutral instrumentation that integrates with Datadog, Grafana, Honeycomb, and other observability platforms.
Prompt Versioning
Treat prompts as code. Store them in version control. Tag each production prompt with a version identifier. Log the version with every request. This enables correlating quality regressions with specific prompt changes.
PROMPT_REGISTRY = {
"ticket_classifier": {
"version": "v2.4.1",
"template": "...",
"model": "gpt-4o-mini",
"max_tokens": 200,
"temperature": 0.0,
"eval_baseline": {
"accuracy": 0.91,
"latency_p95_ms": 450,
},
},
"qa_generator": {
"version": "v3.1.0",
"template": "...",
"model": "gpt-4o",
"max_tokens": 1000,
"temperature": 0.2,
"eval_baseline": {
"faithfulness": 0.87,
"relevance": 0.92,
"latency_p95_ms": 1800,
},
},
}Production Checklist
Evaluation:
- Eval dataset with 100+ examples covering normal and edge cases
- Automated eval suite gating every prompt and model change
- Baseline metrics for accuracy, faithfulness, relevance, safety
- Every production bug added as a regression test case
Reliability:
- Timeout on all LLM API calls (recommended: 30s default, 10s for classification)
- Graceful degradation when the LLM API is unavailable
- Rate limiting on inbound requests and outbound API calls
- Input validation, length limits, and injection detection
- Output filtering for PII, policy violations, and off-topic content
Cost:
- Daily spend alerts per model tier
- Token usage tracking per endpoint
- Model routing or a documented plan for implementing it
- Output length constraints (
max_tokens) set per task
Operations:
- Structured logging with trace IDs on every request
- Dashboards for latency, error rate, token usage, and cost
- Prompt version tracking in logs
- Runbook for common failure modes (API timeout, model degradation, cost spike)
- Rollback procedure for prompt and model changes
Fine-Tuning Decision Framework
Fine-tuning is warranted when prompt engineering reaches diminishing returns, the task is narrow and well-defined, and sufficient labeled data (1,000+ examples) is available.
| Factor | Prompt Engineering | Fine-Tuning |
|---|---|---|
| Data requirement | 3-10 examples | 1,000+ examples |
| Iteration speed | Minutes | Hours to days |
| Per-token cost | Higher (longer prompts) | Lower (shorter prompts) |
| Task specificity | General | Narrow, well-defined |
| Maintenance | Update prompt text | Retrain on new data |
| Risk | Low (revert prompt) | Catastrophic forgetting |
Use LoRA (Low-Rank Adaptation) rather than full fine-tuning. LoRA trains 0.1-1% of model parameters, reducing compute cost by 10-100x while achieving 80-95% of full fine-tuning quality. Evaluate the fine-tuned model against the base model on a held-out test set covering both the target task and general capabilities to detect catastrophic forgetting.
# LoRA fine-tuning with Hugging Face PEFT
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: 8-64, higher = more capacity
lora_alpha=32, # Scaling factor, typically 2x rank
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = get_peft_model(base_model, lora_config)
# Trainable params: ~0.3% of total
# Training time: ~2 hours on 1x A100 for 7B model, 5K examplesLatency Optimization
End-to-end latency for a RAG query breaks down as follows:
+----------------------------------------------+
| Component | Typical Latency |
|------------------------+---------------------|
| Query embedding | 10-30ms |
| Vector search (HNSW) | 5-15ms |
| BM25 search | 5-10ms |
| Reranking (20 docs) | 50-150ms |
| Context assembly | 1-5ms |
| LLM generation (TTFT) | 200-800ms |
| LLM generation (total) | 500-3000ms |
| Output filtering | 5-20ms |
| Total | 800-4000ms |
+----------------------------------------------+
Optimization strategies:
- Run vector search and BM25 search in parallel. Saves 5-15ms.
- Stream LLM output to reduce perceived latency. Time-to-first-token matters more than total generation time for user-facing applications.
- Use speculative decoding (vLLM supports this) with a small draft model to speed up generation by 2-3x for self-hosted models.
- Cache frequent queries. A 10% cache hit rate on 50K daily queries saves 5K LLM calls.
- Pre-compute embeddings for common query patterns.
import asyncio
async def parallel_search(query: str, vector_store, bm25_index, k: int = 50):
"""Run vector and keyword search concurrently."""
vector_task = asyncio.create_task(
vector_store.async_similarity_search(query, k=k)
)
keyword_task = asyncio.create_task(
bm25_index.async_search(query, k=k)
)
vector_results, keyword_results = await asyncio.gather(
vector_task, keyword_task
)
return vector_results, keyword_resultsAll latency numbers above assume cloud-hosted infrastructure in the same region as the LLM API provider. Cross-region API calls add 50-200ms of network latency.