LLM Log Debugging Glossary

A reference guide to common terms used when exploring, debugging, and curating LLM logs with Hyperparam.

Agent span: A distinct unit of execution within a trace that represents a specific, autonomous action taken by an AI agent such as a tool call, reasoning step or planning phase. It acts as a nested container within larger distributed traces, recording input/output data, latency and metadata for debugging complex autonomous workflows.

Chunking: A technique for splitting large, unmanageable logs into smaller, manageable and semantically meaningful units of text or data to improve comprehension, retention or AI processing efficiency.

Context: The total input data (system instructions, conversation history, user prompts and retrieved documents) that a model considers during a run, often called the "context window." It serves as short-term memory, enabling coherence and relevance in responses, and is measured in tokens.

Context rot: Also called "context loss" or "context degradation," refers to the measurable decline of an LLM's ability to recall, maintain or act upon information as the input context window fills up, even before the maximum token limit is reached. It's noticeable as a continuous, silent degradation of the model's output, frequently characterized by the model struggling to retrieve information from long prompts or forgetting earlier instructions.

Dataset curation: The process of using LLM interaction data (prompts, responses, user feedback, metadata) generated in production to identify, filter, clean and refine data for future training or fine-tuning. Techniques include annotation, labeling, filtering, segmentation and derived columns.

Dataset-scale inspection: The review of an entire, multi-gigabyte dataset of LLM logs. Helpful for pinpointing issues that are harder to see with sampling, such as distribution drift or repeated wasted calls.

Distribution drift: Occurs when the topics, language style or complexity of user prompts shift away over time from the data the model was originally trained on, leading to irrelevant responses, reduced accuracy or hallucinations.

Evaluation drift: The gradual, subtle decline in the quality of a model's outputs over time despite the model and prompts remaining technically unchanged. Shows up as a slow erosion of trust, where the model generates fluent and confident answers that are increasingly inaccurate or irrelevant.

Evaluation signals: Metrics and data points captured during the running of a model that indicate the quality, safety and operational performance of its outputs. Essential for identifying performance regressions or policy violations in production.

Hallucination detection: Identifying scenarios where the LLM produces plausible but factually incorrect information. Often involves evaluating model outputs against reliable sources — such as provided context in RAG systems — to ensure faithfulness, factual consistency and accuracy.

Hallucinations: Occur when the model generates confident but illogical, fabricated or false information that differs from the provided data or known facts. They take the form of incorrect answers, fabricated citations or auto-generated pseudo-facts.

Invalid tool argument: An error that occurs when the model tries to call a function or tool but generates arguments that don't match the expected structure, type or format defined in the tool's schema. The LLM essentially acts like a buggy API client, producing incorrect inputs that the backend system can't execute.

Label broadcasting: The automated, scalable application of metadata, categories or tags (generated by an LLM) across large, unstructured datasets.

Latency (P90/P95/P99): Time taken by the model to return a response. P90 pinpoints the performance the majority of users experience, barring the top 10% of slowest responses. High P95 values often indicate systemic issues affecting nearly all users except the top 5% of slowest responses. P99 highlights the slowest 1% of responses, indicating the worst outliers.

Large Language Model (LLM): A large deep learning model that excels at natural language processing tasks. LLMs are trained on massive datasets to understand, summarize, generate and predict human-like language. Popular examples include ChatGPT, Claude and Gemini.

LLM-as-a-judge: An automated evaluation method where an LLM acts as an evaluator to analyze, score and provide feedback on the outputs of another system. Within an LLM log, this appears as structured, AI-generated critiques including numerical scores and categorical feedback stored alongside each user prompt and AI response. Evaluation metrics typically include relevance, consistency, accuracy/factuality and ROUGE/BLEU.

LLM call span: A direct invocation of an LLM capturing inputs, outputs, sampling parameters (temperature) and token usage.

LLM log: A record of all interactions, API calls and internal processes related to a large language model. These logs contain input prompts, generated output, and metadata (latency, user ID). Used for debugging, monitoring AI performance, tracking usage for billing and analyzing security events.

Log templatization/parsing: Using LLMs to convert raw, unstructured log data into structured events by identifying static templates and dynamic variables. Simplifies log analysis, enabling AI to detect patterns and anomalies efficiently.

Model version: A specific identifier (often logged as model, model_name or model_id) that tracks which exact version of an LLM was used for a particular inference call or request. Tracking model versions is critical for comparing performance over time and rolling back to older versions if a new model performs poorly.

Perplexity: A key evaluation metric measuring how confidently and accurately a model predicts the next token in a sequence. A lower perplexity score indicates better performance.

Prompt regressions: Instances where a change to a prompt, a model update or a configuration tweak causes the model's performance to degrade, introducing errors or diminishing the quality of output. Often manifest as silent failures — the system returns a response, but the output is irrelevant, less accurate or violates safety guidelines.

Prompt versioning: The systematic tracking of prompt iterations (changes in text, few-shot examples and parameters) linked directly to the specific outputs, latency and costs they produced. Creates an audit trail that allows developers to debug, compare performance and roll back to previous versions.

Rabbit-holing: A failure mode where a model or agent latches onto a bad line of reasoning, tool path or task interpretation and keeps pursuing it past the point where it should have corrected course, asked for clarification or stopped.

Retrieval augmented generation (RAG): A framework that improves LLM accuracy by fetching relevant, external data to ground its answers. Instead of relying on only pre-trained knowledge, RAG retrieves current, domain-specific data, reducing hallucinations and allowing for accurate, cited and up-to-date responses.

RAG monitoring: The practice of capturing, analyzing and auditing the internal steps of a RAG pipeline to ensure it remains accurate, reliable and cost-effective.

Root span: The top-level unit of work that represents an entire user request or high-level operation from start to finish. Acts as the "parent" of a trace, providing the overall context for all sub-operations that happen during that request.

Semantic drift: The gradual loss of meaning, relevance or factual accuracy in a model's output over time or across a long generation. A silent failure where the structure of the log remains valid but the value of the content decays.

Sensitive data scanner: A natively integrated tool that scans inputs and outputs for PII (personally identifiable information) or compliance breaches.

Span: A single unit of work within a trace, such as a direct call to the LLM, a database retrieval or a tool execution.

Summarization: The act of condensing chunks or entire documents into a shorter form that retains only the essential information the LLM needs to know.

Sycophancy: A model's tendency to tailor its responses to match a user's stated or implied beliefs, even when those beliefs are factually incorrect. Instead of prioritizing truth, the model acts as a "yes-man" to maximize perceived helpfulness or user satisfaction.

Throughput: The number of queries a system can handle in a specific time frame.

Token usage: The total number of tokens (input + output) processed, crucial for tracking costs and latency.

Tool failure: Occurs when a model attempts to use an external capability (like an API, search engine or database) but the process breaks. Because LLMs often act as agents that plan and execute multi-step tasks, a single tool failure can cause a "reliability cliff" where the entire request collapses.

Tool-call traces: Detailed records that capture the entire lifecycle of an AI agent's interaction with external tools, APIs or data sources. Provide a step-by-step audit trail — including inputs, outputs, latency and errors — of what the model decided to do, which tools it called and the results it received.

Trace: A log of the entire request path as it moves through the application, encompassing all intermediate steps such as tool calls, data retrieval and LLM reasoning.

Trace inspection: The process of manually or automatically reviewing a trace to understand the "why" behind a model's output. While monitoring alerts you to a system failure, trace inspection helps determine exactly where the logic broke down.

Vector store: Databases that hold vector representations of log data to facilitate retrieval-based analysis.

Verbosity: A performance metric that tracks how wordy a model is relative to the information it provides. Important for cost control, latency and debugging.

Waiting-state failure: A failure mode where a model or agent enters a stalled state and waits for user input when it should have continued the task, used an available tool or returned a more complete response.

Wasted call: Also referred to as a "failed tool call," a logged attempt by the LLM to use an external tool (e.g., API, database) that results in an error, incorrect function execution or invalid output. Common causes include missing parameters, malformed JSON, incorrect arguments, rate limits or hallucinations where the model invokes non-existent tools.

LLM Log Debugging Glossary - Hyperparam