Error Analysis for AI Systems

06 May, 2025

Custom metrics do not work as intended. Each use case has different goals and failure modes. Hence, we must look at the actual data to understand what is truly happening. By identifying and categorizing common errors, we can address them directly and drastically improve our systems. Instead of starting from looking at hallucations or truthfulness, we will be better served looking at real examples and creating custom categorization of errors.

General Process (Image Classification to LLMs)

This applies across classification, generation, and dialogue tasks.

Step 1: Curate Realistic Inputs

Collect 30–100 representative user queries.
Define key variables:
- Intent
- Persona
- Scenario
Use LLMs to synthesize diverse combinations of these dimensions.

Step 2: Initial Prompt Design

Use tools like AI Playground to test and iterate on prompts. And observe the quality and variability of outputs.

Step 3: Generate Outputs (Traces)

Run inputs through the LLM to collect full responses and internal reasoning steps.
Each trace includes:
- Input
- LLM output
- Intermediate reasoning or tool use
Aim for 100+ traces for a meaningful sample size.

Step 4: Manual Review Using a Custom Annotator (Important step!)

Build an annotation interface with fields for:
- Input
- Model response
- Behavior notes
- Binary and categorical labels
Add features like:
- One-click buttons for correctness
- Open-ended feedback fields
- Filtering and sorting by error types
- Keyboard shortcuts for speed

Step 5: Add Notes on Each Trace

For every example, document:
- What went wrong?
- Was the input flawed?
- What could the system have done differently?

Step 6: Categorize Mistakes

Use LLMs to help summarize and group failure patterns.
Deep dive on the repeat errors:
- Identify recurring error categories
- Count frequency by type
- Manually refine labels if necessary

Step 7: Refine Labeling

Consolidate into a set of binary labels.
These correspond to specific, repeatable model failure modes.

Step 8: Iterate and repeat

Repeat the entire process each time the prompt or model changes.
This can be time-intensive.
Once failure modes stabilize, LLM-as-judge can automate evaluation on fixed inputs.

After identifying the common errors, we can take the following actions, depending on the problem space.

Prompt Fix Techniques

Technique	Purpose
Revise instructions	Improve tone, remove redundancy, fill gaps
Add few-shot examples	Teach style, voice, or desired behavior
Add formatting constraints	Ensure consistent output structure
Re-run fixed input set	Compare performance after changes
Automate with LLM-as-judge	Reduce manual effort once categories stabilize

Bootstrapping with Synthetic Data

Synthetic data is effective, even when no real users exist yet.
Key principle: ground synthetic examples in real system constraints.
Binary classification setups are especially quick to iterate on in early stages.

References and Further Reading

#Evaluation #LLM #til