Error Analysis for AI Systems
Custom metrics do not work as intended. Each use case has different goals and failure modes. Hence, we must look at the actual data to understand what is truly happening. By identifying and categorizing common errors, we can address them directly and drastically improve our systems. Instead of starting from looking at hallucations or truthfulness, we will be better served looking at real examples and creating custom categorization of errors.
General Process (Image Classification to LLMs)
This applies across classification, generation, and dialogue tasks.
Step 1: Curate Realistic Inputs
- Collect 30–100 representative user queries.
- Define key variables:
- Intent
- Persona
- Scenario
- Use LLMs to synthesize diverse combinations of these dimensions.
Step 2: Initial Prompt Design
- Use tools like AI Playground to test and iterate on prompts. And observe the quality and variability of outputs.
Step 3: Generate Outputs (Traces)
- Run inputs through the LLM to collect full responses and internal reasoning steps.
- Each trace includes:
- Input
- LLM output
- Intermediate reasoning or tool use
- Aim for 100+ traces for a meaningful sample size.
Step 4: Manual Review Using a Custom Annotator (Important step!)
- Build an annotation interface with fields for:
- Input
- Model response
- Behavior notes
- Binary and categorical labels
- Add features like:
- One-click buttons for correctness
- Open-ended feedback fields
- Filtering and sorting by error types
- Keyboard shortcuts for speed
Step 5: Add Notes on Each Trace
- For every example, document:
- What went wrong?
- Was the input flawed?
- What could the system have done differently?
Step 6: Categorize Mistakes
- Use LLMs to help summarize and group failure patterns.
- Deep dive on the repeat errors:
- Identify recurring error categories
- Count frequency by type
- Manually refine labels if necessary
Step 7: Refine Labeling
- Consolidate into a set of binary labels.
- These correspond to specific, repeatable model failure modes.
Step 8: Iterate and repeat
- Repeat the entire process each time the prompt or model changes.
- This can be time-intensive.
- Once failure modes stabilize, LLM-as-judge can automate evaluation on fixed inputs.
After identifying the common errors, we can take the following actions, depending on the problem space.
Prompt Fix Techniques
Technique | Purpose |
---|---|
Revise instructions | Improve tone, remove redundancy, fill gaps |
Add few-shot examples | Teach style, voice, or desired behavior |
Add formatting constraints | Ensure consistent output structure |
Re-run fixed input set | Compare performance after changes |
Automate with LLM-as-judge | Reduce manual effort once categories stabilize |
Bootstrapping with Synthetic Data
- Synthetic data is effective, even when no real users exist yet.
- Key principle: ground synthetic examples in real system constraints.
- Binary classification setups are especially quick to iterate on in early stages.
References and Further Reading
- Field Guide to Evaluating LLMs – Hamel
- YouTube-Blog Error Analysis Prompt
- [Braintrust Eval Playgrounds](