Muralidhar Kashipathi's blog

Error Analysis for AI Systems

Custom metrics do not work as intended. Each use case has different goals and failure modes. Hence, we must look at the actual data to understand what is truly happening. By identifying and categorizing common errors, we can address them directly and drastically improve our systems. Instead of starting from looking at hallucations or truthfulness, we will be better served looking at real examples and creating custom categorization of errors.

General Process (Image Classification to LLMs)

This applies across classification, generation, and dialogue tasks.

Step 1: Curate Realistic Inputs

Step 2: Initial Prompt Design

Step 3: Generate Outputs (Traces)

Step 4: Manual Review Using a Custom Annotator (Important step!)

Step 5: Add Notes on Each Trace

Step 6: Categorize Mistakes

Step 7: Refine Labeling

Step 8: Iterate and repeat

After identifying the common errors, we can take the following actions, depending on the problem space.


Prompt Fix Techniques

Technique Purpose
Revise instructions Improve tone, remove redundancy, fill gaps
Add few-shot examples Teach style, voice, or desired behavior
Add formatting constraints Ensure consistent output structure
Re-run fixed input set Compare performance after changes
Automate with LLM-as-judge Reduce manual effort once categories stabilize

Bootstrapping with Synthetic Data


References and Further Reading

#Evaluation #LLM #til