You've built a prompt that generates documentation. Now you need to know: Is it actually good? And more importantly: How do you measure "good" systematically?
This exercise teaches you the fundamentals of LLM evaluation using LangSmith's TypeScript SDK in a Jupyter notebook.
Three levels of evaluation rigor:
If you haven't already:
monday-docs-evaluation.ipynb
Run the cell (Shift+Enter). Deno will auto-install dependencies.
In the previous exercise, you:
This is Evaluation Level 0 - essential but doesn't scale.
Now let's make it systematic...
Simple, automated checks you can code yourself:
Step 1: Define Your Metrics
For Monday.com documentation, what should we check?
Metric | Good For | Example 1 | Example 2 | Example 3 |
---|---|---|---|---|
Contains key terms | Critical info is present | ✅ Function names | ✅ Parent class refs | ❌ |
Length bounds | Prevent too short/long | ✅ 200-800 tokens | ✅ 300-1000 tokens | ❌ |
Citation present | Ensure attribution | ✅ Style guide mention | ✅ Parent doc link | ❌ |
Schema validity | JSON structure | ❌ | ❌ | ✅ Valid JSON |
Step 2: Write Evaluator Functions
Step 3: Add More Evaluators
Step 4: Prepare Your Dataset
You already have the dataset in LangSmith from the context engineering exercise. Let's reference it:
Step 5: Define Your Application Function
This wraps your prompt so it can be evaluated:
Step 6: Run Evaluation
Step 7: Analyze Results
Use a powerful LLM (like GPT-4) to evaluate your outputs on subjective criteria:
Why this matters: Basic metrics can't judge "Is this explanation clear?" Only another LLM (or human) can.
Step 1: Define Your Rubric
For Monday.com documentation, what matters?
Criterion | Weight | What to Check | 1-5 Scale |
---|---|---|---|
Technical Accuracy | 30% | Code descriptions are factually correct | 1=wrong, 3=mostly right, 5=perfect |
Clarity | 25% | Understandable to a new engineer | 1=confusing, 3=clear, 5=crystal clear |
Style Match | 20% | Follows Monday.com doc style | 1=ignores, 3=partial, 5=perfect match |
Completeness | 15% | Covers key functions without bloat | 1=missing major stuff, 5=just right |
Citation Quality | 10% | References style guide appropriately | 1=no citations, 5=proper citations |
Step 2: Write the Judge Prompt
Step 4: Run Complete Evaluation
Now run with ALL evaluators (basic + judge):
Step 5: Analyze Judge Results
Now you have three layers of evaluation:
Exercise: Find Patterns
Then re-run evaluation to see if you improved!
Smoke Test (manual, slow, doesn't scale)
↓
Basic Metrics (fast, cheap, catches obvious issues)
↓
LLM-as-Judge (expensive, slow, catches nuanced issues)
↓
Human Review (gold standard, very expensive)
Use the right tool for the job:
✅ DO:
❌ DON'T: