Logo
⚠️ Unsaved
[M]:

The Goal

You've built a prompt that generates documentation. Now you need to know: Is it actually good? And more importantly: How do you measure "good" systematically?

This exercise teaches you the fundamentals of LLM evaluation using LangSmith's TypeScript SDK in a Jupyter notebook.


What You'll Learn

Three levels of evaluation rigor:

  1. Smoke Test - Quick manual sanity checks (you already did this)
  2. Basic Metrics - Automated reference-based scoring with custom code
  3. LLM-as-Judge - Using another AI to evaluate your AI

Setup: TypeScript Jupyter Notebook (5 mins)

Step 1: Install Deno Jupyter Kernel

If you haven't already:

[ ]:
bash
# Install Deno (if not installed)
brew install deno

# Install Jupyter kernel
deno jupyter --install

# Start Jupyter
jupyter lab

[M]:

Step 2: Create New Notebook

  1. In Jupyter, click "New""Deno"
  2. Save as monday-docs-evaluation.ipynb
  3. You're ready to write TypeScript!

Step 3: Install LangSmith SDK

[ ]:
typescript
// Import LangSmith SDK
import { evaluate } from "npm:langsmith/evaluation";
import { Client } from "npm:langsmith";

// Verify it works
console.log("LangSmith SDK loaded!");

[M]:

Run the cell (Shift+Enter). Deno will auto-install dependencies.

[M]:

Phase 1: You Already Did This (Smoke Test)

In the previous exercise, you:

  • ✅ Ran your prompt on examples
  • ✅ Eyeballed the outputs
  • ✅ Compared to golden outputs
  • ✅ Rated with 🔥/👍/🤷/💩

This is Evaluation Level 0 - essential but doesn't scale.

Now let's make it systematic...


Phase 2: Basic Metrics (20 mins)

What Are Basic Metrics?

Simple, automated checks you can code yourself:

  • Contains Check: Does output include key phrases?
  • Length Check: Is output within expected token range?
  • Schema Validation: For structured outputs, is format correct?
  • Citation Check: Does it reference the expected sources?

Exercise: Write Custom Evaluators

Step 1: Define Your Metrics

For Monday.com documentation, what should we check?

MetricGood ForExample 1Example 2Example 3
Contains key termsCritical info is present✅ Function names✅ Parent class refs
Length boundsPrevent too short/long✅ 200-800 tokens✅ 300-1000 tokens
Citation presentEnsure attribution✅ Style guide mention✅ Parent doc link
Schema validityJSON structure✅ Valid JSON

Step 2: Write Evaluator Functions

[ ]:
typescript
import type { EvaluationResult } from "npm:langsmith/evaluation";

// Evaluator 1: Check if output contains key technical terms
const containsKeyTerms = async ({
outputs,
referenceOutputs
}: {
outputs: Record<string, any>;
referenceOutputs?: Record<string, any>;
}): Promise<EvaluationResult> => {
const output = outputs.text?.toLowerCase() || "";
const required = referenceOutputs?.required_terms || [];
const found = required.filter((term: string) =>
output.includes(term.toLowerCase())
);
const score = required.length > 0 ? found.length / required.length : 1.0;
return {
key: "contains_key_terms",
score: score,
comment: `Found ${found.length}/${required.length} terms: ${found.join(", ")}`
};
};

console.log("✅ containsKeyTerms evaluator defined");

[M]:

Step 3: Add More Evaluators

[ ]:
typescript
// Evaluator 2: Check output length is reasonable
const lengthInRange = async ({
outputs
}: {
outputs: Record<string, any>;
}): Promise<EvaluationResult> => {
const text = outputs.text || "";
const tokenEstimate = text.split(/\s+/).length; // Rough estimate
const min = 200;
const max = 800;
const inRange = tokenEstimate >= min && tokenEstimate <= max;
return {
key: "length_in_range",
score: inRange ? 1.0 : 0.0,
comment: `${tokenEstimate} tokens (target: ${min}-${max})`
};
};

// Evaluator 3: Check if style guide is cited
const hasCitation = async ({
outputs
}: {
outputs: Record<string, any>;
}): Promise<EvaluationResult> => {
const text = outputs.text || "";
const citations = ["style guide", "documentation standard", "Monday.com style"];
const found = citations.some(cite =>
text.toLowerCase().includes(cite.toLowerCase())
);
return {
key: "has_citation",
[M]:

Step 4: Prepare Your Dataset

You already have the dataset in LangSmith from the context engineering exercise. Let's reference it:

[ ]:
typescript
const client = new Client();
const datasetName = "Monday Code Documentation Challenge";

// Verify dataset exists
const datasets = await client.listDatasets();
const dataset = datasets.find(d => d.name === datasetName);

if (dataset) {
console.log(`✅ Found dataset: ${dataset.name} with ${dataset.example_count} examples`);
} else {
console.log("❌ Dataset not found. Make sure you completed the context engineering exercise.");
}

[M]:

Step 5: Define Your Application Function

This wraps your prompt so it can be evaluated:

[ ]:
typescript
// This simulates calling your prompt from the previous exercise
// Replace with actual LangSmith prompt call
const runDocumentationPrompt = async (inputs: Record<string, any>) => {
// TODO: Replace this with your actual prompt execution
// For now, we'll simulate calling the LangSmith API
const promptName = "your-prompt-name-here"; // From context engineering exercise
// This is a placeholder - you'd call your actual prompt here
return {
text: "Generated documentation would appear here...",
metadata: { model: "gpt-4", tokens: 500 }
};
};

[M]:

Step 6: Run Evaluation

[ ]:
typescript
// Run evaluation with your basic metrics
const results = await evaluate(
runDocumentationPrompt,
{
data: datasetName,
evaluators: [
containsKeyTerms,
lengthInRange,
hasCitation
],
experimentPrefix: "basic-metrics"
}
);

console.log("✅ Evaluation complete!");
console.log("View results at:", results.experimentUrl);

[M]:

Step 7: Analyze Results

[ ]:
typescript
// Quick summary of results
const summary = {
totalExamples: results.results.length,
averageScores: {
containsKeyTerms: 0,
lengthInRange: 0,
hasCitation: 0
}
};

results.results.forEach(result => {
summary.averageScores.containsKeyTerms += result.evaluationResults.results.find(
r => r.key === "contains_key_terms"
)?.score || 0;
summary.averageScores.lengthInRange += result.evaluationResults.results.find(
r => r.key === "length_in_range"
)?.score || 0;
summary.averageScores.hasCitation += result.evaluationResults.results.find(
r => r.key === "has_citation"
)?.score || 0;
});

// Calculate averages
Object.keys(summary.averageScores).forEach(key => {
summary.averageScores[key] /= summary.totalExamples;
});

console.log("📊 Summary:", summary);

[M]:

Phase 3: LLM-as-Judge (25 mins)

What Is LLM-as-Judge?

Use a powerful LLM (like GPT-4) to evaluate your outputs on subjective criteria:

  • Accuracy of technical descriptions
  • Clarity for new engineers
  • Style/tone match
  • Completeness without verbosity

Why this matters: Basic metrics can't judge "Is this explanation clear?" Only another LLM (or human) can.


Exercise: Build Your First Judge

Step 1: Define Your Rubric

For Monday.com documentation, what matters?

CriterionWeightWhat to Check1-5 Scale
Technical Accuracy30%Code descriptions are factually correct1=wrong, 3=mostly right, 5=perfect
Clarity25%Understandable to a new engineer1=confusing, 3=clear, 5=crystal clear
Style Match20%Follows Monday.com doc style1=ignores, 3=partial, 5=perfect match
Completeness15%Covers key functions without bloat1=missing major stuff, 5=just right
Citation Quality10%References style guide appropriately1=no citations, 5=proper citations

Step 2: Write the Judge Prompt

[ ]:
typescript
const JUDGE_PROMPT = `You are an expert technical documentation reviewer at Monday.com.

Evaluate the following documentation against our standards.

## Input Code:
{code}

## Generated Documentation:
{output}

## Style Guide Reference:
{style_guide}

## Reference Output (for comparison only):
{reference}

---

Score the documentation on these criteria (1-5 scale):

1. **Technical Accuracy** (30% weight): Are all code descriptions factually correct?
- 1: Contains errors or misrepresentations
- 3: Mostly accurate with minor issues
- 5: Completely accurate

2. **Clarity** (25% weight): Would a new Monday.com engineer understand this?
- 1: Confusing or unclear
- 3: Clear but could be better
- 5: Exceptionally clear

3. **Style Match** (20% weight): Does it follow Monday.com documentation style?
- 1: Ignores style guide
- 3: Partially follows
- 5: Perfect style match

[M]:

Step 4: Run Complete Evaluation

Now run with ALL evaluators (basic + judge):

[ ]:
typescript
const fullResults = await evaluate(
runDocumentationPrompt,
{
data: datasetName,
evaluators: [
// Basic metrics
containsKeyTerms,
lengthInRange,
hasCitation,
// LLM judge
llmAsJudge
],
experimentPrefix: "full-evaluation",
maxConcurrency: 1 // Run sequentially to avoid rate limits
}
);

console.log("✅ Full evaluation complete!");
console.log("View results at:", fullResults.experimentUrl);

[M]:

Step 5: Analyze Judge Results

[ ]:
typescript
// Extract judge scores
const judgeResults = fullResults.results.map(result => {
const judgeEval = result.evaluationResults.results.find(
r => r.key === "llm_judge"
);
if (judgeEval && judgeEval.comment) {
const details = JSON.parse(judgeEval.comment);
return {
example: result.example.inputs,
score: judgeEval.score,
verdict: details.verdict,
scores: details.scores,
issues: details.key_issues,
strengths: details.strengths
};
}
return null;
}).filter(r => r !== null);

// Display results
console.log("\n📊 LLM Judge Results:\n");
judgeResults.forEach((result, i) => {
console.log(`Example ${i + 1}:`);
console.log(` Overall: ${result.score.toFixed(2)}/5.0 - ${result.verdict}`);
console.log(` Detailed Scores:`, result.scores);
console.log(` Issues:`, result.issues);
console.log(` Strengths:`, result.strengths);
console.log("");
});

// Calculate average scores
const avgScores = {
technical_accuracy: 0,
clarity: 0,
[M]:

Phase 4: Analysis & Iteration (15 mins)

Compare All Evaluation Layers

Now you have three layers of evaluation:

  1. Your smoke test ratings (🔥/👍/🤷/💩)
  2. Basic metrics (contains, length, citations)
  3. LLM judge scores (1-5 on each criterion)

Exercise: Find Patterns

[ ]:
typescript
// Compare basic metrics vs judge scores
const comparison = fullResults.results.map(result => {
const basicMetrics = {};
result.evaluationResults.results.forEach(r => {
if (r.key !== "llm_judge") {
basicMetrics[r.key] = r.score;
}
});
const judgeScore = result.evaluationResults.results.find(
r => r.key === "llm_judge"
)?.score || 0;
return {
example: result.example.inputs.query || "Unknown",
basicMetrics,
judgeScore,
alignment: judgeScore > 3.5 && Object.values(basicMetrics).every(v => v > 0.7)
};
});

console.log("\n🔍 Evaluation Alignment Analysis:\n");
console.table(comparison);

// Find discrepancies
const discrepancies = comparison.filter(c => !c.alignment);
console.log(`\n⚠️ Found ${discrepancies.length} cases where metrics disagree`);
discrepancies.forEach(d => {
console.log(` - ${d.example}: Basic metrics say OK, but judge scored ${d.judgeScore.toFixed(2)}`);
});

[M]:

Key Questions to Ask:

[ ]:
typescript
// Generate insights
console.log("\n💡 Key Insights:\n");

// 1. Which criterion is the weakest?
const weakestCriterion = Object.entries(avgScores)
.sort(([,a], [,b]) => a - b)[0];
console.log(`1. Weakest area: ${weakestCriterion[0]} (${weakestCriterion[1].toFixed(2)}/5.0)`);

// 2. Are basic metrics predictive of judge scores?
const correlation = comparison.every(c =>
(c.judgeScore > 3.5 && Object.values(c.basicMetrics).every(v => v > 0.7)) ||
(c.judgeScore <= 3.5 && Object.values(c.basicMetrics).some(v => v <= 0.7))
);
console.log(`2. Basic metrics predict judge scores: ${correlation ? "YES ✅" : "NO ❌"}`);

// 3. Which examples need the most work?
const needsWork = comparison
.filter(c => c.judgeScore < 3.5)
.sort((a, b) => a.judgeScore - b.judgeScore);
console.log(`3. Examples needing improvement: ${needsWork.length}`);
needsWork.forEach(e => {
console.log(` - ${e.example} (score: ${e.judgeScore.toFixed(2)})`);
});
```

### **Iterate Your Prompt**

Based on findings, go back to LangSmith and adjust your prompt:

**If "Style Match" scores low:**
```
Add to your prompt:
"CRITICAL: Follow the Monday.com documentation style guide exactly.
Use the same tone, structure, and terminology as shown in the example."
```
[M]:

Then re-run evaluation to see if you improved!

[ ]:
typescript
// After updating your prompt, run this:
const iterationResults = await evaluate(
runDocumentationPrompt, // Updated prompt
{
data: datasetName,
evaluators: [containsKeyTerms, lengthInRange, hasCitation, llmAsJudge],
experimentPrefix: "iteration-2"
}
);

console.log("📈 Iteration complete! Compare experiments in LangSmith UI");
```

[M]:

Key Takeaways

The Evaluation Hierarchy

Smoke Test (manual, slow, doesn't scale)
    ↓
Basic Metrics (fast, cheap, catches obvious issues)
    ↓
LLM-as-Judge (expensive, slow, catches nuanced issues)
    ↓
Human Review (gold standard, very expensive)
[M]:

Use the right tool for the job:

  • Developing? → Smoke tests + basic metrics
  • Pre-production? → Add LLM judge
  • In production? → Monitor basic metrics + sample with judge
  • Critical decisions? → Human review on samples

Judge Design Best Practices

DO:

  • Use a different model for judging than the one you're testing
  • Set temperature=0 for consistency
  • Define clear 1-5 scales with examples
  • Weight criteria by importance
  • Return structured JSON for parsing
  • Validate judge outputs (parse the JSON!)

DON'T:

  • Use the same model to judge itself (bias!)
  • Make vague criteria ("is it good?")
  • Trust the judge blindly - spot-check its decisions
  • Forget that judges can be wrong too
  • Ignore basic metrics - they're fast and often catch issues judges miss

Cost Awareness

[ ]:
typescript
// Track evaluation costs
console.log("\n💰 Cost Estimates:\n");
console.log(`Basic metrics: ~$0.00 (client-side logic)`);
console.log(`LLM judge calls: ${judgeResults.length} × ~$0.02 = ~$${(judgeResults.length * 0.02).toFixed(2)}`);
console.log(`Total evaluation cost: ~$${(judgeResults.length * 0.02).toFixed(2)}`);
console.log(`\n💡 Tip: Use basic metrics in CI/CD, reserve judge for pre-release checks`);

Sign in to save your work and access it from anywhere