TypeScript Notebook - Jupyter Notebook Environment

[M]:

The Goal

You've built a prompt that generates documentation. Now you need to know: Is it actually good? And more importantly: How do you measure "good" systematically?

This exercise teaches you the fundamentals of LLM evaluation using LangSmith's TypeScript SDK in a Jupyter notebook.

What You'll Learn

Three levels of evaluation rigor:

Smoke Test - Quick manual sanity checks (you already did this)
Basic Metrics - Automated reference-based scoring with custom code
LLM-as-Judge - Using another AI to evaluate your AI

Setup: TypeScript Jupyter Notebook (5 mins)

Step 1: Install Deno Jupyter Kernel

If you haven't already:

[ ]:

bash

# Install Deno (if not installed)

brew install deno

# Install Jupyter kernel

deno jupyter --install

# Start Jupyter

jupyter lab

[M]:

Step 2: Create New Notebook

In Jupyter, click "New" → "Deno"
Save as monday-docs-evaluation.ipynb
You're ready to write TypeScript!

Step 3: Install LangSmith SDK

[ ]:

typescript

// Import LangSmith SDK

import { evaluate } from "npm:langsmith/evaluation";

import { Client } from "npm:langsmith";

// Verify it works

console.log("LangSmith SDK loaded!");

[M]:

Run the cell (Shift+Enter). Deno will auto-install dependencies.

[M]:

Phase 1: You Already Did This (Smoke Test)

In the previous exercise, you:

✅ Ran your prompt on examples
✅ Eyeballed the outputs
✅ Compared to golden outputs
✅ Rated with 🔥/👍/🤷/💩

This is Evaluation Level 0 - essential but doesn't scale.

Now let's make it systematic...

Phase 2: Basic Metrics (20 mins)

What Are Basic Metrics?

Simple, automated checks you can code yourself:

Contains Check: Does output include key phrases?
Length Check: Is output within expected token range?
Schema Validation: For structured outputs, is format correct?
Citation Check: Does it reference the expected sources?

Exercise: Write Custom Evaluators

Step 1: Define Your Metrics

For Monday.com documentation, what should we check?

Metric	Good For	Example 1	Example 2	Example 3
Contains key terms	Critical info is present	✅ Function names	✅ Parent class refs	❌
Length bounds	Prevent too short/long	✅ 200-800 tokens	✅ 300-1000 tokens	❌
Citation present	Ensure attribution	✅ Style guide mention	✅ Parent doc link	❌
Schema validity	JSON structure	❌	❌	✅ Valid JSON

Step 2: Write Evaluator Functions

[ ]:

typescript

import type { EvaluationResult } from "npm:langsmith/evaluation";

// Evaluator 1: Check if output contains key technical terms

const containsKeyTerms = async ({

outputs,

referenceOutputs

}: {

outputs: Record<string, any>;

referenceOutputs?: Record<string, any>;

}): Promise<EvaluationResult> => {

const output = outputs.text?.toLowerCase() || "";

const required = referenceOutputs?.required_terms || [];

const found = required.filter((term: string) =>

output.includes(term.toLowerCase())

);

const score = required.length > 0 ? found.length / required.length : 1.0;

return {

key: "contains_key_terms",

score: score,

comment: `Found ${found.length}/${required.length} terms: ${found.join(", ")}`

};

console.log("✅ containsKeyTerms evaluator defined");

[M]:

Step 3: Add More Evaluators

[ ]:

typescript

// Evaluator 2: Check output length is reasonable

const lengthInRange = async ({

outputs

}: {

outputs: Record<string, any>;

}): Promise<EvaluationResult> => {

const text = outputs.text || "";

const tokenEstimate = text.split(/\s+/).length; // Rough estimate

const min = 200;

const max = 800;

const inRange = tokenEstimate >= min && tokenEstimate <= max;

return {

key: "length_in_range",

score: inRange ? 1.0 : 0.0,

comment: `${tokenEstimate} tokens (target: ${min}-${max})`

};

// Evaluator 3: Check if style guide is cited

const hasCitation = async ({

outputs

}: {

outputs: Record<string, any>;

}): Promise<EvaluationResult> => {

const text = outputs.text || "";

const citations = ["style guide", "documentation standard", "Monday.com style"];

const found = citations.some(cite =>

text.toLowerCase().includes(cite.toLowerCase())

);

return {

key: "has_citation",

[M]:

Step 4: Prepare Your Dataset

You already have the dataset in LangSmith from the context engineering exercise. Let's reference it:

[ ]:

typescript

const client = new Client();

const datasetName = "Monday Code Documentation Challenge";

// Verify dataset exists

const datasets = await client.listDatasets();

const dataset = datasets.find(d => d.name === datasetName);

if (dataset) {

console.log(`✅ Found dataset: ${dataset.name} with ${dataset.example_count} examples`);

} else {

console.log("❌ Dataset not found. Make sure you completed the context engineering exercise.");

}

[M]:

Step 5: Define Your Application Function

This wraps your prompt so it can be evaluated:

[ ]:

typescript

// This simulates calling your prompt from the previous exercise

// Replace with actual LangSmith prompt call

const runDocumentationPrompt = async (inputs: Record<string, any>) => {

// TODO: Replace this with your actual prompt execution

// For now, we'll simulate calling the LangSmith API

const promptName = "your-prompt-name-here"; // From context engineering exercise

// This is a placeholder - you'd call your actual prompt here

return {

text: "Generated documentation would appear here...",

metadata: { model: "gpt-4", tokens: 500 }

};

[M]:

Step 6: Run Evaluation

[ ]:

typescript

// Run evaluation with your basic metrics

const results = await evaluate(

runDocumentationPrompt,

{

data: datasetName,

evaluators: [

containsKeyTerms,

lengthInRange,

hasCitation

],

experimentPrefix: "basic-metrics"

}

);

console.log("✅ Evaluation complete!");

console.log("View results at:", results.experimentUrl);

[M]:

Step 7: Analyze Results

[ ]:

typescript

// Quick summary of results

const summary = {

totalExamples: results.results.length,

averageScores: {

containsKeyTerms: 0,

lengthInRange: 0,

hasCitation: 0

}

};

results.results.forEach(result => {

summary.averageScores.containsKeyTerms += result.evaluationResults.results.find(

r => r.key === "contains_key_terms"

)?.score || 0;

summary.averageScores.lengthInRange += result.evaluationResults.results.find(

r => r.key === "length_in_range"

)?.score || 0;

summary.averageScores.hasCitation += result.evaluationResults.results.find(

r => r.key === "has_citation"

)?.score || 0;

});

// Calculate averages

Object.keys(summary.averageScores).forEach(key => {

summary.averageScores[key] /= summary.totalExamples;

});

console.log("📊 Summary:", summary);

[M]:

Phase 3: LLM-as-Judge (25 mins)

What Is LLM-as-Judge?

Use a powerful LLM (like GPT-4) to evaluate your outputs on subjective criteria:

Accuracy of technical descriptions
Clarity for new engineers
Style/tone match
Completeness without verbosity

Why this matters: Basic metrics can't judge "Is this explanation clear?" Only another LLM (or human) can.

Exercise: Build Your First Judge

Step 1: Define Your Rubric

For Monday.com documentation, what matters?

Criterion	Weight	What to Check	1-5 Scale
Technical Accuracy	30%	Code descriptions are factually correct	1=wrong, 3=mostly right, 5=perfect
Clarity	25%	Understandable to a new engineer	1=confusing, 3=clear, 5=crystal clear
Style Match	20%	Follows Monday.com doc style	1=ignores, 3=partial, 5=perfect match
Completeness	15%	Covers key functions without bloat	1=missing major stuff, 5=just right
Citation Quality	10%	References style guide appropriately	1=no citations, 5=proper citations

Step 2: Write the Judge Prompt

[ ]:

typescript

const JUDGE_PROMPT = `You are an expert technical documentation reviewer at Monday.com.

Evaluate the following documentation against our standards.

## Input Code:

{code}

## Generated Documentation:

{output}

## Style Guide Reference:

{style_guide}

## Reference Output (for comparison only):

{reference}

---

Score the documentation on these criteria (1-5 scale):

1. **Technical Accuracy** (30% weight): Are all code descriptions factually correct?

- 1: Contains errors or misrepresentations

- 3: Mostly accurate with minor issues

- 5: Completely accurate

2. **Clarity** (25% weight): Would a new Monday.com engineer understand this?

- 1: Confusing or unclear

- 3: Clear but could be better

- 5: Exceptionally clear

3. **Style Match** (20% weight): Does it follow Monday.com documentation style?

- 1: Ignores style guide

- 3: Partially follows

- 5: Perfect style match

[M]:

Step 4: Run Complete Evaluation

Now run with ALL evaluators (basic + judge):

[ ]:

typescript

const fullResults = await evaluate(

runDocumentationPrompt,

{

data: datasetName,

evaluators: [

// Basic metrics

containsKeyTerms,

lengthInRange,

hasCitation,

// LLM judge

llmAsJudge

],

experimentPrefix: "full-evaluation",

maxConcurrency: 1 // Run sequentially to avoid rate limits

}

);

console.log("✅ Full evaluation complete!");

console.log("View results at:", fullResults.experimentUrl);

[M]:

Step 5: Analyze Judge Results

[ ]:

typescript

// Extract judge scores

const judgeResults = fullResults.results.map(result => {

const judgeEval = result.evaluationResults.results.find(

r => r.key === "llm_judge"

);

if (judgeEval && judgeEval.comment) {

const details = JSON.parse(judgeEval.comment);

return {

example: result.example.inputs,

score: judgeEval.score,

verdict: details.verdict,

scores: details.scores,

issues: details.key_issues,

strengths: details.strengths

};

}

return null;

}).filter(r => r !== null);

// Display results

console.log("\n📊 LLM Judge Results:\n");

judgeResults.forEach((result, i) => {

console.log(`Example ${i + 1}:`);

console.log(` Overall: ${result.score.toFixed(2)}/5.0 - ${result.verdict}`);

console.log(` Detailed Scores:`, result.scores);

console.log(` Issues:`, result.issues);

console.log(` Strengths:`, result.strengths);

console.log("");

});

// Calculate average scores

const avgScores = {

technical_accuracy: 0,

clarity: 0,

[M]:

Phase 4: Analysis & Iteration (15 mins)

Compare All Evaluation Layers

Now you have three layers of evaluation:

Your smoke test ratings (🔥/👍/🤷/💩)
Basic metrics (contains, length, citations)
LLM judge scores (1-5 on each criterion)

Exercise: Find Patterns

[ ]:

typescript

// Compare basic metrics vs judge scores

const comparison = fullResults.results.map(result => {

const basicMetrics = {};

result.evaluationResults.results.forEach(r => {

if (r.key !== "llm_judge") {

basicMetrics[r.key] = r.score;

}

});

const judgeScore = result.evaluationResults.results.find(

r => r.key === "llm_judge"

)?.score || 0;

return {

example: result.example.inputs.query || "Unknown",

basicMetrics,

judgeScore,

alignment: judgeScore > 3.5 && Object.values(basicMetrics).every(v => v > 0.7)

};

});

console.log("\n🔍 Evaluation Alignment Analysis:\n");

console.table(comparison);

// Find discrepancies

const discrepancies = comparison.filter(c => !c.alignment);

console.log(`\n⚠️ Found ${discrepancies.length} cases where metrics disagree`);

discrepancies.forEach(d => {

console.log(` - ${d.example}: Basic metrics say OK, but judge scored ${d.judgeScore.toFixed(2)}`);

});

[M]:

Key Questions to Ask:

[ ]:

typescript

// Generate insights

console.log("\n💡 Key Insights:\n");

// 1. Which criterion is the weakest?

const weakestCriterion = Object.entries(avgScores)

.sort(([,a], [,b]) => a - b)[0];

console.log(`1. Weakest area: ${weakestCriterion[0]} (${weakestCriterion[1].toFixed(2)}/5.0)`);

// 2. Are basic metrics predictive of judge scores?

const correlation = comparison.every(c =>

(c.judgeScore > 3.5 && Object.values(c.basicMetrics).every(v => v > 0.7)) ||

(c.judgeScore <= 3.5 && Object.values(c.basicMetrics).some(v => v <= 0.7))

);

console.log(`2. Basic metrics predict judge scores: ${correlation ? "YES ✅" : "NO ❌"}`);

// 3. Which examples need the most work?

const needsWork = comparison

.filter(c => c.judgeScore < 3.5)

.sort((a, b) => a.judgeScore - b.judgeScore);

console.log(`3. Examples needing improvement: ${needsWork.length}`);

needsWork.forEach(e => {

console.log(` - ${e.example} (score: ${e.judgeScore.toFixed(2)})`);

});

```

### **Iterate Your Prompt**

Based on findings, go back to LangSmith and adjust your prompt:

**If "Style Match" scores low:**

```

Add to your prompt:

"CRITICAL: Follow the Monday.com documentation style guide exactly.

Use the same tone, structure, and terminology as shown in the example."

```

[M]:

Then re-run evaluation to see if you improved!

[ ]:

typescript

// After updating your prompt, run this:

const iterationResults = await evaluate(

runDocumentationPrompt, // Updated prompt

{

data: datasetName,

evaluators: [containsKeyTerms, lengthInRange, hasCitation, llmAsJudge],

experimentPrefix: "iteration-2"

}

);

console.log("📈 Iteration complete! Compare experiments in LangSmith UI");

```

[M]:

Key Takeaways

The Evaluation Hierarchy

Smoke Test (manual, slow, doesn't scale)
    ↓
Basic Metrics (fast, cheap, catches obvious issues)
    ↓
LLM-as-Judge (expensive, slow, catches nuanced issues)
    ↓
Human Review (gold standard, very expensive)

[M]:

Use the right tool for the job:

Developing? → Smoke tests + basic metrics
Pre-production? → Add LLM judge
In production? → Monitor basic metrics + sample with judge
Critical decisions? → Human review on samples

Judge Design Best Practices

✅ DO:

Use a different model for judging than the one you're testing
Set temperature=0 for consistency
Define clear 1-5 scales with examples
Weight criteria by importance
Return structured JSON for parsing
Validate judge outputs (parse the JSON!)

❌ DON'T:

Use the same model to judge itself (bias!)
Make vague criteria ("is it good?")
Trust the judge blindly - spot-check its decisions
Forget that judges can be wrong too
Ignore basic metrics - they're fast and often catch issues judges miss

Cost Awareness

[ ]:

typescript

// Track evaluation costs

console.log("\n💰 Cost Estimates:\n");

console.log(`Basic metrics: ~$0.00 (client-side logic)`);

console.log(`LLM judge calls: ${judgeResults.length} × ~$0.02 = ~$${(judgeResults.length * 0.02).toFixed(2)}`);

console.log(`Total evaluation cost: ~$${(judgeResults.length * 0.02).toFixed(2)}`);

console.log(`\n💡 Tip: Use basic metrics in CI/CD, reserve judge for pre-release checks`);