🔬 AI Observability 實戰：像 Git Diff 一樣追蹤 AI 的 Context 演變

當我們調試程式碼時，有 debugger 和 profiler；當我們追蹤版本時，有 git diff。但當我們調試 AI 的行為時，我們有什麼？

這篇文章將分享我在 Context Engineering Lab 中開發的 AI Observability 工具，以及一個驚人的發現：只需要調整 52 個 tokens，就能讓 AI 的準確率從 0% 提升到 100%。

🎯 為什麼需要 AI Observability？

在傳統軟體開發中，我們有完善的調試工具鏈：

Code Diff: 看到程式碼的每一行變更
Performance Profiler: 找出效能瓶頸
Error Tracking: 追蹤錯誤發生的脈絡

但在 AI 開發中，我們面臨新的挑戰：

❓ Context 改了什麼？影響有多大？
❓ 為什麼同樣的 prompt，結果卻不同？
❓ Token 數量增加是否值得？
❓ 哪種策略的 ROI 最高？

🛠️ AI Observability 工具全景

1. MCP (Model Context Protocol) - 動態 Context 注入

MCP 是 Anthropic 推出的協議，讓 AI 能動態獲取外部資訊。想像它是 AI 的「外掛系統」：

# MCP 工具調用示例
@mcp.tool()
def get_code_context(file_path, symbol_name):
    """動態注入相關程式碼到 context"""
    return {
        "file": file_path,
        "symbol": find_symbol(symbol_name),
        "dependencies": get_dependencies(symbol_name)
    }

實際應用場景：

🔍 AI 需要理解某個函數時，自動載入相關程式碼
📊 查詢資料庫時，自動注入 schema 資訊
📝 編輯文件時，動態載入相關章節

2. Langfuse/Langsmith - 生產環境追蹤

這類工具專注於生產環境的 AI 行為追蹤：

from langfuse import Langfuse

langfuse = Langfuse()

# 追蹤每次 AI 調用
@langfuse.trace
def analyze_sentiment(text):
    context = build_context(text)
    response = call_ai(context)
    
    # 自動記錄：
    # - Token 使用量
    # - 延遲時間
    # - Context 內容
    # - 回應品質
    return response

核心功能：

📈 Token 使用量分析
⏱️ 延遲監控
🎯 準確率追蹤
💰 成本計算

3. Chainlit - 對話式 AI 的 UI 框架

Chainlit 讓你能即時看到 Context 的演變過程：

import chainlit as cl

@cl.on_message
async def main(message: str):
    # 顯示 context 演變的每一步
    async with cl.Step(name="Context Evolution") as step:
        # Step 1: 基礎 context
        step.output = "Added system prompt (15 tokens)"
        
        # Step 2: 載入相關資料
        step.output = "Added code context (300 tokens)"
        
        # Step 3: 加入範例
        step.output = "Added examples (150 tokens)"

💬User:"Fix the auth bug"

🔧Step 1: List Files

Added: project structure (150 tok)

🔧Step 2: Find Symbol

Added: auth_user function (300 tok)

🤖AI Response

[With full context visualization]

🚀 自製 Context Diff 工具實戰

我開發了一個專門的 Context Diff 工具，像 git diff 一樣追蹤 AI context 的變化。以下是實際實驗結果：

實驗設計：情感分析任務

測試案例：

🎧 "這支耳機音質不錯，但藍牙常常斷線。"
⌨️ "The keyboard feels great, but the battery dies too fast."
📷 "相機畫質很棒，可是夜拍對焦很慢。"

三種 Context 策略對比

策略 A：Baseline（模糊指示）

context_a = """You are a sentiment analyzer.
Extract product info from this review."""
# Token 數：13

結果：完全失敗 ❌

產品信息提取：
- 產品類型：耳機
- 音質評價：不錯
- 藍牙連接問題：常常斷線

不是 JSON 格式
沒有 sentiment 欄位
準確率：0%

策略 B：Rules-based（明確規則）

context_b = """You are a sentiment analyzer.

Extract the following information from product reviews:
- sentiment: must be "positive", "neutral", or "negative"
- product: the product name (string)
- issue: description of any issues (string, or empty)

Output must be valid JSON format.
Do not include markdown code blocks."""
# Token 數：65（+52）

結果：完美成功 ✅

{
  "sentiment": "negative",
  "product": "耳機",
  "issue": "藍牙常常斷線"
}

準確率：100%
提升：+100%！

策略 C：Few-shot（加入範例）

context_c = context_b + """

Examples:

Input: "這支耳機音質不錯，但藍牙常常斷線。"
Output: {"sentiment": "negative", "product": "headphones", "issue": "bluetooth connection"}

Input: "The keyboard feels great, but the battery dies too fast."
Output: {"sentiment": "negative", "product": "keyboard", "issue": "battery life"}"""
# Token 數：161（+96）

結果：同樣完美 ✅

準確率：100%
輸出更一致
Token 成本較高

📊 Context Diff 視覺化

我的工具產生了這樣的對比視圖：

--- Context A (Baseline)
+++ Context B (Rules-based)
 You are a sentiment analyzer.
-Extract product info from this review.
+
+Extract the following information from product reviews:
+- sentiment: must be "positive", "neutral", or "negative"
+- product: the product name (string)
+- issue: description of any issues (string, or empty)
+
+Output must be valid JSON format.
+Do not include markdown code blocks.

Token 演變時間線：

📈 Context Evolution Timeline
┌────────┬─────────────┬────────┬──────────┬──────────┐
│ Step   │ Strategy    │ Tokens │ Δ Tokens │ Accuracy │
├────────┼─────────────┼────────┼──────────┼──────────┤
│ #0     │ Baseline    │     13 │          │      0%  │
│ #1     │ + Rules     │     65 │      +52 │    100%  │
│ #2     │ + Examples  │    161 │      +96 │    100%  │
└────────┴─────────────┴────────┴──────────┴──────────┘

💡 關鍵洞察

1. 最小有效改動 (Minimal Effective Change)

實驗證明：只需要 52 個 tokens 的精確指示，就能獲得 100% 的提升。這比加入大量範例（+96 tokens）更有效率。

2. ROI 分析

策略	Token 成本	準確率	ROI	建議
Baseline	13	0%	-∞	❌ 永遠不要用
Rules	65 (5x)	100%	♾️	⭐⭐⭐ 推薦
Few-shot	161 (12x)	100%	高	⭐⭐ 特定場景

3. 何時使用不同策略

使用 Rules-based 當：

✅ 任務明確且結構化
✅ Token 預算有限
✅ 需要快速迭代

使用 Few-shot 當：

✅ 處理複雜邊緣案例
✅ 需要特定格式輸出
✅ 穩定性優於成本

🔧 實作你自己的 Context Visualizer

這是核心程式碼片段：

class ContextVisualizer:
    def __init__(self):
        self.snapshots = []
        
    def add_snapshot(self, name, content, metadata=None):
        """記錄 context 快照"""
        self.snapshots.append({
            "name": name,
            "content": content,
            "tokens": self.count_tokens(content),
            "metadata": metadata or {}
        })
    
    def show_diff(self, idx_a, idx_b):
        """顯示兩個 context 的差異"""
        diff = unified_diff(
            self.snapshots[idx_a]["content"].splitlines(),
            self.snapshots[idx_b]["content"].splitlines(),
            lineterm=""
        )
        print("\n".join(diff))
    
    def show_evolution(self):
        """顯示 context 演變時間線"""
        for i, snapshot in enumerate(self.snapshots):
            if i > 0:
                token_diff = snapshot["tokens"] - self.snapshots[i-1]["tokens"]
                print(f"Step {i}: {snapshot['name']} | "
                      f"Tokens: {snapshot['tokens']} ({token_diff:+d})")

📈 生產環境最佳實踐

1. 建立 Context 實驗框架

# context_experiment.py
class ContextExperiment:
    def __init__(self):
        self.strategies = []
        self.results = []
    
    def add_strategy(self, name, context_builder):
        self.strategies.append((name, context_builder))
    
    def run(self, test_cases):
        for name, builder in self.strategies:
            context = builder()
            score = evaluate(context, test_cases)
            self.results.append({
                "strategy": name,
                "tokens": count_tokens(context),
                "score": score,
                "cost": calculate_cost(context)
            })
    
    def get_best_strategy(self):
        # 根據 ROI 選擇最佳策略
        return max(self.results, key=lambda x: x["score"] / x["cost"])

2. 持續監控與優化

# monitoring.py
from langfuse import Langfuse

class ContextMonitor:
    def __init__(self):
        self.langfuse = Langfuse()
    
    @langfuse.trace
    def track_context_performance(self, context, response):
        # 自動追蹤每次 context 使用
        metrics = {
            "token_count": count_tokens(context),
            "response_quality": score_response(response),
            "latency": measure_latency(),
            "cost": calculate_cost()
        }
        
        # 警告：如果 token 效率低於閾值
        if metrics["token_count"] / metrics["response_quality"] > THRESHOLD:
            alert("Context efficiency below threshold!")
        
        return metrics

🎯 結論與行動項目

三個關鍵要點

可觀測性是 AI 開發的基石
- 沒有測量就沒有優化
- Context 變化必須可視化
- ROI 應該驅動決策
工具組合優於單一方案
- MCP 處理動態注入
- Langfuse 追蹤生產環境
- 自製工具填補空白
小改動可能有大影響
- 52 tokens = 100% 提升
- 精確指示 > 大量範例
- 測試、測量、迭代

立即行動

今天就開始：安裝並設置一個 AI Observability 工具
```
pip install langfuse chainlit tiktoken
```
建立基準線：測量你現有 AI 應用的 context 效率
實驗與迭代：用 A/B 測試找出最佳 context 策略

AI Observability 實戰：像 Git Diff 一樣追蹤 AI 的 Context 演變