Chapter 8 of 81 min read

Design Benchmark

วิธีการทดสอบโหลดและการวัดผลระบบ LLM

Chapter 8: Design Benchmark

การออกแบบวิธีการทดสอบโหลด (Load Testing) และการวัดผลระบบ

Load Testing Tools

Locust

from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def generate(self):
        self.client.post(
            "/v1/completions",
            json={
                "model": "llama-2-7b",
                "prompt": "Hello",
                "max_tokens": 100
            }
        )

llmperf (vLLM)

python -m vllm.benchmarks.llmperf \
    --endpoint http://localhost:8000 \
    --num-requests 100 \
    --output benchmark_results.json

Benchmark Scenarios

  1. Baseline Test - Single user, measure basic performance
  2. Load Test - Expected traffic, verify SLOs
  3. Stress Test - Beyond capacity, find breaking point
  4. Soak Test - Long duration, check for memory leaks

Results Analysis

Benchmark Results:
==================
Requests/sec: 1250
Avg TTFT: 78ms
Avg ITL: 28ms
P99 TTFT: 156ms
P99 ITL: 45ms
Error Rate: 0.01%

Bottleneck: KV cache memory at 85% utilization
Recommendation: Add 2 more GPUs or implement request queuing

Pre-deployment Checklist

  • Baseline performance measured
  • SLO targets defined and verified
  • Auto-scaling tested under load
  • Failure scenarios simulated
  • Monitoring and alerting configured
  • Documentation updated

จบหลักสูตร - Congratulations on completing the LLM Deployment course!