Chapter 8 of 8•1 min read

Design Benchmark

วิธีการทดสอบโหลดและการวัดผลระบบ LLM

Chapter 8: Design Benchmark

การออกแบบวิธีการทดสอบโหลด (Load Testing) และการวัดผลระบบ

Load Testing Tools

Locust

from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def generate(self):
        self.client.post(
            "/v1/completions",
            json={
                "model": "llama-2-7b",
                "prompt": "Hello",
                "max_tokens": 100
            }
        )

llmperf (vLLM)

python -m vllm.benchmarks.llmperf \
    --endpoint http://localhost:8000 \
    --num-requests 100 \
    --output benchmark_results.json

Benchmark Scenarios

Baseline Test - Single user, measure basic performance
Load Test - Expected traffic, verify SLOs
Stress Test - Beyond capacity, find breaking point
Soak Test - Long duration, check for memory leaks

Results Analysis

Benchmark Results:
==================
Requests/sec: 1250
Avg TTFT: 78ms
Avg ITL: 28ms
P99 TTFT: 156ms
P99 ITL: 45ms
Error Rate: 0.01%

Bottleneck: KV cache memory at 85% utilization
Recommendation: Add 2 more GPUs or implement request queuing

Pre-deployment Checklist

Baseline performance measured
SLO targets defined and verified
Auto-scaling tested under load
Failure scenarios simulated
Monitoring and alerting configured
Documentation updated

จบหลักสูตร - Congratulations on completing the LLM Deployment course!