- Home
- /
- Learn
- /
- LLM Deployment from an Instance to Cluster
- /
- Design Benchmark
Chapter 8 of 8•1 min read
Design Benchmark
วิธีการทดสอบโหลดและการวัดผลระบบ LLM
Chapter 8: Design Benchmark
การออกแบบวิธีการทดสอบโหลด (Load Testing) และการวัดผลระบบ
Load Testing Tools
Locust
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(1, 3)
@task
def generate(self):
self.client.post(
"/v1/completions",
json={
"model": "llama-2-7b",
"prompt": "Hello",
"max_tokens": 100
}
)
llmperf (vLLM)
python -m vllm.benchmarks.llmperf \
--endpoint http://localhost:8000 \
--num-requests 100 \
--output benchmark_results.json
Benchmark Scenarios
- Baseline Test - Single user, measure basic performance
- Load Test - Expected traffic, verify SLOs
- Stress Test - Beyond capacity, find breaking point
- Soak Test - Long duration, check for memory leaks
Results Analysis
Benchmark Results:
==================
Requests/sec: 1250
Avg TTFT: 78ms
Avg ITL: 28ms
P99 TTFT: 156ms
P99 ITL: 45ms
Error Rate: 0.01%
Bottleneck: KV cache memory at 85% utilization
Recommendation: Add 2 more GPUs or implement request queuing
Pre-deployment Checklist
- Baseline performance measured
- SLO targets defined and verified
- Auto-scaling tested under load
- Failure scenarios simulated
- Monitoring and alerting configured
- Documentation updated
จบหลักสูตร - Congratulations on completing the LLM Deployment course!