![]() |
| Getting started with Nvidia AIPerf |
The new Nvidia AIPerf tool is an excellent free tool for LLM Performance testing. You can customise it as per your needs and is a massive upgrade to other tools especially if you use Nvidia GPUs.
How to install Nvidia AIPerf tool?
The simplest way to install the AIPerf tool is using pip from the official pypi source.
python3 -m venv venv
source venv/bin/activate
pip install aiperfYou can now call the tool using command-line like below: (here we are taking example of llama3.2:3b model hosted on Ollama). You will need an HF token to access the Llama model. (you will need Meta to approve your use case). I am benchmarking the model on NVIDIA GPU 1660 Super which is a pretty low-end consumer graphics card.aiperf profile --model llama3.2:3b --url http://192.168.50.149:11434
--endpoint-type chat
--tokenizer meta-llama/Llama-3.2-3B-Instruct
--synthetic-input-tokens-mean 128
--synthetic-input-tokens-stddev 10
--output-tokens-mean 50
--output-tokens-stddev 5
--concurrency 1
--request-count 100
--artifact-dir ./results/5250882b_c1_llama3.2_3b
The binary & subcommand
- aiperf profile — runs a benchmarking profile: sends N requests, measures
latency/throughput, writes results to disk.
Target
- - --model llama3.2:3b — which model to call (passed as the model field in every API request)
- - --url http://192.168.50.149:11434 — the Ollama server to hit
- - --endpoint-type chat — use /v1/chat/completions (not /v1/completions). Required because Ollama rejects array-format prompt on the completions endpoint.
Synthetic workload shape
- --tokenizer meta-llama/Llama-3.2-3B-Instruct — tokenizer used to measure prompt lengths (requires HF_TOKEN). AIPerf generates synthetic prompts and uses this to hit the token targets.
- - --synthetic-input-tokens-mean 128 / --stddev 10 — each request's input prompt will be ~128 ± 10 tokens.
- - --output-tokens-mean 50 / --stddev 5 — ask the model to generate ~50 tokens per response.
Load shape
- - --concurrency 1 — one request in-flight at a time (serial). Tests single-request latency baseline.
- - --request-count 100 — send 100 total requests.
![]() |
| Nvidia AIPerf in action |
Output
- - --artifact-dir ./results/5250882b_c1_llama3.2_3b — writes results here: profile_export_aiperf.json (metrics), logs/aiperf.log (process log).What you'll get out of it: p50/p90/p99 TTFT, inter-token latency (ITL), totalrequest latency, and throughput (tokens/sec). TTFT will be null unless you add
- --streaming — without it, the model completes the full response beforereturning, so there's no first-token signal.
![]() |
| Nvidia AIPerf final stats |
The final report gives you a complete picture how you model performed.
- Benchmark run completed against an Ollama server hosting
llama3.2:3bat 192.168.50.149, using NVIDIA AIPerf at concurrency 1 over 100 requests. - Average request latency 3,533 ms (p50 3,444 ms, p99 6,428 ms), output token throughput 57.81 tokens/sec, request throughput 0.28 req/sec. Input sequence length tracked the target closely (avg 130 tokens against a 128 ± 10 target). Total benchmark duration was 353.6 seconds.
- The token-count discrepancy warning (100% of requests exceeding the 10% threshold) reflects a known mismatch between client-side Hugging Face tokenization (meta-llama/Llama-3.2-3B-Instruct) and Ollama's GGUF tokenizer accounting, which includes chat template and special tokens differently. This is expected on short sequences and does not indicate a measurement problem. The threshold can be raised via AIPERF_METRICS_USAGE_PCT_DIFF_THRESHOLD if the warning is noisy.




0 comments:
Post a Comment
What do you think?.