|
| Reduce CPU spikes - AI Summarization |
Summarization aims to compress a lengthy source document into a concise format while retaining its core components and key ideas.
However, when you are hosting your own LLM, handling CPU spikes (in the absence of a GPU) can be your biggest concern.
The Core Problem: The Prefill Trap
|
| All cores are touching 100% usage during summarization call |
We’ve all been there: you spin up an LLM locally or on an internal server using Ollama (for example), throw a couple of simple prompt tests at it, and it responds beautifully. It feels like magic.
Then, you deploy it into a real-world pipeline to summarize long-form articles, and reality hits.
The moment a few concurrent articles hit your endpoint, your server completely freezes. Your application logs stop scrolling, API calls timeout, and your CPU utilization graph shoots up to a flat 100% line.
If this is happening to you, you are trapped in the Prefill Phase Bottleneck. Here is exactly why it happens and the configuration choices we used to fix it.
When performing an article summarization task, your prompt contains thousands of words of input text. Before an LLM can generate its first token of summary, it has to process the entire article at once to build an attention matrix known as the Key-Value (KV) Cache.
By default, Ollama optimizes for low-latency interactive chat, not heavy document ingestion. For a long summarization task, this default behavior forces your system to:
- Map massive, uncompressed precision data arrays into system memory.
- Max out memory bandwidth trying to calculate global self-attention.
- Keep the model sitting idle in memory, blocking resources from other processes.
To turn Ollama into a lean, mean summarization machine, we need to alter its fundamental configuration. Here are the precise tweaks that saved our infrastructure.
The Solution
Part 1: Restricting Ollama Core Consumption
When you pass a massive article (several thousand words) into a local LLM, the model doesn't just start generating text immediately. It first has to process the entire input prompt all at once to build an attention matrix known as the Key-Value (KV) Cache. This is called the Prefill (Ingestion) Phase.
While generation is sequential, prefill is massively compute-heavy. By default, Ollama tries to perform this ingestion as aggressively as possible, attempting to maximize utilization of all available cores, including virtual/hyperthreaded cores. On a standard server CPU, this creates massive thread contention, memory bandwidth bottlenecks, and ultimate system instability.
Our solution required a two-part offensive: forcing the model to be resource-conscious and ensuring that this specific model was used every single time.
Our first breakthrough was moving away from the generic llama3.2:3b and creating our own optimized, stable variant: llama3.2:3b-clean.
We didn't rely on environment variables, which can change or be misconfigured. Instead, we hard-coded the resource limits directly into the model's DNA using a custom Modelfile. Here is the exact definition:
# /config/Modelfile
FROM llama3.2:3b
# --- RESOURCE STABILITY TWEAKS ---
# 1. HARD-CAP THREADS. Sets a strict ceiling on the number of physical CPU cores.
# This prevents core thrashing by avoiding hyperthreaded/logical cores.
# Use your physical core count (e.g., 2, 3, or 4), not logical.
PARAMETER num_thread 3
# 2. OPTIMIZE BATCH SIZE. Limits how many tokens are processed simultaneously in a single batch.
# A smaller batch slows down ingestion, spreading the CPU load.
PARAMETER num_batch 128
# --- APPLICATION TWEAKS ---
# 3. SET CONTEXT CEILING. Establishes a rigid, predictable memory ceiling for the input.
PARAMETER num_ctx 4096
# 4. ENFORCE CONCISENESS. Prevents the model from generating long, unnecessary text.
PARAMETER num_predict 80
Why This Works (Technical Breakdown)
num_thread 3: The most critical tweak. Many server CPUs have dozens of logical cores. Allowing Ollama to use24or32threads during the ingestion phase for a 3B model is massive overkill. It guarantees cache misses and severe thread contention. We set this to3, matching our exact physical core allocation for this service. This provides a clear, focused path for the processor.num_batch 128: The default can be much higher. A large batch size lets Ollama try to ingest massive chunks of text at once, causing a violent CPU spike. Lowering this to128forces the ingestion phase to be more of a "stream" than a "flood," smoothing out the load.num_ctx 4096: Newer models can have a context length of 128k or higher. Without a cap, your server might attempt to pre-allocate massive infrastructure blocks, leading to immediate memory pressure.4096is more than enough for summarizing standard articles.
Tweaking Specific Ollama Parameters
Setting these environment variables before running ollama serve completely altered how our system handles long-form inputs.
1. OLLAMA_FLASH_ATTENTION: true
Standard attention scales quadratically with document length, frequently accessing slow system memory layers. By turning on Flash Attention, Ollama switches to an optimized, memory-efficient algorithm that processes text blocks dynamically.
- Why it matters for CPU: It dramatically minimizes redundant memory reads and writes, keeping hot data exactly where it belongs—in your processor's high-speed caches.
2. OLLAMA_KV_CACHE_TYPE: q4_0
By default, Ollama builds the KV cache using f16 (Float16) precision. For a long article, a high-precision cache eats massive chunks of memory space and bandwidth.
- Why it matters for CPU: Setting this to
q4_0compresses the context cache down to 4-bit quantization. This single change slashes the memory footprint of your article's context by roughly 75%, eliminating memory thrashing and keeping processing fast and light.
3. OLLAMA_CONTEXT_LENGTH: 4096
Some newer open-source models support massive context lengths by default (up to 32k or 128k tokens). If you pass text into Ollama without establishing a boundary, the server may try to pre-allocate massive chunks of infrastructure memory to support an unnecessarily large window.
- Why it matters for CPU: Capping the length to
4096tokens sets a predictable resource ceiling perfectly tailored for standard articles (roughly 3,000 words). Ollama knows exactly how much memory to allocate without over-provisioning.
4. OLLAMA_KEEP_ALIVE: 0s
Ollama normally holds a loaded model in memory for 5 minutes after a request ends, waiting for a follow-up prompt. In a multi-step backend pipeline or microservice architecture, an idle model sitting in system RAM or VRAM prevents other services or subsequent system processes from reclaiming that hardware compute.
- Why it matters for CPU: Setting this to
0sinstructs Ollama to unload the model weights immediately after generating the summary. Resources are freed up instantly, preventing cumulative server degradation if your pipeline runs on a scheduled interval.
If you are running Ollama via a Docker container, setting up a systemd service, or running on k8s, your target configuration deployment should look like this:
# Example: Docker Compose configuration snippet
services:
ollama:
image: ollama/ollama
environment:
- OLLAMA_FLASH_ATTENTION=true
- OLLAMA_KV_CACHE_TYPE=q4_0
- OLLAMA_CONTEXT_LENGTH=4096
- OLLAMA_KEEP_ALIVE=0s
ports:
- "11434:11434"
Part 2: The Automating Script
Our optimized Modelfile is only useful if it's always used. In a DevOps or automated infrastructure environment (like Docker Compose or a k3s cluster), you can't rely on a developer manually running ollama create every time.
We created an Initialization Shell Script that runs as an automated setup job before the main application service starts. This script is what actually downloads the base model and compiles our resource-conscious llama3.2:3b-clean variant.
#!/bin/sh
# file: init_optimized_model.sh
echo "Starting temporary initialization server..."
# Start the Ollama background process
ollama serve &
# STEP 1: Wait for local server instance to respond.
# We don't want to proceed until we can communicate with the engine.
until ollama list 2>/dev/null; do
echo "Waiting for engine to respond..."
sleep 2
done
# STEP 2: Download and preserving raw base model.
# This makes sure the 3B weights are present on disk.
echo "1. Downloading raw base model..."
ollama pull llama3.2:3b
# STEP 3: Compile optimized 4K CPU model variant.
# This runs 'ollama create' using our /config/Modelfile.
# This creates the stable model variant we will use in production.
echo "2. Compiling optimized 4K CPU model variant..."
ollama create llama3.2:3b-clean -f /config/Modelfile
echo "Initialization complete. Terminating setup process."
# The setup script exits, and we can now start the main application server.
exit 0
This script ensures that whenever our infrastructure is deployed, our specific, optimized model is built and ready, guaranteeing predictability from day one.
The Results
After moving from stock llama3.2:3b to our custom llama3.2:3b-clean variant (deployed via script), we saw an immediate and dramatic transformation under real summarization loads:
|
| CPU spikes solved for summarization |
You can see that one core is always free thanks to our implementation.
|
| Results after optimizations for Summarization |
Final Takeaway
AI architectures don't always require throwing expensive, high-tier hardware at the problem. Often, the difference between an unstable, crashing microservice and a production-grade backend engine comes down to tuning memory bandwidth and context management.
By compressing your KV cache, enforcing rigid token caps, and pruning inactive model runtimes, you can transform Ollama into a highly predictable, rock-solid article summarizer that plays nice with the rest of your server ecosystem.
P.S. -> Written using Gemini and Claude AI tools

0 comments:
Post a Comment
What do you think?.