How to Reduce CPU Spikes for AI Summarization

Share
Reduce CPU spikes - AI Summarization
Reduce CPU spikes - AI Summarization

Summarization aims to compress a lengthy source document into a concise format while retaining its core components and key ideas.

However, when you are hosting your own LLM, handling CPU spikes (in the absence of a GPU) can be your biggest concern.


The Core Problem: The Prefill Trap

CPU 100 % Peak
All cores are touching 100% usage during summarization call

We’ve all been there: you spin up an LLM locally or on an internal server using Ollama (for example), throw a couple of simple prompt tests at it, and it responds beautifully. It feels like magic.

Then, you deploy it into a real-world pipeline to summarize long-form articles, and reality hits.

The moment a few concurrent articles hit your endpoint, your server completely freezes. Your application logs stop scrolling, API calls timeout, and your CPU utilization graph shoots up to a flat 100% line.

If this is happening to you, you are trapped in the Prefill Phase Bottleneck. Here is exactly why it happens and the configuration choices we used to fix it.

When performing an article summarization task, your prompt contains thousands of words of input text. Before an LLM can generate its first token of summary, it has to process the entire article at once to build an attention matrix known as the Key-Value (KV) Cache.

By default, Ollama optimizes for low-latency interactive chat, not heavy document ingestion. For a long summarization task, this default behavior forces your system to:

  1. Map massive, uncompressed precision data arrays into system memory.
  2. Max out memory bandwidth trying to calculate global self-attention.
  3. Keep the model sitting idle in memory, blocking resources from other processes.

To turn Ollama into a lean, mean summarization machine, we need to alter its fundamental configuration. Here are the precise tweaks that saved our infrastructure.

The Solution

Part 1: Restricting Ollama Core Consumption

When you pass a massive article (several thousand words) into a local LLM, the model doesn't just start generating text immediately. It first has to process the entire input prompt all at once to build an attention matrix known as the Key-Value (KV) Cache. This is called the Prefill (Ingestion) Phase.

While generation is sequential, prefill is massively compute-heavy. By default, Ollama tries to perform this ingestion as aggressively as possible, attempting to maximize utilization of all available cores, including virtual/hyperthreaded cores. On a standard server CPU, this creates massive thread contention, memory bandwidth bottlenecks, and ultimate system instability.

Our solution required a two-part offensive: forcing the model to be resource-conscious and ensuring that this specific model was used every single time.

Our first breakthrough was moving away from the generic llama3.2:3b and creating our own optimized, stable variant: llama3.2:3b-clean.

We didn't rely on environment variables, which can change or be misconfigured. Instead, we hard-coded the resource limits directly into the model's DNA using a custom Modelfile. Here is the exact definition:

# /config/Modelfile
FROM llama3.2:3b

# --- RESOURCE STABILITY TWEAKS ---

# 1. HARD-CAP THREADS. Sets a strict ceiling on the number of physical CPU cores.
# This prevents core thrashing by avoiding hyperthreaded/logical cores.
# Use your physical core count (e.g., 2, 3, or 4), not logical.
PARAMETER num_thread 3

# 2. OPTIMIZE BATCH SIZE. Limits how many tokens are processed simultaneously in a single batch.
# A smaller batch slows down ingestion, spreading the CPU load.
PARAMETER num_batch 128

# --- APPLICATION TWEAKS ---

# 3. SET CONTEXT CEILING. Establishes a rigid, predictable memory ceiling for the input.
PARAMETER num_ctx 4096

# 4. ENFORCE CONCISENESS. Prevents the model from generating long, unnecessary text.
PARAMETER num_predict 80

Why This Works (Technical Breakdown)

  • num_thread 3: The most critical tweak. Many server CPUs have dozens of logical cores. Allowing Ollama to use 24 or 32 threads during the ingestion phase for a 3B model is massive overkill. It guarantees cache misses and severe thread contention. We set this to 3, matching our exact physical core allocation for this service. This provides a clear, focused path for the processor.
  • num_batch 128: The default can be much higher. A large batch size lets Ollama try to ingest massive chunks of text at once, causing a violent CPU spike. Lowering this to 128 forces the ingestion phase to be more of a "stream" than a "flood," smoothing out the load.
  • num_ctx 4096: Newer models can have a context length of 128k or higher. Without a cap, your server might attempt to pre-allocate massive infrastructure blocks, leading to immediate memory pressure. 4096 is more than enough for summarizing standard articles.

Tweaking Specific Ollama Parameters

Setting these environment variables before running ollama serve completely altered how our system handles long-form inputs.

1. OLLAMA_FLASH_ATTENTION: true

Standard attention scales quadratically with document length, frequently accessing slow system memory layers. By turning on Flash Attention, Ollama switches to an optimized, memory-efficient algorithm that processes text blocks dynamically.

  • Why it matters for CPU: It dramatically minimizes redundant memory reads and writes, keeping hot data exactly where it belongs—in your processor's high-speed caches.

2. OLLAMA_KV_CACHE_TYPE: q4_0

By default, Ollama builds the KV cache using f16 (Float16) precision. For a long article, a high-precision cache eats massive chunks of memory space and bandwidth.

  • Why it matters for CPU: Setting this to q4_0 compresses the context cache down to 4-bit quantization. This single change slashes the memory footprint of your article's context by roughly 75%, eliminating memory thrashing and keeping processing fast and light.

3. OLLAMA_CONTEXT_LENGTH: 4096

Some newer open-source models support massive context lengths by default (up to 32k or 128k tokens). If you pass text into Ollama without establishing a boundary, the server may try to pre-allocate massive chunks of infrastructure memory to support an unnecessarily large window.

  • Why it matters for CPU: Capping the length to 4096 tokens sets a predictable resource ceiling perfectly tailored for standard articles (roughly 3,000 words). Ollama knows exactly how much memory to allocate without over-provisioning.

4. OLLAMA_KEEP_ALIVE: 0s

Ollama normally holds a loaded model in memory for 5 minutes after a request ends, waiting for a follow-up prompt. In a multi-step backend pipeline or microservice architecture, an idle model sitting in system RAM or VRAM prevents other services or subsequent system processes from reclaiming that hardware compute.

  • Why it matters for CPU: Setting this to 0s instructs Ollama to unload the model weights immediately after generating the summary. Resources are freed up instantly, preventing cumulative server degradation if your pipeline runs on a scheduled interval.

If you are running Ollama via a Docker container, setting up a systemd service, or running on k8s, your target configuration deployment should look like this:

# Example: Docker Compose configuration snippet
services:
  ollama:
    image: ollama/ollama
    environment:
      - OLLAMA_FLASH_ATTENTION=true
      - OLLAMA_KV_CACHE_TYPE=q4_0
      - OLLAMA_CONTEXT_LENGTH=4096
      - OLLAMA_KEEP_ALIVE=0s
    ports:
      - "11434:11434"

Part 2: The Automating Script

Our optimized Modelfile is only useful if it's always used. In a DevOps or automated infrastructure environment (like Docker Compose or a k3s cluster), you can't rely on a developer manually running ollama create every time.

We created an Initialization Shell Script that runs as an automated setup job before the main application service starts. This script is what actually downloads the base model and compiles our resource-conscious llama3.2:3b-clean variant.

#!/bin/sh
# file: init_optimized_model.sh

echo "Starting temporary initialization server..."
# Start the Ollama background process
ollama serve &

# STEP 1: Wait for local server instance to respond.
# We don't want to proceed until we can communicate with the engine.
until ollama list 2>/dev/null; do
  echo "Waiting for engine to respond..."
  sleep 2
done

# STEP 2: Download and preserving raw base model.
# This makes sure the 3B weights are present on disk.
echo "1. Downloading raw base model..."
ollama pull llama3.2:3b

# STEP 3: Compile optimized 4K CPU model variant.
# This runs 'ollama create' using our /config/Modelfile.
# This creates the stable model variant we will use in production.
echo "2. Compiling optimized 4K CPU model variant..."
ollama create llama3.2:3b-clean -f /config/Modelfile

echo "Initialization complete. Terminating setup process."
# The setup script exits, and we can now start the main application server.
exit 0

This script ensures that whenever our infrastructure is deployed, our specific, optimized model is built and ready, guaranteeing predictability from day one.

The Results

After moving from stock llama3.2:3b to our custom llama3.2:3b-clean variant (deployed via script), we saw an immediate and dramatic transformation under real summarization loads:

CPU spikes solved for summarization
CPU spikes solved for summarization

You can see that one core is always free thanks to our implementation.

Results after optimizations for Summarization
Results after optimizations for Summarization

Final Takeaway

AI architectures don't always require throwing expensive, high-tier hardware at the problem. Often, the difference between an unstable, crashing microservice and a production-grade backend engine comes down to tuning memory bandwidth and context management.

By compressing your KV cache, enforcing rigid token caps, and pruning inactive model runtimes, you can transform Ollama into a highly predictable, rock-solid article summarizer that plays nice with the rest of your server ecosystem.

P.S. -> Written using Gemini and Claude AI tools



0 comments:

Post a Comment

What do you think?.

© 2007 - DMCA.com Protection Status
The content is copyrighted to Sundeep Machado


Note: The author is not responsible for damages related to improper use of software, techniques, tips and copyright claims.