Reduce CPU spikes - AI Summarization

Summarization aims to compress a lengthy source document into a concise format while retaining its core components and key ideas.

However, when you are hosting your own LLM, handling CPU spikes (in the absence of a GPU) can be your biggest concern.

The Core Problem: The Prefill Trap

All cores are touching 100% usage during summarization call

Setting up a local LLM via Ollama usually starts smoothly; simple prompt tests return quick, impressive results. However, scaling that setup for real-world tasks like long-form summarization introduces a major bottleneck. The moment multiple articles hit the endpoint simultaneously, system resources cap out. The server freezes, API calls time out, and CPU utilization spikes to a sustained 100%.

If this is happening to you, you are trapped in the Prefill Phase Bottleneck. Here is exactly why it happens and the configuration choices we used to fix it.

When you pass a massive article (several thousand words) into a local LLM, the model doesn't just start generating text immediately. It first has to process the entire input prompt all at once to build an attention matrix known as the Key-Value (KV) Cache. This is called the Prefill (Ingestion) Phase.

While generation is sequential, prefill is massively compute-heavy. By default, Ollama tries to perform this ingestion as aggressively as possible, attempting to maximize utilization of all available cores, including virtual/hyperthreaded cores. On a standard server CPU, this creates massive thread contention, memory bandwidth bottlenecks, and ultimate system instability.

By default, Ollama optimizes for low-latency interactive chat, not heavy document ingestion. For a long summarization task, this default behavior forces your system to:

Map massive, uncompressed precision data arrays into system memory.
Max out memory bandwidth trying to calculate global self-attention.
Keep the model sitting idle in memory, blocking resources from other processes.

In order to solve these problems, we need to alter Ollama's fundamental configuration. Here are the precise tweaks that saved our infrastructure.

The Solution

Part 1: Restricting Ollama Core Consumption

Our solution required a two-part offensive: forcing the model to be resource-conscious and ensuring that this specific model was used every single time.

Our first breakthrough was moving away from the generic llama3.2:3b and creating our own optimized, stable variant: llama3.2:3b-clean.

We didn't rely on environment variables, which can change or be misconfigured. Instead, we hard-coded the resource limits directly into the model's DNA using a custom Modelfile. Here is the exact definition:

# /config/Modelfile
FROM llama3.2:3b

# --- RESOURCE STABILITY TWEAKS ---

# 1. HARD-CAP THREADS. Sets a strict ceiling on the number of physical CPU cores.
# This prevents core thrashing by avoiding hyperthreaded/logical cores.
# Use your physical core count (e.g., 2, 3, or 4), not logical.
PARAMETER num_thread 3

# 2. OPTIMIZE BATCH SIZE. Limits how many tokens are processed simultaneously in a single batch.
# A smaller batch slows down ingestion, spreading the CPU load.
PARAMETER num_batch 128

# --- APPLICATION TWEAKS ---

# 3. SET CONTEXT CEILING. Establishes a rigid, predictable memory ceiling for the input.
PARAMETER num_ctx 4096

# 4. ENFORCE CONCISENESS. Prevents the model from generating long, unnecessary text.
PARAMETER num_predict 80

Why This Works (Technical Breakdown)

num_thread 3: The most critical tweak. Many server CPUs have dozens of logical cores. Allowing Ollama to use 24 or 32 threads during the ingestion phase for a 3B model is massive overkill. It guarantees cache misses and severe thread contention. We set this to 3, matching our exact physical core allocation for this service. This provides a clear, focused path for the processor.
num_batch 128: The default can be much higher. A large batch size lets Ollama try to ingest massive chunks of text at once, causing a violent CPU spike. Lowering this to 128 forces the ingestion phase to be more of a "stream" than a "flood," smoothing out the load.
num_ctx 4096: Newer models can have a context length of 128k or higher. Without a cap, your server might attempt to pre-allocate massive infrastructure blocks, leading to immediate memory pressure. 4096 is more than enough for summarizing standard articles.

Tweaking Specific Ollama Parameters

Setting these environment variables before running ollama serve completely altered how our system handles long-form inputs.

1. `OLLAMA_FLASH_ATTENTION: true`

Standard attention scales quadratically with document length, frequently accessing slow system memory layers. By turning on Flash Attention, Ollama switches to an optimized, memory-efficient algorithm that processes text blocks dynamically.

Why it matters for CPU: It dramatically minimizes redundant memory reads and writes, keeping hot data exactly where it belongs—in your processor's high-speed caches.

2. `OLLAMA_KV_CACHE_TYPE: q4_0`

By default, Ollama builds the KV cache using f16 (Float16) precision. For a long article, a high-precision cache eats massive chunks of memory space and bandwidth.

Why it matters for CPU: Setting this to q4_0 compresses the context cache down to 4-bit quantization. This single change slashes the memory footprint of your article's context by roughly 75%, eliminating memory thrashing and keeping processing fast and light.

3. `OLLAMA_CONTEXT_LENGTH: 4096`

Some newer open-source models support massive context lengths by default (up to 32k or 128k tokens). If you pass text into Ollama without establishing a boundary, the server may try to pre-allocate massive chunks of infrastructure memory to support an unnecessarily large window.

Why it matters for CPU: Capping the length to 4096 tokens sets a predictable resource ceiling perfectly tailored for standard articles (roughly 3,000 words). Ollama knows exactly how much memory to allocate without over-provisioning.

4. `OLLAMA_KEEP_ALIVE: 0s`

Ollama normally holds a loaded model in memory for 5 minutes after a request ends, waiting for a follow-up prompt. In a multi-step backend pipeline or microservice architecture, an idle model sitting in system RAM or VRAM prevents other services or subsequent system processes from reclaiming that hardware compute.

Why it matters for CPU: Setting this to 0s instructs Ollama to unload the model weights immediately after generating the summary. Resources are freed up instantly, preventing cumulative server degradation if your pipeline runs on a scheduled interval.

If you are running Ollama via a Docker container, setting up a systemd service, or running on k8s, your target configuration deployment should look like this:

# Example: Docker Compose configuration snippet
services:
  ollama:
    image: ollama/ollama
    environment:
      - OLLAMA_FLASH_ATTENTION=true
      - OLLAMA_KV_CACHE_TYPE=q4_0
      - OLLAMA_CONTEXT_LENGTH=4096
      - OLLAMA_KEEP_ALIVE=0s
    ports:
      - "11434:11434"

Part 2: The Automating Script

Our optimized Modelfile is only useful if it's always used. In a DevOps or automated infrastructure environment (like Docker Compose or a k3s cluster), you can't rely on a developer manually running ollama create every time.

We created an Initialization Shell Script that runs as an automated setup job before the main application service starts. This script is what actually downloads the base model and compiles our resource-conscious llama3.2:3b-clean variant.

#!/bin/sh
# file: init_optimized_model.sh

echo "Starting temporary initialization server..."
# Start the Ollama background process
ollama serve &

# STEP 1: Wait for local server instance to respond.
# We don't want to proceed until we can communicate with the engine.
until ollama list 2>/dev/null; do
  echo "Waiting for engine to respond..."
  sleep 2
done

# STEP 2: Download and preserving raw base model.
# This makes sure the 3B weights are present on disk.
echo "1. Downloading raw base model..."
ollama pull llama3.2:3b

# STEP 3: Compile optimized 4K CPU model variant.
# This runs 'ollama create' using our /config/Modelfile.
# This creates the stable model variant we will use in production.
echo "2. Compiling optimized 4K CPU model variant..."
ollama create llama3.2:3b-clean -f /config/Modelfile

echo "Initialization complete. Terminating setup process."
# The setup script exits, and we can now start the main application server.
exit 0

This script ensures that whenever our infrastructure is deployed, our specific, optimized model is built and ready, guaranteeing predictability from day one.

The Results

After moving from stock llama3.2:3b to our custom llama3.2:3b-clean variant (deployed via script), we saw an immediate and dramatic transformation under real summarization loads:

CPU spikes solved for summarization

You can see that one core is always free thanks to our implementation.

Results after optimizations for Summarization

Final Takeaway

AI architectures don't always require throwing expensive, high-tier hardware at the problem. Often, the difference between an unstable, crashing microservice and a production-grade backend engine comes down to tuning memory bandwidth and context management.

By compressing your KV cache, enforcing rigid token caps, and pruning inactive model runtimes, you can transform Ollama into a highly predictable, rock-solid article summarizer that plays nice with the rest of your server ecosystem.

P.S. -> Written using Gemini and Claude AI tools

How to Reduce CPU Spikes for AI Summarization

The Core Problem: The Prefill Trap

The Solution

Part 1: Restricting Ollama Core Consumption

Why This Works (Technical Breakdown)

Tweaking Specific Ollama Parameters

1. `OLLAMA_FLASH_ATTENTION: true`

2. `OLLAMA_KV_CACHE_TYPE: q4_0`

3. `OLLAMA_CONTEXT_LENGTH: 4096`

4. `OLLAMA_KEEP_ALIVE: 0s`

Part 2: The Automating Script

The Results

Final Takeaway

0 comments:

Post a Comment

How to Reduce CPU Spikes for AI Summarization

The Core Problem: The Prefill Trap

The Solution

Part 1: Restricting Ollama Core Consumption

Why This Works (Technical Breakdown)

Tweaking Specific Ollama Parameters

1. OLLAMA_FLASH_ATTENTION: true

2. OLLAMA_KV_CACHE_TYPE: q4_0

3. OLLAMA_CONTEXT_LENGTH: 4096

4. OLLAMA_KEEP_ALIVE: 0s

Part 2: The Automating Script

The Results

Final Takeaway

0 comments:

Post a Comment

1. `OLLAMA_FLASH_ATTENTION: true`

2. `OLLAMA_KV_CACHE_TYPE: q4_0`

3. `OLLAMA_CONTEXT_LENGTH: 4096`

4. `OLLAMA_KEEP_ALIVE: 0s`