How to deploy Google's latest Embedding Model embeddinggemma-300m on Nvidia Triton Server

EmbeddingGemma on NVIDIA Triton Server

An embedding model is needed to generate embeddings (vector representations that help an LLM to understand things like text, images etc).

Google recently released EmbeddingGemma-300M (a whooping 300 million parameter model) which has very low requirements to run. We test this assumption in this article.

EmbeddingGemma 300M running on Commodity NVIDIA GPU

Step 1. Download the model from hugging face at your local machine

Please create a folder called "server" and inside that create another folder called model_repository.

We need to download the model from Hugging face. We first need to create an access token by visiting this Hugging face link. We need to give proper rights to the token to download the Google Gemma models:

Grant Gemma repository access in Hugging face

Grant Gemma model repository access in Hugging face

As we can see above, we have only granted read repos access.

Now, we can download the model by using the below command:

pip3.12 show huggingface_hub
pip3.12 show huggingface_hub_cli

hf download google/embeddinggemma-300m \
  --local-dir model_repository/embeddinggemma-300m/1/embeddinggemma-300m \
  --local-dir-use-symlinks False

Triton Inference server expects the model to be downloaded in a particular format. Please make sure that the files are downloaded properly.

Step 2. Create the model.py file and config.pbtxt file

The config.pbtxt file in NVIDIA Triton Inference Server is a crucial component for defining the configuration of a deployed model. It is a Protocol Buffer Text Format file that describes various aspects of how a model should be loaded, executed, and managed by Triton.
The complete code can be found here
Folder Structure

model_repository/embeddinggemma-300m/1/embeddinggemma-300m/ (empty placeholder — drop the local HF snapshot here)

models/
└── embeddinggemma-300m/
    ├── config.pbtxt
    ├── 1/
    │   ├── model.py
    │   └── embeddinggemma-300m/ (empty placeholder — drop the local HF snapshot here)
Step 3. Create a Dockerfile and build an image
We need to create a custom Dockerfile with following content:

FROM nvcr.io/nvidia/tritonserver:25.05-py3

# GPU PyTorch (CUDA 12.1 wheel bundles its own CUDA, works fine in this image)
# 1) PyTorch nightly with CUDA 12.4 (>= 2.6)
RUN python3.12 -m pip3.12 install --no-cache-dir \
    --index-url https://download.pytorch.org/whl/nightly/cu124 \
    torch --pre

RUN python3.12 -m pip3.12 install --no-cache-dir \
    "transformers>=4.46.1" \
    "safetensors>=0.4.5" \
    numpy \
    "protobuf>=3.20.3,<5.0" \
    "sentencepiece>=0.1.99"


# 2) Transformers that supports gemma3_text + friends
#RUN python3.12 -m pip3.12 install --no-cache-dir \
#    "transformers>=4.46.1" \
#    "safetensors>=0.4.5" \
#    numpy

Now we build this docker file and create a custom image. This dcokerfile should be inside the server folder

sudo docker build -t triton-gemma-cpu

Now we run NVIDIA Triton server as a docker container:

sudo docker run --rm --privileged --gpus all  -it   -p 8000:8000 -p 8001:8001 -p 8002:8002   -v $PWD/model_repository:/models   triton-gemma-cpu   tritonserver --model-repository=/models

The above command runs the Triton Inference Server in a Docker container with full GPU access. It maps the standard Triton ports (HTTP, gRPC, Metrics) to your local machine and, most importantly, links your local model_repository folder to the server's model directory, allowing it to serve the models you have stored there.

Now test the model is running by using curl

curl -s http://localhost:8000/v2/models/embeddinggemma-300m

The response will be :

{"name":"embeddinggemma-300m","versions":["1"],"platform":"python","inputs":[{"name":"TEXT","datatype":"BYTES","shape":[-1,-1]},{"name":"MODE","datatype":"BYTES","shape":[-1,1]}],"outputs":[{"name":"EMBEDDINGS","datatype":"FP32","shape":[-1,-1,768]}]}

Sample request:

curl -X POST localhost:8000/v2/models/embeddinggemma-300m/infer -d \
'{
  "inputs": [
    {
      "name": "TEXT",
      "shape": [ 1, 2 ],
      "datatype": "BYTES",
      "data": [ "This is the first document", "This is the second document" ]
    }
  ]
}'

How to deploy Google's latest Embedding Model embeddinggemma-300m on Nvidia Triton Server

0 comments:

Post a Comment