|  | 
| EmbeddingGemma on NVIDIA Triton Server | 
An embedding model is needed to generate embeddings (vector representations that help an LLM to understand things like text, images etc). 
Google recently released EmbeddingGemma-300M (a whooping 300 million parameter model) which has very low requirements to run. We test this assumption in this article.
|  | 
| EmbeddingGemma 300M running on Commodity NVIDIA GPU | 
Step 1. Download the model from hugging face at your local machine 
 Please create a folder called "server" and inside that create another folder called model_repository.
We need to download the model from Hugging face. We first need to create an access token by visiting this Hugging face link. We need to give proper rights to the token to download the Google Gemma models:
|  | 
| Grant Gemma model repository access in Hugging face | 
As we can see above, we have only granted read repos access.
Now, we can download the model by using the below command:
pip3.12 show huggingface_hub pip3.12 show huggingface_hub_clihf download google/embeddinggemma-300m \ --local-dir model_repository/embeddinggemma-300m/1/embeddinggemma-300m \ --local-dir-use-symlinks False
Triton Inference server expects the model to be downloaded in a particular format. Please make sure that the files are downloaded properly.
Step 2. Create the model.py file and config.pbtxt file
Now we build this docker file and create a custom image. This dcokerfile should be inside the server folderThe config.pbtxt file in NVIDIA Triton Inference Server is a crucial component for defining the configuration of a deployed model. It is a Protocol Buffer Text Format file that describes various aspects of how a model should be loaded, executed, and managed by Triton.Folder Structuremodel_repository/embeddinggemma-300m/1/embeddinggemma-300m/(empty placeholder — drop the local HF snapshot here)models/ └── embeddinggemma-300m/ ├── config.pbtxt ├── 1/ │ ├── model.py │ └──embeddinggemma-300m/(empty placeholder — drop the local HF snapshot here)Step 3. Create a Dockerfile and build an imageWe need to create a custom Dockerfile with following content:FROM nvcr.io/nvidia/tritonserver:25.05-py3 # GPU PyTorch (CUDA 12.1 wheel bundles its own CUDA, works fine in this image) # 1) PyTorch nightly with CUDA 12.4 (>= 2.6) RUN python3.12 -m pip3.12 install --no-cache-dir \ --index-url https://download.pytorch.org/whl/nightly/cu124 \ torch --pre RUN python3.12 -m pip3.12 install --no-cache-dir \ "transformers>=4.46.1" \ "safetensors>=0.4.5" \ numpy \ "protobuf>=3.20.3,<5.0" \ "sentencepiece>=0.1.99" # 2) Transformers that supports gemma3_text + friends #RUN python3.12 -m pip3.12 install --no-cache-dir \ # "transformers>=4.46.1" \ # "safetensors>=0.4.5" \ # numpy
sudo docker build -t triton-gemma-cpu
Now we run NVIDIA Triton server as a docker container:
sudo docker run --rm --privileged --gpus all -it -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/models triton-gemma-cpu tritonserver --model-repository=/models
The above command runs the Triton Inference Server in a Docker container with full GPU access. It maps the standard Triton ports (HTTP, gRPC, Metrics) to your local machine and, most importantly, links your local 
model_repository folder to the server's model directory, allowing it to serve the models you have stored there.Now test the model is running by using curl
curl -s http://localhost:8000/v2/models/embeddinggemma-300m
The response will be :
{"name":"embeddinggemma-300m","versions":["1"],"platform":"python","inputs":[{"name":"TEXT","datatype":"BYTES","shape":[-1,-1]},{"name":"MODE","datatype":"BYTES","shape":[-1,1]}],"outputs":[{"name":"EMBEDDINGS","datatype":"FP32","shape":[-1,-1,768]}]}Sample request:
curl -X POST localhost:8000/v2/models/embeddinggemma-300m/infer -d \
'{
  "inputs": [
    {
      "name": "TEXT",
      "shape": [ 1, 2 ],
      "datatype": "BYTES",
      "data": [ "This is the first document", "This is the second document" ]
    }
  ]
}' 

0 comments:
Post a Comment
What do you think?.