Instructions to use google/gemma-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

llama-cpp-python

How to use google/gemma-2b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-2b",
	filename="gemma-2b.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use google/gemma-2b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-2b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-2b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-2b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-2b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-2b
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-2b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-2b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-2b

Use Docker

docker model run hf.co/google/gemma-2b

LM Studio
Jan

vLLM

How to use google/gemma-2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/gemma-2b

SGLang

How to use google/gemma-2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use google/gemma-2b with Ollama:
```
ollama run hf.co/google/gemma-2b
```

Unsloth Studio new

How to use google/gemma-2b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-2b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-2b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-2b to start chatting

Docker Model Runner
How to use google/gemma-2b with Docker Model Runner:
```
docker model run hf.co/google/gemma-2b
```

Lemonade

How to use google/gemma-2b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-2b

Run and chat with the model

lemonade run user.gemma-2b-{{QUANT_TAG}}

List all available models

lemonade list

Following blog for fine tuning gemma-2b doesn't yield same results

#60

by chongdashu - opened May 19, 2024

Discussion

chongdashu

May 19, 2024

Following the blog here: https://huggingface.co/blog/gemma-peft

I've replicated the entire blog but don't get the same result.
It still outputs the same as prior to fine-tuning.

Here is the notebook

chongdashu

May 19, 2024

It seems if i rely on the latest dependencies i.e.

!pip install -q -U accelerate bitsandbytes git+https://github.com/huggingface/transformers.git
!pip install datasets -q
!pip install peft -q

I get the failure to train.
But if I use the following...

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1

I can get the same results.

I am surprised that the change in libs would cause such a big drop -off

ybelkada

May 20, 2024

Hi @chongdashu
Thanks for the report !
To isolate which lib is responsible, can you try the same experiment with:

peft == 0.8.2 vs peft == 0.11.0 (while keeping all other libs to the 'stable' version)
trl == 0.7.2 vs trl == 0.8.6 (while keeping all other libs to the 'stable' version)
I will also try to reproduce on my end and report here

chongdashu

May 20, 2024

@ybelkada sure thing, let me give it a whirl

chongdashu

May 20, 2024

•

edited May 20, 2024

With peft==0.11.0

I get the following error on trying to train

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:258, in GradScaler._unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    256     continue
    257 if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 258     raise ValueError("Attempting to unscale FP16 gradients.")
    259 if param.grad.is_sparse:
    260     # is_coalesced() == False means the sparse grad has values with duplicate indices.
    261     # coalesce() deduplicates indices and adds all values that have the same index.
    262     # For scaled fp16 values, there's a good chance coalescing will cause overflow,
    263     # so we should check the coalesced _values().
    264     if param.grad.dtype is torch.float16:

ValueError: Attempting to unscale FP16 gradients.

With trl==0.8.6 I replicate the issue where the training loss basically never reduces and the fine tuning doesn't complete successfully.

chongdashu

May 22, 2024

Hi @ybelkada - any idea on what might be going on here with TRL?

merve

May 22, 2024

@chongdashu we are about to merge a change to transformers that'll fix finetuning issues. I will post a notebookized version of blog soon after I confirm it works well

chongdashu

May 22, 2024

@merve great to hear thanks!

merve

May 22, 2024

@chongdashu we have made a few changes around finetuning (also a smol change in API) you can see here: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing

chongdashu

May 22, 2024

Thanks @merve , will check it out!

chongdashu

May 22, 2024

•

edited May 22, 2024

@merve does this need an update of the transformers version?

edit
Oh wait I see it git+https://github.com/huggingface/transformers.git

Though it's not immediately obvious what the API change is?

chongdashu

May 22, 2024

•

edited May 22, 2024

I've tried using the latest transformers with trl, but still the same issue with training loss on gemma-2b.

!pip install --force-reinstall trl accelerate datasets peft bitsandbytes git+https://github.com/huggingface/transformers.git

import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir=".outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
    packing=False
)
trainer.train()

lkv

Google org Jan 22, 2025

•

edited Jan 22, 2025

Hi @chongdashu ,

Fine-tuning very large models like Gemma-2B requires proper memory management, appropriate learning rates, and a solid reward structure if we are using RL (or reinforcement-based fine-tuning). I have reproduced the issue. Could you please refer this gist file for reference. Kindly try and let me know if you have any concerns.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment