# Tokenizer

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/gemma/blob/main/colabs/tokenizer.ipynb)

This tutorial show how to use the Gemma tokenizer. Understanding tokenizer is  important to correctly feed input to the model.

For more info on tokenizer, see the excelent talk from [Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE).

In [None]:
!pip install -q gemma

In [None]:
# Common imports

# Gemma imports
from gemma import gm

## Tokenizer basics

Gemma tokenizers are directly available:

In [None]:
tokenizer = gm.text.Gemma3Tokenizer()

The total number of tokens is available through `.vocab_size`:

In [None]:
tokenizer.vocab_size

256000

### Encoding

You can encode a string:

* Into token ids with `.encode`:

In [None]:
tokenizer.encode('Derinkuyu is an underground city.')

[8636, 979, 78904, 603, 671, 30073, 3413, 235265]

* Into token string with `.split`:

In [None]:
tokenizer.split('Derinkuyu is an underground city.')

['Der', 'ink', 'uyu', ' is', ' an', ' underground', ' city', '.']

One thing to notice is that the whitespace ` ` are part of the tokens. For example, this means that for the model, ` hello` and `hello` map to 2 different token ids.

In [None]:
tokenizer.encode(' hello');
tokenizer.encode('hello');

[25612]

[17534]

If doing next word prediction, it's important to not add a trailing space as it would make the out of distribution.

In [None]:
# When encoding this sentence, the last token will be an empty whitespace,
# which is unusual for the model.
tokenizer.split('The capital of France is ')

['The', ' capital', ' of', ' France', ' is', ' ']

### Decoding

Tokens can be decoded with `.decode`. You can decode a single id or an entire sentence.

In [None]:
tokenizer.decode([8636, 979, 78904, 603, 671, 30073, 3413, 235265])

'Derinkuyu is an underground city.'

In [None]:
tokenizer.decode(4567)

'Med'

## Controls tokens

Some tokens have special meaning. Forgeting about those may affect the model quality significantly.

Special token ids can be accessed through `tokenizer.special_tokens` attribute.

### `<bos>` / `<eos>`

In Gemma models, the begin of sentence token (`<bos>`) should appear only once at the begining of the input. You can add it either explicitly or with `add_eos=True`:

In [None]:
token_ids = tokenizer.encode('Hello world!')
token_ids = [tokenizer.special_tokens.BOS] + token_ids
token_ids

[<_Gemma2SpecialTokens.BOS: 2>, 4521, 2134, 235341]

In [None]:
tokenizer.encode('Hello world!', add_bos=True)

[<_Gemma2SpecialTokens.BOS: 2>, 4521, 2134, 235341]

Similarly, the model can output a `<bos>` token to indicate the prediction is complete.

When fine-tuning Gemma, you can train the model to predict `<eos>` tokens.

In [None]:
tokenizer.encode('Hello world!', add_eos=True)

[4521, 2134, 235341, <_Gemma2SpecialTokens.EOS: 1>]

### `<start_of_turn>` / `<end_of_turn>`

When using the instruction-tuned version of Gemma, the `<start_of_turn>` / `<end_of_turn>` tokens allow to specify who from the user or the model is talking.

The `<start_of_turn>` should be followed by either:

* `user`
* `model`

Example of dialogue with user and model:

In [None]:
token_ids = tokenizer.encode("""<start_of_turn>user
Knock knock.<end_of_turn>
<start_of_turn>model
Who's there ?<end_of_turn>
<start_of_turn>user
Gemma.<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>""")

In [None]:
tokenizer.decode(token_ids[0])

'<start_of_turn>'

### `<start_of_image>`

In Gemma 3, to indicate the position of an image in the text, the prompt should contain the special `<start_of_image>` token. Internally, Gemma model will automatically expand the token to insert the soft images tokens.

(Note: There's also a `<end_of_image>` token, but is handled internally by the model)

### Custom tokens

In all Gemma versions, a few tokens (`99`) are unused. This allow custom applications to define and fine-tune their own custom tokens for their application. Those tokens are available through `tokenizer.special_tokens.CUSTOM + xx`, with `xx` being a number between `0` and `98`

<!-- TODO(epot): Add option to customize the special tokens -->

In [None]:
tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)

'<unused17>'

You can customize what the custom tokens correspond to when constructing the tokenizer.

In [None]:
tokenizer = gm.text.Gemma3Tokenizer(
    custom_tokens={
        0: '<my_custom_tag>',
        17: '<my_other_tag>',
    },
)

tokenizer.encode('<my_other_tag>')

[24]

The custom tokens string are encoded to the matching token id.

In [None]:
tokenizer.special_tokens.CUSTOM + 17

24

In [None]:
tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)

'<my_other_tag>'