Input Block - Tokenization

March 1, 2025 · 3 min read

C | Rust | Quantum Gravity | LLM Researcher

In building LLMs, the first block is called input block. In this step, input text passed through the Tokenizer to create a tokenized text. This Token IDs are then passed through Embedding Layer and positional encoding is added before sending it to the transformer block.

I will talk about positional encoding later in the series. For now think of ids (numerical values) been sent to transformer block.

Here's the architecture of LLMs showing all three blocks.

LLM Achitecture

Tokenizer

Input to the transformers are numbers not words, so we need to change the raw text to tokens (usually words, subwords or even characters) and convert it to token ids. This is called Tokenization.

The token ids are then passed to a neural network which transforms to continuous vector representations in a very high dimensional space. These dense vector representations capture semantic relationships between tokens, allowing machines to process language more effectively.

Here's how Embedding Matrix are typically created.
If you follow the text This - If token id is 2 , the embedding matrix could be [0.2, 0.8....]

Embedding Matrix

info

Embedding Matrix - A lookup table where each row corresponds to a specific token in the vocabulary and contains that token's learned vector representation.

Here's the summary -

Tokenization

info

Typically these Embedding size (rows) are between 768 to roughly 16K for each column inputs text.

Types of Tokenizer

To tokenize the input text, there are three types which are available. Sub-Word based tokenizers are what is typically used in LLMs.
alt text

Byte Pair Encoding (BPE) is one such library.

Byte Pair Encoding (BPE)

info

Orginally BPE was developed as a compression algorithm.
Checkout more details and the example used from Wiki

alt text

Modified BPE

tip

In LLM, we use something called Modified Byte Pair Encoding. Used for encoding plain text to tokens

alt text

BPE Library

In Rust, you can use Tiktoken-rs for BPE

info

Equivalent Python library is available

Tiktoken

Rust Code

Below is the rust code using Tiktoken-rs

use tiktoken_rs::o200k_base;

fn main() {
    let bpe = o200k_base().unwrap();
    let token: Vec<u32> = bpe.encode_with_special_tokens("This is multi line sentence for BPE with rust and a sentence   with spaces");

    println!("Token: {:?}", token);
    println!("Decoding the token {:?}", bpe.decode(token));

}

Output of this program is as shown below

$> cargo run

## OUTPUT

#Token: [2500, 382, 12151, 2543, 21872, 395, 418, 3111, 483, 20294, 326, 261, 21872, 256, 483, 18608]
#Decoding the token Ok("This is multi line sentence for BPE with rust and a sentence   with spaces")

Python Code

import tiktoken
tokenizer = tiktoken.get_encoding("o200k_base")

text_data = ( "Ecode using BPE via Python")

encoder_output = tokenizer.encode(text_data, allowed_special={"<|END|>"})
print(encoder_output)

decoder_output = tokenizer.decode(encoder_output)

print(decoder_output)

$> python3 bpe_example.py

## OUTPUT

# [36, 3056, 2360, 418, 3111, 4493, 26534]
# Ecode using BPE via Python

Tokenizer

Types of Tokenizer​

Byte Pair Encoding (BPE)​

Modified BPE​

BPE Library​

Rust Code​

Python Code​

Types of Tokenizer

Byte Pair Encoding (BPE)

Modified BPE

BPE Library

Rust Code

Python Code