Understanding Byte Pair Encoding (BPE) in…

Dec 26, 2024

Unlocking the Power of Byte Pair Encoding in LLMs: A Closer Look at Tokenization

3 Comments

This article confused me even more than I was.

In the initial example you started with 7 distinct tokens and ended up with 3 distinct tokens. Later on, you said you need to perform 20 merges through the BPE algorithm to add 20 additional tokens.

Once decreased the vocab size, other increased? How?

I read in another source that the merged tokens are still retained in the vocabulary. So the first example of you is misleading.

Expand full comment

Klaus

Jan 8

In Step 2, we iteratively merge tokens from single characters. But do we remove single characters from our vocabulary after merging them? The final vocab only consists of 3 tokens 'low', 'est', '</w>', should it not additionally still contain the original characters?

Expand full comment

Sachin Singh

Jan 7

Extremely well written. Covers all the points and gives a very good idea of tokenisation

Expand full comment

Vizuara’s Substack

Understanding Byte Pair Encoding (BPE) in…