In the initial example you started with 7 distinct tokens and ended up with 3 distinct tokens. Later on, you said you need to perform 20 merges through the BPE algorithm to add 20 additional tokens.
Once decreased the vocab size, other increased? How?
I read in another source that the merged tokens are still retained in the vocabulary. So the first example of you is misleading.
In Step 2, we iteratively merge tokens from single characters. But do we remove single characters from our vocabulary after merging them? The final vocab only consists of 3 tokens 'low', 'est', '</w>', should it not additionally still contain the original characters?
This article confused me even more than I was.
In the initial example you started with 7 distinct tokens and ended up with 3 distinct tokens. Later on, you said you need to perform 20 merges through the BPE algorithm to add 20 additional tokens.
Once decreased the vocab size, other increased? How?
I read in another source that the merged tokens are still retained in the vocabulary. So the first example of you is misleading.
In Step 2, we iteratively merge tokens from single characters. But do we remove single characters from our vocabulary after merging them? The final vocab only consists of 3 tokens 'low', 'est', '</w>', should it not additionally still contain the original characters?
Extremely well written. Covers all the points and gives a very good idea of tokenisation