Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:

https://huggingface.co/roberta-base/raw/main/merges.txt

(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)

Also see GPT-2:

https://huggingface.co/gpt2/raw/main/merges.txt

I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):

RoBERTa base (English): 1.08

RobBERT (Dutch): 1.21

roberta-base-ca-v2 (Catalan): 1.12

ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68

In all these cases, the median token length in pieces was 1.

(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: