1 code implementation • 17 Oct 2023 • Avijit Thawani, Saurabh Ghanekar, Xiaoyuan Zhu, Jay Pujara
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words.