no code implementations • 30 May 2024 • Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models.
no code implementations • 8 Sep 2023 • Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data.