Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was trained with "1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc."

I guess it is good that they mentioned some of it, but yeah, that isn't exceptionally helpful when making claims of it being 100% open source.

I'm not sure why they feel the need to be so secretive if all of the sources are open.



"etc." is the most important part here. There is NL SFT and code SFT data which guessing by the names are instruction data very likely from GPT-4. It is known in finetuning community that training with GPT-4 data is the easiest way of improving the model. If that's the case base JetMoE should be compared to finetuned llama, not base llama.


Pile includes books3, is there a test prompt to see whether the books are present in the training data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: