It was trained with "1 trillion tokens from large-scale open-source pretraining ...

YetAnotherNick · on April 5, 2024

"etc." is the most important part here. There is NL SFT and code SFT data which guessing by the names are instruction data very likely from GPT-4. It is known in finetuning community that training with GPT-4 data is the easiest way of improving the model. If that's the case base JetMoE should be compared to finetuned llama, not base llama.

kristianp · on April 6, 2024

Pile includes books3, is there a test prompt to see whether the books are present in the training data?