@yes_this_time

yes_this_time@lemmy.world · 1 month ago

Yeah, after reading a bit into it. It seems like most of the work is up front, pre filtering and classifying before it hits the model, to your point the model training part is expensive…

I think broadly though, the idea that they are just including the kitchen sink into the models without any consideration of source quality isn’t true

yes_this_time@lemmy.world · 1 month ago

If I’m creating a corpus for an LLM to consume, I feel like I would probably create some data source quality score and drop anything that makes my model worse.