You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.