Our site, BeInCrypto, is included in a dataset for AI training and optimization such as ChatGPT, according to a recent analysis.
BeInCrypto is included in a huge AI training dataset called C4. Recently, The Washington Post and the Allen Institute for Artificial Intelligence studied Google’s C4 dataset to determine which locations feed AI tools. Many large language models have used C4 (Colossal Clean Crawled Corpus) as a teaching tool. However, Open AI’s ChatGPT does not use this dataset.
Large language models such as C4, which ChatGPT uses, “scrape” the Internet to include content in its model. The breadth of the dataset allows the AI to mimic human speech.
The paper sorted C4 websites using data from web analytics company, Likeweb. Next, they ranked the top 10 million websites by the number of “tokens” they contributed. Symbols refer to short pieces of text that are used to make sense of unstructured data, usually consisting of a word or phrase.
The three largest contributors to the dataset are patents.google.com, wikipedia.org, and scribd.com, a subscription-based digital library. News organizations dominated the top 10, with The Guardian, New York Times, Forbes, Los Angeles Times and Huffington Post rounding out the top ten.
C4 First Scraped data in 2019
Other websites that will be highly featured include Instructables, an online platform for sharing DIY instructions and how-tos. The researchers also found at least 27 other identified sites by the United States government as markets for piracy and counterfeiting.
C4 began life as a solo scrape by a non-profit organization CommonCrawl in 2019. They told The Washington Post that it does not attempt to avoid licensed or copyrighted material. Although they try to prioritize quality and trustworthy websites. In addition, its data is free to use and analyze.
With AI technology still threatening many industries, content scraping for large language models is becoming increasingly controversial, particularly in the sectors most at risk from AI. AI training companies do not compensate content creators for the use of their work. On top of that, artists recently hit AI image tools Midjourney and Stable Diffusion with a copyright lawsuit. The lawsuit alleges that the AI art tools violate copyright law by scraping artists’ works without their consent. Experts expect more action against online scraping to follow.
Disclaimer
Adhering to the Trust Project’s guidelines, BeInCrypto is committed to providing unbiased and transparent reporting. This news article aims to provide accurate and timely information. However, readers are advised to independently check the facts and consult with a professional before making any decisions based on this content.