OpenAI will now allow website operators to block its web crawler by updating their site's robots.txt file or by directly blocking the IP address for OpenAI's GPTBot. Either technique will ensure that a site is not scrapped for AI training by OpenAI. This is an obvious approach; I wrote about the need for it back in February while wondering if .
AI training techniques are the focus of intense debate. OpenAI's GPT models, like many large language models, heavily rely on vast amounts of internet data for training. However, the ethics of sourcing this data – especially without explicit consent – has been a hot topic. Platforms like Reddit and Twitter have already begun pushing back against the unrestricted use of their content by AI entities. Moreover, legal challenges have arisen, with creatives alleging unauthorized use of their works by AI companies.
By allowing sites to opt out, OpenAI is acknowledging the importance of consent in the data collection process. It's a step (albeit a small one) toward a more transparent and ethical AI ecosystem… but what about the data already ingested? ChatGPT is happy to tell you that it has ingested everything it could find prior to September 2021. Who do we see about that? Choose your metaphor: the cat's out of the bag, the genie’s out of the bottle, can’t put the toothpaste back in the tube, etc.
As always, your thoughts and comments are both welcome and encouraged. Just reply to this email. -s
P.S. My segment on Good Day NY this morning was about Kai Cenat and influencer marketing. .
ABOUT SHELLY PALMER
Shelly Palmer is the Professor of Advanced Media in Residence at Syracuse University’s S.I. Newhouse School of Public Communications and CEO of The Palmer Group, a consulting practice that helps Fortune 500 companies with technology, media and marketing. Named he covers tech and business for , is a regular commentator on CNN and writes a popular . He's a , and the creator of the popular, free online course, . Follow or visit .