For more than 20 years, Kit Loffstadt has written fan fiction exploring alternate universes for “Star Wars” heroes and “Buffy the Vampire Slayer” villains, sharing her stories for free online.
But in May, Ms. Loffstadt stopped posting her creations after she learned that a data company had copied her stories and fed them into the artificial intelligence technology underlying ChatGPT, the viral chat. Appalled, she hid her writing behind a locked account.
Ms Loffstadt also helped organize an act of rebellion last month against AI systems. Along with dozens of other fanfiction writers, she published a flood of disrespectful stories online to overwhelm and confuse the data collection services that feed writers’ work into AI technology.
“We must all do what we can to show them that the production of our creativity is not for machines to harvest as they please,” said Ms Loffstadt, a 42-year-old voice actor from South Yorkshire in the UK.
Fan fiction writers are just one group now staging revolts against AI systems as technology fever grips Silicon Valley and the world. In recent months, social media companies like Reddit and Twitter, news organizations including The New York Times and NBC News, authors like Paul Tremblay and the actress Sarah Silverman have all taken a stand against AI-sucking their data without permission.
Their protests took different forms. Writers and artists lock their files to protect their work or boycott certain websites that publish AI-generated content, while companies like Reddit want to pay for access to their data. At least 10 lawsuits have been filed this year against AI companies, accusing them of training their systems on artists’ creative work without consent. This past week, Ms. Silverman and authors Christopher Golden and Richard Kadrey sued OpenAI, the creator of ChatGPT, and others over AI’s use of their work.
At the heart of the uprisings is a newfound understanding that online information—stories, artwork, news articles, message board posts, and photos—can have significant untapped value.
The new wave of AI — known as “generative AI” because of the text, images and other content it generates — is built on complex systems like large language models that are capable of producing human-like prose. These models are trained on masses of all kinds of data so that they can answer people’s questions, imitate writing styles or bring out comedy and poetry.
That set off a hunt by tech companies for even more data to feed their AI systems. Google, Meta, and OpenAI basically used information from all over the internet, including large databases of fan fiction, tons of news articles, and collections of books, many of which were available for free online. In tech industry parlance, this was known as “scraping” the internet.
OpenAI’s GPT-3, an AI system released in 2020, contains 500 billion “tokens,” each representing parts of words found mostly online. Some AI models contain more than one trillion tokens.
The practice of internet scraping has been around for a long time and has been largely exposed by the companies and non-profit organizations that have been doing it. But it was not well understood or seen as particularly problematic by the companies that owned the data. That changed after ChatGPT debuted in November and the public learned more about the underlying AI models that powered the chatbots.
“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, the founder and CEO of Nomic, an AI company. “Before, the thinking was that you got value out of data by making it open to everyone and publishing ads. Now, the thinking is that you lock in your data because you can extract so much more value when you use it as input to your AI”
The data protests may have little effect in the long run. Deep-pocketed tech giants like Google and Microsoft are already sitting on mountains of proprietary information and have the resources to license more. But as the era of easily scraping content ends, smaller AI startups and nonprofits that hoped to compete with the big firms may not be able to get enough content to train their systems.
In a statement, OpenAI said ChatGPT was trained on “approved content, publicly available content, and content created by human AI trainers.” It added, “We respect the rights of creators and authors, and look forward to continuing to work with them to protect their interests.”
Google said in a statement that it was involved in talks about how publishers might manage their content in the future. “We believe everyone benefits from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.
The data riots erupted last year after ChatGPT became a global phenomenon. In November, a group of developers filed a proposed class action lawsuit against Microsoft and OpenAI, claiming the companies infringed their copyright after their code was used to train an AI-powered programming assistant.
In January, Getty Images, which provides stock photos and videos, sued Stability AI, an AI company that creates images from text descriptions, alleging that the startup used copyrighted photos to train its systems.
Then in June, Clarkson, a law firm in Los Angeles, filed a 151-page proposed class action lawsuit against OpenAI and Microsoft, describing how OpenAI collected data from minors and saying Internet scraping violated copyright law and constituted “theft.” On Tuesday, the company filed a similar lawsuit against Google.
“The data rebellion that we’re seeing across the country is society’s way of pushing back against this idea that Big Tech is just allowed to take any and all information from any source and make it their own,” said Ryan Clarkson, the founder of Clarkson.
Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were extensive and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, with a “second and third wave” coming that would define the future of AI.
Bigger companies are also pushing back against AI scrapers. In April, Reddit said it wanted to pay for access to its application programming interface, or API, the method by which third parties can download and analyze the social network’s vast database of person-to-person conversations.
Steve Huffman, Reddit’s chief executive, said at the time that his company didn’t “need to give all of that value to some of the biggest companies in the world for free.”
That same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also ask AI companies to pay for data. The site has nearly 60 million questions and answers. Its move was earlier reported from Wired.
News organizations are also resisting AI systems. In an internal memo on the use of generative AI in June, The Times said AI companies must “respect our intellectual property.” A Times spokesman declined to elaborate.
For individual artists and writers, fighting AI systems has meant rethinking where they publish.
Nicholas Kole, 35, an illustrator in Vancouver, British Columbia, was concerned by how his distinct art style could be replicated by an AI system and suspected that the technology was scraping his work. He plans to continue posting his creations to Instagram, Twitter and other social media to attract customers, but he has stopped publishing on sites like ArtStation, which post AI-generated content alongside human-generated content.
“It just feels like a mindless theft from me and other artists,” Mr. Kole said. “It puts a pit of existential dread in my stomach.”
At Archive of Our Own, a fanfiction database with more than 11 million stories, writers have increasingly pressured the site to ban data scraping and AI-generated stories.
In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of popular fanfiction posted on Archives of Our Own, dozens of writers were up in arms. They blocked their stories and wrote subversive content to fool the AI scrappers. They also pushed the leaders of Archive of Our Own to stop allowing AI-generated content.
Betsy Rosenblatt, who provides legal advice to Archive of Our Own and is a professor at the University of Tulsa College of Law, said the site has a policy of “maximum inclusivity” and did not want to be in the position of distinguishing which stories were written. with AI
For Ms. Loffstadt, the fan fiction writer, the fight against AI came while she was writing a story about “Horizon Zero Dawn,” a video game where humans battle AI-powered robots in a post-apocalyptic world. In the game, she said, some of the robots were good and some were bad.
But in the real world, she said, “thanks to hubris and corporate greed, they’re twisted to do bad things.”