OpenAI has released a statement regarding the copyright infringement lawsuit filed by the New York Times (NYT), in which the newspaper accuses OpenAI and Microsoft of using millions of NYT articles without a license to train its AI models.
In the statement, OpenAI reiterates its accusation that the New York Times manipulated prompts to intentionally provoke copyright infringement.
OpenAI also reiterates its position that training AI models with publicly available Internet material is fair use.
Large AI models learn from the “enormous aggregate of human knowledge” and any training content is only a “tiny” contribution to the model’s performance. In OpenAI’s view, the New York Times articles are “not significant” for training AI models like GPT-4.
The NYT believes that its content was intentionally overweighted in the training material to improve the quality of text generation. Since OpenAI has not made its training material for its latest models transparent, we simply don’t know.
OpenAI wants to be a “good citizen”
The opt-out mechanism offered to publishers to prevent OpenAI tools from accessing their sites is a concession from OpenAI’s perspective because it’s “the right thing to do.” For OpenAI, being a “good citizen” is more important than insisting on its rights, it says.
The memorization or “regurgitation” of content by its LLMs, as demonstrated by the New York Times in the lawsuit, is a “rare bug” in the learning process that the company is working to fix. “Much progress” has already been made with recent models, OpenAI says.
Provoking this bug through deliberate prompts is an intentional violation of OpenAI’s terms of service, the company claims. This statement was previously made by Tom Rubin, OpenAI’s head of intellectual property and content.
New York Times allegedly acted in a non-transparent manner
The New York Times is also “not telling the full story,” OpenAI claims. The failed negotiations between the NYT and OpenAI concerned the display of real-time content in ChatGPT.
The NYT had mentioned “along the way” that OpenAI’s language models could generate verbatim copies of its work, but would not show any examples, even after repeated requests. OpenAI only learned of the lawsuit through the NYT article, which came as a “surprise and disappointment.”
The examples now cited in the indictment include articles that are several years old and can also be found on various websites. In addition, the prompts were deliberately manipulated with precise article patterns to provoke memorization, OpenAI says.
But even in this scenario, its models would not typically generate article copies, OpenAI claims. It assumes that the New York Times instructed the model to generate copies of the articles, or “cherry-picked their examples from many attempts.”
OpenAI points to its efforts to support news organizations, citing partnerships with the Associated Press, Axel Springer, the American Journalism Project, and NYU to develop products to assist reporters and editors, train AI models with historical content, and display content in real-time with sources in ChatGPT.
OpenAI, Google, and Apple are said to be in talks with numerous publishers about using content for real-time display and AI training. Additional OpenAI partnerships are expected to be announced soon.
It’s not just publishers who are suing
Besides publishers, authors, programmers, and artists are also suing generative AI providers. The allegations are largely the same: that AI models have been trained without explicit consent, using the work of people whose work they may replace in the future.
According to OpenAI, ChatGPT is not a replacement for the services provided by the New York Times. OpenAI still hopes for good cooperation in the future.