Artificial intelligence startup Anthropic has been accused of aggressively harvesting data from websites to train its AI systems, potentially violating publishers’ terms of service, the Financial Times reported.
advertisement
Companies like Anthropic and OpenAI train their large-scale generative AI language models using massive amounts of data from a variety of sources. Anthropic’s chatbot Claude, a competitor to OpenAI’s ChatGPT, can respond to a range of natural language prompts. Anthropic was founded by a group of former OpenAI employees, and its stated goal is to “responsibly develop and maintain advanced AI for the long-term benefit of humanity.”
But the San Francisco-based company doesn’t always live up to that claim. At least if you believe Matt Barry, CEO of Freelancer.com, an online job board where millions of freelancers offer their services. According to the Financial Times Barry Anthropic accuses it of being “by far the most aggressive tool” in its online portal.
Many web publishers are affected.
According to the report, other web publishers are also accusing Anthropic of collecting data from their websites and ignoring their instructions to stop collecting their content. Freelancer.com received 3.5 million visits in four hours from a “web crawler” linked to Anthropic, the Financial Times reported, citing data available to it. Barry told the newspaper that visits continued to increase even when Freelancer.com tried to deny his requests for access using standard web protocols for controlling crawlers. He then decided to block traffic from Anthropic’s IP addresses altogether.
We should have [Anthropic, Anm.] “This is egregious scraping that slows down the site for everyone who works on it and ultimately impacts our revenue.” Anthropic said it is investigating the issue.
Kyle Wiens, managing director of iFixit.com, a repair guide website, made similar claims to the Financial Times . The site received a million visits from human bots in a 24-hour period. Wiens said iFixit’s terms of service prohibit the use of its data for machine learning. “My first message to Anthropic is: if you’re using this data to train your model, it’s illegal. My second message is: this is not polite on the internet. Crawling is a matter of etiquette.” Websites use a protocol known as robots.txt to keep crawlers and other bots out. The robot exclusion standard regulates who is allowed to automatically browse a website’s content — a highly subjective matter that is often a subject of contention in the age of AI-powered chatbots like ChatGPT.
Controversy over AI training data
Data mining is by no means a new practice, but it has increased dramatically over the past two years as a result of the AI arms race. “Search engines have always done a lot of mining, but training generative AI takes it to a new level,” says Barry. Leading AI companies are competing to develop increasingly powerful and sophisticated language models, and they require massive amounts of data to do so. This also raises the issue of copyright and the use of data to train models. Companies like OpenAI or X are frequently collecting data to train their AI without being asked. Microsoft’s AI chief, Mustafa Suleyman, recently explained that there is a social contract that allows the use of content on the internet — including for AI training. He’s received a lot of pushback.
Companies are defending themselves in various ways. Reddit has begun blocking various search engines and web crawlers if they do not agree to a licensing agreement with the online platform. A legal dispute between the American daily newspaper The New York Times and OpenAI is attracting a lot of attention. The newspaper accuses OpenAI of violating copyright law by using thousands of articles to train its language models — and thus building a business at the newspaper’s expense. It is insisting on compensation. In May, OpenAI struck a deal with News Corp., one of the world’s largest publishers, which includes newspapers such as The Wall Street Journal, The New York Post, The Sunday Times, and The Daily Telegraph. OpenAI secured access to all of the linked newspapers’ content. Other media companies, such as Reuters, are now licensing their content for AI training.
(I am)
More Stories
Ubisoft wants to release a new Assassin's Creed game every 6 months!
A horror game from former developers at Rockstar
Turtle Beach offers the Stealth Pivot Controller for PC and Xbox