Google Fined €250M for Scraping French News for Gemini

France’s competition watchdog has fined Google €250 ($271 million) for breaking EU intellectual property laws by using online news content to train its AI chatbot, Gemini.

In a statement on Wednesday, the watchdog said that Google's AI-powered chatbot Bard – which has since rebranded under Gemini – was trained on content from publishers and news agencies without their or regulators' consent.

"Google linked the use of the content concerned by its artificial intelligence service to the display of protected content", the watchdog said, adding that in doing so Google hindered the ability of publishers and press agencies to negotiate fair prices.

The fine arrives amid a copyright dispute in France over tech companies using online content to make large language models (LLMs) – which has sparked complaints from some of the country's biggest news organisations, including Agence France Presse (AFP).

The dispute appeared to be resolved in 2022 when Google dropped its appeal against an initial €500 million fine issued at the end of a major investigation by the Autorite de la Concurrence.

But in Wednesday's statement, France’s competition watchdog said the tech giant violated the terms of four out of seven commitments agreed in the settlement, and was "failing to respect commitments made in June 2022" or negotiate in "good faith" with news publishers on how much to compensate them for use of their content.

Still, it added that Google has pledged not to contest the facts as part of settlement proceedings and had proposed a series of remedy measures for certain shortcomings.

“Neighbouring Rights”

Many publishers, writers and newsrooms are looking to prevent or at least limit tech companies from scraping their online content without their consent.

Google and other online platforms have been accused for years of making billions from news without sharing the revenue with those who gather it.

This led to the EU creating a form of copyright called "neighbouring rights" to tackle this issue, which allows print media to demand compensation from tech companies for using their content.

France has since been a test case for the rules and after initial resistance, Google and Facebook both agreed to pay some French media for articles shown in web searches. But with AI and LLMs now in the mix, the debate was once again fired up.

Over in the US, the New York Times in 2023 sued Microsoft and OpenAI, the creator of the popular AI chatbot ChatGPT, accusing them of using millions of the newspaper's articles without permission to help train chatbots.

OpenAI has also been sued by multiple authors, lawyers, and even comedians for stealing their copyrighted content without their consent to train its AI models.

"Despite established protocols for the purchase and use of personal information, [OpenAI] took a different approach: theft," a class-action lawsuit by a US law firm against the tech giant from last year reads.

“They systematically scraped 300 billion words from the internet, 'books, articles, websites and posts – including personal information obtained without consent. [They] did so in secret, and without registering as a data broker as required under applicable law."

The Future of AI and Copyright

Experts have warned that the method by which AI firms obtain their data may lead to the work of millions of content creators being stolen, raising questions about the future of creative industries and the ability to tell fact from fiction.

Governments around the world are also taking note of the rapid advancement of AI. The EU parliament recently entered the final stages of passing its landmark “EU AI Act” to protect the world against the “unacceptable level of risk” AI could bring.

One article of that act includes a recital on the importance of transparency in ensuring accountability and facilitating the enforcement of copyrights. Specifically, it mandates GPAI providers to draw up and make publicly available a sufficiently detailed summary of the content used for training their models:

“In order to increase transparency on the data that is used in the pre-training and training of general-purpose AI models, including text and data protected by copyright law, it is adequate that providers of such models draw up and make publicly available a sufficiently detailed summary of the content used for training the general-purpose model.

“While taking into due account the need to protect trade secrets and confidential business information, this summary should be generally comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.”

This provision is poised to spark controversy but stands as the most potent copyright-related clause in the Act Copyright holders are likely to welcome this development, whereas tech companies may harbour concerns.

The importance of this copyright provision is that it creates a way of identifying content that has been AI-generated, which could affect its copyrightability in jurisdictions that may impose restrictions on AI authorship due to human author requirements.

The Compliance Conundrum in the Cloud Era: Governance and Adapting to Regulatory Volatility

Google Fined €250 Million for Scraping French News to Make Gemini

“Neighbouring Rights”

The Future of AI and Copyright

Comments ( 0 )

Explore This Article

“Neighbouring Rights”

The Future of AI and Copyright

Comments ( 0 )

Explore This Article

Subscribe to our Newsletter