Wikipedia is giving AI developers its data to fend off bot scrapers

Published: Apr 18, 2025, 10:55 am Updated: Apr 18, 2025, 10:56 am

To address the growing problem of AI-driven web scraping, the Wikimedia Foundation has collaborated with Kaggle, a Google-owned data science platform, to provide an optimized version of Wikipedia content for artificial intelligence research.

This change is intended to provide AI developers with a structured, machine-readable alternative to scraping the live site, which has put a substantial burden on Wikipedia's servers.

Wikipedia open in Safari on Macbook. — Wikipedia open on Macbook.

The newly published dataset, now available in both English and French, includes article summaries, infoboxes, image links, and structured article parts in JSON format.

It intentionally omits references, markdown, formatting, and multimedia features to focus on content that aids AI training and analytics. The Wikipedia Foundation claims that this dataset is created explicitly for machine learning workflows, making it easier for developers to fine-tune, test, and align models.

Since January 2024, Wikipedia's bandwidth utilization has increased by 50%, primarily due to automated bots gathering article content for training massive language models. These scraping actions increase the non-profit's costs and risk of hurting the user experience. Wikimedia believes that by offering a consistent, high-quality dataset via Kaggle, developers will be discouraged from using less efficient and possibly harmful scraping methods.

Despite being freely licensed under Creative Commons, content housed on Wikipedia still requires attribution and proper licensing and reuse—a guideline that many AI companies have disregarded in their rush to build large-scale models. Some contributors also express concern about how their work may be used to train systems that bypass sources entirely.

Kaggle will host the data via Wikimedia Enterprise, a high-volume reuse service, to make it available to both major tech firms and individual academics.

As AI developers continue to seek large datasets for training, this attempt represents a practical step toward preserving Wikipedia's infrastructure while maintaining its open-access values.

Wikipedia is giving AI developers its data to fend off bot scrapers

Why trust Stealth Optional?