Where to Find Open-Licensed Texts for Open Educational AI Projects

April 6, 2026

Joon Choi

Why Copyright Matters for Open Educational AI Projects

By early 2026, approximately 75 copyright lawsuits have been filed against AI companies, according to the Copyright Alliance. The cases span text, images, music, and video, with major publishers, news organizations, record labels, and film studios collectively claiming that AI companies used copyrighted works without permission to train their models. Settlement amounts have reached into the billions, and courts remain divided on whether training datasets built from copyrighted works fall under fair use protections.

For educational AI developers working on open models and datasets, the message is clear: while you need substantial, diverse text collections to create effective systems, relying on copyrighted material without permission carries significant ethical – and potentially legal – concerns. Even if fair use might apply to your use case in theory, courts evaluate claims case-by-case, and the outcome of a given case is fairly uncertain. Stanford library’s copyright guidelines warn that litigation costs can far exceed any benefit from using the material – even if you ultimately win.

Building open models and datasets requires starting with materials you can legally share. Open-licensed educational texts provide that foundation: they come with explicit permissions that allow redistribution, and sometimes commercial use and modification (depending on the type of the open license). Below are some key sources for building educational AI datasets while respecting intellectual property rights.

Open Text Sources: What's Available

The open-licensed educational text ecosystem has a distinct characteristic. Textbooks and informational texts are abundant through initiatives like OpenStax or LibreTexts, and expository, educational reading passages for language learners and elementary students are readily available. Overall, freely available informational texts have grown steadily in size over the past decade.

On the other hand, contemporary fiction remains notably limited in open collections. While public domain literature is freely available through archives like Project Gutenberg, creative writing from the last few decades is largely absent from open repositories. Fiction writers and publishers have been more reluctant to release work under open licenses compared to their nonfiction counterparts. For writers, Creative Commons licensing for fiction may represent a gamble with questionable benefits that may take years to materialize. And traditional publishers fear CC licensing will hurt their bottom line, and distribution platforms often refuse CC-licensed ebooks because vendors worry readers won’t pay for books that are freely available online.

This gap matters for educational AI systems that require a good distribution of genres and writing styles. For example, training datasets built only from expository texts may leave AI systems unprepared for the narrative structures, dialogue, and literary devices students encounter in fiction.

Some developers turn to large web crawl datasets like Common Crawl as an alternative. However, these massive archives require extensive filtering to remove problematic content and may present significant copyright concerns. For researchers building redistributable educational corpora, the legal ambiguity and intensive curation needs make web crawl datasets a risky foundation.

A more reliable, large-scale solution likely requires partnership approaches, similar to how CommonLit collaborated with researchers to create the CLEAR corpus. And while some datasets use snippet-based approaches (COCA removes 10 words every 200 words), many researchers developing redistributable corpora prefer not to test fair use boundaries in educational applications.

License choice adds another layer of complexity. CC BY (Attribution) requires only source credit, making it a good choice for redistributable datasets. NonCommercial (NC) licenses, however, are riskier because the NC restriction applies to any act requiring copyright permission including reproduction during training. “Commercial use” was not defined with AI in mind, and because the definition remains contested, even the use of an NC-trained model can be legally ambiguous.

As more web content becomes AI-generated, these curated, open-licensed collections become increasingly valuable as sources of human-generated language. Despite the fiction gap, the resources below provide textbooks, informational passages, public domain literature, and some contemporary elementary fiction – enough to support many educational AI applications while researchers and organizations work toward more comprehensive and coordinated solutions to expand open-licensed collections.

Open Educational Resources

OER Commons

A comprehensive digital library of over 50,000 open educational resources across multiple subjects and grade levels. The platform offers robust filtering by educational standards, grade level, and resource type, making it relatively straightforward to locate materials matching specific project needs. Content includes full university courses, K-12 lesson plans and worksheets, open textbooks, and curated collections aligned to Common Core and other standards.

Use Cases:

Building subject-specific training corpora with grade-level metadata
Sourcing Common Core-aligned English Language Arts materials
Finding K-12 lesson plans and activities for context-aware AI tools

OpenStax

A nonprofit publisher of high-quality, peer-reviewed open textbooks spanning K-12 and college-level courses. The collection covers mathematics, science, social studies, humanities, and career readiness, with materials aligned to standard course scope and sequence requirements. Books are updated biannually and undergo the same rigorous editorial process as traditional publishers.

Use Cases:

Providing searchable, domain-specific textbook context for AI tutors
Building question generators for STEM subjects
Creating benchmark datasets for content accuracy evaluation

Global Digital Library

Reading resources for language learners and developing readers, all under Creative Commons or similar licenses. Materials categorized by reading level. While this resource provides contemporary fiction, it is limited in breadth in that it only includes elementary-level texts.

Use Cases:

Developing reading comprehension assessments for early learners
Training text complexity analyzers for elementary content

Note: Organizations contributing to GDL – including Book Dash, Let’s Read, Storyweaver, and African Storybook – maintain their own platforms with additional multilingual and culturally diverse content.

LibreTexts

A collaborative open educational resource platform offering peer-reviewed textbooks and learning materials across 12 disciplines. Content uses various open licenses including CC BY-NC-SA and CK-12’s custom open license (which prohibits AI training).

Use Cases:

Training models on STEM textbook content
Building advanced subject-matter tutoring systems
Creating datasets that span K-12 through undergraduate materials

Public Domain & General Corpora

Project Gutenberg

Over 70,000 books in the public domain, spanning classic literature, historical documents, reference works, and texts in over 60 languages. All content is freely available for unlimited reuse without restrictions.

Use Cases:

Training models on canonical literature and historical texts
Building classical literature comprehension tools
Creating datasets on English language evolution

Internet Archive

Millions of digitized texts with varying license statuses, including substantial public domain collections and curated sets like the Children’s Library. Requires verification of licensing status for each item, but offers scale and breadth.

Use Cases:

Accessing historical educational materials and textbooks
Building datasets requiring specialized or niche content
Tracking evolution of educational content over time

Wikipedia/Simple Wikipedia

Massive, open-licensed, multilingual corpus that many NLP models use. Pre-processed Wikipedia datasets are readily available through platforms like HuggingFace Datasets, which provide cleaned, ready-to-use text without requiring XML parsing or markup stripping.

Use cases:

Building knowledge base systems
Creating datasets with consistent factual framing

Joon Suh Choi

Data Scientist