Digital Humanities & AI — Research Project 2026
How we used AI-assisted development to collect, structure, and query publicly available patent data — entirely within the bounds of the law — to support original humanities and social science research.
Patent records are among the richest primary sources available to researchers — they document the history of invention, reveal corporate strategy, trace knowledge diffusion across industries, and capture how ideas travel between institutions and geographies. Yet for most humanities and social science researchers, patent data has remained frustratingly out of reach: too large to navigate manually, too technical to process without programming experience, and too expensive to access through commercial databases.
This project changes that. By combining AI-assisted programming with freely available government bulk data, we built a fully local, queryable patent database from scratch — no subscription, no cloud dependency, no proprietary lock-in. The result is a research tool that any scholar can replicate, extend, and adapt to their own questions.
The project unfolded in four well-defined phases, each accelerated by working iteratively with an AI coding assistant that could draft, debug, and refactor code on demand.
The USPTO publishes weekly grant files and application files as compressed XML archives through its Bulk Data Storage System. We wrote a Python downloader — generated and refined with AI assistance — that fetches specific date ranges, validates checksums, and stores the raw archives locally. No scraping, no terms-of-service violations: the files are served directly by the government for exactly this kind of downstream use.
USPTO XML files use a complex nested schema that documents patent metadata, claims, abstracts, inventor names, assignee organizations, citation networks, and classification codes. We used Python's standard-library XML parser, with schema-aware extraction logic drafted by the AI assistant, to pull the fields most useful for humanities research: title, abstract, filing date, grant date, inventors, assignee, CPC classification, and forward/backward citation counts.
Extracted records are loaded into a local SQLite database — chosen for its zero-configuration setup, broad tool support, and ability to handle tens of millions of rows on commodity hardware. The schema was designed for research queries: normalized tables for patents, inventors, assignees, and citations, with full-text search enabled on titles and abstracts via SQLite's FTS5 extension. The AI assistant was particularly useful here, generating optimized CREATE TABLE statements and bulk-insert routines that reduced load times by an order of magnitude.
To make the database accessible without requiring SQL fluency, we paired it with a locally running large language model (Ollama) and a lightweight chat interface. Researchers can ask questions in plain English — "Show me all patents assigned to universities in Texas between 1990 and 2010 in the biotechnology classification" — and receive structured results. The entire stack runs offline, which matters for institutional data governance and reproducibility.
The legality of this project rests on a straightforward foundation: U.S. federal government works are not subject to copyright. Patent documents issued by the USPTO are public records produced by a federal agency and are therefore in the public domain under 17 U.S.C. § 105. Bulk data files distributed by the USPTO are published explicitly for downstream research and commercial use.
Official government portal for downloading full patent grant and application XML files. No authentication required. Files are updated weekly.
BigQuery dataset maintained by Google in partnership with the USPTO and EPO. Covers U.S., European, and international filings. Available under Creative Commons CC BY 4.0.
RESTful API for programmatic access to the European Patent Register. Free tier supports research-scale queries under EPO's fair-use policy.
Joint USPTO/EPO taxonomy covering ~260,000 technology categories. Distributed as open data; essential for filtering by technology domain.
A locally controlled patent database opens lines of inquiry that were previously gatekept by budget, technical skill, or institutional affiliation. Here are representative use cases drawn from the digital humanities, history of science, and social science literatures.
Map inventor and assignee addresses over time to trace the spatial concentration and diffusion of technological activity — from industrial-era manufacturing belts to today's biotech clusters.
Reconstruct the intellectual genealogy of a technology field. Identify foundational patents, measure knowledge spillovers between firms, and detect when a domain crosses disciplinary boundaries.
Examine how academic institutions translate research into IP over decades, compare technology transfer rates across universities, and study the Bayh-Dole Act's long-term effects.
Track how legislative changes — from the America Invents Act to software patent eligibility shifts — show up in filing patterns, claim language, and grant rates over time.
Apply computational text analysis to patent abstracts and claims to study how inventors frame novelty, how terminology evolves within a technology field, and how language varies across industries.
Use inventor name disambiguation and gender-inference tools to study the historical underrepresentation of women and minorities in patenting — and measure whether and where that is changing.
Every component of this project is open-source and freely available. The stack was deliberately kept minimal to maximize longevity and ease of replication.
No external Python packages are required for the core pipeline. The chat interface — a browser UI that streams responses from a locally running language model — adds zero cloud dependencies. The only network calls in production are to USPTO's public file servers during initial data ingestion.
The AI coding assistant (Claude, accessed through an agentic workflow) played the role of an expert pair programmer throughout: drafting boilerplate, explaining unfamiliar XML schemas, suggesting database index strategies, and iterating on query performance. The human researcher remained the decision-maker for every architectural and analytical choice. This division of labour — AI handles implementation friction, researcher handles intellectual direction — is the defining feature of agentic coding as a research methodology.
The complete source code, database schema, and step-by-step instructions are available in this repository. To get started, you need:
Start with the USPTO's most recent weekly grant file to validate your pipeline before ingesting years of data. The repository README walks through the full process from download to first query.