Digital Humanities & AI — Research Project 2026

Building a Local Patent Database with Agentic Coding

How we used AI-assisted development to collect, structure, and query publicly available patent data — entirely within the bounds of the law — to support original humanities and social science research.

Python SQLite USPTO Bulk Data Local LLM Open Access

Project Overview

12M+
Patents accessible via bulk data
100%
Public domain data sources
0 ¢
Licensing fees
Local
Runs entirely on your machine

Patent records are among the richest primary sources available to researchers — they document the history of invention, reveal corporate strategy, trace knowledge diffusion across industries, and capture how ideas travel between institutions and geographies. Yet for most humanities and social science researchers, patent data has remained frustratingly out of reach: too large to navigate manually, too technical to process without programming experience, and too expensive to access through commercial databases.

This project changes that. By combining AI-assisted programming with freely available government bulk data, we built a fully local, queryable patent database from scratch — no subscription, no cloud dependency, no proprietary lock-in. The result is a research tool that any scholar can replicate, extend, and adapt to their own questions.

Core insight
The U.S. Patent and Trademark Office publishes its entire patent corpus as open bulk data. The barrier to using it has never been legal — it has been technical. Agentic coding lowers that barrier to the point where a researcher without a computer science background can build production-quality data infrastructure in days, not months.

How We Built It

The project unfolded in four well-defined phases, each accelerated by working iteratively with an AI coding assistant that could draft, debug, and refactor code on demand.

1

Data Acquisition

The USPTO publishes weekly grant files and application files as compressed XML archives through its Bulk Data Storage System. We wrote a Python downloader — generated and refined with AI assistance — that fetches specific date ranges, validates checksums, and stores the raw archives locally. No scraping, no terms-of-service violations: the files are served directly by the government for exactly this kind of downstream use.

2

Parsing & Extraction

USPTO XML files use a complex nested schema that documents patent metadata, claims, abstracts, inventor names, assignee organizations, citation networks, and classification codes. We used Python's standard-library XML parser, with schema-aware extraction logic drafted by the AI assistant, to pull the fields most useful for humanities research: title, abstract, filing date, grant date, inventors, assignee, CPC classification, and forward/backward citation counts.

3

Structured Storage

Extracted records are loaded into a local SQLite database — chosen for its zero-configuration setup, broad tool support, and ability to handle tens of millions of rows on commodity hardware. The schema was designed for research queries: normalized tables for patents, inventors, assignees, and citations, with full-text search enabled on titles and abstracts via SQLite's FTS5 extension. The AI assistant was particularly useful here, generating optimized CREATE TABLE statements and bulk-insert routines that reduced load times by an order of magnitude.

4

Local Query Interface

To make the database accessible without requiring SQL fluency, we paired it with a locally running large language model (Ollama) and a lightweight chat interface. Researchers can ask questions in plain English — "Show me all patents assigned to universities in Texas between 1990 and 2010 in the biotechnology classification" — and receive structured results. The entire stack runs offline, which matters for institutional data governance and reproducibility.

Why This Matters for Research

A locally controlled patent database opens lines of inquiry that were previously gatekept by budget, technical skill, or institutional affiliation. Here are representative use cases drawn from the digital humanities, history of science, and social science literatures.

🗺️

Geography of Innovation

Map inventor and assignee addresses over time to trace the spatial concentration and diffusion of technological activity — from industrial-era manufacturing belts to today's biotech clusters.

📊

Citation Network Analysis

Reconstruct the intellectual genealogy of a technology field. Identify foundational patents, measure knowledge spillovers between firms, and detect when a domain crosses disciplinary boundaries.

🏛️

University–Industry Relationships

Examine how academic institutions translate research into IP over decades, compare technology transfer rates across universities, and study the Bayh-Dole Act's long-term effects.

⚖️

Legal and Policy History

Track how legislative changes — from the America Invents Act to software patent eligibility shifts — show up in filing patterns, claim language, and grant rates over time.

🌐

Language and Rhetoric

Apply computational text analysis to patent abstracts and claims to study how inventors frame novelty, how terminology evolves within a technology field, and how language varies across industries.

👩‍🔬

Diversity in Invention

Use inventor name disambiguation and gender-inference tools to study the historical underrepresentation of women and minorities in patenting — and measure whether and where that is changing.

Reproducibility advantage
Because the database is built from stable, versioned government archives and runs entirely on local hardware, analyses are fully reproducible. Other researchers can reconstruct the exact same dataset from the same source files, which satisfies data-management requirements from funders like NEH, NSF, and SSHRC without requiring a data-sharing agreement.

Technical Stack

Every component of this project is open-source and freely available. The stack was deliberately kept minimal to maximize longevity and ease of replication.

Python 3 (stdlib only) SQLite + FTS5 Ollama (local LLM runtime) XML / ElementTree HTTP server (http.server) Vanilla HTML/CSS/JS USPTO Bulk XML CPC taxonomy

No external Python packages are required for the core pipeline. The chat interface — a browser UI that streams responses from a locally running language model — adds zero cloud dependencies. The only network calls in production are to USPTO's public file servers during initial data ingestion.

The AI coding assistant (Claude, accessed through an agentic workflow) played the role of an expert pair programmer throughout: drafting boilerplate, explaining unfamiliar XML schemas, suggesting database index strategies, and iterating on query performance. The human researcher remained the decision-maker for every architectural and analytical choice. This division of labour — AI handles implementation friction, researcher handles intellectual direction — is the defining feature of agentic coding as a research methodology.

Replicate This Project

The complete source code, database schema, and step-by-step instructions are available in this repository. To get started, you need:

Start with the USPTO's most recent weekly grant file to validate your pipeline before ingesting years of data. The repository README walks through the full process from download to first query.