Patent Database Project

Project Overview

12M+

Patents accessible via bulk data

100%

Public domain data sources

0 ¢

Licensing fees

Local

Runs entirely on your machine

Patent records are among the richest primary sources available to researchers — they document the history of invention, reveal corporate strategy, trace knowledge diffusion across industries, and capture how ideas travel between institutions and geographies. Yet for most humanities and social science researchers, patent data has remained frustratingly out of reach: too large to navigate manually, too technical to process without programming experience, and too expensive to access through commercial databases.

This project changes that. By combining AI-assisted programming with freely available government bulk data, we built a fully local, queryable patent database from scratch — no subscription, no cloud dependency, no proprietary lock-in. The result is a research tool that any scholar can replicate, extend, and adapt to their own questions.

Core insight

The U.S. Patent and Trademark Office publishes its entire patent corpus as open bulk data. The barrier to using it has never been legal — it has been technical. Agentic coding lowers that barrier to the point where a researcher without a computer science background can build production-quality data infrastructure in days, not months.

How We Built It

The project unfolded in four well-defined phases, each accelerated by working iteratively with an AI coding assistant that could draft, debug, and refactor code on demand.

Data Acquisition

The USPTO publishes weekly grant files and application files as compressed XML archives through its Bulk Data Storage System. We wrote a Python downloader — generated and refined with AI assistance — that fetches specific date ranges, validates checksums, and stores the raw archives locally. No scraping, no terms-of-service violations: the files are served directly by the government for exactly this kind of downstream use.

Parsing & Extraction

USPTO XML files use a complex nested schema that documents patent metadata, claims, abstracts, inventor names, assignee organizations, citation networks, and classification codes. We used Python's standard-library XML parser, with schema-aware extraction logic drafted by the AI assistant, to pull the fields most useful for humanities research: title, abstract, filing date, grant date, inventors, assignee, CPC classification, and forward/backward citation counts.

Structured Storage

Extracted records are loaded into a local SQLite database — chosen for its zero-configuration setup, broad tool support, and ability to handle tens of millions of rows on commodity hardware. The schema was designed for research queries: normalized tables for patents, inventors, assignees, and citations, with full-text search enabled on titles and abstracts via SQLite's FTS5 extension. The AI assistant was particularly useful here, generating optimized CREATE TABLE statements and bulk-insert routines that reduced load times by an order of magnitude.

Local Query Interface

To make the database accessible without requiring SQL fluency, we paired it with a locally running large language model (Ollama) and a lightweight chat interface. Researchers can ask questions in plain English — "Show me all patents assigned to universities in Texas between 1990 and 2010 in the biotechnology classification" — and receive structured results. The entire stack runs offline, which matters for institutional data governance and reproducibility.

Keeping It Legal

The legality of this project rests on a straightforward foundation: U.S. federal government works are not subject to copyright. Patent documents issued by the USPTO are public records produced by a federal agency and are therefore in the public domain under 17 U.S.C. § 105. Bulk data files distributed by the USPTO are published explicitly for downstream research and commercial use.

Primary

USPTO Bulk Data Storage System

Official government portal for downloading full patent grant and application XML files. No authentication required. Files are updated weekly.

Supplemental

Google Patents Public Data

BigQuery dataset maintained by Google in partnership with the USPTO and EPO. Covers U.S., European, and international filings. Available under Creative Commons CC BY 4.0.

International

EPO Open Patent Services (OPS)

RESTful API for programmatic access to the European Patent Register. Free tier supports research-scale queries under EPO's fair-use policy.

Classification

Cooperative Patent Classification (CPC)

Joint USPTO/EPO taxonomy covering ~260,000 technology categories. Distributed as open data; essential for filtering by technology domain.

What we do not do

We do not scrape commercial databases (Derwent, LexisNexis, PatSnap), reproduce full patent texts in redistributable form beyond what fair use permits, or circumvent any access controls. Our pipeline touches only data that agencies publish for download.

A note on non-U.S. patents

Copyright status of patent documents varies by jurisdiction. Canadian, EU, and PCT filings are generally accessible through official open-data programs, but researchers should verify the terms of each source before building a public-facing product on top of them. For purely local, non-commercial research use, fair dealing and fair use provisions in most jurisdictions provide broad latitude.

Why This Matters for Research

A locally controlled patent database opens lines of inquiry that were previously gatekept by budget, technical skill, or institutional affiliation. Here are representative use cases drawn from the digital humanities, history of science, and social science literatures.

🗺️

Geography of Innovation

Map inventor and assignee addresses over time to trace the spatial concentration and diffusion of technological activity — from industrial-era manufacturing belts to today's biotech clusters.

📊

Citation Network Analysis

Reconstruct the intellectual genealogy of a technology field. Identify foundational patents, measure knowledge spillovers between firms, and detect when a domain crosses disciplinary boundaries.

🏛️

University–Industry Relationships

Examine how academic institutions translate research into IP over decades, compare technology transfer rates across universities, and study the Bayh-Dole Act's long-term effects.

⚖️

Legal and Policy History

Track how legislative changes — from the America Invents Act to software patent eligibility shifts — show up in filing patterns, claim language, and grant rates over time.

🌐

Language and Rhetoric

Apply computational text analysis to patent abstracts and claims to study how inventors frame novelty, how terminology evolves within a technology field, and how language varies across industries.

👩‍🔬

Diversity in Invention

Use inventor name disambiguation and gender-inference tools to study the historical underrepresentation of women and minorities in patenting — and measure whether and where that is changing.

Reproducibility advantage

Because the database is built from stable, versioned government archives and runs entirely on local hardware, analyses are fully reproducible. Other researchers can reconstruct the exact same dataset from the same source files, which satisfies data-management requirements from funders like NEH, NSF, and SSHRC without requiring a data-sharing agreement.

Technical Stack

Every component of this project is open-source and freely available. The stack was deliberately kept minimal to maximize longevity and ease of replication.

Python 3 (stdlib only) SQLite + FTS5 Ollama (local LLM runtime) XML / ElementTree HTTP server (http.server) Vanilla HTML/CSS/JS USPTO Bulk XML CPC taxonomy

No external Python packages are required for the core pipeline. The chat interface — a browser UI that streams responses from a locally running language model — adds zero cloud dependencies. The only network calls in production are to USPTO's public file servers during initial data ingestion.

The AI coding assistant (Claude, accessed through an agentic workflow) played the role of an expert pair programmer throughout: drafting boilerplate, explaining unfamiliar XML schemas, suggesting database index strategies, and iterating on query performance. The human researcher remained the decision-maker for every architectural and analytical choice. This division of labour — AI handles implementation friction, researcher handles intellectual direction — is the defining feature of agentic coding as a research methodology.

Replicate This Project

The complete source code, database schema, and step-by-step instructions are available in this repository. To get started, you need:

Python 3.8 or later (no pip installs required)
Ollama installed locally for the chat interface
~50 GB of disk space for a representative multi-year patent sample
A research question

Start with the USPTO's most recent weekly grant file to validate your pipeline before ingesting years of data. The repository README walks through the full process from download to first query.