How We Reduced a 1.5GB Database by 99%
A deep technical dive into optimizing the NHTSA VPIC database for offline VIN decoding across Node.js, browsers, and edge computing.
When we set out to build Corgi, an offline VIN decoder that runs everywhere, we faced a fundamental constraint: the NHTSA’s VPIC database was 1.5GB. That’s fine for a server, but impossible for browsers, impractical for edge workers, and painful for any application that needs to bundle the data.
This is the story of how we systematically reduced that 1.5GB database to 21MB compressed, while preserving all the functionality needed for accurate VIN decoding. No magic algorithms. No lossy compression. Just methodical analysis of what data actually matters.
The Challenge: Making Government Data Mobile
The National Highway Traffic Safety Administration maintains the Vehicle Product Information Catalog (VPIC), a comprehensive database of every vehicle sold in America since 1981. It’s the authoritative source for VIN decoding, containing specifications for millions of vehicle configurations.
The problem? The database was designed for regulatory compliance on government servers, not for distribution to end users. At 1.5GB, you’re looking at:
30+ second downloads on average mobile connections
Memory pressure that crashes browser tabs
Bundle size violations for any modern web framework
Cold start timeouts on serverless platforms like Cloudflare Workers
We needed this database to work in three very different environments:
// Node.js - traditional server usage
const decoder = await createDecoder();
// Browser - loaded via SQL.js WebAssembly
const browserDecoder = await createDecoder({
databasePath: "https://cdn.example.com/vpic.lite.db.gz",
runtime: "browser",
});
// Cloudflare Workers - edge computing via D1
initD1Adapter(env.D1_DATABASE);
const edgeDecoder = await createDecoder({
databasePath: "D1",
runtime: "cloudflare",
});Each platform has different constraints. Browsers need small downloads. Edge workers have execution time limits. Mobile devices have limited memory. The only solution was making the database dramatically smaller.
Phase 1: Understanding What’s Actually in There
Before optimizing anything, we needed to understand the database structure. We wrote a simple script to analyze table sizes:
SELECT
name as Table_Name,
ROUND(SUM(pgsize)/1024.0/1024.0, 2) as Size_MB
FROM dbstat
WHERE name NOT LIKE 'sqlite%'
GROUP BY name
ORDER BY SUM(pgsize) DESC
LIMIT 10;The results were illuminating
The largest table, WMIYearValidChars, contained 847MB of data mapping World Manufacturer Identifiers to valid model years. This is useful for strict validation, but we realized we could derive this information from other tables at query time.
First insight: The database wasn’t optimized for size because it never needed to be. Government servers have terabytes of storage.
Phase 2: The Big Win - Removing WMIYearValidChars
Dropping the WMIYearValidChars table was our single biggest optimization:
sqlite3 "$WORK_DB" "
DROP TABLE IF EXISTS WMIYearValidChars;
VACUUM;"Result: 1.5GB to 653MB. A 56% reduction from one table.
But we didn’t just delete it blindly. We verified that our decode logic could function without it by using the Wmi_VinSchema relationship table instead:
SELECT DISTINCT vs.Id as SchemaId, vs.Name as SchemaName
FROM Wmi w
JOIN Wmi_VinSchema wvs ON w.Id = wvs.WmiId
JOIN VinSchema vs ON wvs.VinSchemaId = vs.Id
WHERE w.Wmi = ?
AND ? >= wvs.YearFrom
AND (wvs.YearTo IS NULL OR ? <= wvs.YearTo)The YearFrom and YearTo columns in Wmi_VinSchema gave us the same validation capability with a fraction of the storage.
Phase 3: Surgical Pattern Table Pruning
The Pattern table was our next target. At 312MB, it contained patterns for every possible VIN element - most of which applications don’t need.
The database contained patterns for elements like:
Essential: Make, Model, Series, Trim, Body Style, Engine, Fuel Type
Nice to have: Plant location, drive type, transmission
Rarely needed: ABS type, airbag locations, tire pressure monitor system type
We kept only the patterns that matter for typical VIN decode use cases:
DELETE FROM Pattern
WHERE ElementId NOT IN (
SELECT Id FROM Element
WHERE Name IN (
-- Core VIN elements
'Model', 'Make', 'Series', 'Trim',
-- Plant information
'Plant', 'Plant Country', 'Plant City', 'Plant State',
-- Vehicle characteristics
'Body Class', 'Body Style', 'Doors',
'DriveType', 'Engine Model', 'Engine Configuration',
'Fuel Type - Primary', 'Fuel Type', 'Transmission',
-- Additional useful elements
'Turbo', 'Displacement (L)', 'Displacement (CC)',
'Cylinders', 'Engine Manufacturer', 'Engine Power (KW)',
'Gross Vehicle Weight Rating', 'Brake System Type',
'Battery Type', 'Battery Energy (kWh)', 'Charger Level',
'Electric Range', 'Base Price ($)', 'Trim Level'
)
);Result: Pattern table reduced from 312MB to 47MB.
Phase 4: Removing Orphaned Data
After pruning patterns, we discovered orphaned records - patterns referencing VIN schemas that no longer had any WMI associations:
DELETE FROM Pattern
WHERE VinSchemaId NOT IN (
SELECT VinSchemaId FROM Wmi_VinSchema
);This cleanup removed another 23MB of dead data that would never be accessed.
Phase 5: Dropping Unused Tables
The original database contained dozens of tables for specialized features we didn’t need:
Safety feature tables (not needed for basic VIN decoding):
DROP TABLE IF EXISTS ABS;
DROP TABLE IF EXISTS AdaptiveCruiseControl;
DROP TABLE IF EXISTS AirBagLocFront;
DROP TABLE IF EXISTS AirBagLocKnee;
DROP TABLE IF EXISTS AirBagLocations;
DROP TABLE IF EXISTS BlindSpotMonitoring;
DROP TABLE IF EXISTS LaneDepartureWarning;
DROP TABLE IF EXISTS ParkAssist;
DROP TABLE IF EXISTS TPMS;
-- ... 15 more safety tablesSpecialized vehicle types (motorcycles, buses, trailers):
DROP TABLE IF EXISTS BusFloorConfigType;
DROP TABLE IF EXISTS BusType;
DROP TABLE IF EXISTS CustomMotorcycleType;
DROP TABLE IF EXISTS MotorcycleChassisType;
DROP TABLE IF EXISTS TrailerBodyType;
DROP TABLE IF EXISTS TrailerType;Redundant schema tables:
DROP TABLE IF EXISTS VehicleSpecPattern;
DROP TABLE IF EXISTS VehicleSpecSchema;
DROP TABLE IF EXISTS VehicleSpecSchema_Model;
DROP TABLE IF EXISTS DEFS_Body;
DROP TABLE IF EXISTS DEFS_Make;
DROP TABLE IF EXISTS DEFS_Model;Each table removal was validated by running our test suite to ensure core functionality remained intact.
Phase 6: Index Optimization for Read-Only Workloads
The original database had indexes optimized for write operations (the government updates this data regularly). We needed indexes optimized for read-only VIN decoding.
We analyzed our query patterns and created targeted composite indexes:
-- WMI lookup (the first query in every decode)
CREATE INDEX idx_wmi_lookup ON Wmi(Wmi);
-- Schema lookup with year filtering
CREATE INDEX idx_wmi_vinschema_composite
ON Wmi_VinSchema(WmiId, VinSchemaId, YearFrom, YearTo);
-- Pattern matching (the core of VIN decoding)
DROP INDEX IF EXISTS idx_pattern_optimized;
CREATE INDEX idx_pattern_composite
ON Pattern(VinSchemaId, ElementId, Keys);
-- Element lookups
CREATE INDEX idx_element_name ON Element(Name);
-- Make/Model joins
CREATE INDEX idx_make_model_composite ON Make_Model(ModelId, MakeId);
-- Update query statistics
ANALYZE;The key insight here was index column ordering. For pattern matching, we put VinSchemaId first because that’s the most selective filter - it dramatically reduces the candidate set before checking element types.
Phase 7: The Final VACUUM
After all modifications, running SQLite’s VACUUM command reclaimed all the freed space:
sqlite3 "$WORK_DB" "VACUUM;"This rebuilds the database file from scratch, eliminating fragmentation and dead pages.
Final uncompressed size: 64MB (down from 1.5GB)
Phase 8: Compression for Distribution
For distribution, we apply gzip compression:
gzip -9 vpic.lite.dbFinal compressed size: 21MB
The database compresses well because:
SQLite stores data in pages with predictable structures
String values (make names, model names) have high repetition
Integer IDs compress extremely well
The Results: Measurable Impact
The query speed improvement comes from both smaller table scans and optimized indexes. When the Pattern table is 47MB instead of 312MB, even a full scan is fast.
What We Learned
1. Profile Before Optimizing
The dbstat virtual table in SQLite is invaluable:
SELECT name, SUM(pgsize) as size
FROM dbstat
GROUP BY name
ORDER BY size DESC;This tells you exactly where your bytes are going. We would have wasted weeks optimizing the wrong tables without this data.
2. Understand Your Access Patterns
We only needed read operations for a specific use case (VIN decoding). This let us:
Remove write-optimized indexes
Drop tables only needed for data maintenance
Create read-optimized composite indexes
If you’re optimizing for a general-purpose database, you have fewer options.
3. Validate Aggressively
Every optimization was followed by running our full test suite:
pnpm testWe caught several regressions where removing a table broke an edge case query. Automated testing made aggressive optimization safe.
4. Compression Is Not a Substitute
Compression helps distribution size, but it doesn’t help query performance. A 1.5GB database compressed to 400MB still needs 1.5GB of RAM to query. The real optimization had to happen at the data level.
The Code Is Open Source
We’ve packaged this optimized database into Corgi (@cardog/corgi), a fully offline VIN decoder:
import { createDecoder } from "@cardog/corgi";
const decoder = await createDecoder();
const result = await decoder.decode("KM8K2CAB4PU001140");
console.log(result.components.vehicle);
// {
// make: 'Hyundai',
// model: 'Kona',
// year: 2023,
// series: 'SE',
// bodyStyle: 'SUV',
// driveType: '4WD/4-Wheel Drive/4x4',
// fuelType: 'Gasoline'
// }The library handles database download, caching, and updates automatically. First run downloads the 21MB database, subsequent runs use the cached copy.
Install: npm install @cardog/corgi
Source: github.com/cardog-ai/corgi
Beyond Corgi: Applying These Techniques
The optimization approach we used isn’t specific to VIN databases. Any large dataset can benefit from systematic analysis:
Measure table sizes to find the biggest opportunities
Analyze query patterns to understand what data is actually accessed
Remove unused data (tables, columns, rows) aggressively
Optimize indexes for your specific access patterns
Compress for distribution, not for storage
The NHTSA spent decades collecting comprehensive vehicle data. By understanding what our application actually needed, we made that data accessible to environments that could never handle the original 1.5GB payload.
At Cardog, this optimization powers our VIN decoding across web, mobile, and API services. Whether you’re building an automotive application or just curious about database optimization, we hope this deep dive proves useful.
Sometimes the best optimization is just removing what you don’t need. Merry Christmas!


