Integration Guide - CAP Lawsuit Summarization

Overview

This tool automates the extraction of structured data from NYPD misconduct lawsuit complaint PDFs. It's designed to match the LELU (Law Enforcement Look Up) database taxonomy exactly, making it easy to import extracted data into Airtable.

What It Does

OCR scanned complaint PDFs using vision AI models
Extract narrative summaries in ProPublica style
Classify using exact LELU taxonomy (allegations, force types, themes, etc.)
Extract officer names, badge numbers, precincts, and addresses
Provide source provenance (exact text + page number) for every extraction

How It Works

PDF

→

Vision OCR

Qwen3 VL / DeepSeek

→

Extraction LLM

Llama 4 Scout

→

JSON

Structured Data

Extraction Modes

Mode	OCR	Extraction	Cost/Page	Best For
DeepSeek	Self-hosted (EC2)	Llama 4	~$0.0002	Large batches, lowest cost
Two-Step	Qwen3 VL (Bedrock)	Llama 4	~$0.003	Balance of cost and quality
Unified	Qwen3 VL (single pass)		~$0.004	Simplest pipeline

Workflow Options

Option 1: Web Interface

Use the Demo page to review individual cases with side-by-side comparison of AI extraction vs. human summaries.

Option 2: Command Line Batch

# Activate virtual environment
source venv/bin/activate

# Run extraction on all PDFs in a folder
python run_vision_extraction.py --mode two-step --input-dir ./complaints/

# Export results to CSV
python export_results.py --format csv --output lelu_import.csv

Option 3: API Integration

# JSON API endpoint
GET /api/export/json?method=two_step

# CSV export
GET /api/export/csv?method=two_step

Output Formats

CSV/TSV for Airtable

Export to CSV or TSV format with pipe-separated multi-value fields. Ready for Airtable import.

Column	Example
case_id	owens_troy
summary	On December 9, 2015, Troy Owens was arrested...
allegations	Excessive force\|False arrest\|Malicious prosecution
force_types	Non-weapon physical force\|Tight handcuffs
officers	John Doe I\|John Doe II

JSON with Provenance

Full JSON includes source text and page numbers for every extracted field.

{
  "allegations": [
    {
      "type": "Excessive force/assault and battery",
      "provenance": {
        "source_text": "defendants... wrongfully touched, assaulted and battered...",
        "page_number": 3,
        "paragraph": "Second Cause of Action, Paragraph 10"
      }
    }
  ]
}

LELU Taxonomy Coverage

The extraction prompt uses the exact LELU taxonomy from Airtable.

Allegations (18)

Abuse of process
ADA/disability claim
Conversion
Excessive force/assault and battery
Excessive pre-arraignment detention
Failure to intervene
False arrest/False imprisonment
Indifference/Denial of medical care
Malicious prosecution
Monell claim
Municipal liability
Negligence
Retaliation for exercise of constitutional rights
Strip search
Unconstitutional condition of confinement/cruel and unusual punishment
Unlawful search and seizure
Violation of NY Civil Rights Law §50-a
Wrongful conviction/imprisonment

Force Types (12)

Baton/asps/object
Canine
Chokehold
Gun pointed (not fired)
Mace/Pepper spray
Non-weapon physical force
Shooting/Discharge of firearm
Sonic weapon/LRAD
Taser
Tight handcuffs
Unknown
Vehicle/Collision

Themes (10)

Disability
Homelessness
Immigration
LGBTQ+ identity
Mental health
Police assisting EMS
Race and/or ethnicity
Religion
Sex/gender
Youth

Locations (17)

At residence but not inside (i.e. Lobby/stairwell/yard/etc)
Bike
Bus/Subway
Commercial
Hospital/Health Clinic
Inside residence
Motor vehicle
NYCHA buildings and/or grounds
Other/Unknown
Police vehicle
Precinct
Prison/Jail
Public/Open space
School
Street
Social Services
Workplace

Defendant Types (7)

City of New York
Correction Officer
Department of Corrections
Individual Police Officer
NYPD
Other City Agency
Other Individual

Technical Requirements

AWS Account with Bedrock access (Qwen3 VL, Llama 4)
Python 3.10+ with boto3, PyMuPDF
Optional: EC2 g5.xlarge for self-hosted DeepSeek OCR

Quick Setup

git clone https://github.com/tiwhi/cap-summary-prototype
cd cap-summary-prototype
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Configure AWS credentials
aws configure

# Test extraction
python run_vision_extraction.py --mode two-step --pdf sample.pdf

Next Steps

Review demo cases to see extraction quality
Compare accuracy metrics across extraction modes
Estimate costs for your document volume
Export sample data for testing Airtable import