CAP Lawsuit Summarization

Integration Guide

How to integrate AI-powered lawsuit summarization into the LELU workflow.

Overview

This tool automates the extraction of structured data from NYPD misconduct lawsuit complaint PDFs. It's designed to match the LELU (Law Enforcement Look Up) database taxonomy exactly, making it easy to import extracted data into Airtable.

What It Does

  • OCR scanned complaint PDFs using vision AI models
  • Extract narrative summaries in ProPublica style
  • Classify using exact LELU taxonomy (allegations, force types, themes, etc.)
  • Extract officer names, badge numbers, precincts, and addresses
  • Provide source provenance (exact text + page number) for every extraction

How It Works

PDF
Vision OCR
Qwen3 VL / DeepSeek
Extraction LLM
Llama 4 Scout
JSON
Structured Data

Extraction Modes

Mode OCR Extraction Cost/Page Best For
DeepSeek Self-hosted (EC2) Llama 4 ~$0.0002 Large batches, lowest cost
Two-Step Qwen3 VL (Bedrock) Llama 4 ~$0.003 Balance of cost and quality
Unified Qwen3 VL (single pass) ~$0.004 Simplest pipeline

Workflow Options

Option 1: Web Interface

Use the Demo page to review individual cases with side-by-side comparison of AI extraction vs. human summaries.

Option 2: Command Line Batch

# Activate virtual environment
source venv/bin/activate

# Run extraction on all PDFs in a folder
python run_vision_extraction.py --mode two-step --input-dir ./complaints/

# Export results to CSV
python export_results.py --format csv --output lelu_import.csv

Option 3: API Integration

# JSON API endpoint
GET /api/export/json?method=two_step

# CSV export
GET /api/export/csv?method=two_step

Output Formats

CSV/TSV for Airtable

Export to CSV or TSV format with pipe-separated multi-value fields. Ready for Airtable import.

ColumnExample
case_idowens_troy
summaryOn December 9, 2015, Troy Owens was arrested...
allegationsExcessive force|False arrest|Malicious prosecution
force_typesNon-weapon physical force|Tight handcuffs
officersJohn Doe I|John Doe II

JSON with Provenance

Full JSON includes source text and page numbers for every extracted field.

{
  "allegations": [
    {
      "type": "Excessive force/assault and battery",
      "provenance": {
        "source_text": "defendants... wrongfully touched, assaulted and battered...",
        "page_number": 3,
        "paragraph": "Second Cause of Action, Paragraph 10"
      }
    }
  ]
}

LELU Taxonomy Coverage

The extraction prompt uses the exact LELU taxonomy from Airtable.

Allegations (18)

  • Abuse of process
  • ADA/disability claim
  • Conversion
  • Excessive force/assault and battery
  • Excessive pre-arraignment detention
  • Failure to intervene
  • False arrest/False imprisonment
  • Indifference/Denial of medical care
  • Malicious prosecution
  • Monell claim
  • Municipal liability
  • Negligence
  • Retaliation for exercise of constitutional rights
  • Strip search
  • Unconstitutional condition of confinement/cruel and unusual punishment
  • Unlawful search and seizure
  • Violation of NY Civil Rights Law ยง50-a
  • Wrongful conviction/imprisonment

Force Types (12)

  • Baton/asps/object
  • Canine
  • Chokehold
  • Gun pointed (not fired)
  • Mace/Pepper spray
  • Non-weapon physical force
  • Shooting/Discharge of firearm
  • Sonic weapon/LRAD
  • Taser
  • Tight handcuffs
  • Unknown
  • Vehicle/Collision

Themes (10)

  • Disability
  • Homelessness
  • Immigration
  • LGBTQ+ identity
  • Mental health
  • Police assisting EMS
  • Race and/or ethnicity
  • Religion
  • Sex/gender
  • Youth

Locations (17)

  • At residence but not inside (i.e. Lobby/stairwell/yard/etc)
  • Bike
  • Bus/Subway
  • Commercial
  • Hospital/Health Clinic
  • Inside residence
  • Motor vehicle
  • NYCHA buildings and/or grounds
  • Other/Unknown
  • Police vehicle
  • Precinct
  • Prison/Jail
  • Public/Open space
  • School
  • Street
  • Social Services
  • Workplace

Defendant Types (7)

  • City of New York
  • Correction Officer
  • Department of Corrections
  • Individual Police Officer
  • NYPD
  • Other City Agency
  • Other Individual

Technical Requirements

  • AWS Account with Bedrock access (Qwen3 VL, Llama 4)
  • Python 3.10+ with boto3, PyMuPDF
  • Optional: EC2 g5.xlarge for self-hosted DeepSeek OCR

Quick Setup

git clone https://github.com/tiwhi/cap-summary-prototype
cd cap-summary-prototype
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Configure AWS credentials
aws configure

# Test extraction
python run_vision_extraction.py --mode two-step --pdf sample.pdf

Next Steps

  1. Review demo cases to see extraction quality
  2. Compare accuracy metrics across extraction modes
  3. Estimate costs for your document volume
  4. Export sample data for testing Airtable import