Re-identification Risk vs Predictive Utility on US Hospital Discharge Data
About
HIPAA's Safe Harbor de-identification standard removes 18 explicit identifiers, but a large literature — from Sweeney (2000) through to current re-identification audits — shows that the residual quasi-identifiers still in the record (demographics, admission codes, payer, specialty) are often enough to re-identify individual patients when joined against external data. Modern AI pipelines compound the problem: every training run, vendor handoff, and research extract is a new exposure surface. The operational question is no longer “is this data anonymous?” — it is “which mitigation gives us defensible privacy without breaking the model that has to run on it?” This project answers that question empirically on a real, large-scale hospital dataset.



