AI Data Security & Privacy

AI data privacy in cybersecurity treats sanitization, unlearning, and inference-time leakage as engineering requirements with measurable controls. Closing that gap requires controls that operate at the data, training, and inference layers, where personal data is sanitized, encoded, and retrieved, not at the perimeter where DLP stops.

AI Data Privacy Terms & Definitions

This page defines 24 principles, techniques, and operational practices that govern how AI systems handle personal data across its full lifecycle. Each risk is mapped to our AI Readiness Framework and the PromptShield™ Risk Management Framework so data privacy connects to a specific control, not a policy clause.

Anonymization

The irreversible removal of identifying information from a dataset so that individuals cannot be re-identified, meeting a higher bar than pseudonymization under GDPR.

Confidential Computing

The use of hardware-based trusted execution environments that keep AI training data and model weights encrypted even while in use, protecting data from cloud providers and privileged insiders.

Consent Management

The systems and workflows that capture, track, and enforce user consent for AI data collection and processing, including the ability to withdraw consent and propagate that withdrawal across dependent models.

Data Classification

The tiered labeling of data by sensitivity (typically public, internal, confidential, restricted) that determines which AI controls, sanitization steps, and deployment restrictions apply to each dataset.

Data Lineage

The end-to-end record of where training data originated, how it was transformed, and which models it was used to train, required for GDPR erasure requests and EU AI Act provenance obligations.

Data Masking

The replacement of sensitive values with fictional but structurally valid substitutes in non-production environments, preventing PII exposure during AI development and testing.

Data Minimization

The GDPR Article 5 principle requiring organizations to collect and process only the personal data strictly necessary for the stated AI purpose, limiting training corpora to what is actually required.

Data Residency

The requirement that personal data remain within specified geographic or jurisdictional boundaries during AI training and inference, enforced through regional deployments and data routing controls.

Data Retention Policy

The policy defining how long AI training data, prompt logs, inference outputs, and model artifacts are kept, balancing operational need against privacy obligations and breach exposure.

Data Subject Access Request

The formal request from an individual to access, correct, delete, or port their personal data under GDPR, CCPA, or similar laws, which for AI must cover both stored records and model-encoded data.

De-Identification

The process of removing direct and indirect identifiers from data using methods like HIPAA Safe Harbor or Expert Determination, reducing but not always eliminating re-identification risk.

Differential Privacy

A mathematical framework that adds calibrated noise to training or outputs, providing provable guarantees that individual records cannot be reconstructed from the model, with epsilon below 1 required for sensitive data.

Federated Learning

A training architecture where models learn across decentralized data sources without centralizing the raw data, keeping personal information on-device or within organizational boundaries.

Homomorphic Encryption

A cryptographic technique allowing computation on encrypted data without decrypting it first, enabling AI inference on sensitive inputs while the data provider retains full confidentiality.

K-Anonymity

A privacy model ensuring each record in a dataset is indistinguishable from at least k-1 others on quasi-identifiers, limiting re-identification risk to an acceptable threshold.

Machine Unlearning

The set of techniques like SISA training and influence-function-based removal that delete the effect of specific training examples from a deployed model without full retraining, required to honor GDPR erasure requests.

Personal Data Processing

Any operation performed on identifiable information under GDPR Article 4, including AI training, inference, logging, and fine-tuning, each of which requires a documented lawful basis.

Privacy By Design

The GDPR Article 25 obligation to embed privacy controls into AI systems from the architecture stage rather than bolting them on after deployment, covering data minimization, default privacy settings, and purpose limitation.

Privacy Impact Assessment

The structured evaluation of how an AI system processes personal data, required under GDPR Article 35 for high-risk processing, which must identify risks, mitigations, and residual exposure before deployment.

Privacy-Preserving Machine Learning

The family of techniques (differential privacy, federated learning, homomorphic encryption, secure multi-party computation, synthetic data) that enable model training and inference without directly exposing personal data.

Pseudonymization

The replacement of identifiers with consistent fake values that preserve relational patterns but allow re-identification with an external key, classified as personal data under GDPR Article 4(5).

Purpose Limitation

The GDPR principle that personal data collected for one purpose cannot be repurposed for AI training or other uses without a compatible legal basis and typically renewed consent.

Secure Multi-Party Computation

A cryptographic technique enabling multiple organizations to jointly train or run models on their combined data without any party revealing its inputs to the others.

Synthetic Data

Artificially generated records that preserve the statistical properties of real data without containing any actual personal records, used to train AI models when real data cannot be shared or retained.

PurpleSec AI Security Readiness Framework

A Practical Framework For Secure, Responsible AI

AI security is not a one-time deployment. It is an ongoing discipline. PurpleSec emphasizes structured discovery, contextual risk analysis, practical control implementation, and continuous refinement.

Frequently Asked Questions

How Is AI Data Privacy Different From Traditional Data Privacy?

Traditional data privacy protects data at rest, in transit, and at egress. Encryption, access control, and DLP cover those three states. AI data privacy adds a fourth state that traditional controls cannot see: data encoded inside model weights. Training data is compressed into the model itself. It leaks through inference responses, through RAG retrieval, and through confidence scores.

A user record deleted from the database still lives in the model that trained on it. That is why AI data privacy requires controls at the data layer, the training layer, and the inference layer, rather than only at the perimeter.

Three regulatory regimes drive most AI data privacy obligations. GDPR Article 17 (right to erasure) applies to personal data encoded in model weights, not just data stored in databases. Article 25 mandates privacy by design. Article 32 requires appropriate technical measures including pseudonymization and encryption. Article 33 sets the 72-hour breach notification clock.

CCPA adds opt-out rights for sale and sharing of personal information and applies to training data licensing. EU AI Act Article 10 requires high-risk AI providers to document data provenance, examine training data for quality issues, and apply appropriate safeguards. Treat GDPR as the rights layer, CCPA as the consumer protection layer, and the EU AI Act as the AI-specific data governance layer.

Every term on this page produces a downstream security, compliance, or brand event when it breaks. Unsanitized PII in training data surfaces verbatim in model outputs. Model inversion attacks use normal inference queries to reconstruct training data, making the model itself the exfiltration channel. Pseudonymization treated as anonymization fails re-identification testing and triggers a GDPR breach.

A RAG system indexes confidential documents and retrieves them across tenant boundaries. A decommissioned model keeps processing customer data under an expired consent framework. Each failure is a P1 or P2 incident with regulatory notification attached, not an ethics discussion item.

Most programs protect training data on the way in and ignore the three exposure surfaces on the way out.

  • Model memorization reproduces phone numbers, email addresses, and PII from training corpora when prompted.
  • Model inversion attacks extract training data through thousands of targeted inference queries.
  • Confidence scores returned in API responses leak far more information than labels alone and make inversion attacks significantly easier.

Pseudonymized data is still personal data under GDPR Article 4(5), which catches organizations that treat it as a compliance shortcut. Metadata carries PII that text sanitization misses: filenames, timestamps, and document properties all enable re-identification.

Machine unlearning is skipped in favor of database-only deletion, leaving model weights in violation of the erasure request.

Scope depends on the data you process and the regulations that cover it. Organizations processing EU personal data must implement data minimization, purpose limitation, DSAR workflows, data residency controls, and machine unlearning procedures.

Map each data class to the regulatory regime that covers it, then apply the terms that match.

Sort AI data privacy controls into three tiers based on where data actually leaks.

  • Tier 1, run now: data classification with four levels from public to restricted, PII sanitization using Microsoft Presidio or AWS Macie on all Level 1+ training data, and a functioning DSAR workflow that processes requests within 30 days including model-level effects.
  • Tier 2, run next quarter: differential privacy training with epsilon below 1 for models trained on sensitive data, confidence score suppression on external APIs, rate limits that prevent model inversion attack volume, and a machine unlearning procedure tied to the model version registry.
  • Tier 3, emerging watch list: federated learning for cross-organization training, homomorphic encryption for inference on encrypted inputs, and secure multi-party computation for joint model training where no party should see raw data.

PurpleSec’s AI Readiness Framework maps each tier to concrete milestones by AI maturity.

Five metrics tell you whether a privacy program is operational:

  • Membership inference attack accuracy below 60% across all models trained on PII. 50% equals random guessing, so anything above 60% means the model is leaking.
  • Differential privacy epsilon below 1 for models trained on healthcare or financial data, documented in the model card.
  • DSAR fulfillment within 30 days including database deletion, model-level unlearning, and audit trail preservation.
  • 72-hour breach notification SLA for any confirmed personal data exposure, per GDPR Article 33.
  • Zero unsanitized PII in training data, verified by automated PII scans at every pipeline stage plus annual penetration testing that attempts to extract PII from production models.

Trending these five numbers quarterly is what separates a privacy program from a privacy policy.

Related Glossary Categories