AI and Data Privacy: What Happens to Data You Send to AI Services
A clear-eyed look at OpenAI, Anthropic, and Google data retention policies, enterprise tiers, self-hosted alternatives, and GDPR obligations when using AI services.
When your application sends a message to OpenAI or Anthropic, what actually happens to that data? This question matters enormously — not just philosophically, but legally. The answer determines whether you need user consent, a data processing agreement, specific contractual terms with your provider, or whether you should be using a cloud API at all.
This article breaks down data handling policies across major providers, explains enterprise tiers and what they actually guarantee, covers GDPR and CCPA obligations, and walks through when self-hosted models make sense.
What Cloud LLM Providers Do with Your Data
OpenAI
OpenAI's data handling depends heavily on which product tier you're using.
Default API (pay-as-you-go)
By default, OpenAI retains API inputs and outputs for up to 30 days for safety review and to assist with debugging. Per their policy, they do not use data submitted via the API to train their models unless you explicitly opt in. However, "safety reviews" means human reviewers may read samples of conversations.
The opt-out for training data retention is the default for API users (unlike ChatGPT consumer products), but the 30-day retention for abuse monitoring is not opt-outable on standard tiers.
OpenAI Enterprise / ChatGPT Enterprise
- Zero data retention: inputs and outputs are not retained by OpenAI after the API call
- No training on your data
- SOC 2 Type II certified
- Business Associate Agreement (BAA) available, making it HIPAA-eligible
- Encryption in transit and at rest
Azure OpenAI
Data stays within your Azure tenant and never leaves it for training purposes. Microsoft has no visibility into your prompts. Azure OpenAI is available in specific compliance-focused regions and supports HIPAA, FedRAMP, and other frameworks.
Anthropic
Standard API
Anthropic's usage policy states they may use API inputs and outputs to improve their models unless you opt out via a data processing agreement. Retention periods are not precisely disclosed in their consumer documentation.
Claude for Enterprise
- No training on customer data
- Extended data retention controls
- DPA available for GDPR compliance
- Zero data retention option for eligible customers
Anthropic publishes a Privacy Policy and a usage policy, but the specific retention windows for standard API customers are less explicitly documented than OpenAI's. If you need specific retention guarantees, you need a contractual DPA.
Google Gemini / Vertex AI
Gemini API (Google AI Studio)
Google uses prompts to improve their models unless you opt out. For developers using the free tier in AI Studio, Google's standard consumer data policies apply.
Vertex AI
Through Vertex AI (Google Cloud), your data is not used for training, stays within your Google Cloud project, and can be configured for specific regions. Vertex AI is eligible for HIPAA BAA.
What "Not Used for Training" Actually Means
When providers say your data isn't used for training, this typically means:
- Your specific inputs/outputs are not included in gradient updates to the model
- Your data is not retained in a training dataset
It does not mean:
- Your data is never stored (it usually is, temporarily)
- No human ever sees it (safety teams typically review samples)
- The model cannot incidentally "learn" from your data in any sense
True zero-data-retention means the response is generated and returned to you and the input/output is immediately discarded — no logging, no retention, no human review. This requires explicit contractual commitment, not just a policy statement.
GDPR Obligations When Using AI APIs
If you're operating in the EU or processing EU personal data, using a cloud LLM API has significant GDPR implications.
Is Your LLM Provider a Data Processor?
Under GDPR, when you send personal data to an AI provider for processing (generating a response, summarizing a document), the AI provider is acting as a data processor on your behalf. You are the data controller.
This triggers several obligations:
1. You must have a Data Processing Agreement (DPA)
Article 28 GDPR requires a written contract between controller and processor. This contract must specify:
- The subject matter and duration of processing
- The nature and purpose of processing
- The type of personal data involved
- The processor's obligations (security, sub-processors, deletion)
Without a DPA with your AI provider, you are likely in violation of GDPR if you're sending personal data through their APIs.
All major providers offer DPAs:
- OpenAI: Available under the Enterprise agreement
- Anthropic: Available on request for business accounts
- Google Cloud (Vertex AI): Available through Google Cloud's standard DPA
- Microsoft Azure OpenAI: Covered by Microsoft's standard Cloud DPA
2. Transfers outside the EU/EEA require a legal basis
If your AI provider processes data in the US (or any country without an EU adequacy decision), you need a valid transfer mechanism:
- Standard Contractual Clauses (SCCs): The most common mechanism; all major providers include these in their DPAs
- EU-US Data Privacy Framework: Companies certified under this framework can transfer EU data to the US legally
- Binding Corporate Rules: For intra-group transfers
3. Retention must be proportionate
GDPR's storage limitation principle requires you to retain personal data no longer than necessary. If your AI provider retains prompts for 30 days for debugging, you need to assess whether that retention is compatible with your stated purposes and privacy notices.
4. Data subject rights must be honourable
If users submit personal data via your AI feature and later request erasure, you must be able to fulfill that request. If the provider has retained their data, you need a mechanism to request deletion from the provider — make sure your DPA covers this.
Practical GDPR Compliance for AI Features
class GDPRCompliantLLMClient:
def __init__(self, provider_client, dpa_signed: bool):
if not dpa_signed:
raise RuntimeError("Cannot process personal data without signed DPA")
self.client = provider_client
def process_with_pii_check(self, text: str, user_id: str) -> str:
# Check if text contains personal data
pii_detected = self.detect_pii(text)
if pii_detected:
# Log for audit trail (GDPR Article 30 record-keeping)
self.log_processing_activity(
user_id=user_id,
data_categories=[p["type"] for p in pii_detected],
purpose="ai_assistance",
legal_basis="contract", # or "legitimate_interest", "consent"
processor="openai",
transfer_mechanism="scc",
)
return self.client.complete(text)
CCPA Considerations
The California Consumer Privacy Act (CCPA) / CPRA adds additional obligations for companies serving California residents.
Key points for AI integrations:
- AI providers receiving personal data qualify as "service providers" under CCPA if you have a written contract restricting them from using the data for their own purposes
- Without such a contract, sharing data with the provider may constitute a "sale" or "sharing" of personal data, triggering opt-out rights
- CCPA gives consumers the right to know what personal information was collected and to request deletion — same challenge as GDPR for prompt data retained by providers
Self-Hosted LLMs: When to Consider Them
Self-hosted models eliminate the third-party data sharing concern entirely. The tradeoff is infrastructure cost, capability (most self-hosted models trail frontier models), and operational complexity.
Ollama (Local Inference)
Ollama runs models locally on developer machines or servers. No data leaves your infrastructure:
# Install and run a model locally
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:8b
ollama serve
import ollama
response = ollama.chat(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_message}
]
)
print(response["message"]["content"])
Suitable for:
- Development and testing environments where you don't want to pay API costs
- Internal tools processing sensitive employee or customer data
- Air-gapped environments (defense, finance, healthcare)
- Use cases where model capability requirements are modest (summarization, classification, simple QA)
Not suitable for:
- Applications requiring frontier model capabilities (complex reasoning, code generation at scale)
- Consumer-facing products needing low latency and high throughput without dedicated hardware
vLLM (Production Self-Hosting)
For production self-hosted inference, vLLM provides PagedAttention for efficient memory usage and OpenAI-compatible APIs:
from openai import OpenAI
# vLLM serves an OpenAI-compatible API
client = OpenAI(
api_key="not-needed-for-self-hosted",
base_url="http://your-vllm-server:8000/v1"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": user_message}]
)
Recommended models for self-hosting:
- Llama 3.1 70B / 405B (Meta): Strong general capability, permissive license for most commercial uses
- Mistral 7B / Mixtral 8x7B: Efficient models with good instruction following
- Qwen 2.5: Strong multilingual capability
- Phi-3 / Phi-4: Small but capable for constrained deployments
Hardware Requirements
| Model Size | Minimum VRAM | Recommended |
|---|---|---|
| 7B | 8 GB (4-bit) | 16 GB |
| 13B | 10 GB (4-bit) | 24 GB |
| 34B | 20 GB (4-bit) | 48 GB |
| 70B | 40 GB (4-bit) | 2x 80 GB A100 |
Quantization Trade-offs
Self-hosted models often run in quantized form (INT4, INT8) to fit in available VRAM. Quantization reduces memory footprint at some cost to output quality:
# Using llama.cpp with 4-bit quantization
from llama_cpp import Llama
llm = Llama(
model_path="./models/llama-3.1-8b-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=-1, # Use all GPU layers
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": user_message}]
)
Building a Data Classification Policy for AI Features
Not all data is equally sensitive. Build a classification system to determine which data can go to cloud APIs and which must stay on-premises:
| Classification | Definition | AI Processing |
|---|---|---|
| Public | Already public information | Any cloud API |
| Internal | Internal business data, no personal data | Cloud API with DPA |
| Confidential | Personal data, financial records | Enterprise API with DPA + zero retention, or self-hosted |
| Restricted | Health data, payment card data, national security | Self-hosted only |
Implement this as a pre-flight check in your AI gateway:
def route_to_appropriate_provider(text: str, classification: str) -> Any:
if classification in ("restricted", "confidential"):
if not ENTERPRISE_AGREEMENT_ACTIVE:
return local_llm_client.complete(text)
return enterprise_api_client.complete(text)
return standard_api_client.complete(text)
Key Takeaways
The privacy implications of AI integrations are substantial and frequently underestimated:
- Default API tiers are not privacy-safe for personal data — you need enterprise contracts or self-hosting
- A signed DPA is legally required under GDPR before sending EU personal data to any AI provider
- "Not used for training" does not mean "not retained" — understand retention windows and human review policies
- Self-hosted models are viable for many use cases and eliminate third-party data sharing entirely
- Data classification should gate API selection — not all requests need frontier models, and not all data can go to them
Privacy compliance for AI features is not a one-time checkbox. Provider policies change, regulations evolve, and your data flows should be reviewed regularly as both technology and compliance requirements shift.