AI Security

AI and Data Privacy: What Happens to Data You Send to AI Services

A clear-eyed look at OpenAI, Anthropic, and Google data retention policies, enterprise tiers, self-hosted alternatives, and GDPR obligations when using AI services.

September 1, 20259 min readShipSafer Team

When your application sends a message to OpenAI or Anthropic, what actually happens to that data? This question matters enormously — not just philosophically, but legally. The answer determines whether you need user consent, a data processing agreement, specific contractual terms with your provider, or whether you should be using a cloud API at all.

This article breaks down data handling policies across major providers, explains enterprise tiers and what they actually guarantee, covers GDPR and CCPA obligations, and walks through when self-hosted models make sense.

What Cloud LLM Providers Do with Your Data

OpenAI

OpenAI's data handling depends heavily on which product tier you're using.

Default API (pay-as-you-go)

By default, OpenAI retains API inputs and outputs for up to 30 days for safety review and to assist with debugging. Per their policy, they do not use data submitted via the API to train their models unless you explicitly opt in. However, "safety reviews" means human reviewers may read samples of conversations.

The opt-out for training data retention is the default for API users (unlike ChatGPT consumer products), but the 30-day retention for abuse monitoring is not opt-outable on standard tiers.

OpenAI Enterprise / ChatGPT Enterprise

  • Zero data retention: inputs and outputs are not retained by OpenAI after the API call
  • No training on your data
  • SOC 2 Type II certified
  • Business Associate Agreement (BAA) available, making it HIPAA-eligible
  • Encryption in transit and at rest

Azure OpenAI

Data stays within your Azure tenant and never leaves it for training purposes. Microsoft has no visibility into your prompts. Azure OpenAI is available in specific compliance-focused regions and supports HIPAA, FedRAMP, and other frameworks.

Anthropic

Standard API

Anthropic's usage policy states they may use API inputs and outputs to improve their models unless you opt out via a data processing agreement. Retention periods are not precisely disclosed in their consumer documentation.

Claude for Enterprise

  • No training on customer data
  • Extended data retention controls
  • DPA available for GDPR compliance
  • Zero data retention option for eligible customers

Anthropic publishes a Privacy Policy and a usage policy, but the specific retention windows for standard API customers are less explicitly documented than OpenAI's. If you need specific retention guarantees, you need a contractual DPA.

Google Gemini / Vertex AI

Gemini API (Google AI Studio)

Google uses prompts to improve their models unless you opt out. For developers using the free tier in AI Studio, Google's standard consumer data policies apply.

Vertex AI

Through Vertex AI (Google Cloud), your data is not used for training, stays within your Google Cloud project, and can be configured for specific regions. Vertex AI is eligible for HIPAA BAA.

What "Not Used for Training" Actually Means

When providers say your data isn't used for training, this typically means:

  • Your specific inputs/outputs are not included in gradient updates to the model
  • Your data is not retained in a training dataset

It does not mean:

  • Your data is never stored (it usually is, temporarily)
  • No human ever sees it (safety teams typically review samples)
  • The model cannot incidentally "learn" from your data in any sense

True zero-data-retention means the response is generated and returned to you and the input/output is immediately discarded — no logging, no retention, no human review. This requires explicit contractual commitment, not just a policy statement.

GDPR Obligations When Using AI APIs

If you're operating in the EU or processing EU personal data, using a cloud LLM API has significant GDPR implications.

Is Your LLM Provider a Data Processor?

Under GDPR, when you send personal data to an AI provider for processing (generating a response, summarizing a document), the AI provider is acting as a data processor on your behalf. You are the data controller.

This triggers several obligations:

1. You must have a Data Processing Agreement (DPA)

Article 28 GDPR requires a written contract between controller and processor. This contract must specify:

  • The subject matter and duration of processing
  • The nature and purpose of processing
  • The type of personal data involved
  • The processor's obligations (security, sub-processors, deletion)

Without a DPA with your AI provider, you are likely in violation of GDPR if you're sending personal data through their APIs.

All major providers offer DPAs:

  • OpenAI: Available under the Enterprise agreement
  • Anthropic: Available on request for business accounts
  • Google Cloud (Vertex AI): Available through Google Cloud's standard DPA
  • Microsoft Azure OpenAI: Covered by Microsoft's standard Cloud DPA

2. Transfers outside the EU/EEA require a legal basis

If your AI provider processes data in the US (or any country without an EU adequacy decision), you need a valid transfer mechanism:

  • Standard Contractual Clauses (SCCs): The most common mechanism; all major providers include these in their DPAs
  • EU-US Data Privacy Framework: Companies certified under this framework can transfer EU data to the US legally
  • Binding Corporate Rules: For intra-group transfers

3. Retention must be proportionate

GDPR's storage limitation principle requires you to retain personal data no longer than necessary. If your AI provider retains prompts for 30 days for debugging, you need to assess whether that retention is compatible with your stated purposes and privacy notices.

4. Data subject rights must be honourable

If users submit personal data via your AI feature and later request erasure, you must be able to fulfill that request. If the provider has retained their data, you need a mechanism to request deletion from the provider — make sure your DPA covers this.

Practical GDPR Compliance for AI Features

class GDPRCompliantLLMClient:
    def __init__(self, provider_client, dpa_signed: bool):
        if not dpa_signed:
            raise RuntimeError("Cannot process personal data without signed DPA")
        self.client = provider_client

    def process_with_pii_check(self, text: str, user_id: str) -> str:
        # Check if text contains personal data
        pii_detected = self.detect_pii(text)

        if pii_detected:
            # Log for audit trail (GDPR Article 30 record-keeping)
            self.log_processing_activity(
                user_id=user_id,
                data_categories=[p["type"] for p in pii_detected],
                purpose="ai_assistance",
                legal_basis="contract",  # or "legitimate_interest", "consent"
                processor="openai",
                transfer_mechanism="scc",
            )

        return self.client.complete(text)

CCPA Considerations

The California Consumer Privacy Act (CCPA) / CPRA adds additional obligations for companies serving California residents.

Key points for AI integrations:

  • AI providers receiving personal data qualify as "service providers" under CCPA if you have a written contract restricting them from using the data for their own purposes
  • Without such a contract, sharing data with the provider may constitute a "sale" or "sharing" of personal data, triggering opt-out rights
  • CCPA gives consumers the right to know what personal information was collected and to request deletion — same challenge as GDPR for prompt data retained by providers

Self-Hosted LLMs: When to Consider Them

Self-hosted models eliminate the third-party data sharing concern entirely. The tradeoff is infrastructure cost, capability (most self-hosted models trail frontier models), and operational complexity.

Ollama (Local Inference)

Ollama runs models locally on developer machines or servers. No data leaves your infrastructure:

# Install and run a model locally
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:8b
ollama serve
import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_message}
    ]
)
print(response["message"]["content"])

Suitable for:

  • Development and testing environments where you don't want to pay API costs
  • Internal tools processing sensitive employee or customer data
  • Air-gapped environments (defense, finance, healthcare)
  • Use cases where model capability requirements are modest (summarization, classification, simple QA)

Not suitable for:

  • Applications requiring frontier model capabilities (complex reasoning, code generation at scale)
  • Consumer-facing products needing low latency and high throughput without dedicated hardware

vLLM (Production Self-Hosting)

For production self-hosted inference, vLLM provides PagedAttention for efficient memory usage and OpenAI-compatible APIs:

from openai import OpenAI

# vLLM serves an OpenAI-compatible API
client = OpenAI(
    api_key="not-needed-for-self-hosted",
    base_url="http://your-vllm-server:8000/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": user_message}]
)

Recommended models for self-hosting:

  • Llama 3.1 70B / 405B (Meta): Strong general capability, permissive license for most commercial uses
  • Mistral 7B / Mixtral 8x7B: Efficient models with good instruction following
  • Qwen 2.5: Strong multilingual capability
  • Phi-3 / Phi-4: Small but capable for constrained deployments

Hardware Requirements

Model SizeMinimum VRAMRecommended
7B8 GB (4-bit)16 GB
13B10 GB (4-bit)24 GB
34B20 GB (4-bit)48 GB
70B40 GB (4-bit)2x 80 GB A100

Quantization Trade-offs

Self-hosted models often run in quantized form (INT4, INT8) to fit in available VRAM. Quantization reduces memory footprint at some cost to output quality:

# Using llama.cpp with 4-bit quantization
from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-3.1-8b-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # Use all GPU layers
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": user_message}]
)

Building a Data Classification Policy for AI Features

Not all data is equally sensitive. Build a classification system to determine which data can go to cloud APIs and which must stay on-premises:

ClassificationDefinitionAI Processing
PublicAlready public informationAny cloud API
InternalInternal business data, no personal dataCloud API with DPA
ConfidentialPersonal data, financial recordsEnterprise API with DPA + zero retention, or self-hosted
RestrictedHealth data, payment card data, national securitySelf-hosted only

Implement this as a pre-flight check in your AI gateway:

def route_to_appropriate_provider(text: str, classification: str) -> Any:
    if classification in ("restricted", "confidential"):
        if not ENTERPRISE_AGREEMENT_ACTIVE:
            return local_llm_client.complete(text)
        return enterprise_api_client.complete(text)
    return standard_api_client.complete(text)

Key Takeaways

The privacy implications of AI integrations are substantial and frequently underestimated:

  1. Default API tiers are not privacy-safe for personal data — you need enterprise contracts or self-hosting
  2. A signed DPA is legally required under GDPR before sending EU personal data to any AI provider
  3. "Not used for training" does not mean "not retained" — understand retention windows and human review policies
  4. Self-hosted models are viable for many use cases and eliminate third-party data sharing entirely
  5. Data classification should gate API selection — not all requests need frontier models, and not all data can go to them

Privacy compliance for AI features is not a one-time checkbox. Provider policies change, regulations evolve, and your data flows should be reviewed regularly as both technology and compliance requirements shift.

AI privacy
GDPR
data retention
OpenAI policy
self-hosted LLM
Ollama

Check Your Security Score — Free

See exactly how your domain scores on DMARC, TLS, HTTP headers, and 25+ other automated security checks in under 60 seconds.