Vector Database Security: Securing Embeddings and Preventing Data Extraction

Vector databases — Pinecone, Weaviate, Qdrant, Chroma, pgvector — are now foundational infrastructure for AI applications. They store embeddings (high-dimensional numerical representations of text, images, or other data) and enable semantic similarity search. Most teams treat them as a simple storage layer and pay little attention to their security properties.

This is a mistake. Vector databases contain dense representations of your most sensitive data, have novel attack vectors (embedding inversion, cross-tenant leakage) that traditional database security doesn't address, and are frequently misconfigured in ways that expose data to unauthorized users.

What Is an Embedding, and What Can an Attacker Do with It?

An embedding is a numerical vector — typically 768 to 3,072 floating-point numbers — that encodes the semantic content of a piece of text. Similar texts produce similar vectors. This is what enables semantic search: find documents similar to this query, even if they use different words.

From a security perspective, embeddings present two concerns:

1. They contain encoded information about the original content

An embedding is not a hash. A cryptographic hash is a one-way function: you cannot recover the input from the hash. An embedding is an encoding in a learned high-dimensional space. While you cannot directly "decode" an embedding back to the original text character-by-character, the information is recoverable under specific attack conditions.

2. They enable finding data through similarity, bypassing keyword-based access controls

Many legacy access control systems work by checking whether users can query specific identifiers (document IDs, table names). Vector search finds documents based on semantic content — which means a user can find content they're not supposed to access by crafting queries that are semantically similar to that content, even if they don't know its identifier.

Embedding Inversion Attacks

Research has demonstrated that embeddings can be partially inverted — meaning it is possible to recover significant portions of the original text from the embedding alone.

How Inversion Works

Morris et al. (2023, "Text Embeddings Reveal (Almost) As Much As Text") showed that models trained to invert embeddings can reconstruct documents with high accuracy. Their Vec2Text method achieved recall of over 90% on many sentences using OpenAI's text-embedding-ada-002 embeddings.

The attack works as follows:

Collect a large number of (text, embedding) pairs from the target model (often possible via API)
Train an inversion model that maps embeddings back to text
Apply the inversion model to target embeddings

The attacker needs access to the embedding vectors themselves and the ability to query the embedding model. If your vector database is exposed with minimal authentication, an attacker who gains API access can extract embeddings and attempt inversion.

What This Means for Your Data

If you store embeddings of:

Customer emails or support tickets
Internal documents
Private user data
Healthcare records
Financial information

...and an attacker gains access to those embeddings, they can potentially recover significant portions of the original text using inversion techniques.

Defense: Encrypt Embeddings at Rest and in Transit

from cryptography.fernet import Fernet
import numpy as np
import struct

class EncryptedEmbeddingStore:
    def __init__(self, encryption_key: bytes):
        self.fernet = Fernet(encryption_key)

    def encrypt_embedding(self, embedding: list[float]) -> bytes:
        """Serialize and encrypt an embedding vector."""
        # Convert float list to bytes
        embedding_bytes = struct.pack(f'{len(embedding)}f', *embedding)
        return self.fernet.encrypt(embedding_bytes)

    def decrypt_embedding(self, encrypted: bytes) -> list[float]:
        """Decrypt and deserialize an embedding vector."""
        embedding_bytes = self.fernet.decrypt(encrypted)
        count = len(embedding_bytes) // 4  # 4 bytes per float
        return list(struct.unpack(f'{count}f', embedding_bytes))

Note: Most vector databases perform similarity computation on plaintext vectors. Encrypting embeddings at rest protects against storage-layer breaches but not against an attacker who has API access to the database, since they can query the API which decrypts on the fly. For strong inversion defense, the key control is strict API authentication and authorization.

Defense: Consider Noise Injection for Stored Embeddings

For embeddings you need to store but don't need to support exact retrieval from (e.g., analytics or feature stores), adding calibrated Gaussian noise degrades inversion while preserving similarity relationships:

import numpy as np

def add_privacy_noise(embedding: list[float], epsilon: float = 0.1) -> list[float]:
    """
    Add Laplace noise for differential privacy.
    Smaller epsilon = more privacy, less accuracy.
    """
    embedding_array = np.array(embedding)
    sensitivity = 1.0  # L2 sensitivity of the embedding
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale, embedding_array.shape)
    noisy = embedding_array + noise
    # Re-normalize to unit sphere
    noisy = noisy / np.linalg.norm(noisy)
    return noisy.tolist()

Authorization for Vector Search

The Problem: Semantic Search Bypasses Identifier-Based ACLs

Traditional access control: "Can user X access document ID 12345?" Enforced by checking user permissions against document identifiers.

Vector search: "Find documents semantically similar to this query." No identifier to check — the search returns whatever is most similar, regardless of access permissions.

An employee in accounting asks the company knowledge base: "What is the executive compensation plan?" They don't know the document's ID or title, but if they have access to the vector search API and there are no filters, they'll retrieve the document if it exists.

Metadata Filtering: The Primary Defense

Every document in your vector database must be stored with authorization metadata, and every query must filter by that metadata:

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

def upsert_document_with_acl(
    doc_id: str,
    content: str,
    embedding: list[float],
    allowed_roles: list[str],
    classification: str,
    owner_id: str,
):
    """Store document with access control metadata."""
    index = pc.Index("knowledge-base")
    index.upsert(vectors=[{
        "id": doc_id,
        "values": embedding,
        "metadata": {
            "content": content,
            "allowed_roles": allowed_roles,
            "classification": classification,
            "owner_id": owner_id,
            "content_hash": hashlib.sha256(content.encode()).hexdigest(),
        }
    }])

def search_with_authorization(
    query_embedding: list[float],
    user_id: str,
    user_roles: list[str],
    classification_ceiling: str,
    top_k: int = 5,
) -> list[dict]:
    """Perform similarity search with mandatory authorization filter."""

    CLASSIFICATION_ORDER = ["public", "internal", "confidential", "restricted"]
    max_class_idx = CLASSIFICATION_ORDER.index(classification_ceiling)
    allowed_classifications = CLASSIFICATION_ORDER[:max_class_idx + 1]

    filter_condition = {
        "$and": [
            {
                "$or": [
                    {"owner_id": {"$eq": user_id}},
                    {"allowed_roles": {"$in": user_roles}},
                    {"classification": {"$eq": "public"}},
                ]
            },
            {"classification": {"$in": allowed_classifications}},
        ]
    }

    index = pc.Index("knowledge-base")
    results = index.query(
        vector=query_embedding,
        filter=filter_condition,
        top_k=top_k,
        include_metadata=True,
    )

    return [
        {
            "id": match.id,
            "score": match.score,
            "content": match.metadata.get("content"),
            "classification": match.metadata.get("classification"),
        }
        for match in results.matches
    ]

Weaviate Authorization Example

Weaviate supports class-level and property-level permissions:

import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()

# Create a collection with authorization metadata
client.collections.create(
    name="Documents",
    properties=[
        wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="allowed_roles", data_type=wvc.config.DataType.TEXT_ARRAY),
        wvc.config.Property(name="classification", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="owner_id", data_type=wvc.config.DataType.TEXT),
    ],
)

# Query with filter
documents = client.collections.get("Documents")
results = documents.query.near_text(
    query="executive compensation",
    filters=(
        wvc.query.Filter.by_property("allowed_roles").contains_any(user_roles)
        | wvc.query.Filter.by_property("owner_id").equal(user_id)
        | wvc.query.Filter.by_property("classification").equal("public")
    ),
    limit=5,
)

pgvector Authorization (PostgreSQL)

For teams using pgvector, authorization is handled through standard PostgreSQL row-level security:

-- Enable row-level security on the embeddings table
ALTER TABLE document_embeddings ENABLE ROW LEVEL SECURITY;

-- Policy: users can see their own documents and public documents
CREATE POLICY document_access_policy ON document_embeddings
    FOR SELECT
    USING (
        owner_id = current_user_id()
        OR classification = 'public'
        OR current_user_id() = ANY(allowed_user_ids)
    );

-- Semantic search query automatically applies RLS
SELECT
    id,
    content,
    1 - (embedding <=> $1::vector) AS similarity
FROM document_embeddings
WHERE 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;
-- RLS policy automatically filters to documents the current user can access

PostgreSQL RLS is a strong defense because it's enforced at the database level — application code cannot bypass it accidentally.

Tenant Isolation in Multi-Tenant Vector Databases

For SaaS applications, tenant isolation in vector databases is critical. Pinecone namespaces, Weaviate multi-tenancy, and Qdrant collections all provide mechanisms.

Pinecone Namespaces

def get_namespace(tenant_id: str) -> str:
    """Deterministic namespace per tenant."""
    return f"tenant_{tenant_id}"

def index_document(tenant_id: str, doc_id: str, embedding: list[float], metadata: dict):
    index = pc.Index("shared-index")
    index.upsert(
        vectors=[{"id": doc_id, "values": embedding, "metadata": metadata}],
        namespace=get_namespace(tenant_id),
    )

def search(tenant_id: str, query_embedding: list[float], top_k: int = 5):
    index = pc.Index("shared-index")
    # Namespace isolation: only searches within this tenant's namespace
    return index.query(
        vector=query_embedding,
        namespace=get_namespace(tenant_id),
        top_k=top_k,
        include_metadata=True,
    )

Caveat: Pinecone namespace isolation is a soft boundary. It relies on the application correctly using the tenant's namespace. A bug that omits the namespace parameter would search across all tenants. Defense in depth: combine namespace isolation with metadata filtering.

Weaviate Multi-Tenancy

Weaviate v1.20+ supports native multi-tenancy with hard isolation:

# Create a multi-tenant collection
client.collections.create(
    name="TenantDocuments",
    multi_tenancy_config=wvc.config.Configure.multi_tenancy(enabled=True),
)

# Add a tenant
collection = client.collections.get("TenantDocuments")
collection.tenants.create([wvc.tenants.Tenant(name=f"tenant_{tenant_id}")])

# All operations are scoped to a specific tenant
tenant_collection = collection.with_tenant(f"tenant_{tenant_id}")
tenant_collection.data.insert({"content": "...", "metadata": {}})

Weaviate's multi-tenancy uses separate storage for each tenant, providing stronger isolation than namespace-based approaches.

Securing Vector Database Deployments

Authentication and Network Controls

# docker-compose.yml for Weaviate with authentication
version: '3.8'
services:
  weaviate:
    image: semitechnologies/weaviate:latest
    environment:
      # Enable API key authentication
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'your-secret-key-1,your-secret-key-2'
      AUTHENTICATION_APIKEY_USERS: 'user1@example.com,user2@example.com'
      # Disable anonymous access
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
      # Enable RBAC
      AUTHORIZATION_ADMINLIST_ENABLED: 'true'
      AUTHORIZATION_ADMINLIST_USERS: 'admin@example.com'
    # Only expose on private network, not public internet
    ports:
      - "127.0.0.1:8080:8080"

Pinecone Access Controls

# Use project-scoped API keys
# Create read-only key for query-only services
# Create write key only for ingestion services

import os

# Ingestion service uses write key
ingestion_client = Pinecone(api_key=os.environ["PINECONE_WRITE_KEY"])

# Query service uses read-only key
query_client = Pinecone(api_key=os.environ["PINECONE_READ_KEY"])

Audit Logging for Vector Database Access

class AuditedVectorDB:
    def __init__(self, vector_db, audit_logger):
        self.db = vector_db
        self.audit = audit_logger

    def query(self, query_embedding: list[float], user_id: str, filters: dict, top_k: int):
        results = self.db.query(
            vector=query_embedding,
            filter=filters,
            top_k=top_k,
            include_metadata=True,
        )

        # Log every query with user context
        self.audit.log({
            "event": "vector_search",
            "user_id": user_id,
            "filters_applied": filters,
            "results_count": len(results.matches),
            "result_ids": [m.id for m in results.matches],
            "timestamp": datetime.utcnow().isoformat(),
        })

        return results

Summary

Securing vector databases requires addressing concerns that don't appear in traditional database security:

Risk	Mitigation
Embedding inversion	Strict API authentication, encryption at rest
Cross-tenant data leakage	Namespace isolation + metadata filtering
Broken access control in RAG	Mandatory authorization filters on every query
Unauthenticated access	API key authentication, network controls
Missing audit trail	Log all queries with user context
Knowledge base poisoning	Restricted write access, content integrity checks

Vector databases are not "just a storage layer." They contain dense representations of your most sensitive data and require the same — or greater — security attention as your primary database.