Vector Database Security: Securing Embeddings and Preventing Data Extraction
A technical guide to vector database security covering embedding inversion attacks, multi-tenant access control, authorization for vector search, and securing Pinecone, Weaviate, and Chroma deployments.
Vector databases — Pinecone, Weaviate, Qdrant, Chroma, pgvector — are now foundational infrastructure for AI applications. They store embeddings (high-dimensional numerical representations of text, images, or other data) and enable semantic similarity search. Most teams treat them as a simple storage layer and pay little attention to their security properties.
This is a mistake. Vector databases contain dense representations of your most sensitive data, have novel attack vectors (embedding inversion, cross-tenant leakage) that traditional database security doesn't address, and are frequently misconfigured in ways that expose data to unauthorized users.
What Is an Embedding, and What Can an Attacker Do with It?
An embedding is a numerical vector — typically 768 to 3,072 floating-point numbers — that encodes the semantic content of a piece of text. Similar texts produce similar vectors. This is what enables semantic search: find documents similar to this query, even if they use different words.
From a security perspective, embeddings present two concerns:
1. They contain encoded information about the original content
An embedding is not a hash. A cryptographic hash is a one-way function: you cannot recover the input from the hash. An embedding is an encoding in a learned high-dimensional space. While you cannot directly "decode" an embedding back to the original text character-by-character, the information is recoverable under specific attack conditions.
2. They enable finding data through similarity, bypassing keyword-based access controls
Many legacy access control systems work by checking whether users can query specific identifiers (document IDs, table names). Vector search finds documents based on semantic content — which means a user can find content they're not supposed to access by crafting queries that are semantically similar to that content, even if they don't know its identifier.
Embedding Inversion Attacks
Research has demonstrated that embeddings can be partially inverted — meaning it is possible to recover significant portions of the original text from the embedding alone.
How Inversion Works
Morris et al. (2023, "Text Embeddings Reveal (Almost) As Much As Text") showed that models trained to invert embeddings can reconstruct documents with high accuracy. Their Vec2Text method achieved recall of over 90% on many sentences using OpenAI's text-embedding-ada-002 embeddings.
The attack works as follows:
- Collect a large number of (text, embedding) pairs from the target model (often possible via API)
- Train an inversion model that maps embeddings back to text
- Apply the inversion model to target embeddings
The attacker needs access to the embedding vectors themselves and the ability to query the embedding model. If your vector database is exposed with minimal authentication, an attacker who gains API access can extract embeddings and attempt inversion.
What This Means for Your Data
If you store embeddings of:
- Customer emails or support tickets
- Internal documents
- Private user data
- Healthcare records
- Financial information
...and an attacker gains access to those embeddings, they can potentially recover significant portions of the original text using inversion techniques.
Defense: Encrypt Embeddings at Rest and in Transit
from cryptography.fernet import Fernet
import numpy as np
import struct
class EncryptedEmbeddingStore:
def __init__(self, encryption_key: bytes):
self.fernet = Fernet(encryption_key)
def encrypt_embedding(self, embedding: list[float]) -> bytes:
"""Serialize and encrypt an embedding vector."""
# Convert float list to bytes
embedding_bytes = struct.pack(f'{len(embedding)}f', *embedding)
return self.fernet.encrypt(embedding_bytes)
def decrypt_embedding(self, encrypted: bytes) -> list[float]:
"""Decrypt and deserialize an embedding vector."""
embedding_bytes = self.fernet.decrypt(encrypted)
count = len(embedding_bytes) // 4 # 4 bytes per float
return list(struct.unpack(f'{count}f', embedding_bytes))
Note: Most vector databases perform similarity computation on plaintext vectors. Encrypting embeddings at rest protects against storage-layer breaches but not against an attacker who has API access to the database, since they can query the API which decrypts on the fly. For strong inversion defense, the key control is strict API authentication and authorization.
Defense: Consider Noise Injection for Stored Embeddings
For embeddings you need to store but don't need to support exact retrieval from (e.g., analytics or feature stores), adding calibrated Gaussian noise degrades inversion while preserving similarity relationships:
import numpy as np
def add_privacy_noise(embedding: list[float], epsilon: float = 0.1) -> list[float]:
"""
Add Laplace noise for differential privacy.
Smaller epsilon = more privacy, less accuracy.
"""
embedding_array = np.array(embedding)
sensitivity = 1.0 # L2 sensitivity of the embedding
scale = sensitivity / epsilon
noise = np.random.laplace(0, scale, embedding_array.shape)
noisy = embedding_array + noise
# Re-normalize to unit sphere
noisy = noisy / np.linalg.norm(noisy)
return noisy.tolist()
Authorization for Vector Search
The Problem: Semantic Search Bypasses Identifier-Based ACLs
Traditional access control: "Can user X access document ID 12345?" Enforced by checking user permissions against document identifiers.
Vector search: "Find documents semantically similar to this query." No identifier to check — the search returns whatever is most similar, regardless of access permissions.
An employee in accounting asks the company knowledge base: "What is the executive compensation plan?" They don't know the document's ID or title, but if they have access to the vector search API and there are no filters, they'll retrieve the document if it exists.
Metadata Filtering: The Primary Defense
Every document in your vector database must be stored with authorization metadata, and every query must filter by that metadata:
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-api-key")
def upsert_document_with_acl(
doc_id: str,
content: str,
embedding: list[float],
allowed_roles: list[str],
classification: str,
owner_id: str,
):
"""Store document with access control metadata."""
index = pc.Index("knowledge-base")
index.upsert(vectors=[{
"id": doc_id,
"values": embedding,
"metadata": {
"content": content,
"allowed_roles": allowed_roles,
"classification": classification,
"owner_id": owner_id,
"content_hash": hashlib.sha256(content.encode()).hexdigest(),
}
}])
def search_with_authorization(
query_embedding: list[float],
user_id: str,
user_roles: list[str],
classification_ceiling: str,
top_k: int = 5,
) -> list[dict]:
"""Perform similarity search with mandatory authorization filter."""
CLASSIFICATION_ORDER = ["public", "internal", "confidential", "restricted"]
max_class_idx = CLASSIFICATION_ORDER.index(classification_ceiling)
allowed_classifications = CLASSIFICATION_ORDER[:max_class_idx + 1]
filter_condition = {
"$and": [
{
"$or": [
{"owner_id": {"$eq": user_id}},
{"allowed_roles": {"$in": user_roles}},
{"classification": {"$eq": "public"}},
]
},
{"classification": {"$in": allowed_classifications}},
]
}
index = pc.Index("knowledge-base")
results = index.query(
vector=query_embedding,
filter=filter_condition,
top_k=top_k,
include_metadata=True,
)
return [
{
"id": match.id,
"score": match.score,
"content": match.metadata.get("content"),
"classification": match.metadata.get("classification"),
}
for match in results.matches
]
Weaviate Authorization Example
Weaviate supports class-level and property-level permissions:
import weaviate
import weaviate.classes as wvc
client = weaviate.connect_to_local()
# Create a collection with authorization metadata
client.collections.create(
name="Documents",
properties=[
wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="allowed_roles", data_type=wvc.config.DataType.TEXT_ARRAY),
wvc.config.Property(name="classification", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="owner_id", data_type=wvc.config.DataType.TEXT),
],
)
# Query with filter
documents = client.collections.get("Documents")
results = documents.query.near_text(
query="executive compensation",
filters=(
wvc.query.Filter.by_property("allowed_roles").contains_any(user_roles)
| wvc.query.Filter.by_property("owner_id").equal(user_id)
| wvc.query.Filter.by_property("classification").equal("public")
),
limit=5,
)
pgvector Authorization (PostgreSQL)
For teams using pgvector, authorization is handled through standard PostgreSQL row-level security:
-- Enable row-level security on the embeddings table
ALTER TABLE document_embeddings ENABLE ROW LEVEL SECURITY;
-- Policy: users can see their own documents and public documents
CREATE POLICY document_access_policy ON document_embeddings
FOR SELECT
USING (
owner_id = current_user_id()
OR classification = 'public'
OR current_user_id() = ANY(allowed_user_ids)
);
-- Semantic search query automatically applies RLS
SELECT
id,
content,
1 - (embedding <=> $1::vector) AS similarity
FROM document_embeddings
WHERE 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;
-- RLS policy automatically filters to documents the current user can access
PostgreSQL RLS is a strong defense because it's enforced at the database level — application code cannot bypass it accidentally.
Tenant Isolation in Multi-Tenant Vector Databases
For SaaS applications, tenant isolation in vector databases is critical. Pinecone namespaces, Weaviate multi-tenancy, and Qdrant collections all provide mechanisms.
Pinecone Namespaces
def get_namespace(tenant_id: str) -> str:
"""Deterministic namespace per tenant."""
return f"tenant_{tenant_id}"
def index_document(tenant_id: str, doc_id: str, embedding: list[float], metadata: dict):
index = pc.Index("shared-index")
index.upsert(
vectors=[{"id": doc_id, "values": embedding, "metadata": metadata}],
namespace=get_namespace(tenant_id),
)
def search(tenant_id: str, query_embedding: list[float], top_k: int = 5):
index = pc.Index("shared-index")
# Namespace isolation: only searches within this tenant's namespace
return index.query(
vector=query_embedding,
namespace=get_namespace(tenant_id),
top_k=top_k,
include_metadata=True,
)
Caveat: Pinecone namespace isolation is a soft boundary. It relies on the application correctly using the tenant's namespace. A bug that omits the namespace parameter would search across all tenants. Defense in depth: combine namespace isolation with metadata filtering.
Weaviate Multi-Tenancy
Weaviate v1.20+ supports native multi-tenancy with hard isolation:
# Create a multi-tenant collection
client.collections.create(
name="TenantDocuments",
multi_tenancy_config=wvc.config.Configure.multi_tenancy(enabled=True),
)
# Add a tenant
collection = client.collections.get("TenantDocuments")
collection.tenants.create([wvc.tenants.Tenant(name=f"tenant_{tenant_id}")])
# All operations are scoped to a specific tenant
tenant_collection = collection.with_tenant(f"tenant_{tenant_id}")
tenant_collection.data.insert({"content": "...", "metadata": {}})
Weaviate's multi-tenancy uses separate storage for each tenant, providing stronger isolation than namespace-based approaches.
Securing Vector Database Deployments
Authentication and Network Controls
# docker-compose.yml for Weaviate with authentication
version: '3.8'
services:
weaviate:
image: semitechnologies/weaviate:latest
environment:
# Enable API key authentication
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'your-secret-key-1,your-secret-key-2'
AUTHENTICATION_APIKEY_USERS: 'user1@example.com,user2@example.com'
# Disable anonymous access
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
# Enable RBAC
AUTHORIZATION_ADMINLIST_ENABLED: 'true'
AUTHORIZATION_ADMINLIST_USERS: 'admin@example.com'
# Only expose on private network, not public internet
ports:
- "127.0.0.1:8080:8080"
Pinecone Access Controls
# Use project-scoped API keys
# Create read-only key for query-only services
# Create write key only for ingestion services
import os
# Ingestion service uses write key
ingestion_client = Pinecone(api_key=os.environ["PINECONE_WRITE_KEY"])
# Query service uses read-only key
query_client = Pinecone(api_key=os.environ["PINECONE_READ_KEY"])
Audit Logging for Vector Database Access
class AuditedVectorDB:
def __init__(self, vector_db, audit_logger):
self.db = vector_db
self.audit = audit_logger
def query(self, query_embedding: list[float], user_id: str, filters: dict, top_k: int):
results = self.db.query(
vector=query_embedding,
filter=filters,
top_k=top_k,
include_metadata=True,
)
# Log every query with user context
self.audit.log({
"event": "vector_search",
"user_id": user_id,
"filters_applied": filters,
"results_count": len(results.matches),
"result_ids": [m.id for m in results.matches],
"timestamp": datetime.utcnow().isoformat(),
})
return results
Summary
Securing vector databases requires addressing concerns that don't appear in traditional database security:
| Risk | Mitigation |
|---|---|
| Embedding inversion | Strict API authentication, encryption at rest |
| Cross-tenant data leakage | Namespace isolation + metadata filtering |
| Broken access control in RAG | Mandatory authorization filters on every query |
| Unauthenticated access | API key authentication, network controls |
| Missing audit trail | Log all queries with user context |
| Knowledge base poisoning | Restricted write access, content integrity checks |
Vector databases are not "just a storage layer." They contain dense representations of your most sensitive data and require the same — or greater — security attention as your primary database.