Container Runtime Security: seccomp, AppArmor, and Falco

Container images define what your application is. Container runtime security defines what it can do. Without runtime controls, a compromised container has full access to the Linux kernel's system call interface — the same access as any process running on the host. seccomp, AppArmor, and Falco are the three layers that restrict and monitor what containers actually do at runtime, independent of what the image contains.

Understanding the Threat Model

When an attacker gains code execution inside a container, they immediately try to expand their access. Common techniques include:

Using ptrace syscall to attach to other processes
Calling mount to access host filesystems
Using unshare to escape namespace boundaries
Exploiting kernel vulnerabilities via unrestricted syscall access
Reading sensitive files from /proc or /sys

Each of these requires specific Linux syscalls. Restricting which syscalls a container can make dramatically reduces the attack surface for privilege escalation.

seccomp: Syscall Filtering

seccomp (Secure Computing Mode) filters the syscalls a process is allowed to make. Docker and Kubernetes both support seccomp profiles that specify an allow list (or block list) of syscalls.

Docker's Default seccomp Profile

Docker ships with a default seccomp profile that blocks ~44 syscalls including ptrace, reboot, mount, keyctl, and others commonly used in container escapes. You can verify it is applied:

# Inspect a running container's seccomp status
docker inspect CONTAINER_ID | jq '.[0].HostConfig.SecurityOpt'
# Should show: ["seccomp=..."]

The default profile is reasonable but not tight. A Node.js web application does not need syscalls like prctl, process_vm_readv, or perf_event_open. A custom profile can be significantly more restrictive.

Writing a Custom seccomp Profile

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": [
        "read", "write", "close", "fstat", "mmap", "mprotect",
        "munmap", "brk", "rt_sigaction", "rt_sigprocmask",
        "ioctl", "access", "pipe", "select", "sched_yield",
        "mremap", "madvise", "poll", "epoll_wait", "epoll_create",
        "epoll_ctl", "clone", "execve", "wait4", "kill",
        "getpid", "socket", "connect", "accept", "sendto",
        "recvfrom", "bind", "listen", "getsockname", "getpeername",
        "socketpair", "setsockopt", "getsockopt", "exit", "futex",
        "getcwd", "openat", "getdents64", "lstat", "stat",
        "open", "exit_group", "set_robust_list", "prlimit64",
        "getrandom", "sendmsg", "recvmsg"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

This allowlist approach (defaultAction SCMP_ACT_ERRNO) only permits the syscalls your application actually needs. Generate a baseline by running your application under strace:

strace -f -e trace=all -o syscalls.log node server.js
grep -oP 'SYS_\K\w+' syscalls.log | sort -u

Apply the profile to a container:

docker run --security-opt seccomp=./profile.json node:20 node server.js

Kubernetes seccomp

In Kubernetes 1.19+, seccomp profiles are generally available:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/node-server.json
  containers:
    - name: app
      image: myapp:latest

The profile file must be present on the node at /var/lib/kubelet/seccomp/profiles/node-server.json.

AppArmor: Mandatory Access Control

AppArmor (Application Armor) enforces mandatory access control (MAC) policies at the kernel level. Where seccomp restricts syscalls, AppArmor restricts what files, network resources, and capabilities a process can access — regardless of file permissions.

Writing an AppArmor Profile for Docker

#include <tunables/global>

profile docker-node-app flags=(attach_disconnected, mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  network inet tcp,
  network inet udp,
  network inet6 tcp,

  # Allow read access to app files
  /app/** r,
  /app/node_modules/** r,

  # Allow write to temp and logs only
  /tmp/** rw,
  /var/log/app/** w,

  # Deny access to sensitive paths
  deny /etc/shadow r,
  deny /proc/sysrq-trigger w,
  deny /sys/** w,

  # Allow necessary capabilities
  capability net_bind_service,
  capability setuid,
  capability setgid,
}

Load and apply the profile:

sudo apparmor_parser -r -W /etc/apparmor.d/docker-node-app
docker run --security-opt apparmor=docker-node-app myapp:latest

Falco: Runtime Threat Detection

seccomp and AppArmor are preventive controls — they block things before they happen. Falco is a detective control: it monitors syscall activity in real time and fires alerts when it detects suspicious behavior patterns.

Installing Falco

# Install Falco on a Kubernetes cluster via Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --set driver.kind=ebpf \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl=https://hooks.slack.com/...

The eBPF driver is preferred for modern kernels — it does not require a kernel module and is safer to deploy in production.

Writing Falco Rules

Falco rules use a YAML DSL. Each rule defines a condition (evaluated against syscall events) and an output (the alert message):

- rule: Shell Spawned in Container
  desc: Detect shell execution inside a container
  condition: >
    container.id != host and
    proc.name in (shell_binaries) and
    container.image.repository != allowed_shell_runners
  output: >
    Shell spawned in container
    (user=%user.name container=%container.name
     image=%container.image.repository
     command=%proc.cmdline)
  priority: WARNING
  tags: [container, shell, mitre_execution]

- rule: Sensitive File Read in Container
  desc: Detect reads of sensitive files inside containers
  condition: >
    container and
    open_read and
    fd.name in (sensitive_files)
  output: >
    Sensitive file read in container
    (file=%fd.name user=%user.name
     container=%container.name)
  priority: ERROR
  tags: [container, filesystem]

- rule: Container Running as Root
  desc: Detect container process running as UID 0
  condition: >
    container and
    proc.is_container_healthcheck = false and
    user.uid = 0 and
    not allowed_root_containers
  output: >
    Container running as root
    (container=%container.name image=%container.image.repository)
  priority: NOTICE

Falco ships with ~100 default rules covering common attack techniques including privilege escalation, data exfiltration, reverse shells, and credential access.

Hardening the Container Itself

Beyond these three controls, container security begins with the Dockerfile and pod spec:

# Pod security context
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 2000
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
          add: ["NET_BIND_SERVICE"]  # only if needed
      volumeMounts:
        - mountPath: /tmp
          name: tmp-volume

Running as a non-root user with readOnlyRootFilesystem: true and allowPrivilegeEscalation: false removes the most common privilege escalation paths. Mount a writable emptyDir volume for /tmp if your application needs to write temporary files.

Together, seccomp restricts syscalls, AppArmor restricts file and network access, Falco detects anomalous behavior, and hardened pod specs enforce least-privilege execution. These controls stack — an attacker who bypasses one still faces the others.