Security Incident Response Playbook: Step-by-Step for SaaS Companies

A security incident without a playbook is a fire drill without a fire escape plan. When an alert fires at 2 AM, the quality of your response depends almost entirely on the decisions you made before the incident — who owns what, what tools are available, and what steps to follow under pressure.

This playbook covers the six phases of incident response as defined by NIST SP 800-61, adapted for SaaS companies operating cloud-native infrastructure.

Before the Incident: Preparation

Preparation is not a phase in the incident — it is the prerequisite for all other phases being effective.

Define your incident severity tiers:

Severity	Definition	Response SLA
P1 – Critical	Data breach, ransomware, complete service outage	Immediate (24/7 on-call)
P2 – High	Partial breach, active exploitation, significant degradation	1 hour
P3 – Medium	Suspicious activity, policy violation, limited impact	4 hours (business hours)
P4 – Low	Informational alerts, no confirmed impact	Next business day

Establish an on-call rotation with clear escalation paths. Every P1 incident needs a designated Incident Commander (IC) who owns coordination, not just a technical responder who owns investigation.

Set up your communication channels:

A dedicated incident Slack channel (#incident-YYYY-MM-DD-N)
A shared incident timeline document (Google Doc or Notion)
Out-of-band communication (phone/Signal) in case your Slack is compromised

Ensure your tooling is ready:

Centralized logging (SIEM) with at least 90 days of retention
Endpoint detection and response (EDR) on all managed devices
Cloud audit logs (CloudTrail, GCP Audit Logs) with alerting
Runbooks for common attack scenarios (account compromise, data exfil, ransomware)

Phase 1: Detection and Initial Assessment

Incidents are detected through multiple channels: automated alerts, customer reports, third-party notifications, or routine log reviews.

Detection sources to monitor:

SIEM correlation rules (brute force, impossible travel, privilege escalation)
Cloud provider security services (AWS GuardDuty, GCP SCC, Microsoft Defender)
Intrusion detection systems (network and host-based)
Bug bounty submissions
Customer support tickets mentioning unusual behavior
Dark web monitoring alerts

When a potential incident is identified, the first responder performs an initial assessment:

Is this a confirmed incident or a false positive?
What systems, data, and users are potentially affected?
Is the incident ongoing (active attacker) or historical (past breach)?
What is the initial severity classification?

Declare an incident early rather than late. It is far easier to stand down from a P2 than to upgrade a P4 to P1 two hours in.

Phase 2: Triage and Investigation

Once an incident is declared, the IC assembles the response team and begins structured investigation.

Parallel workstreams:

Technical investigation: What happened, how, and when? Trace attacker activity through logs.
Scope assessment: What data was accessed, modified, or exfiltrated?
Impact assessment: Who is affected (users, customers, employees)?
Legal/compliance notification: Does this trigger breach notification requirements (GDPR 72-hour rule, HIPAA 60-day rule)?

Investigation techniques:

For cloud environments, start with IAM audit logs:

# AWS CloudTrail — look for unusual API calls
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=compromised-user \
  --start-time 2026-03-08T00:00:00Z \
  --end-time 2026-03-09T23:59:59Z

For application-level incidents, query your centralized logs for the affected user or IP:

SELECT timestamp, event_type, user_id, ip_address, resource
FROM audit_logs
WHERE (user_id = 'compromised-user' OR ip_address = '198.51.100.42')
  AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp ASC;

Maintain a live timeline. Every finding, action taken, and decision made should be timestamped in the incident document. This is critical for post-mortem analysis and for legal/regulatory purposes.

Phase 3: Containment

Containment stops the bleeding without destroying evidence.

Short-term containment (immediate):

Disable compromised accounts — do not delete them (preserve evidence)
Revoke and rotate exposed credentials, API keys, and tokens
Block attacker IPs at the WAF/firewall level
Isolate compromised EC2 instances / VMs from the network (snapshot first)
Suspend suspicious OAuth grants

Evidence preservation:

Take memory dumps and disk snapshots before terminating compromised instances
Export relevant log ranges to immutable storage (S3 with Object Lock)
Screenshot any active attacker sessions before terminating them

Long-term containment:

Deploy hotfixes for exploited vulnerabilities
Increase monitoring sensitivity on affected systems
Notify affected customers if there is active risk to their data

A key containment decision is whether to kick the attacker out immediately or monitor them. In rare cases (sophisticated APT, law enforcement involvement), you may be advised to maintain visibility on attacker activity. In most cases, especially with customer data at risk, contain immediately.

Phase 4: Eradication

Once the attacker's access is cut off, remove all persistence mechanisms and the root cause.

Remove all backdoors, web shells, malware, and unauthorized SSH keys
Audit all service accounts and IAM roles for unauthorized permissions added during the intrusion
Rotate all credentials that were or may have been exposed — not just the compromised ones
Patch the exploited vulnerability
Rebuild compromised systems from known-good images rather than cleaning them in place

Verify eradication by running your detection tooling against cleaned systems. An attacker who placed a secondary backdoor while you were focused on the primary one is a common scenario.

Phase 5: Recovery

Recovery restores systems to normal operation while maintaining heightened monitoring.

Restore services from clean backups or rebuild from infrastructure-as-code
Re-enable accounts after password resets and MFA enrollment verification
Gradually restore access, starting with least-privileged roles
Monitor closely for 48-72 hours post-recovery for signs of re-entry

Customer and stakeholder communication follows your breach notification obligations:

GDPR: notify supervisory authority within 72 hours, affected data subjects "without undue delay" when high risk
HIPAA: notify HHS and affected individuals within 60 days
State laws (CCPA, SHIELD Act, etc.): vary by state and breach type

Communicate clearly and factually. Do not minimize the incident. Specify what happened, what data was involved, what you did, and what affected parties should do.

Phase 6: Post-Mortem

The post-mortem is not about blame — it is about systemic improvement.

Hold the post-mortem within 5 business days while the incident is fresh.

Post-mortem structure:

Timeline of events (detection through recovery)
Root cause analysis (use the "5 Whys" technique)
Contributing factors
What went well
What could be improved
Action items with owners and due dates

Example root cause chain (5 Whys):

Why was there a breach? → An attacker gained access via a phished employee credential.
Why did the credential give access to production data? → The employee had overly broad IAM permissions.
Why were permissions overly broad? → No access review process exists.
Why does no access review process exist? → It was never prioritized in the roadmap.
Why was it never prioritized? → Security metrics are not tracked in sprint planning.

Root cause: Security requirements are not part of the engineering planning process.

The action item is not "revoke Bob's permissions" — it is "implement quarterly access reviews and add security metrics to sprint planning."

Document post-mortems in a searchable knowledge base. Patterns across incidents reveal systemic weaknesses that deserve investment.