Security Incident Response Playbook: Step-by-Step for SaaS Companies
A practical incident response playbook for SaaS companies covering detection, triage, containment, eradication, recovery, and post-mortem.
Security Incident Response Playbook: Step-by-Step for SaaS Companies
A security incident without a playbook is a fire drill without a fire escape plan. When an alert fires at 2 AM, the quality of your response depends almost entirely on the decisions you made before the incident — who owns what, what tools are available, and what steps to follow under pressure.
This playbook covers the six phases of incident response as defined by NIST SP 800-61, adapted for SaaS companies operating cloud-native infrastructure.
Before the Incident: Preparation
Preparation is not a phase in the incident — it is the prerequisite for all other phases being effective.
Define your incident severity tiers:
| Severity | Definition | Response SLA |
|---|---|---|
| P1 – Critical | Data breach, ransomware, complete service outage | Immediate (24/7 on-call) |
| P2 – High | Partial breach, active exploitation, significant degradation | 1 hour |
| P3 – Medium | Suspicious activity, policy violation, limited impact | 4 hours (business hours) |
| P4 – Low | Informational alerts, no confirmed impact | Next business day |
Establish an on-call rotation with clear escalation paths. Every P1 incident needs a designated Incident Commander (IC) who owns coordination, not just a technical responder who owns investigation.
Set up your communication channels:
- A dedicated incident Slack channel (
#incident-YYYY-MM-DD-N) - A shared incident timeline document (Google Doc or Notion)
- Out-of-band communication (phone/Signal) in case your Slack is compromised
Ensure your tooling is ready:
- Centralized logging (SIEM) with at least 90 days of retention
- Endpoint detection and response (EDR) on all managed devices
- Cloud audit logs (CloudTrail, GCP Audit Logs) with alerting
- Runbooks for common attack scenarios (account compromise, data exfil, ransomware)
Phase 1: Detection and Initial Assessment
Incidents are detected through multiple channels: automated alerts, customer reports, third-party notifications, or routine log reviews.
Detection sources to monitor:
- SIEM correlation rules (brute force, impossible travel, privilege escalation)
- Cloud provider security services (AWS GuardDuty, GCP SCC, Microsoft Defender)
- Intrusion detection systems (network and host-based)
- Bug bounty submissions
- Customer support tickets mentioning unusual behavior
- Dark web monitoring alerts
When a potential incident is identified, the first responder performs an initial assessment:
- Is this a confirmed incident or a false positive?
- What systems, data, and users are potentially affected?
- Is the incident ongoing (active attacker) or historical (past breach)?
- What is the initial severity classification?
Declare an incident early rather than late. It is far easier to stand down from a P2 than to upgrade a P4 to P1 two hours in.
Phase 2: Triage and Investigation
Once an incident is declared, the IC assembles the response team and begins structured investigation.
Parallel workstreams:
- Technical investigation: What happened, how, and when? Trace attacker activity through logs.
- Scope assessment: What data was accessed, modified, or exfiltrated?
- Impact assessment: Who is affected (users, customers, employees)?
- Legal/compliance notification: Does this trigger breach notification requirements (GDPR 72-hour rule, HIPAA 60-day rule)?
Investigation techniques:
For cloud environments, start with IAM audit logs:
# AWS CloudTrail — look for unusual API calls
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=compromised-user \
--start-time 2026-03-08T00:00:00Z \
--end-time 2026-03-09T23:59:59Z
For application-level incidents, query your centralized logs for the affected user or IP:
SELECT timestamp, event_type, user_id, ip_address, resource
FROM audit_logs
WHERE (user_id = 'compromised-user' OR ip_address = '198.51.100.42')
AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp ASC;
Maintain a live timeline. Every finding, action taken, and decision made should be timestamped in the incident document. This is critical for post-mortem analysis and for legal/regulatory purposes.
Phase 3: Containment
Containment stops the bleeding without destroying evidence.
Short-term containment (immediate):
- Disable compromised accounts — do not delete them (preserve evidence)
- Revoke and rotate exposed credentials, API keys, and tokens
- Block attacker IPs at the WAF/firewall level
- Isolate compromised EC2 instances / VMs from the network (snapshot first)
- Suspend suspicious OAuth grants
Evidence preservation:
- Take memory dumps and disk snapshots before terminating compromised instances
- Export relevant log ranges to immutable storage (S3 with Object Lock)
- Screenshot any active attacker sessions before terminating them
Long-term containment:
- Deploy hotfixes for exploited vulnerabilities
- Increase monitoring sensitivity on affected systems
- Notify affected customers if there is active risk to their data
A key containment decision is whether to kick the attacker out immediately or monitor them. In rare cases (sophisticated APT, law enforcement involvement), you may be advised to maintain visibility on attacker activity. In most cases, especially with customer data at risk, contain immediately.
Phase 4: Eradication
Once the attacker's access is cut off, remove all persistence mechanisms and the root cause.
- Remove all backdoors, web shells, malware, and unauthorized SSH keys
- Audit all service accounts and IAM roles for unauthorized permissions added during the intrusion
- Rotate all credentials that were or may have been exposed — not just the compromised ones
- Patch the exploited vulnerability
- Rebuild compromised systems from known-good images rather than cleaning them in place
Verify eradication by running your detection tooling against cleaned systems. An attacker who placed a secondary backdoor while you were focused on the primary one is a common scenario.
Phase 5: Recovery
Recovery restores systems to normal operation while maintaining heightened monitoring.
- Restore services from clean backups or rebuild from infrastructure-as-code
- Re-enable accounts after password resets and MFA enrollment verification
- Gradually restore access, starting with least-privileged roles
- Monitor closely for 48-72 hours post-recovery for signs of re-entry
Customer and stakeholder communication follows your breach notification obligations:
- GDPR: notify supervisory authority within 72 hours, affected data subjects "without undue delay" when high risk
- HIPAA: notify HHS and affected individuals within 60 days
- State laws (CCPA, SHIELD Act, etc.): vary by state and breach type
Communicate clearly and factually. Do not minimize the incident. Specify what happened, what data was involved, what you did, and what affected parties should do.
Phase 6: Post-Mortem
The post-mortem is not about blame — it is about systemic improvement.
Hold the post-mortem within 5 business days while the incident is fresh.
Post-mortem structure:
- Timeline of events (detection through recovery)
- Root cause analysis (use the "5 Whys" technique)
- Contributing factors
- What went well
- What could be improved
- Action items with owners and due dates
Example root cause chain (5 Whys):
- Why was there a breach? → An attacker gained access via a phished employee credential.
- Why did the credential give access to production data? → The employee had overly broad IAM permissions.
- Why were permissions overly broad? → No access review process exists.
- Why does no access review process exist? → It was never prioritized in the roadmap.
- Why was it never prioritized? → Security metrics are not tracked in sprint planning.
Root cause: Security requirements are not part of the engineering planning process.
The action item is not "revoke Bob's permissions" — it is "implement quarterly access reviews and add security metrics to sprint planning."
Document post-mortems in a searchable knowledge base. Patterns across incidents reveal systemic weaknesses that deserve investment.