Incident Response for Ecommerce: The 3am Playbook

An ecommerce platform going down at 3am on a Friday before a sale weekend is not the same as a SaaS application going down. It's revenue stopping in real time. It's customers abandoning carts and not returning. It's orders not processing. The urgency is different, the stakes are different, and the response needs to be correspondingly faster and more structured.

Most ecommerce teams don't have an incident response process until they need one urgently. The first real incident is when the process gets improvised — which is the worst time to improvise. The runbook in this guide is the one I wish every team I've worked with had before their first P1 incident.

This playbook defines the process before the incident, so the team knows what to do at 3am without having to think through it from first principles.

Ecommerce Incidents Are Different

The key differences between ecommerce incidents and generic software incidents:

Revenue is the first metric. In a generic software incident, the primary metric is uptime. In ecommerce, it's revenue impact per minute. A fully down site and a broken checkout (with a browseable catalog) are technically similar in severity but economically very different. The response prioritization should reflect this.

The business is watching. Ecommerce outages are visible to the business in real-time — through operations dashboards, through customer service call volume, through revenue reports. The incident response team is communicating with a business audience that has financial stakes in the timeline, not just a technical audience.

Peak periods change the math. An incident at 2pm on a Tuesday with 50 concurrent sessions is very different from the same incident during a sale event with 500 concurrent sessions. Your severity model should account for traffic level, not just technical symptoms.

Partial failures are common. Magento has many integrations: payment gateway, shipping carriers, tax calculation, search, inventory. A failure in any of these produces a partial outage — the site is up, but checkout is broken, or search returns no results, or shipping options don't load. Partial failures are harder to detect automatically and harder to communicate clearly.

The Severity Model for Ecommerce

Define severity before an incident, not during one. Severity determines response speed, escalation path, and communication requirements.

P1 — Complete outage: The storefront is not loading, or checkout is completely broken, or orders cannot be processed. Revenue impact: 100% of normal rate. Response required: immediate, all hands, business notification within 5 minutes. Target resolution: 30 minutes or rollback decision.

P2 — Partial outage with revenue impact: Checkout works but with a broken payment method; search returns no results; specific product category is inaccessible; significant performance degradation (>5s checkout). Revenue impact: 20–70% of normal rate. Response: immediate, primary on-call, business notification within 15 minutes. Target resolution: 60 minutes.

P3 — Degraded functionality without direct revenue impact: Admin functionality broken, reporting unavailable, specific integration not syncing, minor frontend visual issues. Revenue impact: minimal or indirect. Response: business hours response acceptable. Target resolution: 24 hours.

P4 — Minor issues: Cosmetic issues, non-critical third-party service delays, minor admin UI bugs. Response: planned sprint work. Target resolution: next sprint.

The First 15 Minutes

The first 15 minutes of a P1/P2 incident are the highest-stakes period. The actions taken (or not taken) in this window determine whether the incident resolves in 30 minutes or 3 hours.

Minute 0–2: Confirm the incident. Alerts can be false positives. Before waking anyone up, verify that the incident is real: check the monitoring dashboard, try to load the site manually, check recent deployments in the CI/CD system. A deployment in the last 30 minutes is the most likely cause of a sudden incident.

Minute 2–5: Assess scope and severity. Is checkout broken or is the whole site down? Is it affecting all users or a segment? Is it happening in all regions or a specific one? The scope determines whether the response is "roll back the last deploy" or "page the entire team."

Minute 5–10: Communicate. For P1/P2, notify the business stakeholder immediately, even before you have a diagnosis. "The checkout is down, we're investigating, you'll hear from us in 10 minutes." No business stakeholder wants to discover an outage before the technical team notifies them. This is a relationship-preserving action, not just a procedural one.

Minute 10–15: Decide: rollback or investigate. If there was a deployment in the last hour and the symptoms appeared after it, the default decision should be rollback first, investigate second. The cost of reverting a good deploy is one hour of work. The cost of spending 90 minutes investigating while the site is down is 90 minutes of revenue loss. Rollback wins the expected value calculation in almost all cases.

Investigation and Mitigation Patterns

If rollback is not applicable (no recent deploy, or rollback has been ruled out), structured investigation is faster than unguided exploration.

Check in this order:

Error logs: `/var/log/magento.log`, `/var/log/exception.log`, and the PHP error log. An exception log entry with a timestamp that correlates to the incident start is usually the fastest path to root cause.
Infrastructure: Is the database responding? Is Redis up? Is Elasticsearch up? Is memory at capacity? A resource exhaustion issue appears as mysterious frontend failures without clear application errors.
External dependencies: Is the payment gateway API responding? Is the shipping carrier API timing out? Is the tax service returning errors? Partial outages are often caused by a dependent service failure.
Recent configuration changes: Admin configuration changes don't leave a code trail. Check the config audit log or ask whether anyone changed payment gateway credentials, shipping method configuration, or tax settings recently.

Mitigation before root cause. If you've identified a likely cause but confirming root cause will take another 30 minutes, apply the mitigation first if it's safe to do so. Switch to a backup payment gateway. Disable the integration that's failing and fall back to a manual process. Restore the service, then diagnose why it broke.

Monitoring and observability setup: The investigation patterns above only work if you have monitoring in place before the incident. Setting up proper alerting, log aggregation, and health check dashboards is part of Magendoo's technical leadership engagements — including defining severity thresholds, configuring alert routing, and establishing the on-call rotation that makes the first 15 minutes work.

Communication During an Incident

Communication during an incident is as important as technical response. Bad communication during a well-resolved incident damages trust. Good communication during a long incident preserves it.

Business communication: Update every 15 minutes during a P1, every 30 minutes during a P2. The update format: current status (resolved/investigating/mitigated), what is known, what actions are being taken, next update in X minutes. Do not skip an update because there's nothing new to report — "still investigating, no new information, next update in 15 minutes" is better than silence.

Don't speculate in public communications. "We think it might be the payment gateway" becomes the public narrative and is damaging if it turns out to be wrong. Communicate facts: "We're investigating the checkout flow" rather than hypotheses.

Customer communication: For extended outages, the business owns customer communication. The technical team's job is to provide accurate status so the business can communicate it correctly, not to write customer-facing messages.

The Post-Mortem

Every P1 and P2 incident requires a post-mortem. Not because someone needs to be blamed — the useful post-mortem explicitly rejects blame — but because the incident is the clearest possible signal of a gap in the system, and that gap should be closed before it causes the next incident.

The post-mortem document has four sections:

What happened: Timeline of the incident, actions taken, resolution. Factual, not interpretive.
Root cause: The technical root cause and, importantly, the systemic cause. The root cause of a checkout failure is rarely "the code had a bug." The systemic cause is "this code wasn't covered by tests" or "this code path wasn't exercised in staging" or "this deploy happened without QA."
What went well: What detection, communication, or response actions were effective. These should be reinforced, not taken for granted.
Action items: Specific, assigned, time-bounded improvements. "Improve monitoring" is not an action item. "Add checkout success rate alert with 90-day baseline threshold by [date], owner: [name]" is an action item.

The post-mortem review should happen within 48 hours of resolution, when the incident is fresh. Action items from the previous post-mortem should be reviewed before every P1/P2 discussion — if the same root cause appears twice, the first post-mortem's action items weren't implemented.

When post-mortems reveal systemic issues: If your post-mortem action items keep pointing to the same areas — fragile integrations, untested code paths, performance bottlenecks under load — that's a signal the codebase needs a systematic review, not just point fixes. A Magento Audit turns recurring incident patterns into a prioritized remediation plan, so the same root cause stops appearing in post-mortems.

Incident Response Runbook

Severity levels defined and shared with both technical team and business stakeholders
On-call rotation established with clear coverage and escalation paths
P1/P2 notification list: who gets paged, in what order, through what channel
Rollback procedure documented and tested — not assumed to work
First 15 minutes checklist posted in the team communication channel
Business stakeholder notification template: consistent format, fast to compose
Log access: every team member knows where to find magento.log, exception.log, PHP error log
Infrastructure health checks: database, Redis, Elasticsearch — monitoring covers all
External dependency status pages bookmarked: payment gateway, shipping, tax services
Update cadence rule: P1 every 15 min, P2 every 30 min — no silent periods
Post-mortem template prepared — used within 48h of every P1/P2 resolution
Action items from previous post-mortems tracked and reviewed regularly

Written by the Magendoo engineering team — 22+ years in commerce engineering. About Magendoo

Ecommerce Incidents Are Different

The Severity Model for Ecommerce

The First 15 Minutes

Investigation and Mitigation Patterns

Communication During an Incident

The Post-Mortem

Incident Response Runbook

Continue Reading

Want to set up proper incident response?