When the Store Goes Down and Revenue Stops
An ecommerce outage is not a technical event. It's a business event with a technical cause.
An ecommerce platform going down at 3am on a Friday before a sale weekend is not the same as a SaaS application going down. It's revenue stopping in real time. It's customers abandoning carts and not returning. It's orders not processing. The urgency is different, the stakes are different, and the response needs to be correspondingly faster and more structured.
Most ecommerce teams don't have an incident response process until they need one urgently. The first real incident is when the process gets improvised — which is the worst time to improvise. The runbook in this guide is the one I wish every team I've worked with had before their first P1 incident.
This playbook defines the process before the incident, so the team knows what to do at 3am without having to think through it from first principles.
The key differences between ecommerce incidents and generic software incidents:
Revenue is the first metric. In a generic software incident, the primary metric is uptime. In ecommerce, it's revenue impact per minute. A fully down site and a broken checkout (with a browseable catalog) are technically similar in severity but economically very different. The response prioritization should reflect this.
The business is watching. Ecommerce outages are visible to the business in real-time — through operations dashboards, through customer service call volume, through revenue reports. The incident response team is communicating with a business audience that has financial stakes in the timeline, not just a technical audience.
Peak periods change the math. An incident at 2pm on a Tuesday with 50 concurrent sessions is very different from the same incident during a sale event with 500 concurrent sessions. Your severity model should account for traffic level, not just technical symptoms.
Partial failures are common. Magento has many integrations: payment gateway, shipping carriers, tax calculation, search, inventory. A failure in any of these produces a partial outage — the site is up, but checkout is broken, or search returns no results, or shipping options don't load. Partial failures are harder to detect automatically and harder to communicate clearly.
Define severity before an incident, not during one. Severity determines response speed, escalation path, and communication requirements.
P1 — Complete outage: The storefront is not loading, or checkout is completely broken, or orders cannot be processed. Revenue impact: 100% of normal rate. Response required: immediate, all hands, business notification within 5 minutes. Target resolution: 30 minutes or rollback decision.
P2 — Partial outage with revenue impact: Checkout works but with a broken payment method; search returns no results; specific product category is inaccessible; significant performance degradation (>5s checkout). Revenue impact: 20–70% of normal rate. Response: immediate, primary on-call, business notification within 15 minutes. Target resolution: 60 minutes.
P3 — Degraded functionality without direct revenue impact: Admin functionality broken, reporting unavailable, specific integration not syncing, minor frontend visual issues. Revenue impact: minimal or indirect. Response: business hours response acceptable. Target resolution: 24 hours.
P4 — Minor issues: Cosmetic issues, non-critical third-party service delays, minor admin UI bugs. Response: planned sprint work. Target resolution: next sprint.
The first 15 minutes of a P1/P2 incident are the highest-stakes period. The actions taken (or not taken) in this window determine whether the incident resolves in 30 minutes or 3 hours.
Minute 0–2: Confirm the incident. Alerts can be false positives. Before waking anyone up, verify that the incident is real: check the monitoring dashboard, try to load the site manually, check recent deployments in the CI/CD system. A deployment in the last 30 minutes is the most likely cause of a sudden incident.
Minute 2–5: Assess scope and severity. Is checkout broken or is the whole site down? Is it affecting all users or a segment? Is it happening in all regions or a specific one? The scope determines whether the response is "roll back the last deploy" or "page the entire team."
Minute 5–10: Communicate. For P1/P2, notify the business stakeholder immediately, even before you have a diagnosis. "The checkout is down, we're investigating, you'll hear from us in 10 minutes." No business stakeholder wants to discover an outage before the technical team notifies them. This is a relationship-preserving action, not just a procedural one.
Minute 10–15: Decide: rollback or investigate. If there was a deployment in the last hour and the symptoms appeared after it, the default decision should be rollback first, investigate second. The cost of reverting a good deploy is one hour of work. The cost of spending 90 minutes investigating while the site is down is 90 minutes of revenue loss. Rollback wins the expected value calculation in almost all cases.
If rollback is not applicable (no recent deploy, or rollback has been ruled out), structured investigation is faster than unguided exploration.
Check in this order:
Mitigation before root cause. If you've identified a likely cause but confirming root cause will take another 30 minutes, apply the mitigation first if it's safe to do so. Switch to a backup payment gateway. Disable the integration that's failing and fall back to a manual process. Restore the service, then diagnose why it broke.
Communication during an incident is as important as technical response. Bad communication during a well-resolved incident damages trust. Good communication during a long incident preserves it.
Business communication: Update every 15 minutes during a P1, every 30 minutes during a P2. The update format: current status (resolved/investigating/mitigated), what is known, what actions are being taken, next update in X minutes. Do not skip an update because there's nothing new to report — "still investigating, no new information, next update in 15 minutes" is better than silence.
Don't speculate in public communications. "We think it might be the payment gateway" becomes the public narrative and is damaging if it turns out to be wrong. Communicate facts: "We're investigating the checkout flow" rather than hypotheses.
Customer communication: For extended outages, the business owns customer communication. The technical team's job is to provide accurate status so the business can communicate it correctly, not to write customer-facing messages.
Every P1 and P2 incident requires a post-mortem. Not because someone needs to be blamed — the useful post-mortem explicitly rejects blame — but because the incident is the clearest possible signal of a gap in the system, and that gap should be closed before it causes the next incident.
The post-mortem document has four sections:
The post-mortem review should happen within 48 hours of resolution, when the incident is fresh. Action items from the previous post-mortem should be reviewed before every P1/P2 discussion — if the same root cause appears twice, the first post-mortem's action items weren't implemented.
These guides come from 22+ years and 50+ Magento projects. If your team is facing one of these challenges, I can help — through a focused platform audit, technical leadership engagement, or hands-on development.
Start a Conversation All Guides