Ember

Your incident process is not broken. Your context is.

Most engineering teams don't lack tools. They lack connected context at the moment decisions are made. That's what makes incidents expensive.

4 min read

The alert fires. Engineers land in the incident channel. Someone asks what changed. Someone else is already checking the deployment log. A third person is searching Slack.

Two minutes in, the thread has more questions than answers: Did something deploy this afternoon? Is this the same pattern as last month? Who reviewed the changes to the payment service? Should we roll back?

What makes the first twenty minutes expensive is rarely system complexity. It is time spent gathering context before anyone can act.

The data exists. The problem is where it lives.

Most engineering teams are not short on tools. They have pull requests and code review history, deployment records, monitoring dashboards, Slack or Teams threads where engineers discussed the change before it shipped, customer reports in a support queue, and postmortems filed away in Confluence or Notion.

All of that exists. None of it is connected.

Each system was built to solve one problem well. The deployment tool tracks what shipped. The monitoring platform shows current behaviour. The version control system holds the code and the review conversation. None of them were designed to answer the question a responder needs answered right now: given everything that changed and everything that is happening, where should I look first?

When an alert fires, the team does not open a unified picture of what led to this moment. They open six tabs and start building it themselves.

Context reconstruction is hidden incident toil

Before the team has a working hypothesis, there is a phase that does not appear in postmortems or time-to-resolve metrics. Someone scrolls the deployment log. Someone else searches Slack. The on-call engineer messages the person who reviewed the relevant PR. Someone checks recent support tickets to see whether customers were already affected.

This is not incident response. It is incident preparation. And it consumes time that should go toward the actual investigation.

The people who move fastest through this phase carry context in their heads. They know which services are fragile, which recent changes felt risky in review, and which failure patterns have appeared before. When they are available, things move. When they are not, the team rebuilds the map from scratch under pressure, often anchoring to the most visible signal rather than the most relevant one.

The first job in an incident is not execution. It is orientation. And orientation under pressure, with scattered context, is where incidents quietly become expensive.

Dashboards tell you what, not how you got here

Monitoring is necessary. Without it, teams do not know something is wrong. But dashboards were built to answer one question: how is the system behaving right now?

What they do not show is the sequence that led to this moment.

A 4% error rate on the payments API does not tell you the rate started climbing at 14:23, eleven minutes after a deployment completed. It does not show that someone flagged a concern in the PR: "this path hasn't been load-tested, worth watching in production." It does not surface the customer report filed two hours earlier, or the March incident traced to the same connection pool.

All of that context exists. A dashboard does not connect it.

The missing layer

What sits between the alert and the team having enough to act is not another dashboard. It is a layer that connects the signals that already exist, so responders do not have to piece them together manually.

The deployment that shipped this afternoon. The PR review thread where someone flagged a concern. The error pattern that started climbing shortly after. The customer report that matches the symptoms. The postmortem from six months ago traced to the same service.

Each signal lives in a different system. Each is findable in isolation. Together, under pressure, they are slow to assemble and easy to get wrong. The missing layer makes them visible as a connected picture when the incident is active, with explicit confidence and the evidence behind each observation. It does not make decisions. It reduces how much context responders have to reconstruct before they can act.

Where Ember fits

Ember is being built to close this gap. Not by adding another dashboard, not by declaring root cause autonomously, but by connecting the signals engineering teams already have: deployments, pull requests and review conversations, telemetry, team discussion in Slack or Teams, and operational memory from past incidents.

When production behaves differently, Ember surfaces which recent changes are in the relevant window, whether any of them included a concern in review, whether a similar pattern appeared before, and what confidence applies to each observation.

The team can disagree with any of it, set a hypothesis aside, or follow a thread Ember did not surface. That judgment stays with the engineers. What changes is how much of the first twenty minutes goes toward finding information versus acting on it.

The right context, at the right moment

Most teams have the process. Runbooks, monitoring, postmortems. What breaks down is the context needed to run that process well, scattered across systems that do not talk to each other and partly held in the heads of engineers.

The teams that respond well are the ones who can orient quickly: what changed, has this happened before, where should we start. That orientation is not a process problem. It is a context problem.

The answer is not more alerts or more dashboards. It is the relevant context pulled together at the moment decisions are being made.

Related posts

Engineering LeadershipIncident Management

Incident AI should show its work

AI can help teams respond to incidents, but only if engineers can see the evidence, confidence, and reasoning behind its recommendations.

Read article

Want more Ember insights?

We publish deep-dives on engineering culture, incident patterns, and risk reduction.

Browse more posts →