The alert is rarely the beginning of an incident.
By the time latency spikes or customers start complaining, something has usually already changed. A deployment landed. A config value moved. A dependency shifted. A feature flag exposed a path that had not been tested under real traffic.
The team may not know which change matters yet, but they know the question to ask: what changed?
The question teams come back to
Production systems rarely fail in isolation. They fail because something shifted: a deployment, a config value, a dependency version, a feature flag, a traffic spike, a certificate expiry, a resource limit quietly crossed.
The first job in an incident is not to find someone to blame. It is to understand what made the system different.
Experienced on-call engineers often start here before they even open a dashboard. They check recent deploys, scan for config changes, and search Slack for anything that looks like a warning from the last few hours.
The instinct is sound. The execution is where things get difficult.
Why change is usually the best clue
If a service that was healthy at noon is degraded by 2pm, something changed between noon and 2pm.
The range of things that qualify as "change" is wider than most teams assume when the alert first fires:
- Code deployments and hotfixes
- Configuration changes, including feature flags and environment variables
- Infrastructure updates, scaling events, or certificate rotations
- Dependency updates, including transitive ones
- Data shape changes, including schema migrations, new customer usage patterns, and unexpected payload sizes
- External factors, including third-party APIs, CDN behaviour, and DNS changes
Any of these can interact with existing assumptions in ways that were not anticipated. The system was not fragile in isolation. It became fragile when a new change met an old assumption.
The data needed to answer "what changed?" usually exists. It is in version control, deployment logs, CI/CD pipelines, infrastructure audit trails, Slack threads, and ticket comments. The problem is that it is not connected when the team needs it.
Why manual correlation breaks
Experienced engineers carry a lot of context about which components are fragile and which changes tend to cause trouble. But that knowledge does not scale, and it breaks down in exactly the situations where it matters most.
Manual correlation becomes difficult when:
- several deployments land in the same window and the cause is not obvious
- symptoms appear in a different service from the one that changed
- a deploy looks healthy at first, then fails under real traffic hours later
- the person with the most context is unavailable
- ownership boundaries are unclear
When any of these conditions apply, reconstructing "what changed?" takes time the team does not have. Piecing it together manually under pressure is where incidents get expensive.
Correlation is not causation
It is worth being precise about what change correlation actually tells you.
A recent deployment is not proof. A risky PR is not guilt. But both can be useful evidence.
The goal is not to close the investigation early. It is to create a prioritised list of hypotheses worth investigating first. A change that touched the affected service, shipped in the relevant window, and has been associated with similar incidents before deserves early attention, even if it turns out to be unrelated.
Teams that treat change correlation as evidence rather than conclusion make better use of it. They investigate the most plausible hypotheses first, discard them when the evidence does not hold, and keep looking when the obvious candidate does not explain what they are seeing.
Where Ember fits
Monitoring is necessary. Without it, teams do not know when to ask "what changed?" at all. Alerts, dashboards, and error tracking are foundational.
But monitoring tells you that something is wrong. It does not tell you why. That gap is where incidents become expensive: the time spent reconstructing context under pressure, without connected data, often without the right people available.
Ember is being built to connect the signals that already exist: recent changes, deployment history, team conversation, observability context, and past incident memory. The goal is to give responders a ranked, evidence-backed view of what changed and what is plausibly related.
Ember is not trying to declare root cause from a single signal. It is trying to reduce the amount of context responders have to rebuild manually while the incident is already moving.
When an incident starts, that looks less like:
- "Root cause identified."
- "Incident resolved automatically."
And more like:
- "These deployments landed in the relevant window."
- "This change has been associated with similar incidents before."
- "Confidence in this hypothesis is moderate. The signal is present but not conclusive."
That kind of reasoning does not replace engineers. It gives them a better starting point.
Teams that answer faster learn faster
That is what the question is really for.
Not blame. Not guesswork. Not a shortcut to root cause.
A faster route to useful context.
Teams that answer "what changed?" quickly recover faster, write better postmortems, and carry more learning into the next deployment.
That is the shift Ember is being built to support.