Why engineering incidents don't start at 'incident'

The alert fires. The ticket gets created. The war room spins up.

This is what we call an incident.

But something was already wrong before any of that happened. The alert didn't cause the problem. It just made the problem undeniable.

By the time a system pages someone, the failure has already occurred. The customer has already been affected. The damage is done, and the team is in recovery mode.

Alerts and tickets are lagging indicators. They tell you what happened, not what was happening.

The incident didn't start when the alert fired. It started when something felt off and nobody wrote it down.

The signals teams learn to ignore

Every engineering team generates early warning signals. Most of them get dismissed.

Not maliciously. Not negligently. They get dismissed because nothing in the workflow treats them as meaningful. There's no place to put them, no system to record them, no process that acts on them.

So they pass without comment. And then, later, someone says: "We should have seen this coming."

PRs that feel risky but pass CI

A pull request touches three services, changes a database migration, and modifies a critical path. CI passes. Code review approves.

But something doesn't sit right.

Maybe the author hesitates before merging. Maybe a reviewer adds a comment that says "this should be fine, but keep an eye on it." Maybe someone requests a second reviewer, not because the code is wrong, but because the change feels consequential.

These moments carry signal. They indicate that experienced engineers perceive risk that automated checks can't see. But there's no field in the PR template for "gut feeling." No metric captures "uneasy approval."

The merge happens. The deploy goes out. And if nothing breaks, the signal is forgotten. If something does break, the team wonders why they didn't act on what they felt.

Extra reviewers added "just in case"

When a developer adds a third reviewer to a PR, they're saying something.

They might be saying the change is high-stakes. They might be saying they don't fully trust the tests. They might be saying this area of the codebase has hurt them before.

Whatever the reason, the act of seeking additional review is information. It reveals perceived risk.

But pull request systems don't interpret behaviour. They track approvals, not anxiety. They record who reviewed, not why the author felt they needed more eyes.

The pattern goes unnoticed. The next time that same component changes, the same unease reappears. And the cycle continues.

Deploys delayed without a clear reason

The deploy is ready. Tests pass. Approvals are in. But it doesn't ship.

Sometimes the reason is explicit: "Let's wait until after the weekend." Sometimes it's implicit: the engineer just doesn't click the button yet.

Delayed deploys often reflect caution that can't be articulated. Something about the timing, the context, or the recent history of the system makes the engineer pause.

This is valuable signal. It suggests the team doesn't fully trust the deploy process or the code being shipped. But deployment pipelines don't track hesitation. They track success and failure.

If the deploy eventually ships and nothing breaks, the hesitation vanishes from the record. If something does break, the delay might be remembered, but the reasoning behind it is lost.

Engineers expressing unease in chat

"I'm not sure why, but this feels fragile."

"Anyone else noticing weird behaviour in staging?"

"This works, but I don't love it."

Slack and Teams are full of these messages. They're not bug reports. They're not incident declarations. They're engineers thinking out loud, sharing intuition, testing whether their unease is shared.

Most of the time, these messages get a thumbs-up or a brief reply. The conversation moves on. The observation isn't captured anywhere structured.

But these informal expressions are often the earliest indicators that something is off. They appear hours or days before alerts fire. They represent distributed sensing, engineers noticing patterns their tools can't see.

And they disappear into chat history, unsearchable and unconnected to what happens next.

Flaky tests that persist

A test fails intermittently. It gets re-run. It passes the second time. The build proceeds.

This happens again. And again. The test is flaky. Everyone knows it. Someone might even comment: "That one's been flaky for weeks."

But the test stays flaky. Because it's not failing hard enough to block anyone. Because fixing it would take time nobody has. Because it's easier to re-run than to investigate.

Flaky tests are symptoms. They indicate non-determinism in the system: race conditions, timing dependencies, environmental sensitivity. They're signals that something isn't quite right.

But CI systems treat them as noise. They track pass/fail, not "passed on the third attempt after two unexplained failures."

The flakiness persists until something related breaks in production. Then the team realises the flaky test was trying to tell them something all along.

Why existing tools can't capture these signals

The problem isn't that teams lack tools. The problem is that the tools they have are optimised for different questions.

Monitoring only sees production

Observability platforms are sophisticated. They track latency, error rates, throughput, and resource utilisation. They correlate metrics across services. They surface anomalies.

But they only see what's running.

They can't see the PR that introduced risk. They can't see the engineer who hesitated. They can't see the chat message expressing concern. They can't connect upstream decisions to downstream consequences.

By the time monitoring detects a problem, the problem is already a problem.

Incident tools only see declared incidents

Incident management systems are designed to coordinate response. They track status, assign owners, log timelines, and generate post-mortems.

But they only see what gets declared.

The wobble that self-resolved. The near-miss that nobody reported. The degradation that stayed just below the alerting threshold. These don't exist in incident tooling because nobody created a ticket.

Incident tools are reactive by design. They're optimised for what happens after someone decides something is wrong.

Metrics only capture what's easily countable

DORA metrics, sprint velocity, cycle time, deployment frequency. These are useful. They provide aggregate signals about team performance.

But they're lagging indicators built from discrete events. They tell you what happened, not why. They tell you that velocity dropped, not that it dropped because three senior engineers spent the week debugging a subtle race condition.

Metrics are compressed. They sacrifice context for comparability. That compression loses the texture that explains the numbers.

The conceptual shift

Incidents aren't binary. They don't spring into existence the moment an alert fires.

Risk exists on a continuum. It accumulates gradually, in small decisions, quiet concerns, and ignored signals. The incident is just the moment when accumulated risk becomes undeniable.

This reframing changes what teams should be paying attention to.

Context before failure matters

Post-mortems are valuable. But they're reconstructive. They try to piece together what happened after the fact, often from incomplete records and fading memories.

The context that matters most is the context that existed before failure. The reasoning behind decisions. The concerns that were raised. The signals that were present but not acted upon.

Preserving this context, not just outcomes, changes what's learnable from incidents. It allows teams to see the pattern before the pattern repeats.

Preserving reasoning over outcomes

Outcomes are binary: the deploy succeeded or it didn't. The incident was resolved or it wasn't.

But reasoning is nuanced. Why did the engineer hesitate? What made the reviewer uneasy? Why did the team decide to deploy anyway?

When only outcomes are preserved, learning is limited to "that didn't work, don't do it again." When reasoning is preserved, learning extends to "we ignored signals that indicated this might not work, and here's how to recognise them next time."

Teams don't need better alerts. They need better memory.

What this means for engineering teams

If incidents don't start at "incident," then incident prevention doesn't start at alerting thresholds.

It starts earlier. In code review. In deployment decisions. In casual Slack conversations. In the moments when engineers sense risk but have no mechanism to record it.

The teams that reduce incident frequency aren't necessarily the ones with the best monitoring. They're the ones that have learned to notice, capture, and act on early signals.

This isn't about adding more tools. It's about recognising that valuable information is already being generated. It's just not being preserved.

The question isn't "how do we detect incidents faster?" It's "how do we notice when an incident is forming?"

From reaction to recognition

The gap between sensing risk and declaring an incident is where prevention lives.

Engineers sense things. They notice when code feels fragile. They hesitate before deploys that seem risky. They express concern in chat. They add extra reviewers.

These are acts of recognition. They indicate that something in the system is worth attention.

The challenge is that recognition without recording is temporary. The insight exists in someone's head, briefly, and then it's gone. The next person to encounter the same risk has no access to what the previous person sensed.

Making recognition durable, giving teams a way to preserve and connect early signals, is what shifts the work from reactive response to proactive awareness.

Not through automation. Not through more alerts. Through memory.

Because incidents don't start at "incident." And the teams that understand this are the ones that have fewer incidents to start with.