GitHub's February 9 Incidents: What 'Mitigation' Leaves Out
GitHub's February availability report documents two incidents an hour apart with the same root cause. The first mitigation was correct for what the team could see, and it was not the fix. This is a reading of the public report through two lenses, one heuristic and one systems-theoretic.
GitHub’s availability report for February 2026 covers six incidents and there are two on February 9 worth reviewing. Two windows of degraded service an hour and fourteen minutes apart, and GitHub’s own investigation found they came from the same underlying cause. The first was declared mitigated before the second began. The mitigation GitHub applied was a correct response to what the team could see at the time. It also was not enough to prevent the second incident.
I have sat in enough incident channels to recognize the shape of this one. A team stops the bleeding, watches the graphs recover, posts the all-clear, and then watches the same graphs go bad again. The cache-layer specifics are not the part that is interesting. The pattern is, and it is not specific to GitHub.
What GitHub said happened
The detail is in GitHub’s February 2026 availability report. The short version:
A configuration change to a user-settings caching mechanism set off a large volume of cache rewrites all at once. In the first incident, 16:12 to 17:39 UTC, the asynchronous rewrites overwhelmed a shared component that coordinates background work, and that cascaded into connection exhaustion in the service proxying Git operations over HTTPS. GitHub stopped it by disabling the async cache rewrites and restarting the Git proxy across multiple datacenters.
Then the all-clear.
The second incident ran 18:53 to 20:09 UTC. A different source of cache updates, one the first mitigation had not touched, pushed a high volume of synchronous writes. That produced replication delays, the same kind of cascade, and connection exhaustion in the same Git HTTPS proxy. GitHub disabled that source too, and restarted the proxy again.
The same component exhausted, but a different thing pushing it over. Seventy-four minutes of apparent recovery in between.
What public reports do and don’t tell us
GitHub’s availability report is a public disclosure. It is bound by what GitHub chose to share, which is the sequence of events, the shared root cause, and a remediation list. I do not have the on-call channel transcripts, the dashboards the responders were watching, the team’s working mental model of the system going in, what the design review for the configuration change actually covered, or the small facts that would describe what the work of responding was like in the moment. As such, the rest of this post is a reading of that disclosure. It is more or less a hypothesis about a class of mistake, illustrated by what GitHub published. It is not a verdict on what happened in the incident room.
Any alternative history I construct from these gaps cannot be tested. It also implicitly judges the incident by a standard available only in hindsight. I want to be transparent, and mark the places where I do not know whether the alternative I am imagining would have produced a better outcome. The system could have failed in other ways. The proposed fixes carry their own failure modes. I’ll do my best to name those as I go.
The heuristic response
None of this makes GitHub careless. Put yourself in the incident for a moment. The async cache rewrites were the visible cause. Disabling them stopped the cascade, the proxy recovered, service came back. Every step is one a competent on-call engineer would take, in the order they would take it.
This is the heuristic incident response process in action. Find the visible cause. Disable it. Restart what broke. Watch the graphs recover. Post the all-clear. It is the standard playbook in most engineering organizations, and it usually works. The dashboard agrees. The metric for time-to-resolution gets its number. The incident closes and the organization moves on, as if the word “mitigated” meant what its intuitive interpretation suggests.
The problem is not anything an individual did on February 9. It is the model the playbook runs on. The model assumes that disabling the visible cause returns the system to a safe state. That assumption holds when the visible cause is the only thing that can drive the system to the failure condition. It does not hold when something else can drive the system there too, and the heuristic playbook has no step that asks which of those two situations it is in.
Feb 9 incident 1 was a case where the assumption was wrong. Disabling the async cache rewrites stopped the cascade. The proxy recovered. The all-clear got posted. Seventy-four minutes later, a different source of writes pushed the same proxy into the same failure. Good-faith engineers working inside a flawed model produced mitigation theater, because the model usually works.
Heuristics vs Systems
The heuristic incident response is exactly that: a heuristic. It’s not a systems theory. Before going further, it is worth setting the two views side by side, because they read the same incident pair quite differently.
The heuristic view. Connection-pool exhaustion in the Git HTTPS proxy is a failure mode. Different write sources can drive the proxy to that state. The heuristic addresses the source that was active when the incident started. Widening the scope, alerting on saturation directly, inventorying the other writers, these are extensions of the heuristic, not departures from it.
The workflow under that heuristic runs like this. The responder watches the dashboard, sees the visible cause stop misbehaving, and calls it mitigated. The dashboard agrees. Whatever made the proxy vulnerable to that load remains exposed to the next thing that reaches it.
There is a plainer reason this happens than “people are not careful enough.” The dashboard is in front of the responder and the rest of the system is not. When the visible problem stops, staying in the incident is expensive, and the question that might matter next, what else reaches this same vulnerability, points at code and dependencies the dashboard does not show. Closing the incident is easy. Doing the inventory is not. That is not a diligence problem. It is a problem with what the incident tooling puts in front of the responder.
GitHub’s own remediation list, written after both incidents, goes after the mechanism: optimize the caching mechanism so it stops amplifying writes, add self-throttling for bulk updates, and fix the connection-exhaustion behavior in the proxy so it recovers on its own instead of needing a manual restart. The team got to that scope eventually. The two hours and forty-three minutes of degraded service and two incidents was the price of getting there.
The systems view. STAMP, the safety framework Nancy Leveson developed and laid out in Engineering a Safer World, looks at the same incident pair and reframes the question. The proxy did not have a broken part waiting to be triggered. There was a control structure around the proxy, with controllers issuing actions and getting feedback through loops. They were supposed to keep the system inside its safety constraints. Several had gaps. The question STAMP asks is which control action or feedback signal, if it had been different, would have kept the system out of an unsafe state.
Let’s look at four entities in the model:
- The design reviewer. Whoever signed off on the design change before it shipped. The furthest upstream controller. They decide whether a change leaves the system safe, working from their own picture of how the proxy and the other write sources will behave. This is probably not a single person, but it’s possible it could have been.
- The on-call responder. Whoever was paged when the proxy started failing at 16:12 UTC. Deciding in real time what is happening, what to do, and when to stop. Working from the dashboard, the alert that paged them, and their own picture of the system. This also was probably not a single person, in this case it’s unlikely it was.
- The dashboard. Not a person, but the feedback channel the responder is working from. The dashboard decides which questions the responder can answer in real time and which ones they have to guess at.
- The alerting policy. Also not a person. The alerting policy decides what counts as the proxy being in trouble, what counts as fine, and what gets escalated. Its definitions are designed in advance and only updated when an incident proves them inadequate.
With those four in hand, here is how I imagine the control structure failed on February 9.
The design review released a configuration change. The interaction between the new write pattern and exsting write sources under load was not surfaced in time to prevent the incident. The on-call responder went into incident 1 with a working assumption that disabling the visible path would return the system to a safe state. That was a reasonable read given what the dashboard showed, but the assumption did not match the actual system. The dashboard showed proxy health and observable load, and it did not show write-source coverage, so there was no signal that would have told the responder whether the system was actually in a safe state at 17:39 UTC. The alerting policy treated the visible path going quiet as “clear,” with no separate check that the proxy was safe under future load from elsewhere.
Each of those is a place where the larger system gave the people inside it no way to catch the problem in time. The failure was prepared in the control structure around the proxy, well before either incident window opened.
That is a different shape of analysis, and it points at different fixes. The heuristic view ends at the failure mode: alert on it, inventory the writers, scope the mitigation wider. The systems view says the design review is missing a check, the dashboard is missing a signal, the alerting policy is missing a verification step, and the responder is working from a mental picture of the system that no one has any way to update without an incident. A team can pick where to intervene based on where the leverage is highest and where it can afford to act.
A second case
GitHub also had an Actions incident on May 5, 2026. A routine scale-up of hosted-runner VMs in one region hit an internal rate limit while the VMs were pulling images from storage. From the incident disclosure: “Existing backoff logic was not triggered because of the response code returned in this case.” There was a backoff mechanism. It was correct. It did not fire, because it was watching for particular response codes, and the code that arrived was not one of them.
Using the entities modeled above again, here’s how I imagine the event emerged. Months before the scale-up, the design review for the backoff logic shipped a list of response codes the designer believed meant “rate-limited.” The list did not include the actual code the storage service returned in the May 5 case. The on-call responder paged during the scale-up went into the incident with the backoff already deployed as the safety mechanism for exactly this kind of overload. The working picture started with “the safety mechanism is in place,” so the actual failure, the safety mechanism itself being bypassed, sat outside the responder’s expectations. The dashboard showed runners failing to pull images, and it did not show whether the backoff was firing or what response codes the storage service was returning, so there was no quick signal that the safety mechanism itself was the source of the problem. The alerting policy was wired to the same response codes as the backoff, which meant nothing fired to flag that the backoff was misconfigured for the rate limit it was meeting.
GitHub’s response widens across the three incidents. Feb 9 incident 1 was mitigated heuristically: disable the path, restart the proxy, declare mitigation. Then incident 2 happened. The remediation list GitHub published was wider and more systemic: optimize the caching mechanism so it stops amplifying writes, add self-throttling for bulk updates, fix the connection-exhaustion behavior in the proxy so it recovers on its own. The remediation for the May 5 incident reads more like that second move than the first, however it still leaves gaps.
Reviewing rate limits end-to-end for similar operations is a defense against unsafe control actions: in this case, the backoff system not providing the “back off” control action when the upstream is rate-limiting. Viewing through the three defense lenses:
- Prevention. The end-to-end review makes the backoff logic’s model of “what counts as rate-limited” more complete across operations that share the risk. This is where GitHub’s remediation lands.
- Detection. Nothing in the remediation adds a signal that fires when the backoff should fire but doesn’t. The dashboard and the alerting policy were wired to the same response codes as the backoff, so all three were blind to the same gap, and unless addressed might still be.
- Correction. Nothing describes a fallback if the backoff is bypassed again by some other unrecognized rate limiting signal. The system remains exposed to the same class of failure on any upstream contract change the prevention layer has not yet caught up with.
Closing
The heuristic incident response process answers a particular question: what caused the visible failure, and what stopped it. It is a useful question, and most production engineering is organized around it. There is another question the heuristic does not naturally ask: what was the system designed to prevent, and what did the defenses around that prevention actually do. The Feb 9 pair and the May 5 backoff are two cases where the second question is the one the public report leaves on the table.
Both questions are legitimate. The heuristic is what runs during an incident, because the first question is the one the on-call engineer can answer in real time with the tools in front of them. The systems lens is what makes the second question answerable, applied before the change ships or after the incident closes. Knowing both is what lets a team see which kind of work it is doing.
We are building Revelara around the substrate the second question needs: a body of knowledge pulled from publicly disclosed incidents, through lenses that model the safety of the system, not just the failure mechanism each one describes.