It only takes four to six seconds for a troubled 4ESS switch to rid itself of all its calls, drop everything temporarily, and re-boot its software from scratch. Starting over from scratch will generally rid the switch of any software problems that may have developed in the course of running the system. Bugs that arise will be simply wiped out by this process. It is a clever idea. This process of automatically re-booting from scratch is known as the "normal fault recovery routine." Since AT&T"s software is in fact exceptionally stable, systems rarely have to go into "fault recovery" in the first place; but AT&T has always boasted of its "real world" reliability, and this tactic is a belt-and-suspenders routine.
The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their "OK" signals to the switch.
The switch would make a little note to that effect in its "status map,"
recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work.
Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play.
The flaw caused the 4ESS switch to interact, subtly but drastically, with incoming telephone calls from human users. If--and only if-- two incoming phone-calls happened to hit the switch within a hundredth of a second, then a small patch of data would be garbled by the flaw.
But the switch had been programmed to monitor itself constantly for any possible damage to its data.
When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software.
It would signal its fellow switches not to send any more work.
It would go into the fault-recovery mode for four to six seconds.
And then the switch would be fine again, and would send out its "OK, ready for work" signal.
However, the "OK, ready for work" signal was the VERY THING THAT HAD CAUSED THE SWITCH TO GO DOWN IN THE FIRST PLACE. And ALL the System 7 switches had the same flaw in their status-map software.
As soon as they stopped to make the bookkeeping note that their fellow switch was "OK," then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second.
At approximately 2:25 P.M. EST on Monday, January 15, one of AT&T"s 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, announced "I"m going down," then announced, "I"m back, I"m OK." And this cheery message then blasted throughout the network to many of its fellow 4ESS switches.
Many of the switches, at first, completely escaped trouble.
These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second.
Their software did not fail--at first. But three switches-- in Atlanta, St. Louis, and Detroit--were unlucky, and were caught with their hands full. And they went down.
And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches.
As more and more switches did have that bit of bad luck and collapsed, the call-traffic became more and more densely packed in the remaining switches, which were groaning to keep up with the load. And of course, as the calls became more densely packed, the switches were MUCH MORE LIKELY to be hit twice within a hundredth of a second.
It only took four seconds for a switch to get well.
There was no PHYSICAL damage of any kind to the switches, after all. Physically, they were working perfectly.
This situation was "only" a software problem.
But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept KNOCKING one another down with their contagious "OK" messages.
It took about ten minutes for the chain reaction to cripple the network.
Even then, switches would periodically luck-out and manage to resume their normal work. Many calls--millions of them--were managing to get through. But millions weren"t.
The switching stations that used System 6 were not directly affected.
Thanks to these old-fashioned switches, AT&T"s national system avoided complete collapse. This fact also made it clear to engineers that System 7 was at fault.
Bell Labs engineers, working feverishly in New Jersey, Illinois, and Ohio, first tried their entire repertoire of standard network remedies on the malfunctioning System 7. None of the remedies worked, of course, because nothing like this had ever happened to any phone system before.
By cutting out the backup safety network entirely, they were able to reduce the frenzy of "OK" messages by about half. The system then began to recover, as the chain reaction slowed. By 11:30 P.M. on Monday January 15, sweating engineers on the midnight shift breathed a sigh of relief as the last switch cleared-up.
By Tuesday they were pulling all the brand-new 4ESS software and replacing it with an earlier version of System 7.
If these had been human operators, rather than computers at work, someone would simply have eventually stopped screaming. It would have been OBVIOUS that the situation was not "OK," and common sense would have kicked in. Humans possess common sense-- at least to some extent. Computers simply don"t.
On the other hand, computers can handle hundreds of calls per second. Humans simply can"t. If every single human being in America worked for the phone company, we couldn"t match the performance of digital switches: direct-dialling, three-way calling, speed-calling, call- waiting, Caller ID, all the rest of the cornucopia of digital bounty. Replacing computers with operators is simply not an option any more.
And yet we still, anachronistically, expect humans to be running our phone system. It is hard for us to understand that we have sacrificed huge amounts of initiative and control to senseless yet powerful machines.
When the phones fail, we want somebody to be responsible.
We want somebody to blame.
When the Crash of January 15 happened, the American populace was simply not prepared to understand that enormous landslides in cybers.p.a.ce, like the Crash itself, can happen, and can be n.o.body"s fault in particular. It was easier to believe, maybe even in some odd way more rea.s.suring to believe, that some evil person, or evil group, had done this to us.
"Hackers" had done it. With a virus. A trojan horse.
A software bomb. A dirty plot of some kind. People believed this, responsible people. In 1990, they were looking hard for evidence to confirm their heartfelt suspicions.
And they would look in a lot of places.
Come 1991, however, the outlines of an apparent new reality would begin to emerge from the fog.
On July 1 and 2, 1991, computer-software collapses in telephone switching stations disrupted service in Washington DC, Pittsburgh, Los Angeles and San Francisco.
Once again, seemingly minor maintenance problems had crippled the digital System 7. About twelve million people were affected in the Crash of July 1, 1991.
Said the New York Times Service: "Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers, but most seemed to think the problems stemmed from some unknown defect in the software running the networks."
And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to "glitches" in the "signal transfer point" software that DSC had designed for Bell Atlantic and Pacific Bell.
The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation"s capital of phone service.
It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires TEN MILLION lines of code.
On Tuesday, September 17, 1991, came the most spectacular outage yet.
This case had nothing to do with software failures--at least, not directly.
Instead, a group of AT&T"s switching stations in New York City had simply run out of electrical power and shut down cold. Their back-up batteries had failed. Automatic warning systems were supposed to warn of the loss of battery power, but those automatic systems had failed as well.
This time, Kennedy, La Guardia, and Newark airports all had their voice and data communications cut.
This horrifying event was particularly ironic, as attacks on airport computers by hackers had long been a standard nightmare scenario, much trumpeted by computer-security experts who feared the computer underground. There had even been a Hollywood thriller about sinister hackers ruining airport computers--DIE HARD II.
Now AT&T itself had crippled airports with computer malfunctions-- not just one airport, but three at once, some of the busiest in the world.
Air traffic came to a standstill throughout the Greater New York area, causing more than 500 flights to be cancelled, in a spreading wave all over America and even into Europe. Another 500 or so flights were delayed, affecting, all in all, about 85,000 pa.s.sengers.
(One of these pa.s.sengers was the chairman of the Federal Communications Commission.)
Stranded pa.s.sengers in New York and New Jersey were further infuriated to discover that they could not even manage to make a long distance phone call, to explain their delay to loved ones or business a.s.sociates. Thanks to the crash, about four and a half million domestic calls, and half a million international calls, failed to get through.
The September 17 NYC Crash, unlike the previous ones, involved not a whisper of "hacker" misdeeds. On the contrary, by 1991, AT&T itself was suffering much of the vilification that had formerly been directed at hackers. Congressmen were grumbling.
So were state and federal regulators. And so was the press.
For their part, ancient rival MCI took out snide full-page newspaper ads in New York, offering their own long-distance services for the "next time that AT&T goes down."
"You wouldn"t find a cla.s.sy company like AT&T using such advertising,"
protested AT&T Chairman Robert Allen, unconvincingly. Once again, out came the full-page AT&T apologies in newspapers, apologies for "an inexcusable culmination of both human and mechanical failure."
(This time, however, AT&T offered no discount on later calls.
Unkind critics suggested that AT&T were worried about setting any precedent for refunding the financial losses caused by telephone crashes.)
Industry journals asked publicly if AT&T was "asleep at the switch."
The telephone network, America"s purported marvel of high-tech reliability, had gone down three times in 18 months. Fortune magazine listed the Crash of September 17 among the "Biggest Business Goofs of 1991,"
cruelly parodying AT&T"s ad campaign in an article ent.i.tled "AT&T Wants You Back (Safely On the Ground, G.o.d Willing)."
Why had those New York switching systems simply run out of power?