Skip to main content

Spanning-Tree Stops Trains

More
12 years 10 months ago #36856 by TheBishop
Saw this today on Slashdot:

"The railway signaling failure which crippled Sydney on April 12 (some commuters reported trips of more than three hours) was caused by a failing LAN switch and software that couldn’t cope, an engineering report has found.
The switch, probably a Cisco device (Railcorp’s dominant LAN kit supplier) was part of the network in the Sydenham signaling station. That facility governs signaling for a large chunk of the Sydney rail network.
The guilty switch suffered partial failure of two electrolytic capacitors (probably the power supply). The switch is part of a dual redundant LAN which is supposed to be resilient to failure; however, the configuration couldn't handle an intermittent breakdown.
With the caps failing, the switch would shut down and try to re-start itself. This, the engineer’s report says, meant the Sydenham LAN was “caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network.”
It only took a little over ten minutes for technical staff to initiate a disaster recovery plan, but the procedure took more than an hour to complete. In that time, the software that governs the trains, known as ATRICS, was unable to cope with the flaky network. This led to a knock-on effect, taking out a system called Microloc at another station, Revesby.
With ATRICS and Microloc both failing, the rail network failed to a “safe state” in which the trains were halted where they were. Because of the hugely interdependent state of the Sydney rail network, 847 trains were delayed, 240 were cancelled, and it took the rest of April 12th for the system to recover."

Moral: We all depend on spanning-tree and assume we have resilient architectures, but spanning-tree doesn't do too well with repeated intermittent faults that come and go faster than its convergence time. The world will always need engineers...
More
12 years 10 months ago #36870 by next_virus
Excellent article.
More
12 years 10 months ago #36871 by Arani
Replied by Arani on topic ...
Informative. Just trying to grasp the domino effects starting from two electrolytic capacitors failing to 847 delays and 240 cancellations.

Picking pebbles on the shore of the networking ocean
More
12 years 10 months ago #36892 by S0lo
Replied by S0lo on topic Re: Spanning-Tree Stops Trains
Cisco. Lesson learned the hard way!!. Well, Let's just hope that it's learned.

Studying CCNP...

Ammar Muqaddas
Forum Moderator
www.firewall.cx
More
12 years 10 months ago #36900 by jester
Very informative article
Accidents and Failures make us to change the rules
Time to create page: 0.151 seconds