Feb
2019
Importance of Technology Investments – Amherst Network Issues
Amherst College, a premiere residential liberal arts college, lost network connectivity for almost a week. The college supported almost all of the technology services locally, which meant that pretty much everything was inaccessible for that period – Email, Learning Management System, Web site, administrative systems etc. And the faculty could not connect to the web from classrooms and students needed to use their cell phones to connect to the outside world. As one of them tweeted, students who could not afford to have unlimited data plans were limited from doing even this. You can read about the details here.
The IT staff did a remarkable job given the circumstances and had the community support all through, based on what I have heard. And I am so thankful for them coming out and sharing their experiences openly with their colleagues. This is so important for the rest of us to learn from, not just the technology piece, but how to best manage such a crisis.
What really happened? It is a complicated story on a lot of fronts, but the core issue that caused this outage is due to lack of investments in network hardware. Because they are still running on hardware that is pretty old, their network is configured as a “flat” network (Layer 2). Most modern networks are Layer 3 networks where we can segment networks based on a variety of criteria, such as a separate segments based on particular buildings, or connections from classrooms etc.
Amherst suffered what is called a Mac Flap Storm. Each network device has a unique address, called the MAC address and the networks operate under this uniqueness assumption to forward the network data to the appropriate device. Any compromise to that can “flood” the network and it is especially worst in Layer 2 networks. It will basically cripple the entire network. This can happen either because network wires create a short circuit or a misconfiguration either of which can advertise the same Mac address on two or more ports. This is most probably what happened in Amherst case. The worst thing about the MAC flap storm is there is no easy way to detect them!
The modern network hardware and changes to configurations in the past few years have helped many of us avoid this. The newer network switches can be configured to detect such anomaly and essentially “kill those connections”. Even otherwise, the segmented Layer 3 network will contain the storm just to a single segment. This makes the detection of the problem a little easier, and the remaining network will work just fine. Of course, if the affected segment is where all your services run, then everyone will be down. Careful design of the network and building redundancy can help avoid such situations.
The other strategy is to move services to the cloud. When one does this, the risk is mitigated in that the services are generally spread out and the cloud vendors themselves think a lot about redundancies, so outages are minimal. Even if the campus network is down, those with cell phones and connections from home can still connect to services. Whereas this helps, classrooms still cannot function!
Regardless of what we do, network is a fundamental to the technology use on campus. Even if our strategy is to move to the cloud, we need the network on campus to perform without failure. This requires considerable investment to make sure that we keep the hardware refreshed and our network staff well trained. Given the financial situation in Higher Eds, this sometimes does not rise in priorities. I wrote a blog post in 2016 on this subject titled “Technology – Lessons to Learn from Deferred Maintenance Mistakes” I don’t want to repeat what I said there, but whatever I mentioned there applies today too.
One of the popular suggestions (including from me) is to create a “rainy day” fund to accumulate money annually to use when time comes for a refresh. Whereas this is one approach, I don’t think it is practical for a variety of reasons. Given the current financial climate, for one, the willingness to put away money annually is probably not going to happen and also everyone’s eyes will be on it when it grows to a sizable amount.
An alternative to getting network and other hardware refreshed is to get out of the ownership model and move to a lease model. If you can negotiate the lease terms carefully, this completely eliminates the need for new funding requests and makes a predictable annual commitment. And when the lease ends, you practically have no choice but to look at a refresh. Yes, you can buy back the old hardware, but cost considerations would tell you it is probably not a wise decision. We moved to this model about 3 years ago.
Setting aside the mechanics of exactly how to fund the infrastructure, senior administration needs to understand the importance of network infrastructure and allocate appropriate resources to having them current and refreshed. I know that many of my colleagues are sharing the Amherst story with their senior administrators to advocate for such a commitment. Whereas it will yield result in some cases, in many cases I fear that people will shrug it off and say “what are the chances that this will happen to us?” and move on.
That is, until they become the next Amherst story!