Spring 2022 – Tech Issues on First Day of Classes
It will soon be 2 years since the covid crisis struck the United States causing all of our lives to be turned upside down. Higher ed institutions faced disruptions that we never imagined would happen during our lifetimes. Despite the severity of the crisis and the disruption it caused, thanks to wide availability of the internet and the advances in technology we did fairly well. However it’s fair to say that we all have lost 2 years of our life to the crisis because we simply could not live our lives the way that we were used to. This is why when we began the planning for last fall there was a sense of excitement that we may be inching towards a different world and more normalcy. That all turned south again because of a second wave and then the Omicron variant causing havoc. And here we are, a semester later, the crisis seems to continue though there are signs of it subsiding and we will never know what’s in the future.
As the Wellesley operations team was discussing the plans for spring semester in early January, it became increasingly clear that when the students returned to campus in mid-January the Omicron effect was not going to be under control. At the same time the administration was reluctant to move to a remote learning environment unless it is absolutely necessary. After a lot of deliberations, the decision was made that we will be remote just for the first week of spring semester. The rationale for this was that the greatest risk we faced was related to the students traveling from elsewhere and coming to the campus. By asking the students to come as planned and ensuring they all got two negative tests before they are released from in room restrictions, the community could feel safe. And the way to accomplish this was to make the first week remote.
This made us scramble to make some adjustments to the network…. And this is what resulted in an issue that we wish didn’t happen!
When everyone was home and doing remote learning, it did not tax the on-campus network at all because all the Zoom traffic was going from home networks directly to Zoom and vice versa. However when we have all students on campus, our networks should be able to handle literally thousands of simultaneous Zoom traffic during class meeting time. Though when the crisis hit us in March 2020, we doubled our internet connectivity to 4 Gb, it was not used that much, even after all the students returned back to campus. Though we had 4 Gb of internet bandwidth, we noticed that our firewall did not have the capacity to handle that level of traffic.
Though we had ordered and received a higher capacity firewall, other priority projects and the fact the current firewall was handling all of our traffic made us wait a bit to install it. These installations are time consuming and finding a stretch of time to take the network down for a significant amount of time is extremely hard. When we were fully remote for 3 days after Thanksgiving, we saw that the firewall was getting heavily taxed and we said, we need to do something as soon as feasible. Unfortunately, the only window of time we had was mid January.
When we heard of the final decision to have one week of remote classes, we scrambled quickly to identify a time to install the new firewall. Despite having an expert consultant and Cisco (our network vendor) to support our excellent network team, it took us three attempts, (yes, three!) to install it successfully. When this was completed, it was the Friday before the start of classes.
We monitored the network over the weekend, when the use was heavy because most students had arrived by that time. We felt good about how the network behaved.
Monday, when the classes began, everything held up and we were happy. However, around 9:30 we started hearing about problems. We quickly brought in the consultant and Cisco. The firewall seemed to be working fine from the standard metrics point of view. Our network engineer found out that the issue was that the firewall was running out of resources for Network Address Translation (NAT).
NAT in simple terms is used to map addresses in a local network to one or more outgoing IP addresses. This happens in all of our home networks where we all get 10.x IP addresses that the internet router translates and presents to the outside world its own IP number as the one generating the traffic. In order for this to work, the device doing the translation needs to keep track of this map (like, which internal computer sent me a packet) so that when it receives a response, it can reverse map and send it to the right computer.
We had recently converted the College network to 10.x address space and when the students came back, they all received the new IP number. Since everything worked fine during the weekend, we did not anticipate this to be an issue. But a few more faculty and staff on campus tipped the scale and the firewall’s ability to map was getting taxed.
Our network engineers had a quick solution to this by adding more outgoing addresses to map to. But this did not work. This was frustrating. Despite involving the consultant and Cisco support, we could not figure out why this was not working. We quickly made a decision to change IP address space for all of the residence halls around 2 PM and by 3 PM the network was back.
We escalated the issue further up in Cisco and a smart engineer found an issue with the configuration on the firewall the next morning. If only we had the fortune to have him 20 hours earlier! With the exception of the Monday issue, the network has held up well.
Still, we wish that we didn’t have that outage on Monday. Heroic efforts by LTS networking staff and several others should be commended. We communicated often with the community and got the network up as soon as we could. We learned a few lessons in the process.
Despite the temptation to assume that it is the changing of internal IP numbers to 10.x space is the cause of the problem, I want to assure that it was not the issue. It was a configuration issue in the firewall that should have handled the NAT that caused the problem and as I mentioned, even Cisco was unable to solve that until several hours later!
Now, looking forward to continuing the stable network so we can divert our attention to other projects to benefit the community!