Two student employees looked up from their work to see who had approached the sliding window at the IT Help Desk. A slideshow played on the screen behind them. Ironically, a slide that read “crashes happen” appeared, just days after SMU’s IT department faced one of the largest IT outages in a while.
While the help desk employees are student workers who did not handle the restoration of SMU’s servers over Fall Break, the main IT staff worked overtime throughout the outage in order to quickly restore the servers. Although IT sent email updates to faculty and staff regarding affected services throughout the outage, IT first delivered a detailed explanation of the outage on the evening of Oct. 19 on inside.smu, which can be found here.
The outage began at 3:30 a.m. on Saturday, Oct. 10, after maintenance began at midnight on Friday, Oct. 9 to update one of the older storage arrays, which is a collection of hard drive storage that holds information for 140 different SMU servers. IT was working on this array at the advise of its vendor, who voiced concerns about a few hard disk drives in the array.
These repairs usually occur several times a week without any hardware failures like the one SMU recently experienced, according to Rachel Murray, director of IT Customer Service.
Once IT was aware of the outage, they contacted the storage array vendor who then ran tests to rebuild the failed drives, which lasted until early Sunday morning. IT employees then came on site at 7 a.m. on Sunday, Oct. 11 to begin restoring the most popular lost services, including PerunaNet, Canvas, Student Email and my.SMU.
After long hours of repairs on Sunday, IT learned on Monday that several of the SMU servers needed to be rebuilt. The IT staff then dedicated their time to rebuild 68 servers.
IT restored all major applications and servers on Friday, Oct. 16, and they are still working on restoring other minor and duplicate systems. IT also is currently working on transferring the data from this older storage array onto one of the newer storage systems.
“The good news is that we do not typically encounter an outage of this scale,” wrote Murray in an email communication. “We work diligently to provide a stable infrastructure to avoid these types of interruptions.”
For additional information on the services impacted during this outage, read here.