Ethics/Month-in-the-life crash recovery

The following month-in-the-life scenario deals with various crashes that could occur on the clusters, the causes of the crashes and effects on the people using the clusters. As a part of the ethical analysis of the cluster, this has been useful in bringing forward possible problems we may encounter, as well as possible solutions. This scenario was written primarily by Elizabeth Jensen.

Administrator:
Both clusters were down for most of the afternoon today. A storm took out the power in the Old Music Hall around noon. Though the power came back on less than a half hour later, I was in class until 3pm, so I wasn't able to restart the clusters until after that. I was able to restart the production cluster from my room, through the gateway machine, which is on a UPS. The UPS won't last for more than a half hour, but the generators kick in if the power is out that long, so it doesn't matter much. It does allow enough time for me to properly shut down the gateway machine, if necessary, and it also makes it easier to restart the production cluster, since those machines can be remotely booted. I had to go in to manually restart the development cluster machines, though, since they can't be remotely booted, and we don't have a gateway machine for that cluster running yet. Oh well. A quickly solved problem and no one was using the clusters anyway, so it didn’t matter that I didn’t get the development cluster going again until this evening.

Administrator:
The introductory biology lab groups are using Blast for a project this week and the week after Thanksgiving break. Everything is running smoothly so far. They like being able to use Blast on our machines, because there’s less waiting than for the national site, since there are many fewer users. At last, proof that the Beowulf cluster is useful.

Administrator:
A physics professor started his program early this morning on the production cluster. He chose this time because it’s the start of Thanksgiving break, so no one will need the cluster for a few days and he can monopolize it. An hour after he started the program the head node and three child nodes crashed. I didn’t get the message until I was already at home, late in the evening. I tried rebooting the machines through the gateway machine. They came back up, but weren’t running properly. I probably should re-image all the affected machines, but since it’s break, I’m going to put it off for a few days. I’m not sure what caused the crash, but I suspect it was something in the professor’s code. I sent the professor an email letting him know that his program wouldn’t run, and that there was likely a bug in it that had caused the crash. He’ll have to rework it before he tries it again.

Professor:
I’m not sure what went wrong with my program code. I’d been getting help in writing the program for the cluster from a computer science student, but apparently we both missed something. It’s break, so I can’t contact the student until next week, which means I won’t be able to try running my program again until the winter break.

Administrator:
I completely forgot about re-imaging the machines until noon, when I got an email from one of the biology professors telling me that his morning lab group couldn’t access the Blast program. I had to scramble to get things running again in time for the afternoon lab. The entire cluster is up and running again now.

Professor:
The morning lab group today got done very early because we couldn’t access Blast. None of the cluster administrators was around to fix things at the time, so I told the students to leave. Not much point in keeping them there when there’s nothing to do.

Student:
I have enough troubles with computers when they’re working properly. Today I had to spend half an hour trying to do something that worked fine two weeks ago, but today it didn’t work at all. It was really frustrating. The only good thing was that we got done with lab two hours early because we couldn’t do any of our work.

Administrator:
I replaced a machine in the castaway cluster today. It had overheated again because the fan was malfunctioning. We have plenty of extra machines, so I just pulled out the unnecessary hardware and imaged the machine and put in the cluster. The old machine got put in the junk pile. We’ll pull out any remaining useful pieces later.