Ethics/Day-in-the-life crash recovery

Administrator:
Both clusters were down for most of the afternoon today. A storm took out the power in the Old Music Hall, and they didn’t get it back up until early evening. I was able to restart the production cluster from my room, through the gateway machine, which is on a UPS. We should get a UPS for the development cluster, since I had to go in and manually restart those machines. Oh well. A quickly solved problem and no one was using the clusters anyway, so it didn’t matter that I didn’t get the development cluster going again until this evening.

Administrator:
The introductory biology lab groups are using Blast for a project this week and the week after Thanksgiving break. Everything is running smoothly so far. They like being able to use Blast on our machines, because there’s less waiting than for the national site, since there are many fewer users. At last, proof that the Beowulf cluster is useful.

Administrator:
A physics professor started his program early this morning on the production cluster. He chose this time because it’s the start of Thanksgiving break, so no one will need the cluster for a few days and he can monopolize it. An hour after he started the program the head node and three child nodes crashed. I didn’t get the message until I was already at home, late in the evening. I tried rebooting the machines through the gateway machine. They came back up, but weren’t running properly. I probably should re-image all the affected machines, but since it’s break, I’m going to put it off for a few days. I’m not sure what caused the crash, but I suspect it was something in the professor’s code. I sent the professor an email letting him know that his program wouldn’t run, and that there was likely a bug in it that had caused the crash. He’ll have to rework it before he tries it again.

Professor:
I’m not sure what went wrong with my program code. I’d been getting help in writing the program for the cluster from a computer science student, but apparently we both missed something. It’s break, so I can’t contact the student until next week, which means I won’t be able to try running my program again until the winter break.

Administrator:
I completely forgot about re-imaging the machines until noon, when I got an email from one of the biology professors telling me that his morning lab group couldn’t access the Blast program. I had to scramble to get things running again in time for the afternoon lab. The entire cluster is up and running again now.

Professor:
The morning lab group today got done very early because we couldn’t access Blast. None of the cluster administrators was around to fix things at the time, so I told the students to leave. Not much point in keeping them there when there’s nothing to do.

Student:
I have enough troubles with computers when they’re working properly. Today I had to spend half an hour trying to do something that worked fine two weeks ago, but today it didn’t work at all. It was really frustrating. The only good thing was that we got done with lab two hours early because we couldn’t do any of our work.

Administrator:
I replaced a machine in the castaway cluster today. It had overheated again because the fan was malfunctioning. We have plenty of extra machines, so I just pulled out the unnecessary hardware and imaged the machine and put in the cluster. The old machine got put in the junk pile. We’ll pull out any remaining useful pieces later.