Ethics/Day-in-the-life administrator

The following narrative describes what a typical week or so in the life of a cluster administrator is like. It covers new software installation, new node installation, old node repair, and reimaging machines. This narrative helped determine that problem resolution usually does not go as smoothly as one would hope, and that one should allot extra time to solve problems "just in case."

Installing New Software
I received an email this afternoon from a professor who is tackling a rather sophisticated statistics related problem. The amount of data he has to work with is enormous, and this coupled with the complexity of the problem means that it takes forever to run on his or his research students’ desktop machines. He was wondering if we could help him out.

He mentioned that he did all of his calculations in the R statistics programming language. After a bit of digging, I discovered that one could use R in tandem with something called SNOW and Rmpi (as well as some MPI implementation).

After much tinkering and scratching of my head, I finally figured out how to make this all work together. We did a quick test using just two machines, and it almost cut computation time in half. Things are looking good. I’m going to test and document this software implementation some more before I push out the update to the other nodes.

Adding a User
Today in my math class, I overheard two physics students discussing some of their research. They had collected data on ice depth in the Antarctic, and were in the process of analyzing it. I mentioned to them that the Beowulf Cluster might be able to help them with their project, and if nothing else, that it would be a great learning opportunity. Intrigued, they took me up on my offer and want to explore the Beowulf Cluster a little bit. I wrote down the signup URL for them, as well as the URL of our wiki so that they could look around and learn a few things on their own before their accounts were created.

I added them to our master image, but I haven’t pushed out the final image just yet. I told them that they’d have to wait a few days, but that I’d email them immediately after I pushed out an image. It'd be nice if accounts were made a little quicker than this. ..

Adding a New Node
IIT brought over six new machines for the development cluster today. This means that we have about thirty working machines, and ten spares. Since we have a little room in our makeshift rack, I added four of these machines, leaving six for parts.

Preparing each machine involved gutting it, standardizing the internal components, labeling it, resetting and then setting the appropriate BIOS settings, then fetching the MAC address of the new machine, adding an entry to the gateway machine’s DHCPD configuration file, restarting DHCPD, and then turning the machine on. At first boot, each machine reimaged itself. When I got back to my dorm room, I checked ganglia, and sure enough, four new machines popped up.

That's the script, anyway. But as we all know, scripts rarely get acted out unmodified. In reality, only two new machines showed up on ganglia. I decided to return to the CS lab to figure out what went wrong.

Both missing machines gave me one heck of a time. I eventually gave up on one; it shall be parted. I replaced it with another machine.

The other one was just a matter of me getting ahead of myself. First, I forgot to remove the battery in the machine, so on first boot, when I was setting up BIOS, it asked for a password. I had to disconnect all of the appropriate cords, take out the node, and remove the battery for a moment. After this, I got the machine's MAC address, modified the DHCPD configuration file, restarted DHCPD, and booted up the machine once more. Unfortunately, this did not work. I forgot a semicolon in the DHCPD file, so when I restarted DHCPD, it complained about this and DHCPD didn't actually restart. I fixed this, and rebooted the machine. Still no luck.

At this point, I was at a loss for what to do. Eventually I figured out that I had typed the MAC address incorrectly in the DHCPD file. A simple typo basically cost me an hour and a half plus travel time. After fixing this, it reimaged just fine.

So much work to add just four nodes!

Replacing a Node
I returned back to the dorms from the CS lab (I was doing grading), only to check ganglia and see that a node on the development cluster had gone down a few minutes prior. Normally, this wouldn’t bug me that much, but a biology class was going to be doing a class assignment that required the use of the development cluster.

I decided that returning to the CS lab immediately to try and fix the machine would be a good idea; I had a few minutes to spare anyway. Upon arriving, I felt the machine that had given me trouble; it was hot! I also noticed that no air was coming through the power supply. I had seen this problem before--the fan on the power supply went bad.

I had two options: replace the power supply, or replace the machine. Since I didn’t have a screwdriver, I decided that replacing the machine was the best option. I took a pre-gutted spare machine, and went through the normal rigmarole. After setting up the new machine, I hit the power button and left. Thankfully, when I checked ganglia back in my dorm later that evening, the replacement machine had successfully come online.

Reimage Machines
Lots of changes have been made since the last time I reimaged the machines. More importantly, two new people had wanted accounts, so it was imperative to push out the updates.

I had two options: reimage each machine from scratch using SystemImagerSuite, or by issuing a command that copied over just the differences from each machine to the other. Since I had made significant changes, I decided that reimaging from scratch was the best option.

I navigated to the /tftpboot/pxelinux.cfg directory and made all of the appropriate changes. I simply restarted each machine with a simple script I wrote, and away things went.

After everything was done, I emailed the two new users, informing them that they now had access to the development cluster, and that they should contact me if they had any questions.