Final Ethics Report

= Intro =

In order to fully understand the purpose of the Beowulf Clusters at St. Olaf, we decided to analyze the ethical nature of cluster use and administration. To do this, we explored the socio-technical system of the clusters by analyzing the cluster and its surroundings. Furthermore, we conducted interviews with potential cluster users to determine outside opinions of the cluster and its purpose. The following is a discussion of our findings.

= Questions and question areas =
 * Uptime guarantee (PRICE)
 * Notifications of downtime and slow periods (JENSEN)
 * Level of support, maintenance (Student workers, IT, department?, India) (LANDSTEINER)
 * Qualifications for use, computer literacy? (FEEHAN)
 * What people are looking for in the cluster (GOUDZWAARD)

= Socio-Technical System = Before we could devise methods for data collection, we had to determine the cluster's Socio-Technical System. We brainstormed together and came up with a comprehensive Socio-Technical System.

= Methods = As part of our data collection procedures, we utilized two main methods of data collection: interviews and day-in-the-life scenarios. Since we were able to use the wiki to plan and guide our entire ethics report, our methods for each are well documented on the wiki already, so we will link to information where relevant.

Interviews
We first generated a list of questions, using our original ethics proposal as a guide. After creating a list of interview questions, we identified a list of potential interviewees from multiple, relevant, on-campus departments. After contacting each potential interviewee, we arranged meeting times based on each interviewee's schedule and our own schedules.

Interviews were done with one interviewee and two interviewers, with the exception of one interview, which was done with one interviewee and one interviewer.

After each interview was complete, the interview sheet was scanned and uploaded to the wiki. After being uploaded, each interview was transcribed.

Day-in-the-life Scenarios
Day-in-the-life scenarios (where day is any unit of time) were created for our second methods collection section. These scenarios were meant to flesh out the details of various aspects of the cluster. This was accomplished by dealing with a series of "what ifs" in a narrative style. We identified and wrote up narratives for days-in-the-life of a:
 * Cluster User
 * Cluster Administrator
 * Cluster
 * Software Package
 * Crash Recovery

= Analysis (per question) =

Analysis of Interviews
There is a general consensus that a comprehensive schedule of periodic downtimes is necessary. This schedule would be published in easily accessable locations (both physical and digital), and released well ahead of time, ideally as much as a semester in advance, so that classes may plan accordingly. Special uptime requests should be honored whenever possible, although requests should be made in advance. It is absolutely essential that the cluster remain up (and stable) while classes are conducting projects, as this is likely to occur in bursts near midterms and finals, and delays would result in serious consequences for scheduling the class.

Schade: "Some computing projects we would have could take weeks to complete. Therefore it would be desirable to have the cluster running for weeks at a time. Periodic downtimes are o.k. if nothing is scheduled to be running for extended lengths of time."

Hall-Holt: "Periodic downtimes are acceptable. I don't expect more than 50% uptime (he laughs)."

McKelvey: "A week of uptime is acceptable, with a published schedule and downtimes in the 3:00-5:00AM range much preferred, I don't expect any 6 day computations, as they would not be effective pedagogically."

Beussman: "Periodic downtimes are acceptable, but a daily/weekly schedule - published ahead of time - is useful."

Walter: "The most important thing is that classes planning to use the cluster set up schedules ahead of time, and that the schedules are honored (i.e.: no downtime during those periods). Other times more flexible."

The High Performance Computing Collaboratory
http://www.hpc.msstate.edu/

They currently run an average of about 75% utilization on a 586 processor (293 node) cluster. About one node per week crashes or hangs for various reasons.

They have occasional problems with memory leaks or PBS hangups which require large scale reboots of the cluster. They haev a pbs heartbeat script that restart it automatically within a few minutes. A full reboot of the cluster occurs about every 3-4 months.

They get better results with their Sun servers and SGIs, but those systems are not as large.

SEAGF Grid
http://seagf.thaigrid.or.th

Thailand/Singapore/Malaysia

Guava 92% - 14 Intel Xeon CPU, 2.40GHz

Anatta 60% - Unknown

Mybiogrid 54% - Unknown

PRAGMA Grid
http://pragma-goc.rocksclusters.org/

Clusters: 26 | Mean Uptime: 82.92% | Std. Dev.: 16.40% | Min: 39% | Max: 99%



Research Support Group at the University of Alberta
http://www.ualberta.ca/cns/research/

Aurora (46 195 MHz CPUs - 12 GB or 0.25 GB/cpu)

Borealis (64 400 MHz CPUs - 16 GB or 0.25 GB/cpu)

Australis (64 400 MHz CPUs (fast interconnect) - 32 GB or 0.5 GB/cpu)

The maximum walltime (physical elapsed time) for jobs on Aurora and Australis is 24 hours, while the maximum walltime for jobs on Borealis is 12 hours. More precisely, at certain times of the day, all running jobs will be stopped. These times are currently 11:45 AM on Aurora and Australis, and 11:45 AM and PM on Borealis. These restarts ensure that jobs in the queue will have a chance to start in a reasonable period and keep large parallel jobs from being shut out.

They encourage the use of checkpointing (saving the current state of your program before a restart) to avoid losing the results of calculations during a restart and wasting CPU cycles.

analysis of interviews
Most professors expect that a maintenance schedule would be published well in advance. If something were to come up that meant the clusters would need to be taken down for a period of time, they would expect that the cluster be down for as little time as possible. If there are student projects running, or a class needs to use the cluster during that time, the administrators would be expected to wait to update the cluster as the class schedule is more important. Thus it would be important to have schedule both of class projects and predicted down times so that professors and students can plan accordingly. Even if there is a reason to take the cluster down during a period where there are no scheduled projects, the professors would want to be informed, so that they can pass the information on to their classes and researchers, who may use the cluster even when it's not on the schedule.

lit search?
NA

discussion of question
The administrators should schedule any class assignments and research project time first, then schedule maintenance and planned downtime around that. That schedule should be sent to all users at least one week in advance of the first scheduled item, if not longer, and updates should be sent as they become necessary. If something breaks, all users should be notified immediately, with a time line for the cluster coming back online and being ready for use again.

(Reid - some thoughts)

E-mail notifications of downtimes to beowulf-users@stolaf.edu mailing list, one at beginning of semester, brief ones weekly listing downtimes in next two weeks, additional ones whenever changes occur.

Downtimes in the 3-5AM range are much preferred (McKelvey).

Interview Analysis
Our interviews were enlightening to us in several areas:
 * All interviewees had no parallel programming experience, and three out of five interviewees were familiar with a computer programming language relevant to the current iteration of the cluster (C, C++).
 * All interviewees expressed that the lack of a Computer Science support consultant (programming) would be a disincentive to using the cluster in some way. Some would outright not use the cluster, while others indicated that they wouldn't use the cluster as much or that they would not recommend most of their research students use the cluster unless particularly motivated.
 * Maintenance expectations varied greatly from interview to interview, from as short as one hour to as long as one week. Further, some interviewees indicated being flexible during breaks and finals, while others indicated that breaks and finals were probably high-use times.
 * One interviewee expressed interest in becoming more responsible for the well-being of the cluster for simple problems, such as power failure recovery.
 * Problems unfixed for more than a week during summer research would be devastating.
 * Reliability was considered the most important concern by three of the five interviewees. This makes sense given the expectations for generally quick problem resolution
 * Strong documentation and a GUI are more important for casual users (students using the cluster for classes) than for more experienced users.

Day-in-the-Life Analysis
Our Day-in-the-Life Scenarios revealed important information from an administrator's view:
 * Problem resolution can take significantly longer than expected
 * A student's life tends to be quite busy
 * Not everything can be done from the comfort of a dorm room (ie, complete remote management is not possible)
 * A cluster administrator needs to be time flexible (to meet the demands of professors, repairman schedules)
 * A power-failure procedure needs to be designed and implemented
 * Castaway nodes have an initial high failure rate

Discussion of Question
Comparing the expectations of potential users with the data retrieved from the Day-in-the-Life scenarios, we notice a disparity; for instance, students live busy lives and cannot always guarantee quick turnaround time. With that being said, the summer months seem to be the most crucial time to keep the cluster stable and running and also an an important time for quick problem resolution. Further, due to most interviewees lack of experience with clusters, providing some form of human support (such as a CS student programmer) is crucial to cluster use uptake.

analysis of interviews

 * Most Interviewees have some sort of background in Programming (all but one)
 * All Interviewees had a basic idea of what a cluster is and how it can be used
 * Most Interviewees have, at one point or another, used older programming languages like BASIC and Fortran.
 * Two of the Interviewees have experience in modern-day object oriented languages C++ and Java

lit search?
NA

discussion of question
Although most interviewees have some basic knowledge about the cluster, I think we can safely say that it will be difficult for them to use and take full advantage of the cluster's capabilities unless there is a highly skilled student/professor, who has used multiple parrallel processing applications on the cluster in the past, that can give their time and effort to, at least initially, assisting these potential users with their knowledge and technical prowess.

Analysis of Interviews
In general, the interviews revealed that users are looking to use the cluster for personal projects and research. Speed is requested the most, and ease of use and reliability are also important. Users mentioned that scheduling multiple projects on the cluster at once could prolong execution speeds during critical weeks in the semester, like midterms and finals. Security and being-up-to-date are not as desirable, but it would depend on the project. If protein databases were updated for BLAST, it'd be desirable to pick up those updates as soon as possible. However, the latest software is merely a bonus.

The History of the Beowulf Cluster
http://www.beowulf.org/overview/history.html

The Beowulf website describes the criterion for providing a Beowulf cluster as a service to others, over the course of its history. They suggest that building a cluster that the research community could completely control and fully utilize results in a more effective, higher performance computing platform. While learning to build and run a Beowulf cluster is a considerable investment, there are substantial benefits to not being tied to a proprietary solution. Other expectations are also discussed.

Discussion of Question
Because St. Olaf has full control of the Beowulf clusters, we have the ability to design them for serving a purpose that fits every users' ideals. Therefore, providing a happy medium between speed and ease of use would best suit the St. Olaf user environment. To do this, we would need to look further into scheduling processes so that multiple projects running at the same time did not interfere. Furthermore, providing sufficient documentation for users will greatly increase ease-of-use for the clusters. If it weren't too much of a hassle, we should also look into finishing the wwwBLAST interface for mpiBLAST and installing user interfaces for the other tools. This would act as an alternative to command-line use for those who were interested.