CAP F07/Ethics Team 1

Data Deletion
MB
 * Currently, it is planned that data will be erased IF NEED BE. This is THE primary data issue I am looking for feedback on. What led to the realization that we may need to delete data is that there may be classes using the cluster/ large Beowulf cluster usage in general in the future where lots of data is created that takes up space and may not need to be stored for long. Perhaps an email should be sent to someone before their data is erased. Or perhaps the user could set a date after which it would be ok for the data to be erased. Or... There have been technical solutions suggested for this problem (like buying more hard disk space and/or using multiple databases if one database becomes bogged down) but we should really think about this issue and not just rely on technical solutions. One of the main goals of the cluster is to provide services equally to users and therefore the imposition of quotas or deletion of data is a concern that should be approached with this in mind.

Data Access
MB
 * The data is currently accessed through a web page that has no restrictions on who can see the data. It is possible that someone may come across data that was generated from an inaccurate, experimental model but this person takes it to be significant. This person should have contacted whoever generated the data (right now I do not know if the data storage saves who created the data, this might be something to add?) but we may also need to make a disclaimer about the data someone can find on the page.

MB
 * Access to the data could be restricted, possibly in the fashion of the Unix security system where you can assign different access privileges to the owner of the data, groups of people, and the rest of the world in general.

This summer
Todd and I thought a bit about this problem over the course of this summer. We tried to address the issues:
 * 1) Where should data be stored?
 * 2) How long should data be kept before being deleted?
 * 3) What is the fastest way to access [read/write] large data files.

We decided, for the most part, that each Beowulf problem would need its own, unique solution. But in general:
 * 1) Data should be stored either on the nodes, or on the administration machine, depending on the size and usage of the file.  If an input file is small (only a few KB or MB), then it makes sense to pass the data through MPI or through through a network-mounted file system.  If an input data file is large (several MB or GB), then it makes more sense to avoid network overhead and to send the data files to each node before executing an SGE job.
 * 2) We plan to create a 'purging' program eventually, which will find large, unused files on the nodes, and will delete them.  We decided that that is probably the best way to keep our cluster clean.  We will certainly accomidate faculty and students who need server space for extended periods of time, but we decided that they should petition for such space, if need be.
 * 3) Going back to (1), it makes sense to pass files through scp or some other file transfer program if files are large, but they can be parsed and passed through MPI if files are small.  It is generally faster to explore data this way.

We should talk in class a bit more about your specific problem, and find a solution together with Todd and Professor Brown.

-Taylor (TR)