MistRider recovery

To Do

 * Setup new MistRider installation [partial]
 * Install openMotif
 * Test SGE
 * Install Ganglia [partial]

Completed

 * Recover old MistRider disk
 * Direct Copy (on disk currently inside Gateway as /dev/sdc1)
 * Home Directories (/home.backup)
 * Meg
 * Macalester
 * Other
 * NameNode (/other.backup/var/hadoop)
 * Configuration (/other.backup/etc)
 * Applications in /opt (/other.backup/opt)


 * Setup new MistRider installation [partial]
 * Setup DHCP
 * Setup NIS/NFS
 * Install Hadoop
 * Download/install 0.18.1
 * Restore configuration
 * Restore NameNode
 * Bring up & test
 * Re-add users
 * Copy appropriate /etc/passwd entries
 * Restore /home directories
 * Install OpenMPI
 * Install Tomcat
 * Setup passwordless SSH from root to nodes [completed]
 * Install SGE


 * Make recovered home directories accessible on Helios (now in /home/MIST)
 * Re-assigned home directories to the appropriate users
 * Expanded user quotas on Helios

NameNode restoration
So I've been working on setting up Hadoop, and I may have figured out how to restore the namenode. I've been reading up on the configuration and such, and there should be directories dedicated to storing namenode tables, temporary data, and dfs data. I just need to locate these on the mist hard drive, provided it will mount.


 * Fortunately, the NameNode data was copied to the recovery HD. I apparently wasn't careful enough when typing the  commands, because all of the files were in weird places. I think I sorted everything out to how it was originally, under  . --Yates 14:45, 17 November 2009 (CST)


 * I copied the NameNode files into the correct places (on this install, it is /var/hadoop/dfs/name), but I don't think Hadoop will work until passwordless SSH is set up, which I can't do without knowing the old root password (which is still in use on the nodes) --Yates 16:47, 17 November 2009 (CST)

Hadoop is working and the NameNode restoration was a success. :)


 * Patrick noticed the node VMs were eating up serious CPU at first, caused by the DataNode process. The problem seems to have leveled off--it was apparently due to rebalancing. Might want to watch it though. --Yates 00:33, 20 November 2009 (CST)

Recovered files on Helios
I doubled the quotas for all the users who had them earlier. This should get us by for now, and it is better than simply disabling quotas, in case of a runaway job, for instance. --Yates 15:29, 17 November 2009 (CST)

Ganglia
Ganglia is installed and (apparently) set up about how it was. However, it doesn't seem to be receiving information from any of the nodes. It's possible that the version mismatch (3.1.3 on admin, 3.0.7 on nodes) is the cause of the problem, but I don't see how they could be that different. The other possibility is that, since I just used the old config for reference rather than simply copying it over (the new default config and the old Mist config looked too different), I missed something. --Yates 23:30, 18 November 2009 (CST)