Docs/parallelization

1. Exploration of Parallel Stratagems
The first major obstacle to tackle in the process of parallelization is determining an appropriate paradigm to use for a problem. With most scientific problems or models, several possible parallel strategies may be able to correctly solve a problem and it is important to recognize the range of possible solutions. An example of this would be Conway's Game of Life. Essentially this game consists of a grid of 'cells' that live, die, or come back to life according to how many live neighbors the cell has. The most intuitive parallel approach to this problem would be to have each node in your computing cluster represent a single cell. For each round, each cell would report it's status to each of its neighbors and in turn receive status reports from all of its neighbors. The cell would finally decide whether to die, live, or regenerate based on the information transmitted by its neighbors. This cycle repeats ad infinitum. While this approach is intuitive and perhaps almost obvious, it is by no means the only possible parallel paradigm for this problem. Another feasible approach would be to appoint one node in the cluster, perhaps the head node, as a (gatekeeper). This node would be analogous to an air traffic controller in an airport, overseeing communication and shaping and directing traffic. All nodes would report their status to the gatekeeper node at the end of each round and the gatekeeper would in turn broadcast out the needed status information to all nodes. Another possible solution is a circular transmission. Once again each node acts as a 'cell' but you conceptualize the nodes in a ring topology. At the end of each round you start a sequential transmission of status messages from two or more equidistant points in this ring and the information is broadcast around the ring with each node then receiving a cumulative message and extracting it's own pertinent bits and then acting contingently upon them. Regardless of the problem, it is likely that there will exist a plethora of possible parallel algorithms, and generating a diverse array of solutions increases the likelihood of finding a solution that fits both your problem and your hardware/software setup.

2. Cluster Assessment & Strategy Selection
As the end of the last section alluded to, the next major step in parallelization is choosing a parallel paradigm that best fits both your scientific problem and your existing cluster software and hardware. Hopefully at this point in the process, you have several possible parallel algorithms capable of modeling your problem and are now faced with the mixed blessing of selection. The first consideration when beginning to select a solution is evaluation of your software and hardware. These are the two major factors that will influence your decision and if optimization is crucial to your project then identifying the bottlenecks in these two components is essential. Parallel computing is a balancing act between processor speed and network bandwidth and latency. Diagnostic software for your cluster. such as ganglia, will help you identify which of these two is likely to be more of a concern. Additionally, if you can rapidly prototype code for your scientific problem, you can use software such as this to pinpoint whether processing speed or network speed is your limiting factor. The diagnostic aspect of this step could potentially be very consuming and the amount of time and resources you want to allocate to this aspect of the process is dependent on your needs and the overall algorithmic complexity of the problem being addressed. Once you have finished assessing your cluster's configuration, you are ready to select an algorithm. The precision of your algorithm selection will once again depend on your needs. Ideally, you would analyze each algorithm on your list and express it's running time in Big-O notation. You would also want to do something similar in terms of network usage. Although I'm not aware of a comparable established analytical framework such as Big-O notation for network usage, you could analyze usage in terms of number of communications per round for example. Finally, based on the analysis of your cluster's bottlenecks and your analysis of computational and network-based complexity of each algorithm, you can make a choice of algorithms. An alternate path for this final step, if the implementation phase of your algorithm is relatively small, is to code solutions for multiple algorithms and simply compare running times of these prototyped solutions.

3. Implementation
Once you've finally selected what appears to be the best fit algorithm from your list of possibilities, the time has come to actually put your plan into action and code your solution. There exists a wide variety of languages, packages, and API's available for parallel programming such as parallel Haskell, PVM, or openMPI. Your choice of which software package to utilize for your particular project will most likely come down to either what's pre-installed on your cluster or what you're most comfortable with. Regardless of the software package used, there will exist several commonalities in implementing your algorithm:   role identification  Depending on your chosen algorithm, different nodes may play different roles in your computation. The most common example of this would be the use of a director or gateway node (traffic director) in our Conway's Game Of Life example. Upon implementing these different roles, a method for assigning roles to given nodes should be developed. This is especially important if you have specialized processing or network hardware for specific node types.  stage analysis  Typically your algorithm will naturally separate into several stages, the number of which will depend upon the number of node roles you have and the nature of the algorithm itself. At the very least you will have one stage for communication between nodes and one stage for processing information received. Often times you will have multiple instances of each one of these stage types and it's important to delineate these stages both conceptually and practically in your code structures.  

node communication stages  The communication stages of your algorithm are the real essence of parallelism in your program and this should manifest itself in your code. The communication stage, as addressed from the point of view of a single node, can be broken down into a couple component parts. The first is the conditional decision of whether communication is necessary this round. If no state changes occur a node may decide not to communicate. Secondly, a node must pick out recipients to send data to unless a form of broadcast communication is being used.  data processing stages  The data processing stages of your algorithm will constitute the core of the scientific problem itself. Ideally should have no parallel code elements within them, however these stages will handle data that is received and prepare data to be sent off in the next communication. The majority of this stage will be quite similar to a non-parallel version of the algorithm.  

Example - Conway's Game of Life To briefly illustrate the stages of Implementation, consider our earlier example of Conway's Game of Life and suppose we decide to use the traffic controller model. For role identification, we first have the single directing node, which we would likely assign to a machine with enhanced networking and processing capabilities if our network nodes are heterogeneous. The remainder of the nodes will act as cells in the simulation. The next step is delineating the stages of one round of the game. The first step would be a communication stage in which all cell nodes report their status to the director node. The second stage would be a data processing stage where the director node would decide to either broadcast all status information to each cell node or select the information needed by each cell node. The third stage would again be a communication stage in which the director node actually transmits the needed status information to cell nodes. The fourth and final stage of a round would consist of data processing where each cell node decides to live, die or regenerate based on the received status information.