Parametric DFG Mapping Method 1

This page shows one traditional method for describing parallel partitioning and mappings parametrically. Consider the corner-turn or butterflyoperation, also known as matrix transpose for data-domain decomposition mappings. The corner-turn is a good example because it exhibits complex data interactions, occurs quite commonly in many applications, is very regular, and lends itself well to parameteric description.

A parallel corner-turn requires a slice of data from every processor in one stage to be transfered to every processor in the next stage. The problem is further complicated by the fact that the number of processors in each stage may be different. Lets call these parameters Nstg1 and Nstg2. We need to be concerned that the proper quantity of data is transferred and shows-up on each processor, and that the proper amounts kick-off the correct process tasks mapped the to right processors.

Consider the typical butterfly style DFG in figure 1. It shows processing of the first stage split into 3 parallel branches, with the second stage split into 2 parallel branches. Figure 2 shows the resulting time-line graph.


Figure 1 - Typical butterfly style DFG.
           

Figure 2 - Resulting time-line graph.

A method to express the parallelization parametrically is shown in figure 3. The data amounts on the arcs become expressions of the parameters, which imply arbitrary parallelization or gathering for the following stages. Figure 4 shows the resulting time-line for the above parameters.


Figure 3 - Parametric DFG.


Figure 4 - Resulting time-line graph for parametric mapping.
(Nstg1=3, Nstg2=2)

Note: The compute-time values are often parametric accordingly, but this is not the concern of this discussion. Likewise, the base data quantities are usually parameters two, but again we did not wish to add unnecessary complexity.

Next we show that by merely changing the two parameter values, we see new mappings without modifying the graph structure at all. Figures 5 though 9 show the mappings resulting from different parameter settings.


Figure 5 - Time-line for Nstg1=2, Nstg2=2.


Figure 6 - Time-line for Nstg1=4, Nstg2=8.


Figure 7 - Time-line for Nstg1=64, Nstg2=16.


Figure 8 - Time-line for Nstg1=12, Nstg2=48.

... And of course, we must show it still works for the trivial case:


Figure 9 - Time-line for Nstg1=1, Nstg2=1.

All the above timelines were made from the same DFG in figure 3, but merely by changing the parameters Nstg1 and Nstg2. The structure of the graph was not changed.

One additional note is that the node, ct, is essentially an artificial node for distributing the data in the proper ratios. Normally, it's compute-time is set to zero (0.0), so it has no effect on the graph, other than data-distribution. Consequently, the arc from Stg1 to ct is essentially artificial as well, and to avoid the potential impact and clutter of the minimum 1-byte transfers, the arc's NoXfer attribute is set to true.


  

Mapping Control

The specific processor mappings of each stage can be controlled by group, by setting the mapping field of the node to the list of processors in the group.

For example, some of the cases run above did not use any explicit mappings. (The mapping field of all nodes was blank, so the Scheduler selected processors for us.) We can force the tasks to run on specifc processors by naming the processors explicitly in the mapping field of the first and second stage nodes. Figure 10 shows the timeline resulting from the DFG of Figure 3, but with the mapping of node, Stg2, set to: 0 2 (as in /PE_0 /PE_2). Note: The mapping field is a space delimited list.


Figure 10 Time-line for explicit mapping.
  
The mapping/assignment list of the CSIM Scheduler is often under-estimated. There are many things that can be done with it.

Some things everyone should know about mapping assignments:

  1. You can specify parallel assignments. For example, suppose you want a specific task to be mapped in parallel to three computers. Simply name them in the list: PE1 PE2 P3

  2. You can specify sequences of parallel mappings. Suppose you want a specific task to be mapped in parallel first to three computers, next to three others, and then back again to the first three, and so on. Simply name them in the list: PE1 PE2 P3   PE4 PE5 PE6   PE7 PE8 PE9. The list will repeat in a round-robin fashion, each time the sequence is exhausted.

  3. You can specify First-Available-From-a-Group mappings. That is, you can designate a group of processors that a set of tasks will map to. But not any specific order. To do this, set the task node type to be GetAny, then list the processors.

  4. You can specify groups of processors as macros. This is especially handy when mapping large sections of DFG's (many nodes) to various groups. This way, you need only change one definition to alter the mapping of many tasks.

  5. Task groupings can be heirarchical. This provides very rapid remapping of your DFG's. For example, you can define subgroup1 = PE1 PE2, and subgroup2 = PE3 PE4, through subgroup40, etc.. If you define GroupA to be the subgroups from 1-9, and GroupB to be the subgroups from 10-19, then you can conveniently map a highly parallel task to many processors by simply mapping it to GroupA GroupB. Etc..

Note: If using the actual device names in the mapping list, the devices must be spelled exactly as listed in the netinfo file, - they must begin with a forward slash (/). The above assumes macros PExx for the actual names were used. (Often a good idea.) Alternately, the Scheduler can be used without a netinfo file by giving the total number of processors to consider on the Scheduler's command-line with the -n option, and without giving a netinfo file. In this case, (as well as with a netinfo file), the processor logical ID numbers can be used in mapping assignment lists, beginning with 0.