Technical Details about Parallel Mapping:
-----------------------------------------

First, we define some clean terminology conventions to aid in
thinking about this topic.  At each stage of a parallel flow-graph, 
there are a couple points of view, which of course, could add
confusion unless we have good definitions.

To distribute any node, such as Stg1, it is done by:
on the input arcs of the stage to be distributed, multiplying
the produce-amnt by the number of parallel branches.
Example:
	Input arc(s) of Stg1 Node:
		P = k * Nstg1
		T = k
		C = k
This gives rise to Nstg1 parallel copies of Stg1.
We'll call this "fan-out" from the prior stage, and this is easily
accomplished with basic capabilities.  It was fairly simple.

Now the next challenge is gathering the input from all "copies"
of Stg1.  Each node of Stg2 must get a distinct arc from each 
node of Stg1.  We'll call this "fan-in".
Does not matter how many Stg2 branches there are (that's the
"fan-out" aspect).  So even if there is only one Stg2 branch,
it must get one arc from each "copy" of stage-1. 
That is, a Stage-2 node cannot fire until it gets input from
all Stage-1 nodes, and it won't misfire by getting more from
any one the Stage-1 branch (ex. from another wave of data).
Therefore, the output arc(s) of Stage-1 must be replicated, so there
are Nstg1 of them, for separate tracking of their outputs.

The output arc (copies) of Stg1 Node are balanced:
		P = m * Nstg2
		T = m
		C = m
That is, a given arc gets the full amount from stage-1.
This basically says, when any Stg1 node produces, 
it frees all Stg2 nodes of waiting for input from it, but they may 
need input from the other Stg1 branch-arc-copies.

In operation, each firing of a Stg1 node produces on only one of
the output arc copies (ie. serial production rule).  This becomes
a state-variable of the node, and is conveniently accessed/controlled
in the Scheduler code.

This convention enables us to talk cleaning about the distribution
of exactly one stage, independent of the distribution of any other
stage.  Ultimately, it allows users to control distribution
of a stage from a single point, - the node being distributed,
(plus the mapping of that node too, of course).

This should also work through module boundaries by virtue of
the fact that all modules (and bundles) become flattened prior to 
execution.  So all this should hold true.  

This probably restricts the distributed node to only have
replicated output arcs (no mixture of single arcs).
But what would the alternative be? - Only some nodes conditionally 
producing output?

So that's how it works.