CSIM Model Library

CSIM Performance Model Library

Learning to use Perf. Mod. Lib - A New User's Guide to Creating a Model (See this before your first try at a new model!)
Model Usage Notes (See these before first usage!!!)
Core Performance Models User Documentation
Model Reference Guide

CSIM Performance Models of Core Library Components

Processor Models

Generic Processor - Models any processor with a single external I/O port, e.g., i860, i960, PowerPC. May be used with either the static or dynamic scheduler.
Static Processor- Models any processor with a single external I/O port, e.g., i860, i960, PowerPC. May only be used with the static scheduler.
Sharc Processor - Processor model with six link ports and one parallel port. May be used with either the static or dynamic scheduler.
Static Sharc Processor - Processor model with six link ports and one parallel port. May only be used with the static scheduler.
C40 Processor- Processor model with six communication ports and one parallel port. May be used with either the static or dynamic scheduler.
Static C40 Processor- Processor model with six communication ports and one parallel port. May only be used with the static scheduler.
Multi-Priority Processor - Model of any processor with a single I/O port. Supports frame rate multitasking using static scheduler; prog files tasks are separated by TASK instructions.
Multi-Tasking Processor - Model of any processor with a single I/O port. Supports multitasking of tasks using STIM static scheduler, i.e. multiple data flow graphs in same simulation.
Reflexive Processor- Generic Processor w/Local Memory Traffic Generation.
Computer with Middleware Layers, Preemptive Multi-tasking with Priorities.

Bus Models

Generic bus - Common bus model connecting multiple devices. Models almost any bus: e.g. VME, PCI, Ethernet.
Cascade Bus- A multiport cascadable bus element for bus extensions or bus hierarchies.
CBus NIC- The interface model between the processor bus and the cascade bus.
CBus Buffer- This model is to be used when cascading multiple cascade bus models but want to isolate them so that each can run concurrently at its maximum speed. Two such devices need to be used and connected back-to-back.
CBuffer Module- This is a module level device that is made up of two CBus Buffer devices in order to isolate and buffer data transfers between two cascade bus networks.

Delay Models

Delay Box- A delay box.
Latency Box - A latency device.

Switch Models

Generic XBAR - An N-port generic crossbar switch for local multiple device interconnection.
Switcher - A generic 3-port zero-delay switch for local interconnection.

Raceway Models
See: Raceway Model Features for additional information about the following models.

Raceway 1.0 Models

Raceway XBAR- A 6 port Raceway crossbar component that models the Raceway network protocol.
Raceway NIC - the Raceway network interface chip model that is used between the processor local bus and the Raceway network.

Raceway++ Models

Race++ XBAR - An 8 port Race++ crossbar component that models the Race++ network protocol.
Race++ NIC- The Race++ network interface chip model that is used between processor local buses and Race++ networks.

Myrinet Models

Myrinet Switch - A 16 port switch which models the Myrinet switch protocol at a token level.
Lanai - Performance model of the Myrinet Lanai component that interfaces between the processor local bus and the Myrinet network.

Simulation Control Model

Monitor- a simulation control monitor device that is required for all simulations using the core models library.

1. Performance Model Library - Core Models

1.1 Generic Processor Element - generic_pe.sim

This processor model may be used to represent the performance model of any processor with a single external I/O port. This is a task level model of a processor's performance and its communication with other processors through its external port. Tasks are modeled by their computation delay, and communication is modeled by Send and Receive instructions. The processor is modeled by two major concurrent processes: the Computation Agent and the Communication Agent.

Upon startup, this model reads an application program into its memory from a file called "pe_xx.prog", where xx is the logical processor number returned by the MY_ID subroutine. The application program consists of a sequence of Compute, Send and Receive instructions that have been generated by the CSIM Scheduler based on the application data flow graph and system architecture definition.

The Computation Agent interprets and executes the instructions in sequence. A Compute instruction causes the process to delay by the time specified for the task. A Send instruction causes a message to be queued in the processor's output queue and sent out the external io_port to its destination. A Receive instruction causes the processor to dequeue the number of data bytes of type MID (Message ID) from its input buffer. If not enough data of that type has been received, the Computation Agent will wait until that data has been received at its input port. When all instructions have been completed, the Computation agent stops processing.

The Communication Agent consists of an Output Agent and an Input Agent. The Output Agent runs continuously and checks for any messages placed in the Output Queue by the Computation Agent. If there is a message in the Output Queue, the Output Agent sends it out the io_port to the external link. If the external link is full, the Output Agent will wait until the link becomes available for another message. If the link is blocked for an extended period of time, it may cause the Output Queue to be filled to capacity by the Computation Agent. When the Computation Agent tries to send a message out to a full Output Queue, it will also block and wait until there is room in the Output Queue.

The Input Agent also runs continuously waiting for messages coming in at the io_port. When messages come in, they get placed in a mailbox identified by their message ID (MID). The number of bytes received for each MID get recorded and the total number of bytes received for all messages get recorded. When the Computation Agent executes a Receive instruction, it checks for the number of bytes it needs for the specified MID. If it finds at least that number of bytes, it will dequeue that number of bytes from the input buffer. Otherwise, it will wait until sufficient data has been received. If the total number of bytes received by the Input Agent and not dequeued by the Computation Agent reaches the maximum size of the input buffer, the Input Agent will hold up any new messages coming in at the io_port, thus causing messages to be backed up at the sender side. This protocol allows the processor model to simulate real system behavior where limitations in the processor's memory, input and output buffer sizes sets constraints on the system performance.

In dynamic scheduler mode, #define DYNAMIC, the processor gets its instructions from its command queue, which are dynamically loaded by the scheduler during simulation. In dynamic mode, two instruction streams are processed concurrently: the usual command queue and a Send command que. This is done to allow Send commands, which have been held up earlier by the scheduler due to unresolved destination, to be processed immediately even when there is already another Compute instruction being processed concurrently. Only old Send commands would be processed immediately. A Send command is always associated with a completed Compute task. An id is associated with every Compute and Send command to allow comparison of Send commands in the queue with the most recent Compute task processed.

The generic_pe model is a behavioral model of a processor with a single I/O port. The generic model may be used to model almost any processor with a single I/O communication port. It is useful in modeling the performance of a processor at the task level. It uses a list of instructions that are generated automatically by the CSIM scheduler from a data flow graph Graphical User Interface.

The model supports the following set of instructions:

cecompute <time_delay> [label] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- reserves the processor for <time_delay> time units.
recvmessg <Message_ID> <length_in_bytes> [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- keeps the processor waiting until <length_in_bytes> of <Message_ID> are received at its wired input port.
sendmessg <Message_ID> <dst_pe> <length> [priority] [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- sends <length> bytes of <Message_ID> out its wired port to <dst_pe>
pendmessg <Message_ID> <dst_pe> <length> [priority] [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- pends the message to be transmitted with a subsequent 'multimessg' instruction.
consume <Message_ID> <length_in_bytes> [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- similar to 'recvmessg', but the message is received at its wireless port.
produce <Message_ID> <dst_pe> <length> [priority] [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- similar to sendmessg, but the message is sent out its wireless port. The produce/consume instruction pair are generated when SilentXfer = True attribute is set on the DFG arc.
readmessg <Message_ID> <dst_pe> <length> [priority] [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- used by the destination PE to read the data from the source NIC. The postmessg/sendmessg instruction pair are generated when XferType = Pull attribute is set on the DFG arc.
multimessg <Message_ID> [priority] [comment] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- a multicast message that sends out all previously held 'pendmessg' instructions.
monotonic <scheduled_wake_time_delay> [label] [<a> attr_name = attr_value </a> ... <a> attr_name = attr_value </a>]
- generates an initial message and every <scheduled_wake_time_delay> time units from its previous 'monotonic' instruction.
EventNote <Note>
- stores the event time and <Note> into the EventHist.dat file.
loop [comment]
- repeats instruction execution from the beginning of the instruction
END_OF_GRAPHS [label]
- indicates completion of all subgraphs instruction execution.
progmdone [comment]
- indicates completion of subgraph instruction execution.

The instructions are generated by a static scheduler, in which case the list of instructions are generated and stored in a program (.prog) file. Or they may be generated dynamically during simulation by a dynamic scheduler which parses the instructions and delivers them to the processor in an input queue.

The model provides a number of debugging tools and gathers simulation statistics for display.

Major events may be displayed during simulation by setting the Verbosity mode. Setting Verbose mode, causes messages to be displayed in the terminal window. The Verbosity varies between 0 and 10. The higher the Verbosity, the more messages will be displayed. When initiating the simulation by command line, type:
sim.exe -V n
where n is a number from 0 to 10.

The model supports animation during simulation. When a task is executed in the processor, the model highlights its box in the simulation GUI. Either the hardware architecture graph or the software DFG graph may be displayed during simulation. This may be switched by the environment variable SIM_GRAPH, setting it to the file of either the hardware graph or the software graph. The default setting is the hardware architecture file.

During simulation, a number of files are generated which may be used by a post-processor to show processor activity and communication timelines. The file ProcTline.dat will show the task activity timeline using XGRAPH. The file Spider.dat will show the interprocessor communication timeline. It should be used in conjunction with ProcTline.dat to show both the task and communication relationships, as in the command:
XGRAPH ProcTline.dat Spider.dat

A live display of the XGRAPH ProcTline and Spider plot can be generated during simulation by invoking:
sim.exe -S socket_number
where socket_number is optional and can be a value between 1000 and 16383. See Live XGRAPH Display feature document for a more detailed description of its use.

An event history file called EventHist.dat is generated for use with the TLPP display tool which allows greater flexibility for displaying the task and communication timelines.

The on-chip processor memory utilization timeline is captured in IQtrace.dat, OQtrace.dat and Mtrace.dat files to be used with XGRAPH, to show the processor's data input queue, data output queue and total memory utilization, respectively. See Processor Memory Tracing and Management feature document for a more detailed description of its use.

The model includes an ability to specify buffer limits. When no memory/buffer attributes are used, the generic_pe
model uses the global parameters set in parameters.sim:

Infinite_Mem
In_Q_Size
Out_Q_Size
Total_M_Size
M_Range

If you want to override the settings of these parameters, use either global variables or instance attributes via the GUI with the following respective names:

generic_pe_infinite_mem
generic_pe_in_q_size
generic_pe_out_q_size
generic_pe_total_m_size
generic_pe_m_range

These attributes may be set at any level and follow the general rules for use of attributes in an architecture hierarchy.

The timeline displays and statistics may be generated within a specified simulation time window or by events in the data flow graph which specify the beginning and end of the window. See Utilization Time Window Setting document for a more detailed description of its use.

A summary.dat file is generated to collect processor utilization statistics.

1.2 Static Processor Element - static_pe.sim

This processor does not support the dynamic scheduling mode and some of the newer features. It is a scaled down version of the generic_pe.sim model and may be used when desiring a faster simulation. For a full comparison of the features check the revision table in the model source text.

The static_pe model is a behavioral model of a processor with a single I/O port. The model may be used to model almost any processor with a single I/O communication port. It is useful in modeling the performance of a processor at the task level. It uses a list of instructions that are generated automatically by the CSIM static scheduler from a data flow graph Graphical User Interface and stored in the program (.prog) files.

The model supports the following set of instructions:

cecompute <time_delay> [label]

recvmessg <Message_ID> <length_in_bytes> [comment]

sendmessg <Message_ID> <dst_pe> <length> [priority] [comment]

postmessg <Message_ID> <dst_pe> <length> [priority] [comment]

readmessg <Message_ID> <dst_pe> <length> [priority] [comment]

monotonic <scheduled_wake_time_delay> [label]

consume <Message_ID> <length> [comment]

produce <Message_ID> <dst_pe> <length> [priority]

loop [comment]

END_OF_GRAPHS [label]

progmdone [comment]

The static_pe ignores attributes that may be appended to the instructions.

The model provides a number of debugging tools and gathers simulation statistics for display.

An event history file called EventHist.dat is generated for use with the TLPP display tool which allows greater flexibility for displaying the task and communication timelines.

The on-chip processor memory utilization timeline is captured in IQtrace.dat, OQtrace.dat and Mtrace.dat files to be used with XGRAPH, to show the processor's data input queue, data output queue and total memory utilization, respectively. The model includes an ability to specify buffer limits. The model uses the global parameters set in parameters.sim:

Infinite_Mem
In_Q_Size
Out_Q_Size
Total_M_Size
M_Range

A summary.dat file is generated to collect processor utilization statistics.

1.3 ADSP Sharc Processing Element (PE) - sharc.sim

This is a task level model of the Anolog Device's Sharc processor's performance and its communication with other processors through its external ports. Tasks are modeled by their computation delay, and communication is modeled by Send and Receive instructions.

The Sharc has seven external I/O ports called "p0-p6" through which data flows, one is a parallel port and six are serial link ports. The Sharc receives data-messages on its I/O ports and it simulates the computation of application tasks which would operate on the data by a time delay. Result data from the computations are then modeled by sending data out the I/O ports.

The serial ports also serve as a routing mechanism for messages to traverse across a network of multiple Sharc processors. Messages that are received on the serial ports and are not for this processor, get forwarded out another port until it reaches its final destination. The routing path and the destination processor ID is carried along with each message.

The Sharc processor is modeled by concurrent processes: a Computation Agent and Communication Agents for each of the I/O ports.

The Computation Agent interprets and executes the instructions in sequence. A Compute instruction causes the process to delay by the time specified for the task. A Send instruction causes a message to be queued in the processor's Output Queue and sent out one of the external I/O ports to its destination. A Receive instruction causes the processor to dequeue the number of data bytes of type MID (Message ID) from its input buffer. If not enough data of that type has been received, the Computation Agent will wait until that data has been received at its input ports. When all instructions have been completed, the Computation agent stops processing.

The Communication Agent consists of an Input Agent for each of its ports and a common Output Agent for all its ports. The Output Agent runs continuously and checks for any messages placed in the Output Queue by the Computation Agent. If there is a message in the Output Queue, the Output Agent sends it out the selected I/O port to the external link. If the external link is full, the Output Agent will wait until the link becomes available for another message. If the link is blocked for an extended period of time, it may cause the Output Queue to be filled to capacity by the Computation Agent. When the Computation Agent tries to send a message out to a full Output Queue, it will also block and wait until there is room in the Output Queue.

The Input Agents for each of the ports also run continuously waiting for messages coming in at the I/O port. When a message comes in at its port, the Input Agent checks its destination. If it not for this processor, the message gets forwarded out the I/O port specified by the next entry in the message routing path. If it is for this processor, the amount of data in the message gets placed in a mailbox identified by the message ID (MID). The number of bytes received for each MID gets recorded and the total number of bytes received for all messages gets recorded. When the Computation Agent executes a Receive instruction, it checks for the number of bytes it needs for the specified MID. If it finds at least that number of bytes, it will dequeue that number of bytes from the input buffer. Otherwise,it will wait until sufficient data has been received. If the total number of bytes received by the Input Agents and not dequeued by the Computation Agent reaches the maximum size of the input buffer, the Input Agent will hold up any new messages coming in at the I/O port, thus causing messages to be backed up at the sender side. This protocol allows the processor model to simulate real system behavior where limitations in the processor's memory, input and output buffer sizes sets constraints on the system performance.

This model has been generalized to handle any number of I/O ports.

1.4 ADSP Sharc Processing Element (PE) - static_sharc.sim

The Sharc processor is modeled by concurrent processes: a Computation Agent and Communication Agents for each of the I/O ports.

The Computation Agent interprets and executes the instructions in sequence. A Compute instruction causes the process to delay by the time specified for the task. A Send instruction causes a message to be queued in the processor's Output Queue and sent out one of the external I/O ports to its destination. A Receive instruction causes the processor to dequeue the number of data bytes of type MID (Message ID) from its input buffer. If not enough data of that type has been received, the Computation Agent will wait until that data has been received at its input ports. When all instructions have been completed, the Computation agent stops processing.

This processor does not support the dynamic scheduling mode. It is a scaled down version of the sharc.sim model and may be used when desiring a faster simulation. For a full comparison of the features check the revision table in the model source text.

1.5 TMS320C40 DSP Processing Element (PE) - c40.sim

This is a task level model of a c40 processor's performance and its communication with other processors through its external ports. Tasks are modeled by their computation delay, and communication is modeled by Send and Receive instructions.

The c40 has seven external I/O ports called "p0-p6" through which data flows, one is a parallel port and six are serial link ports. The c40 receives data-messages on its I/O ports and it simulates the computation of application tasks which would operate on the data by a time delay. Result data from the computations are then modeled by sending data out the I/O ports.

The serial ports also serve as a routing mechanism for messages to traverse across a network of multiple c40 processors. Messages that are received on the serial ports and are not for this processor, get forwarded out another port until it reaches its final destination. The routing path and the destination processor ID is carried along with each message.

The c40 processor is modeled by concurrent processes: a Computation Agent and Communication Agents for each of the I/O ports.

The Computation Agent interprets and executes the instructions in sequence. A Compute instruction causes the process to delay by the time specified for the task. A Send instruction causes a message to be queued in the processor's Output Queue and sent out one of the external I/O ports to its destination. A Receive instruction causes the processor to dequeue the number of data bytes of type MID (Message ID) from its input buffer. If not enough data of that type has been received, the Computation Agent will wait until that data has been received at its input ports. When all instructions have been completed, the Computation agent stops processing.

This model has been generalized to handle any number of I/O ports.

1.6 TMS320C40 DSP Processing Element (PE) - static_c40.sim

The c40 processor is modeled by concurrent processes: a Computation Agent and Communication Agents for each of the I/O ports.

The Computation Agent interprets and executes the instructions in sequence. A Compute instruction causes the process to delay by the time specified for the task. A Send instruction causes a message to be queued in the processor's Output Queue and sent out one of the external I/O ports to its destination. A Receive instruction causes the processor to dequeue the number of data bytes of type MID (Message ID) from its input buffer. If not enough data of that type has been received, the Computation Agent will wait until that data has been received at its input ports. When all instructions have been completed, the Computation agent stops processing.

This processor does not support the dynamic scheduling mode. It is a scaled down version of the c40.sim model and may be used when desiring a faster simulation. For a full comparison of the features check the revision table in the model source text.

1.7 Reflexive Processor w/Memory Traffic

The reflexive-PE is a variation of the "generic_pe" model. It differs as follows:

This reflexive-PE model automatically generates data traffic to a local memory, every time a message is sent or received by the PE. The transfer is generated "reflexively", thus the name for this model. The reflexive transfer is made to have the same size as the actual message which triggers it.
This model is based on assumptions about the "actual-PE" that we are modeling, that:

To send data, the actual PE must first write data into local memory, then the data flows from local memory to the local NIC (Network Interface).

To receive data, the data first flows from the NIC into local memory, then the PE reads it out of local memory.

Thus the data must be transferred on the local bus twice each time. Because down-stream events must wait for both transfers, the modeled PE/Memory pair is made to hand-shake with an acknowledge to assure the transfer to/from local memory completed before continuing.

This file, reflexive_pe.sim, contains two models:
reflexive_pe
Memory
You should set your memory nodes to be of type "Memory".

This pair of models makes a very important assumption! It assumes that the the memory will always be on port "2" of the local bus, and the PE on port "1". If you use other ports, it is easily changed. There is a DEFINE_GLOBAL block at the bottom of this file which initializes two arrays:

        /* Set default path to local memory. */
        /* Assumes local memory is on port "2" of local bus. */
        int TO_LOCAL_MEMORY[] = {2,-1};

        /* Set default path to local PE. */
        /* Assumes local PE is on port "1" of local bus. */
        int TO_LOCAL_PE[] = {1,-1};

Change these as needed.

1.8 Generic Bus Module - lbus.sim

Model of Generic Local Bus (LBUS) - This models a common bus by funneling all incoming packets one-at-a-time through a single internal queue. Each input port of the bus has a process waiting for a packet to arrive. When a packet arrives on the port, the process pushes the packet onto the internal funnel queue. Concurrently arriving packets pass through the funnel one-at-a-time and are routed out to their respective destination port. One process handles the internal funnel-queue by receiving messages on the queue and dispatching them to the appropriate output ports.

1.9 Cascadable Bus - cascade_bus.sim

This model implements a multi-port communication element that behaves as a bus. It is expandable to any number of ports. It allows data to transfer between only one pair of ports at a time. However, unlike the normal bus model, this bus can be connected to other buses. Transfers between ports on the individual buses will occur independently (without interference). However, transfers between buses will arbitrate for an open path across both (all) buses.

This model consists of a major process thread that is instantiated multiple times, once for each of the used ports on the bus. The port_handler implements the messaging protocol as described in the Bus_NIC model.

It waits for a control signal to arrive on its port. If the arriving control signal is a "REQuest", then it checks to see if this bus is busy. If not, then it sets the status to "busy" and forwards the request message out that requested output port. Otherwise, if the bus is busy, it depends on whether this bus segment is the first in a cascade of buses encountered by the REQuest. If it's the first bus, then the REQuest is queued to be serviced next in line. If it's not the first bus, then it changes the message to a NACK and reflects it back for retry at a later time.

If the arriving control signal is a "done" (ACK) message, then it forwards the "done" message back out the port to which it was assigned, and resets the bus-status flag.

Different bus types may be cascaded together. This device uses two Cascade Bus constants: CB_transfer_rate, and CB_latency (transfer rate and latency). They are defined in the file "parameters.sim". These variables will be overwritten by local instance variables, if they exist. The corresponding instance variables are optionally defined as CSIM attributes or macros and have names, respectively: cb_transfer_rate and cb_latency. Each Cascade Bus element can have a different transfer rate and latency. Each Cascade Bus element sets the lowest transfer rate value by comparing its own rate to that previously set. The latencies get added up by each Cascade Bus element. The receiving Bus NIC gets the resultant transfer rate and latency and uses it to determine the total transfer delay.

1.10 Cascadable Bus NIC- bus_nic.sim

This is a CSIM description of the Network Interface Chip for cascade bus models. The function served by this device is to translate from the local processor-bus protocol to that of the cascade-bus network, and vice-versa. At the local processor side communication is done at the message level. At the remote or network side, communication is done at the packet level. A message is packetized into multiple packets and vice versa. This model consists of two main processes; one handles the local bus side and the other handles the Cascade Bus or "remote" side. Due to the nature of the blocking in the cascade bus models, the inherent transfer-delay mechanism of CSIM is not sufficient by itself to account for the link communication delays. (This is because we cannot know in general when a packet transfer actually begins moving data due to possible blocking.) Therefore, we have implemented a protocol that accurately accounts for data movement that uses two control-signal types. The first, called "REQuest", opens a pathway through the switch network, and the second, called "ACK", returns backward through the pathway after the appropriate transfer delay. As it does, the (ACK) signal closes the pathway. Because the data transfer delay for a packet is entirely accounted for with a time-delay statement within the NIC model before returning the "ACK" message, there must be no delays on the network links. Therefore, the data-rate of the links as specified in the CSIM topology table should be set to infinity (a very high number). In the network, data begins flowing into the destination NIC once the wormhole has been opened to it. Data then continues to flow for packet_length/transfer_rate seconds. Then the path is freed. Therefore, the time-delay in the NIC model for reflecting the "ACK" signal should be packet_length (in bytes) divided by the transfer rate (CB_transfer_rate) plus a latency delay (CB_latency) for each bus element through which the signal has passed. An additional fixed transfer overhead factor (CB_transfer_ovrhd) is added to the delay time. This gives the time delay in uSec.

Different bus types may be cascaded together. Each Cascade Bus element can have a different transfer rate and latency. Each Cascade Bus element sets the lowest transfer rate value by comparing its own rate to that previously set. The latencies get added up by each Cascade Bus element. The receiving Bus NIC gets the resultant transfer rate and latency and uses it to determine the total transfer delay. The constants for the Cascade Bus transfer rate, latency, and overhead are defined as macros in the "parameters.sim" file. The macro names are: CB_transfer_rate, CB_latency, and CB_transfer_ovrhd respectively. The constants PACKET_HEADER_SIZE and CB_PACKET_SIZE are also specified in the "parameters.sim" file. All five of these variables will be overwritten by local instance variables, if they exist. The corresponding instance variables are optionally defined as CSIM variables or macros and have names: cb_packet_header_size, cb_packet_size, cb_transfer_rate, cb_latency, and cb_transfer_ovrhd.

The local side process handles outgoing messages one at a time. It waits for a message to come from the local side. When it does, it checks to see if the NIC is in the idle state. If it is, then it sets it's state to sending, and forwards the message out the remote side, and goes to sleep, until it will be eventually reinstated when the returning "ACK" signal is received by the remote side process. The remote side process watches for either an incoming "REQuest" or a returning "ACK" message. If a "request" is received, then the remote side process spawns a sub-thread (called EOM), that delays for the appropriate transfer time, awakens and sends the "ACK" signal back and also forwards the packet to the local_side port.

1.11 Cascadable Bus Buffer- bus_buffer.sim

This is a CSIM description of the inter-bus buffer for cascade bus models. The function served by this device is to isolate two cascade bus networks and to buffer data transfers between the buses. This also allows each bus to run at its highest transfer rate and not be slowed down by the other slower bus.

Two of these devices need to be used and configured back-to-back in a buffer module with a common local-side link. The function of the buffer module is to forward a message/packet received on one Cascade Bus to the other Cascade Bus. It buffers it and forwards it in the same format and length as it receives it. No packetization is done by the module. It also handles the handshaking signals on each Cascade Bus.

The bus_buffer model consists of two main processes; one handles the local bus side and the other handles the CascadeBus or "remote" side. The remote side behaves like a bus_nic model, but without packetization. Due to the nature of the blocking in the cascade bus models, the inherent transfer-delay mechanism of CSIM is not sufficient by itself to account for the link communication delays. (This is because we cannot know in general when a packet transfer actually begins moving data due to possible blocking.) Therefore, we have implemented a protocol that accurately accounts for data movement that uses two control-signal types. The first, called "REQuest", opens a pathway through the bus network, and the second, called "ACK", returns backward through the pathway after the appropriate transfer delay. As it does, the (ACK) signal closes the pathway.

Because the data transfer delay for a packet is entirely accounted for with a time-delay statement within the NIC model before returning the "ACK" message, there must be no delays on the network links. Therefore, the data-rate of the links as specified in the CSIM topology table should be set to infinity (a very high number). In the network, data begins flowing into the destination NIC once the wormhole has been opened to it. Data then continues to flow for packet_length/transfer_rate seconds. Then the path is freed. Therefore, the time-delay in the NIC model for reflecting the "ACK" signal should be packet_length (in bytes) divided by the transfer rate (CB_transfer_rate) plus a latency delay (CB_latency) for each bus element through which the signal has passed. An additional fixed transfer overhead factor (CB_transfer_ovrhd) is added to the delay time. This gives the time delay in uSec. Each Cascade Bus element can have a different transfer rate and latency. Each Cascade Bus element sets the lowest transfer rate value by comparing its own rate to that previously set. The latencies get added up by each Cascade Bus element. The receiving Bus NIC gets the resultant transfer rate and latency and uses it to determine the total transfer delay.

The constants for the Cascade Bus transfer rate, latency, and overhead are defined as macros in the "parameters.sim" file. The macro names are: CB_transfer_rate, CB_latency, and CB_transfer_ovrhd respectively. The constants PACKET_HEADER_SIZE and CB_PACKET_SIZE are also specified in the "parameters.sim" file. These variables will be overwritten by local instance variables, if they exist. The corresponding instance variables are optionally defined as CSIM variables or macros and have names: cb_transfer_rate, cb_latency, cb_transfer_ovrhd, cb_packet_header_size and cb_packet_size.

The remote side process watches for either an incoming "REQuest" or a returning "ACK" message. If a "request" is received, then the remote side process spawns a sub-thread (called EOM), that delays for the appropriate transfer time, awakens and sends the "ACK" signal back and also forwards the packet to the local_side port.

1.12 Cascadable Bus Buffer Module- cbuf_module.sim

This is a CSIM structural model of the Cascade Bus buffer module. This module is made up of two Bus Buffer models connected back-to-back at the local side. The function served by this device is to isolate two cascade bus networks and to buffer data transfers between the buses.

The function of the buffer module is to forward a message/packet received on one Cascade Bus to the other Cascade Bus. It buffers it and forwards it in the same format and length as it receives it. No packetization is done by this module. It also handles the handshaking signals on each Cascade Bus.

1.13 Simple Unidirectional Delay Element - latency.sim

This simple model functions as a unidirectional delay element. The output data stream begins at a specified latency after the beginning of the input stream.

1.14 Hardware Delay - delay_box.sim

This is a CSIM description of a hardware delay. It extracts the delay amount from the device's delay_amount attribute. The value must be an integer or real delay in simulation time units.

Devices of type "HW_delay_block" will block input until the specified delay. Devices of type "HW_delay_nonblock" act like a FIFO pipe and will merely delay messages the specified amount of of time: an arbitrary no. of messages (specified by the parameter MAXDELAYB4BLOCK) can be held simultaneously by the device--after which the device blocks.

Both devices act in a full duplex manner. Both will delay messages coming from either port--either while blocking or passing-- before releasing the message on the other port INDEPENDENTLY of whatever is happening in the other direction. THIS DEVICE ACTS LIKE TWO SEPARATE DEVICES operating each on data going in opposite directions. To make these devices act in a half duplex or simplex manner, manually set an arc attached to one of the sides to be half duplex or simplex.

1.15 Generic Crossbar Switch - generic_xbar.sim

This is a token based performance model of a generic crossbar switch. The ports connected to it may be full duplex, half duplex or simplex. Any number of ports may be connected and each instance of the switch may of a different size. It allows transfers from any port to any other port. Multiple concurrent transfers may be formed as long as each transfer is destined to a different output port. When multiple transfers are concurrently destined to the same output port, only one transfer gets through while the others are queued in a FIFO. Only one transfer may be queued from each input port. All messages passing through the switch are transferred at the same rate set by a parameter called XBAR_SWITCH_RATE, in units of bytes/microsecond. The time delay, in microseconds, through the switch is a function of the message length, in bytes.

The XBAR_SWITCH_RATE constant is defined in "parameters.sim". An instance attribute called, generic_xbar_rate, if set, overwrites the global XBAR_SWITCH_RATE value. An attribute called infinite_xbar_rate may be set to 1 to allow the generic_xbar to run at an infinite transfer rate or zero delay.

This switch should be used with all its ports connected to links with zero delay or set to very high transfer rate. This device should normally be used as a local switch. It should generally not be used to form a network of multiple switches. If multiple switches are concatenated, be aware that the transfers will behave as a store and forward network where the entire message is held up at each switch for the duration of the transfer before being forwarded to the next switch.

This model is implemented as a major process thread, the port handler, that is instantiated multiple times, once for each of the ports on the crossbar switch. Each port handler waits for a message to arrive at its port. When it detects that a message has arrived, it checks the message's destination output port to see if it's available. If the output port is found to be in use by another message transfer, the message is queued at the output port's queue and waits until all messages ahead of it are transferred out. If it's available, it sets it to be in-use and delays for a time that is the message length divided by the switch rate. At the end of that time interval, it sends the message out the output port and releases the output port. The port handler does not process any new messages until its previous message has been sent out. Each output port has its own queue.

1.16 Switcher - switcher.sim

A three-port zero-delay switch that receives messages on any of its three ports, decodes and increments the next routing-address in the message's route-list, and forwards the message out the designated port. The message starts going out the designated port immediately upon the start of reception, if possible. Otherwise the the message is queued on the outgoing port. This is not a store-and-forward device.

1.17 Race_XBAR - race_xbar.sim

This is model of a Raceway crossbar-switch. The model generically handles six ports, but is dynamically expandable to any number of ports. The model consists of a major process thread, port_handler, for each of the ports on the Raceway crossbar switch. The port_handler implements the messaging protocol as described in the Race_NIC model.

It waits for a control signal to arrive on its port. If the arriving control signal is a "REQuest", then it checks to see if the requested output port is available. If so, then it assigns the requested output port and this process's input port as being "a pair in-use", and forwards the request message out that requested output port. Otherwise, if the requested port is already in-use, then it changes the message to a NACK and reflects it back.

If a new request has a higher priority than an existing connection, then a "preempt" message is generated and sent out the forward port of the conflicting pair. Soon, a "done" (ACK) message will come back to release the conflicted ports, and when it does, then the preempting request is serviced.

If the arriving control signal is a "done" message, then it forwards the "done" message back out the port to which it was assigned, and de-assigns the port pair.

1.18 Race_NIC - race_nic.sim

This model is a CSIM description of a RACEway Network Interface Chip (NIC). The function served by this device is to translate from the local processor-bus protocol to that of the Raceway, and vice-versa. Messages at the local side are packetized into smaller size packets at the Raceway side and vice-versa.

This model consists of two main processes; one handles the local bus side and the other handles the Raceway or "remote" side.

Due to the nature of the blocking protocol of the Raceway network, the default transfer-delay mechanism of CSIM-links is not sufficient by itself to account for the link communication delays. (This is because we cannot know in general when a packet transfer actually begins moving data due to possible blocking.) Therefore, we have implemented a protocol that accurately accounts for RACEway data movement that uses two control-signal types. The first, called "REQuest", opens a pathway through the switch network, and the second, called "ACK", returns backward through the pathway after the appropriate transfer delay. As it does, the (ACK) signal closes the pathway.

Because the data transfer delay for a packet is entirely accounted for with a time-delay statement within the NIC model before returning the "ACK" message, there must be no delays on the network links. Therefore, the data-rate of the links as specified in the CSIM topology table should be set to infinity (a very high number). In the RACEway network, data begins flowing into the destination NIC once the wormhole has been opened to it. Data then continues to flow for packet_length/transfer_rate seconds. Then the path is freed. Therefore, the time-delay in the NIC model for reflecting the "ACK" signal should be packet_length (in bytes) divided by transfer_rate, plus the per xbar latency. This gives the time delay in uSec, since the RACEway transfer rate is 160-Bytes/uSec (=160MBytes/sec), and the per xbar delay is 3-clock ticks, or 75-nS (=0.075-uS). An additional fixed transfer overhead factor is added to the delay time. The constants for the RACEway transfer rate and overhead are defined as macros in the "parameters.sim" file. The macro names are: RACE_transfer_rate and RACE_transfer_ovrhd respectively.

1.19 RacePP_XBAR - racepp_xbar.sim

This is a model of a Raceway++ crossbar-switch. The model generically handles eight ports, but is dynamically expandable to any number of ports. The model consists of a major process thread, port_handler, for each of the ports on the Raceway++ crossbar switch. The port_handler implements the messaging protocol as described in the RacePP_NIC model.

If the arriving control signal is a "done" message, then it forwards the "done" message back out the port to which it was assigned, and de-assigns the port pair.

The Race++ crossbar model allows for the use of alternate redundant output ports to be used in successive packet transfers. If any of the alternate ports is busy, it attempts to route the data to an alternate port. The first output port that it uses is the port specified by the transfer routing list. The use of alternate output ports is specified by device attributes as follows:
pX_alternate = Y
where X is the output port number as used by the route_list and Y is the alternate output port number to be attempted if X is busy. Any number of available output ports may be used. For example, to use ports 1, 2 and 3 as alternate output ports, set the following device attributes:
p1_alternate = 2
p2_alternate = 3
p3_alternate = 1

1.20 RacePP_NIC - racepp_nic.sim

This model is a CSIM description of the RACEway++ Network Interface Chip(NIC). The function served by this device is to translate from the local processor-bus protocol to that of the Raceway++, and vice-versa. Messages at the local side are packetized into smaller size packets at the Raceway++ side and vice-versa.

This model consists of two main processes; one handles the local bus side and the other handles the Raceway or "remote" side.

Due to the nature of the blocking protocol of the Raceway++ network, the default transfer-delay mechanism of CSIM-links is not sufficient by itself to account for the link communication delays. (This is because we cannot know in general when a packet transfer actually begins moving data due to possible blocking.) Therefore, we have implemented a protocol that accurately accounts for RACEway++ data movement that uses two control-signal types. The first, called "REQuest", opens a pathway through the switch network, and the second, called "ACK", returns backward through the pathway after the appropriate transfer delay. As it does, the (ACK) signal closes the pathway.

Because the data transfer delay for a packet is entirely accounted for with a time-delay statement within the NIC model before returning the "ACK" message, there must be no delays on the network links. Therefore, the data-rate of the links as specified in the CSIM topology table should be set to infinity (a very high number). In the RACEway++ network, data begins flowing into the destination NIC once the wormhole has been opened to it. Data then continues to flow for packet_length/transfer_rate seconds. Then the path is freed. Therefore, the time-delay in the NIC model for reflecting the "ACK" signal should be packet_length (in bytes) divided by transfer_rate, plus the per xbar latency. This gives the time delay in uSec, since the RACEway transfer rate is 320-Bytes/uSec (=320MBytes/sec), and the per xbar delay is 3-clock ticks, or 37.5-nS (=0.0375-uS). An additional fixed transfer overhead factor is added to the delay time.

The constants for the RACEway++ transfer rate, overhead and XBAR latency are defined as macros in the "parameters.sim" file. So are the parameters used for packetization, packet size and packet header size defined in parameters.sim. The macro names are: RACEPP_transfer_rate, RACEPP_transfer_ovrhd, RACEPP_xbar_latency, PACKET_SIZE and PACKET_HEADER_SIZE, respectively. They may be overwritten by instance attributes called: racepp_transfer_rate, racepp_transfer_ovrhd, racepp_xbar_latency, racepp_packet_size and racepp_packet_header_size, respectively.

1.23 Myrinet_Switch - myrinet_xbar.sim

This is a model of a Myrinet crossbar-switch. The model generically handles 16 ports, but is expandable to any number of ports. The model consists of a major process thread, port_handler, for each of the ports on the Myrinet crossbar switch. The port_handler implements the messaging protocol as described in the LANai model.

It waits for a control signal to arrive on its port. If the arriving control signal is a "REQ", then it checks to see if the requested output port is available. If so, then it assigns the requested output port and this process's input port as being "a pair in-use", and forwards the request-message out that requested output port. Otherwise, if the requested port is already in-use, then it queues the request on the port's waiting-queue.

If the arriving control signal is an ACK message, then it forwards the ACK message back out the port to which it was assigned, de-assigns the port pair, and checks to see if there are any waiting REQ messages in either of the port's waiting-queues. If there are, then it activates them as described above (as if they just arrived).

1.24 LANai Interface Chip - lanai.sim

This model is a CSIM description of the LANai interface chip. The function served by this device is to translate from the local processor bus protocol to that of the Myrinet, and vice-versa.

This model consists of two main processes; one handles the local processor bus side and the other handles the Myrinet or "remote" side. The remote side contains two states and uses a second thread (b) to represent the second state.

Due to the nature of the blocking protocol of the Myrinet network, the inherent transfer-delay mechanism of CSIM is not sufficient by itself to account for the link communication delays. (This is because we cannot know in general when a packet transfer actually begins moving data due to possible blocking.) Therefore, we have implemented a protocol that accurately accounts for Myrinet data movement that uses two control-signal types. The first, called "REQ", opens a pathway through the switch network, and the second, called "ACK", returns backward through the pathway after the appropriate transfer delay. As it does, the ACK signal closes the pathway.

Because the data transfer delay for a packet is entirely accounted for with a time-delay statement within the LANai model before returning the ACK message, there must be no delays on the network links. Therefore, the data-rate of the links as specified in the CSIM topology table should be set to infinity (a very high number). In the Myrinet network, data begins flowing into the destination LANai once the wormhole has been opened to it. Data then continues to flow for packet_length/transfer_rate seconds. Then the path is freed. An additional fixed transfer overhead factor is added to the delay time. The constants for the Myrinet transfer rate and overhead are defined as macros in the "parameters.sim" file. The macro names are: Mnet_transfer_rate and Mnet_transfer_ovrhd respectively.

The LANai supports full duplex communication on the Myrinet side. The local side process handles outgoing messages one at a time. It waits for a message to come from the local side. When it does, it simply forwards it out the remote side, sets it's state to pending_send, and goes to sleep, until it will be eventually reinstated when the returning ACK signal is received by the remote side process.

The remote side process watches for either an incoming REQ or a returning ACK message. If a REQ is received, then the remote side process spawns a sub-thread (called remote_side_b), that delays for the appropriate transfer time, awakens and sends the ACK signal back and also forwards the packet to the local_side port. This must be done by a separate thread, because the main remote_side process must always be ready to receive ACK messages in response to outgoing packets from the local side. When an ACK signal is encountered, the remote_side process simply re-triggers the local_side process which was pending.

(Questions, Comments, & Suggestions: cstrasbe@csim.com)