ECE 4750 Section 12: Networks

Author: Christopher Batten
Date: November 18, 2022

Table of Contents

Network Overview
Implementing and Testing the Route Unit
Implementing and Testing the Switch Unit
Implementing and Testing the Router

This discussion section serves as a basic introduction to networks which will help students implement a simple ring network for lab 4. You should log into the ecelinux servers using the remote access option of your choice and then source the setup script.

% source setup-ece4750.sh
% mkdir -p $HOME/ece4750
% cd $HOME/ece4750
% git clone git@github.com:cornell-ece4750/ece4750-sec12-net sec12
% cd sec12
% TOPDIR=$PWD
% mkdir $TOPDIR/build

Network Overview

In order to implement the multicore processor shown below, we will need to implement three networks: a cache network that interconnects each processor’s data memory interface to each of the four data cache banks, a memory network that interconnects each data cache’s memory interface to main memory, and a memory network that interconnects each instruction cache’s memory interface to main memory.

The cache network actually includes two networks: one that enables processors to send cache requests to the caches and one that enables caches to send cache responses back to the processors. We also need adapters at the network interfaces to convert to/from memory messages and network messages.

The memory networks also include two networks. The primary differences it that there is only a single destination.

More generally, a network enables sending messages from a set of input terminals to a set of output terminals. Bus and crossbar networks use long global wires that every input terminal can write and every output terminal can read.

Bus topologies are simple but offer low throughput. Crossbar topologies enable higher throughput, but are also more expensive in terms of area and energy. Scalable networks use a set of smaller routers interconnected by shorter channels to create a network topology. Examples include butterfly and torus topologies. In lab 4, you will be implementing a simple 1D torus topology (i.e., a four node ring) which only uses nearest neighbor communication.

In addition to the network topology, a network microarchitecture will also need to implement a network routing algorithm (what path should we take to get from a given input terminal to a given output terminal?) and a network arbitration algorithm (how should we allocate resources like ports and buffers?).

We can use zero load latency and ideal terminal throughput to analyze the first-order performance of a network. The zero load latency is the number of cycles it takes for a message to go from the input terminals to the output terminals assuming a specific traffic pattern. The ideal terminal throughput is the maximum achievable throughput an input terminal can achieve assuming a specific traffic pattern, a perfect routing algorithm, and a perfect flow control scheme. We will analyze the zero load latency and ideal terminal throughput of our simple four node ring network topology together in the discussion section.

Each router in our network will have three input streams and three output streams. All streams are latency insensitive using the val/rdy microprotocol. We will use the following router microarchitecture which includes three input queues, three route units, and three switch units. The route units implement the routing algorithm and determine which output stream a given input message should be sent, while the switch units implement the arbitration algorithm and determine which input stream can send a message to an output stream on any given cycle.

All of our networks will work with network messages (also called packets) that use the following format.

 43  42 41  40 39    32 31            0
+------+------+--------+---------------+
| src  | dest | opaque |    payload    |
+------+------+--------+---------------+

This network message is shown with a payload of 32 bits, but our networks will actually be parameterized by the payload size so we can use a single network implementation in the cache request network, cache response network, memory request network, and memory response network.

Implementing and Testing the Route Unit

We will start by implementing a very basic route unit. Take a look at the route unit in lab4_sys/NetRouterRouteUnit.v. The interface looks like this:

module lab4_sys_NetRouterRouteUnit
#(
  parameter p_msg_nbits = 44
)
(
  input  logic                   clk,
  input  logic                   reset,

  // Router id (which router is this in the network?)

  input  logic     [1:0]         router_id,

  // Input stream

  input  logic [p_msg_nbits-1:0] istream_msg,
  input  logic                   istream_val,
  output logic                   istream_rdy,

  // Output streams

  output logic [p_msg_nbits-1:0] ostream_msg [3],
  output logic                   ostream_val [3],
  input  logic                   ostream_rdy [3]
);

The route unit has one input stream interface and three output stream interfaces. Notice the [3] at the end of the output stream ports. This is new Verilog syntax for modeling an array of ports. The route unit we will implement in the discussion section will simply use the destination field of the network message to determine the output port. For the ring network, you will need to implement a more complicated route unit that picks an output port based on your desired routing algorithm and the current router’s id.

Go ahead and complete the implementation of the route unit. You want to first check to make sure the input stream is valid, check the destination field, and use the destination field to set the appropriate output stream valid signal and input stream ready signal. Here is a sketch of the logic you will need.

if ( istream_val ) begin
  if ( istream_msg_hdr.dest == 0 ) begin
    istream_rdy = ostream_rdy[0];
    ostream_val[0] = 1;
  end
  else if ( istream_msg_hdr.dest == 1 ) begin
    istream_rdy = ostream_rdy[1];
    ostream_val[1] = 1;
  end
  else if ( istream_msg_hdr.dest == 2 ) begin
    istream_rdy = ostream_rdy[2];
    ostream_val[2] = 1;
  end
end

You can also directly use the destination field to index into the output stream val/rdy port arrays. Once you have finished you can test your route unit like this:

% cd $TOPDIR/build
% pytest ../lab4_sys/test/NetRouterRouteUnit_test.py
% pytest ../lab4_sys/test/NetRouterRouteUnit_test.py -k stream_to_all] -s

Use the -k and -s command line options to view the line traces for specific test cases. Here is what the line trace looks like.

    src       d    sink0  sink1  sink2
 1r .      > ( ) >       |      |
 2r .      > ( ) >       |      |
.      > ( ) >       |      |
0>0:00 > (0) > 0>0:00|      |
0>1:40 > (1) >       |0>1:40|
0>2:80 > (2) >       |      |0>2:80
0>0:01 > (0) > 0>0:01|      |
0>1:41 > (1) >       |0>1:41|
0>2:81 > (2) >       |      |0>2:81
0>0:02 > (0) > 0>0:02|      |
0>1:42 > (1) >       |0>1:42|
0>2:82 > (2) >       |      |0>2:82
0>0:03 > (0) > 0>0:03|      |
0>1:43 > (1) >       |0>1:43|
0>2:83 > (2) >       |      |0>2:83
0>0:04 > (0) > 0>0:04|      |
0>1:44 > (1) >       |0>1:44|

You can see the network messages are being sent to each of the three output ports based on the destination field. The d column indicates the destination field.

Implementing and Testing the Switch Unit

Next we need to implement a very basic switch unit. Take a look at the switch unit in lab4_sys/NetRouterSwitchUnit.v. The interface looks like this:

module lab4_sys_NetRouterSwitchUnit
#(
  parameter p_msg_nbits = 44
)
(
  input  logic                   clk,
  input  logic                   reset,

  // Input streams

  input  logic [p_msg_nbits-1:0] istream_msg [3],
  input  logic                   istream_val [3],
  output logic                   istream_rdy [3],

  // Output stream

  output logic [p_msg_nbits-1:0] ostream_msg,
  output logic                   ostream_val,
  input  logic                   ostream_rdy
);

The switch unit has three input stream interfaces and one output stream interface. Again, notice the [3] at the end of the input stream ports which is used for modeling an array of ports. The switch unit we will implement in the discussion section will simply use a fixed priority. If multiple input ports want to use a given output port, we give highest priority to the input stream 1 and the lowest priority to input stream 0. We choose this priority, because when we use this switch unit in the router we ideally want to give higher priority to messages already in the network (i.e., input streams 1 and 2) over messages that are waiting at the input terminal (i.e., input stream 0). This simple switch unit will actually work in the ring network, but it could perform poorly since it does not attempt to provide any kind of fair arbitration across the input ports.

Go ahead and complete the implementation of the switch unit. You want to check each of the input stream valid signals in priority order and as soon as you find a valid input stream set the output stream valid bit, output stream message, and input stream ready signal appropriately. Here is a sketch of the logic you will need.

if ( istream_val[1] ) begin
  selected_input = 1;
  istream_rdy[1] = ostream_rdy;
  ostream_val    = 1;
  ostream_msg    = istream_msg[1];
end
else if ( istream_val[2] ) begin
  selected_input = 2;
  istream_rdy[2] = ostream_rdy;
  ostream_val    = 1;
  ostream_msg    = istream_msg[2];
end
else if ( istream_val[0] ) begin
  selected_input = 0;
  istream_rdy[0] = ostream_rdy;
  ostream_val    = 1;
  ostream_msg    = istream_msg[0];
end

You can also make this logic more succinct by first determining the selected input based on the fixed priority and then using the selected input signal to directly index into the input stream val/rdy port arrays.

Once you have finished you can test your route unit like this:

% cd $TOPDIR/build
% pytest ../lab4_sys/test/NetRouterSwitchUnit_test.py
% pytest ../lab4_sys/test/NetRouterSwitchUnit_test.py -k stream_from_all] -s

Use the -k and -s command line options to view the line traces for specific test cases. Here is what the line trace looks like.

    src0   src1   src2      a    sink
 1r .     |.     |.      > ( ) >
 2r .     |.     |.      > ( ) >
.     |.     |.      > ( ) >
#     |1>0:00|#      > (#) > 1>0:00
#     |1>0:01|#      > (#) > 1>0:01
#     |1>0:02|#      > (#) > 1>0:02
#     |1>0:03|#      > (#) > 1>0:03
#     |1>0:04|#      > (#) > 1>0:04
#     |1>0:05|#      > (#) > 1>0:05
#     |1>0:06|#      > (#) > 1>0:06
#     |1>0:07|#      > (#) > 1>0:07
#     |1>0:08|#      > (#) > 1>0:08
#     |1>0:09|#      > (#) > 1>0:09
#     |1>0:0a|#      > (#) > 1>0:0a
#     |1>0:0b|#      > (#) > 1>0:0b
#     |1>0:0c|#      > (#) > 1>0:0c
#     |1>0:0d|#      > (#) > 1>0:0d
#     |1>0:0e|#      > (#) > 1>0:0e
#     |1>0:0f|#      > (#) > 1>0:0f
#     |.     |2>0:00 > (:) > 2>0:00
#     |.     |2>0:01 > (:) > 2>0:01
#     |.     |2>0:02 > (:) > 2>0:02
#     |.     |2>0:03 > (:) > 2>0:03
#     |.     |2>0:04 > (:) > 2>0:04
#     |.     |2>0:05 > (:) > 2>0:05
#     |.     |2>0:06 > (:) > 2>0:06
#     |.     |2>0:07 > (:) > 2>0:07
#     |.     |2>0:08 > (:) > 2>0:08
#     |.     |2>0:09 > (:) > 2>0:09
#     |.     |2>0:0a > (:) > 2>0:0a
#     |.     |2>0:0b > (:) > 2>0:0b
#     |.     |2>0:0c > (:) > 2>0:0c
#     |.     |2>0:0d > (:) > 2>0:0d
#     |.     |2>0:0e > (:) > 2>0:0e
#     |.     |2>0:0f > (:) > 2>0:0f
0>0:00|.     |.      > (.) > 0>0:00
0>0:01|.     |.      > (.) > 0>0:01
0>0:02|.     |.      > (.) > 0>0:02
0>0:03|.     |.      > (.) > 0>0:03
0>0:04|.     |.      > (.) > 0>0:04

You can see input port 1 has the highest priority so input port 2 does not have a chance to send any messages until input port 1 is finish. Input port 0 is the lowest priority and so it gets to go last. The a column indicates how many input ports want to send to messages to this switch unit.

. = one input port has a valid input message
: = two input ports have a valid input messages
# = three input ports have a valid input messages

So # indicates there is congestion at this switch unit.

Implementing and Testing the Router

Now that we have implemented and tested the route unit and switch unit, we can compose them with the input queues to implement the three-port router. Take a look at the switch unit in lab4_sys/NetRouter.v. The interface looks like this:

module lab4_sys_NetRouter
#(
  parameter p_msg_nbits = 44
)
(
  input  logic                   clk,
  input  logic                   reset,

  // Router id (which router is this in the network?)

  input  logic     [1:0]         router_id,

  // Input streams

  input  logic [p_msg_nbits-1:0] istream_msg [3],
  input  logic                   istream_val [3],
  output logic                   istream_rdy [3],

  // Output streams

  output logic [p_msg_nbits-1:0] ostream_msg [3],
  output logic                   ostream_val [3],
  input  logic                   ostream_rdy [3]
);

The router has three input streams and three output streams. We have provided the composition for the router for you. Take a look at the implementation and notice the use of direct assignment to port arrays when instantiating the switch units:

lab4_sys_NetRouterSwitchUnit#(44) sunit0
(
  .clk          (clk),
  .reset        (reset),

  .istream_msg  (`{ runit0_ostream_msg[0], runit1_ostream_msg[0], runit2_ostream_msg[0] }),
  .istream_val  (`{ runit0_ostream_val[0], runit1_ostream_val[0], runit2_ostream_val[0] }),
  .istream_rdy  (`{ runit0_ostream_rdy[0], runit1_ostream_rdy[0], runit2_ostream_rdy[0] }),

  .ostream_msg  (ostream_msg[0]),
  .ostream_val  (ostream_val[0]),
  .ostream_rdy  (ostream_rdy[0])
);

The ```{}`` syntax is simple to the standard Verilog concatentation operator {} but the extra back tick indicates that we are creating an array of signals not a single bit vector. This compact code takes the first stream from each of the three route units and connects them to the first switch unit. You can test the router like this:

% cd $TOPDIR/build
% pytest ../lab4_sys/test/NetRouter_test.py
% pytest ../lab4_sys/test/NetRouter_test.py -k stream_all_to_dest0] -s

Use the -k and -s command line options to view the line traces for specific test cases. Here is what the line trace looks like.

    src0   src1   src2      qqq sss    sink0  sink1  sink2
 1r       |      |       > (   |   ) >       |.     |.
 2r       |      |       > (   |   ) >       |.     |.
      |      |       > (   |   ) >       |.     |.
0>0:00|1>0:00|2>0:00 > (   |   ) >       |      |
0>0:01|1>0:01|2>0:01 > (...|#  ) > 1>0:00|      |
0>0:02|1>0:02|2>0:02 > (:.:|#  ) > 1>0:01|      |
0>0:03|1>0:03|2>0:03 > (*.*|#  ) > 1>0:02|      |
#     |1>0:04|#      > (#.#|#  ) > 1>0:03|      |
#     |1>0:05|#      > (#.#|#  ) > 1>0:04|      |
#     |1>0:06|#      > (#.#|#  ) > 1>0:05|      |
#     |1>0:07|#      > (#.#|#  ) > 1>0:06|      |
#     |1>0:08|#      > (#.#|#  ) > 1>0:07|      |
#     |1>0:09|#      > (#.#|#  ) > 1>0:08|      |
#     |1>0:0a|#      > (#.#|#  ) > 1>0:09|      |
#     |1>0:0b|#      > (#.#|#  ) > 1>0:0a|      |
#     |1>0:0c|#      > (#.#|#  ) > 1>0:0b|      |
#     |1>0:0d|#      > (#.#|#  ) > 1>0:0c|      |
#     |1>0:0e|#      > (#.#|#  ) > 1>0:0d|      |
#     |1>0:0f|#      > (#.#|#  ) > 1>0:0e|      |
#     |      |#      > (#.#|#  ) > 1>0:0f|      |
#     |      |#      > (# #|:  ) > 2>0:00|      |
#     |      |2>0:04 > (# *|:  ) > 2>0:01|      |
#     |      |2>0:05 > (# *|:  ) > 2>0:02|      |
#     |      |2>0:06 > (# *|:  ) > 2>0:03|      |
#     |      |2>0:07 > (# *|:  ) > 2>0:04|      |
#     |      |2>0:08 > (# *|:  ) > 2>0:05|      |
#     |      |2>0:09 > (# *|:  ) > 2>0:06|      |
#     |      |2>0:0a > (# *|:  ) > 2>0:07|      |
#     |      |2>0:0b > (# *|:  ) > 2>0:08|      |
#     |      |2>0:0c > (# *|:  ) > 2>0:09|      |
#     |      |2>0:0d > (# *|:  ) > 2>0:0a|      |
#     |      |2>0:0e > (# *|:  ) > 2>0:0b|      |
#     |      |2>0:0f > (# *|:  ) > 2>0:0c|      |
#     |      |       > (# *|:  ) > 2>0:0d|      |
#     |      |       > (# :|:  ) > 2>0:0e|      |
#     |      |       > (# .|:  ) > 2>0:0f|      |
#     |      |       > (#  |.  ) > 0>0:00|      |
0>0:04|      |       > (*  |.  ) > 0>0:01|      |
0>0:05|      |       > (*  |.  ) > 0>0:02|      |
0>0:06|      |       > (*  |.  ) > 0>0:03|      |
0>0:07|      |       > (*  |.  ) > 0>0:04|      |
0>0:08|      |       > (*  |.  ) > 0>0:05|      |
0>0:09|      |       > (*  |.  ) > 0>0:06|      |
0>0:0a|      |       > (*  |.  ) > 0>0:07|      |
0>0:0b|      |       > (*  |.  ) > 0>0:08|      |
0>0:0c|      |       > (*  |.  ) > 0>0:09|      |
0>0:0d|      |       > (*  |.  ) > 0>0:0a|      |
0>0:0e|      |       > (*  |.  ) > 0>0:0b|      |
0>0:0f|      |       > (*  |.  ) > 0>0:0c|      |
      |      |       > (*  |.  ) > 0>0:0d|      |
      |      |       > (:  |.  ) > 0>0:0e|      |
      |      |       > (.  |.  ) > 0>0:0f|      |

You can see that input port 1 gets to send all of its messages first since it is given highest priority, and then input port 2 is able to start sending its messages. The q column indicates how many messages are in each input queue:

. = one message in queue
: = two messages in queue
* = three messages in queue
# = four messages in queue

The s columns indicate the congestion at each of the three switch units.