# ECE 4750 Section 1: RTL Design with Verilog
Table of Contents
- Verilog RTL for Single-Cycle Multiplier
- Verilog-Based Ad-Hoc Test for Single-Cycle Multiplier
- Python-Based Ad-Hoc Test for Single-Cycle Multiplier
- Verilog RTL for Single-Cycle Multiplier with Valid Bit
- Verilog RTL for Single-Cycle Multiplier with Streaming Interface
This discussion section serves as gentle introduction to the
basics of Verilog RTL design. You should start by logging into the
ecelinux
servers using the remote access option of
your choice and then source the setup script.
% source setup-ece4750.sh
% mkdir -p $HOME/ece4750
% cd $HOME/ece4750
% git clone git@github.com:cornell-ece4750/2025F.git
% cd 2025F/sections/section1
% TOPDIR=$PWD
Verilog RTL for Single-Cycle Multiplier
We will start by implementing a simple single-cycle multiplier. Whever implementing hardware, we always like start with some kind of diagram. It could be a block diagram, datapath diagram, or finite-state-machine diagram. Here is a block diagram for our single-cycle multiplier. Notice how we are using registered inputs. In this course, if we want to include registers in a block we usually prefer registered inputs instead of registered outputs.
Here is the interface for our single-cycle multiplier.
module imul_IntMulScycleV1
(
input logic clk,
input logic reset,
input logic [31:0] in0,
input logic [31:0] in1,
output logic [31:0] out
);
Our single-cycle multiplier takes two 32-bit input values and produces a 32-bit output value. Notice our coding conventions. We prefix all Verilog module names with the corresponding directory path, we use CamelCase for Verilog module names, and we align all port names. We can implement this single-cycle multiplier flat (i.e., directly use behavioral modeling without instantiating any child modules) or structurally (i.e., instantiate child modules). Here is what a flat implementation might look like:
//----------------------------------------------------------------------
// Input Registers (sequential logic)
//----------------------------------------------------------------------
logic [31:0] in0_reg;
logic [31:0] in1_reg;
always @( posedge clk ) begin
if ( reset ) begin
in0_reg <= 32'b0;
in1_reg <= 32'b0;
end
else begin
in0_reg <= in0;
in1_reg <= in1;
end
end
//----------------------------------------------------------------------
// Multiplication Logic (combinational logic)
//----------------------------------------------------------------------
always @(*) begin
out = in0_reg * in1_reg;
end
Note that we are using an always @(posedge clk)
to
model sequential logic and an always @(*)
to model
combinational logic. Always be very explicit about what part of
your design is sequential and what part is combinational.
Always use non-blocking assignments
(<=
) in an always @(posedge clk)
and
always use blocking assignments (=
)
in an always @(*)
. At least when getting started, try
to avoid including too much combinational logic in your sequential
blocks. You can also include simple combinational logic directly
in an assign
statement. So we could replace the
always @(*)
with the following:
assign out = in0_reg * in1_reg;
Here is how we might implement the two registers structurally
by instantiating the vc_ResetReg
component which is
provided in vc/regs.v
.
logic [31:0] in0_reg;
vc_ResetReg#(32,0) in0_reg_
(
.clk (clk),
.reset (reset),
.d (in0),
.q (in0_reg)
);
logic [31:0] in1_reg;
vc_ResetReg#(32,0) in1_reg_
(
.clk (clk),
.reset (reset),
.d (in1),
.q (in1_reg)
);
We specify the register bitwidth and reset value as parameters
using the #()
syntax. To use the
vc_ResetReg
module you will need to add the following
to the top of your Verilog RTL file:
`include "vc/regs.v"
Go ahead and implement our first single-cycle multiplier in
imul/IntMulScycleV1.v
. You can implement the
registers either flat or structurally, and you can implement the
actual multiplication logic either using an
always @(*)
or an assign
statement.
Verilog-Based Ad-Hoc Test for Single-Cycle Multiplier
Now that we have implemented our single-cycle multiplier, we
need to test it. Let’s start by trying the Verilog-based ad-hoc
test harness located in imul/imul-v1-adhoc-test.v
.
Take a few minutes to look at this test harness.
`include "imul/IntMulScycleV1.v"
module top;
// Clocking
logic clk = 1;
always #5 clk = ~clk;
// Instaniate the design under test
logic reset = 1;
logic [31:0] in0;
logic [31:0] in1;
logic [31:0] out;
// Instantiate the multiplier
imul_IntMulScycleV1 imul
(
.clk (clk),
.reset (reset),
.in0 (in0),
.in1 (in1),
.out (out)
);
// Simulate the integer multiplier
initial begin
// Dump waveforms
$dumpfile("imul-v1-adhoc-test.vcd");
$dumpvars;
// Reset
#11;
reset = 1'b0;
// Cycle 1
in0 = 32'h02;
in1 = 32'h03;
#10;
$display( " cycle = 1: in0 = %x, in1 = %x, out = %x", in0, in1, out );
...
$finish;
end
endmodule
The test harness includes some logic to generate a clock, instantiates the design under test, and uses an initial block to set the input signals and display the output signals. Icarus Verilog is an open-source Verilog simulator. You can compile and run this test harness along with our single-cycle multiplier using Icarus Verilog as follows:
% mkdir -p $TOPDIR/build
% cd $TOPDIR/build
% iverilog -g2012 -I .. -o imul-v1-adhoc-test ../imul/imul-v1-adhoc-test.v
% ./imul-v1-adhoc-test
Notice how we are building our simulator in a separate
build
directory to keep generated files separate from
our soure files.
Python-Based Ad-Hoc Test for Single-Cycle Multiplier
The Icarus Verilog simulator is quite slow, only supports a pretty old version of Verilog, and does not produce terribly helpful error messages. Furthermore, writing test benches in Verilog is very tedious and not particularly fun. In this course, we will be using the Verilator simulator which is very fast, supports some of SystemVerilog, and produces much better error messages. In addition we will be using Python to write all of our test benches. Python is fun!
We need to write a PyMTL3 wrapper for every component we want
to test using Python. Take a look at the PyMTL3 wrapper for our
single-cycle multiplier located in
imul/IntMulScycleV1.py
:
from pymtl3 import *
from pymtl3.passes.backends.verilog import *
class IntMulScycleV1( VerilogPlaceholder, Component ):
def construct( s ):
s.in0 = InPort ( 32 )
s.in1 = InPort ( 32 )
s.out = OutPort( 32 )
The PyMTL3 wrapper is a Python class that inherits from the
VerilogPlaceholder
and Component
base
classes. It includes one construct
method which
instantiates all of the ports. The port names and bitwidths in the
wrapper should exactly match the port names in the Verilog module.
PyMTL3 can figure out if there is a clk
and
reset
port automatically.
Now let’s look at the Python test bench located in
imul/imul-v1-adhoc-test.py
:
from sys import argv
from pymtl3 import *
from pymtl3.passes.backends.verilog import *
from IntMulScycleV1 import IntMulScycleV1
# Get list of input values from command line
in0_values = [ int(x,0) for x in argv[1::2] ]
in1_values = [ int(x,0) for x in argv[2::2] ]
# Create and elaborate the model
model = IntMulScycleV1()
model.elaborate()
# Apply the Verilog import passes and the default pass group
model.apply( VerilogPlaceholderPass() )
model = VerilogTranslationImportPass()( model )
model.apply( DefaultPassGroup(linetrace=True,textwave=True,vcdwave="imul-v1-adhoc-test") )
# Reset simulator
model.sim_reset()
# Apply input values and display output values
for in0_value,in1_value in zip(in0_values,in1_values):
# Write input value to input port
model.in0 @= in0_value
model.in1 @= in1_value
model.sim_eval_combinational()
# Tick simulator one cycle
model.sim_tick()
# Tick simulator three more cycles and print text wave
model.sim_tick()
model.sim_tick()
model.sim_tick()
model.print_textwave()
The test bench gets some input values from the command line, instantiates the design under test, applies some PyMLT3 passes, and then runs a simulation by setting the input values and displaying the output value.
You can run this test harness along with our single-cycle multiplier as follows:
% cd $TOPDIR/build
% python ../imul/imul-v1-adhoc-test.py 2 3 16 2
Notice how we are still building our simulator in a separate
build
directory to keep generated files separate from
our soure files. The PyMTL3 framework takes care of using
Verilator to compile your Verilog RTL before running the Python
test bench. The ad-hoc test will display a line trace:
1r 00000000|00000000(00000000 00000000)00000000
2r 00000000|00000000(00000000 00000000)00000000
3: 00000002|00000003(00000000 00000000)00000000
4: 00000010|00000002(00000002 00000003)00000006
5: 00000010|00000002(00000010 00000002)00000020
6: 00000010|00000002(00000010 00000002)00000020
7: 00000010|00000002(00000010 00000002)00000020
A line trace shows the state of the design with fixed width fields and exactly one line per cycle. You can see the input values 2,3 being sent to the multiplier on cycle 3 and the corresponding result produced on cycle 4. Take a closer look at the line-tracing code which is in the Verilog RTL to see how this line trace is produced before continuing.
The ad-hoc test also displays a text-based waveform. The line trace and text-based waveform are isomorphic; they are showing the same information in two different ways.
Finally, the ad-hoc test also generates a VCD file that enables much more detailed waveform visualization using gtkwave.
% cd $TOPDIR/build
% gtkwave imul-v1-adhoc-test.vcd
Verilog RTL for Single-Cycle Multiplier with Valid Bit
Now that we know the basics of Verilog RTL modeling and how to simulate these models, let’s improve our multiplier by adding support for a valid bit. A valid bit will enable the multiplier to know when valid data is available at the input and to pass this information on to the output. The hardware we wish to implement looks like this:
Here is the interface for our single-cycle multiplier with valid bit:
module imul_IntMulScycleV2
(
input logic clk,
input logic reset,
input logic in_val,
input logic [31:0] in0,
input logic [31:0] in1,
output logic out_val,
output logic [31:0] out
);
Copy your implementation from
imul/IntMulScycleV1.v
into
imul/IntMulScycvleV2.v
and modify it so that it
correctly implements the valid bit. Then use the following ad-hoc
test to verify it functions correctly. Look carefully to ensure
the valid bit at the output is only high when the output data is
indeed valid.
% cd $TOPDIR/build
% python ../imul/imul-v2-adhoc-test.py 2 3 16 2
Verilog RTL for Single-Cycle Multiplier with Streaming Interface
The V1 and V2 single-cycle multipliers are latency-sensitive. This means if we want to develop a component that uses these multipliers, our component’s logic would need to be hard-coded to expect the output value one cycle after we set the input values. If we wanted to swap in a different multiplier implementation (e.g., a multi-cycle iterative implementation or a pipelined implementation) we would need to rewrite all of the logic in our component. Testing comopnents that are latency-sensititive is also cumbersome since we need to carefully verify the exact expected outputs every cycle.
In this course, we will make extenstive use of latency-insensitive streaming interfaces. Such interfaces use a val/rdy microprotocol which will enable other logic to always function correctly regardless of how many cycles a component requires.
VAL/RDY MICROPROTOCOL: Assume we have a producer that wishes to send a message to a consumer using the val/rdy micro-protocol. At the beginning of the cycle, the producer determines if it has a new message to send to the consumer. If so, it sets the message bits appropriately and then sets the valid signal high. Also at the beginning of the cycle, the consumer determines if it is able to accept a new message from the producer. If so, it sets the ready signal high. At the end of the cycle, the producer and consumer can independently AND the valid and ready signals together; if both signals are true then the message is considered to have been sent from the producer to the consumer and both sides can update their internal state appropriately. Otherwise, we will try again on the next cycle. To avoid long combinational paths and/or combinational loops, we should avoid making the valid signal depend on the ready signal or the ready signal depend on the valid signal. If you absolutely must, you can make the ready signal depend on the valid signal (e.g., in an arbiter) but it is considered very bad practice to make the valid signal depend on the ready signal. As long as you adhere to this val/rdy microprotocol, composing modules via the stream interfaces should not cause significant timing issue
Here is how we can implement a single-cycle multiplier with a latency-insenstiive streaming interface:
Here is the interface for this single-cycle multiplier:
module imul_IntMulScycleV3
(
input logic clk,
input logic reset,
input logic istream_val,
output logic istream_rdy,
input logic [63:0] istream_msg,
output logic ostream_val,
input logic ostream_rdy,
output logic [31:0] ostream_msg
);
Notice that we have a valid signal (val
), a ready
signal (rdy
), and a message (msg
)
associated with both the input stream and the output stream. We
have provided the implementation of this multiplier for you. Take
a look in imul/IntMulScycleV3.v
. The PyMTL3 wrapper
is located in imul/IntMulScycleV3.py
and looks like
this:
from pymtl3 import *
from pymtl3.passes.backends.verilog import *
from pymtl3.stdlib.stream.ifcs import IStreamIfc, OStreamIfc
class IntMulScycleV3( VerilogPlaceholder, Component ):
def construct( s ):
s.istream = IStreamIfc( Bits64 )
s.ostream = OStreamIfc( Bits32 )
Once we start using streaming interfaces we can take advantage
of stream sources to send messages to our design and stream sinks
to accept messages from our design and check that they are
correct. Here is our new test harness in
imul/imul-v3-adhoc-test.py
:
class TestHarness( Component ):
def construct( s, imsgs, omsgs ):
# Instantiate models
s.src = StreamSourceFL( Bits64, msgs=imsgs, initial_delay=0, interval_delay=0 )
s.sink = StreamSinkFL ( Bits32, msgs=omsgs, initial_delay=0, interval_delay=0 )
s.imul = IntMulScycleV3()
# Connect
s.src.ostream //= s.imul.istream
s.imul.ostream //= s.sink.istream
def done( s ):
return s.src.done() and s.sink.done()
def line_trace( s ):
return s.src.line_trace() + " > " + s.imul.line_trace() + " > " + s.sink.line_trace()
The test harness instantiates a stream source, multiplier, and stream sink. It then hooks the streaming interfaces up. Let’s try it out.
% cd $TOPDIR/build
% python ../imul/imul-v3-adhoc-test.py 2 3 16 2
The line trace looks very similar to our previous version, but
to really see the difference we need to introduce some
back-pressure into our design where the consumer is not
ready. You can do this by changing the initial_delay
and interval_delay
for the stream sink. Experiment
with different values and observe how the back-pressure changes
the line-trace. For streaming interfaces, the line trace works
like this:
.
= val/rdy interface is not valid and not ready#
= val/rdy interface is valid but not ready- space = val/rdy interface is not valid and ready
- message is shown when it is actually transferred across interface
You will be implementing two multi-cycle iterative multipliers in lab 1 which make use of stream interfaces. We will then be able to use these multipliers in the processor you design in lab 2. If the processor correctly adheres to the val/rdy micro-protocol then it will function correctly regardless of the latency of the multiplier.