The most important item on all homework is YOUR NAME! No name, no credit. ALSO, put last 4 digits of SS#. Staple or clip pages together.
Homework must be submitted when due. You loose 10%, one grade, the first day homework is late. Then 10% each week thereafter. Max 50% off. A zero really hurts your average! Paper or EMail to squire@cs.umbc.edu is acceptable. If I can not read or understand your homework, you do not get credit. Type or print if your handwriting is bad. Homework is always due on a scheduled class day within 15 minutes after the start of the class. If class is canceled then homework is due the next time the class meets.
EMail only plain text! No word processor formats. You may use a word processor or other software tools and print the results and turn in paper. Put CS611 and HW number in subject line.
The "submit" facility only works on the "irix.gl.umbc.edu" and linux.gl.umbc.edu machines. The student commands are: submit cs611 HW6 file puts your "file" into cs611 HW6 submitrm cs611 HW6 file removes your "file" from cs611 HW6 submitls cs611 HW6 lists your files in cs611 HW6 Note: "HW" is upper case a) you must have your userid registered for "submit" send mail from a gl machine to squire if your submit fails b) you have to be logged onto a gl machine, SSH or telnet are OK c) everything is case sensitive, remember the uppercase HW.
You must show your work, not just the answer. Book Page 60, Exercise 1.2 Book Page 62, Exercise 1.7
You must show your work, not just the answer. Book Page 121, Exercise 2.11 Book Page 122, Exercise 2.12
You must show your work, not just the answer. Book Page 214, Exercise 3.1 Book Page 219, Exercise 3.12
1. Does a DLX sequence of instructions exists that must stall in a "scoreboard" machine, Figure 4.3, Page 244, yet the same sequence will not stall in a Tomasulo machine, Figure 4.8, Page 253, ? (yes or no) 2. Does a DLX sequence of instructions exists that must stall in a Tomasulo machine, Figure 4.8, Page 253, yet the same sequence will not stall in a "scoreboard" machine, Figure 4.3, Page 244, ? (yes or no) 3. Using the paper "Combining Branch Predictors" by Scott McFarling, use the instruction trace given below and update a Bimodal Predictor, Figure 1, for the case: two PC bits, four entries in the count vector, two bit counts as described in the paper. Initialize all counts to 10 base 2 (different from paper) a) show the four count values at the end of the trace. b) keep a count of correct predictions as you update the counts in order to give the percent predicted correctly. The following trace, as decimal integers, represent the sequence of PC (low order bits) of conditional branches and the letter T following the PC value indicates the branch was taken. 1T, 1T, 1T, 1, 3T, 2, 1T, 1T, 1T, 1, 3, 2, 1T, 1T, 1T, 1, 0T, 0T, 0T, 0, 3T, 2, 1T, 1T, 1T, 1, 2, 3 (Sample output with dummy answers should look like: 0 00 <-- initially 10 in all cases 1 11 2 01 3 11 50%)
Draw the diagram and compute values for the cache system described below. The diagram can be drawn free hand yet needs to be neat enough to be read. Use similar level of detail as was on the handout in class on caches. Show at least the tag comparators and "and" gate with the valid bit. Show all four rows of the L1 cache, and about 8 rows of the 65,536 rows of the L2 cache. Use this diagram to hand simulate the caches action when running the address sequence below. L1 instruction cache for a DLX machine, 2-way associative, block size is 4 words (16 bytes), index field in PC is two bits (e.g. 4 blocks long) LRU (Least recently used) replacement policy. Thus PC bits are +--------------+-------+---------+----------+ | tag | index | word | byte | | | | select | select | +--------------+-------+---------+----------+ bit number 31 6 5 4 3 2 1 0 Timing: the instruction is delivered in 1 ns for a hit, a miss requires that a block be filled from the L2 cache. (The 1 ns is still used by the L1 cache, even on a miss.) L2 general cache, direct mapped, block size is 4 words (16 bytes) index field is 16 bits (i.e. 65,536 blocks long) Timing: four words are delivered to L1 in 8 ns for a hit, a miss requires that a block be filled from RAM. Assume the 8 ns includes time to get the address, put the four words on the bus into the L1 cache and raise the "L2 hit" signal. (The data from RAM flows through the L2 cache on the way to the L1 cache, thus the 8 ns is used by the L2 cache, even on a miss.) RAM, 128 bit bus, (16 bytes) (4 words) delivered to L2 in 20 ns Assume the 20 ns includes time to get the address, fetch the data, put the data on the bus and raise the "data_ready" signal for the L2 cache.. All "valid" bits are initially zero. From the above, the first instruction takes 1 + 8 + 20 = 29 ns. Other facts: The memory to L2 cache bus is 128 bits wide. The L2 to L1 bus is 128 bits wide, thus no word select multiplexer on the L2 cache. Given the sequence of PC addresses below, 1) What is the total time to deliver all instructions. (ns) 2) What is the average time to deliver all instructions. (ns) 3) What is the L1 cache miss rate. (fraction) 4) What is the L2 cache miss rate looking at only the L2 cache. (fraction) 5) Assume one clock per nanosecond (ns) What is the average CPI. (xx.xx clocks per instruction) 6) Show hit or miss on each cache for each PC. (Do this first, of course!) L1 L2 PC: 00000000 <--- show H for hit, M for miss 00000010 blank for unused. 00000020 00000030 for each address for L1 and L2 00000040 00000004 00000008 * change, was 6 00000080 00000044 00000008 440001F4 110003B8 * change, 8 was 6 00000038
Write the VHDL code to perform an IEEE 754 Floating point add. You are given two 32-bit floating point numbers that are to be added to produce a third floating point number. Simplifications you may use include: Input numbers are normalized. No overflow or underflow or denormalization will occur. No rounding is necessary (either use truncation or round toward zero.) Use VHDL add, subtract, shift and other operators as needed. You do not have to go to the gate level. Use fp_add_test.vhdl as a start. Fill in the architecture behavior of fp_add to do the IEEE floating point add. Choose some reasonable test data for "a" and "b" in the test bench. The handout in class shows the commands needed on sunserver1.cs.umbc.edu A previous handout shows commands for linux.gl.umbc.edu but you will have to delete a few words to make VHDL-87 rather than VHDL-93 Look at VHDL help for more information. Compile and run. When reasonably correct, on gl.umbc.edu do a submit submit cs611 HW6 fp_add_test.vhdl
Closed book. Short answer, Numeric problems and some Multiple choice. Numerical problems will be on CPI, Amdahl's Law, Pipelining, Branch Prediction, Cache and IEEE Floating Point. Exam covers book: 1.5, 1.6, 2.3, 2.8, 3.1-3.5, 3.7, 3.9, 4.2, 4.4 5.1-5.5 lectures: 1 through 14 excluding Introduction and VHDL homework: 1 through 5 papers: McFarling "Combining Branch predictors"
Based on textbook and lecture answer the following: Show your work. Q1. Given a PCI bus running at 66MHz, 64 bits wide, what is the maximum bandwidth in MHz? (I could have asked for Mb/sec, same number, yet MB/sec is wrong!) Q2. Given a Ultra SCSI 160MB/s controller and disk drive that spins at 10,000 rpm and has an average seek of 6ms, 160MB/s transfer rate, defragmented. How long does it take, in seconds, to transfer 3.2MB where each disk transfer is a 32KB block? a) assuming 1/3 average seek time for first block and average rotational delay for all blocks. b) like a) but assuming a 4MB internal disk buffer so that only the first block has a rotational delay penalty. All numbers in decimal, K=1,000, M=1,000,000 Defragmented disks will typically only pay the seek penalty on the first block, the internal disk buffer can prevent any rotational delay penalty assuming there is room for read-a-head. Assume ideal timing. Q3. How many raw bytes must be stored to have one hour of music played at 44.1 KHz with 16 bits coming from each of two channels? Q4. What bandwidth in MHz (one bit per clock) is needed to continuously read 4.7GB of digital data from a 12X DVD in 10 minutes? (ignore seek and rotational delay, G=1,000,000,000)
Given a queue/server model M/M/4 Given an average-arrival-rate 20 tasks per second Q1. Given a single-server-utilization of 80% a) What is the average-single-server-rate in tasks per second ? b) What is the average-time-in-queue for a task ? c) What is the average-time-in-system for a task ? d) What is the average-tasks-in-queue ? Q2. What is the maximum-tasks-in-queue ? Possible short answers: average-tasks-in-queue / single-server-utilization about ten times the average-tasks-in-queue unbounded Q3. Given that we want the average-tasks-in-queue to be 10 tasks, (Still M/M/4 and average-arrival-rate of 20 tasks per second) What single-server-utilization is needed? (Answer as a percentage within 2% gets full credit.) Comments: None should be needed and these may be redundant, yet, in order to prevent long lines asking questions: M/M/4 technically stands for First M means memory-less random distribution of arrivals Second M means memory-less random distribution of service times 4 means the single queue is feeding four servers For a server problem it is reasonable to assume an exponential probability distribution for each M. It is reasonable to assume the four servers equally divide the workload of one server that is four times as fast. It is reasonable to assume all servers have the same utilization. The equations in the textbook on page 509 and 510 apply. The equations given in class represent the same equations that are in the textbook. As always, do not plug numbers into randomly selected equations. Try to understand what equations apply to the problem and check if your answers are intuitively reasonable.
Not assigned in Fall 2000
Comprehensive, about 1/3 pre midterm, 2/3 post midterm True/False, multiple choice, short answer In range 25 to 50 questions. The exam covers: Lectures 3 - 13, 16 - 28 Homework 1 - 8 McFarling Paper through bimodal IEEE 754 paper, floating point add, sub, mul, div Textbook 1.5 3.2 - 3.7 4.1 , 4.2 , 4.4 , 4.8 5.2 - 5.7 6.2 - 6.5 7.2 , 7.5 8.2 - 8.6 A.3 - A.5 B.3 E.1
Last updated 12/12/00