## INTRODUCTION

Reza M. Rad UMBC

Based on pages 321-356 of "Nanoelectronics and Information Technology", Rainer Waser

## Outline

- Fundamental requirements for logic devices
- Physical limitations of computing
- Physical implementation concepts
- Major aspects of architectures
- Estimations of the performance of information processing systems
- > The ultimate computer

#### Requirements for logic devices

 Logical states must be mapped to physical properties like voltage amplitude or time of pulses of a physical property (fig1)



Figure 1: As an example, the voltage level representation of 2-valued logical states is shown, as realized in conventional digital CMOS circuits. Since the high values (H) correspond to the "1" and the low values (L) to the "0", this is an example of positive logic. In negative logic, the relationships are mutually exchanged.

- Requirement 1: Non-linear characteristics
  - Required to maintain sufficient signal to noise ratio even in unlimited chains of gates
  - The output signal intervals are smaller than input signal intervals
  - Non-linearity of CMOS gates stems from characteristics of MOSFETs
  - In case of neurons, one important non-linearity is obtained by the threshold function (fig 2), (fig 3-1)

- Fig 2 : non-linear characteristic in a logic gate
- Fig 3 : non-linearity in CMOS (left) and biological neurons (right)





- Requirements for logic devices (cont)
  - Requirement 2: Power amplification
    - To maintain signal level in logic chains, power amplification is necessary
    - It is not sufficient to have only signal amplification
    - Gate output must be able to drive at least two inputs

- CMOS gates not only amplify the voltage but also drive current to charge and discharge the line and input capacitances
- In biological neurons power amplification is done through voltage triggered ion channels (electrochemical potential difference) (fig 3-2)

#### **Power Amplification**



- Requirements for logic devices (cont)
  - Requirement 3: Concatenability
    - Input and output signals must be compatible (based on the same physical property and range)
  - Requirement 4: Feedback Prevention
    - A directed flow of information is required
    - In CMOS feedback prevention is done by MOSFET
    - In biological neurons backward propagation is prevented due to refractory period of voltage gated Na+ channel

- Requirement 5: Complete set of boolean operators
  - A basic set of boolean operators is needed to realize a complete boolean algebra
  - A generic set consists of "OR and NOT" or "AND and NOT" or NOR or NAND gates

### Dynamic properties of logic gates

 Fall time, rise time, propagation delay for high and low (fig 4)

#### Figure 4:

(a) Response  $V_{out}$  of a CMOS NOT gate upon an ideally rectangular input signal,  $V_{in}$ .  $t_F$ : fall time,  $t_R$ : rise time.

(b) Definition of the propagation delay times  $t_{dL}$  and  $t_{dH}$  of the gate. The average delay time is given by  $t_{pd} = (t_{pdL} + t_{pdH})/2$ .



#### Threshold gates

- Si-based CMOS gates and biological neurons can be linked based on the operation of threshold gates
- Threshold gates are the basis of neuromorphic logic
- Definition: a linear threshold gate is a logic device that has n two-valued inputs x1,x2, ..., xn and a single two-valued output y= f(x1,x2, ...,xn)

#### > Threshold gates (cont)

- *f* is determined by weights w1,w2, ..., wn and the threshold value Θ, (fig 5)
- Every boolean function can be realized by threshold gates
- AND gate is made by w1=w2=..=wn=1 and n-1<Θ<n</li>



Figure 5: Threshold gate with binary inputs  $x_1, x_2, ..., x_n$ , the weights  $w_1, w_2, ..., w_n$ , the weighted sum  $\chi$  and the threshold value  $\Theta$ .

 $y = sign(X - \Theta) = \{1 \quad \text{if } X > \Theta, \quad 0 \quad \text{if } X < \Theta \}$  $X = \sum_{k=1}^{n} w_k . x_k \quad , x_k = \{0, 1\}$ 

- Basic advantage of threshold logic compared to conventional logic: inherent parallel processing due to internal multiple valued computation of weighted sum
- A full-adder is shown in the figure (fig 6)



Figure 6: Full adder realized by threshold logic.

- Three fundamental limits to performance of logic functions:
  - Thermodynamics
  - Quantum mechanics
  - Electromagnetism
- > Also a hierarchy of limits given by materails, device type, circuit concept ..
- Major limiting parameters: time and energy

Typically performance limits of a device are illustrated in a average power dissipation (Pd) versus average delay (Td) diagram

Average energy during logic operation is Wd = Pd.Td

Fundamental limit imposed by thermodynamics is the minimum energy required for a binary transition at a given operating temperature:

$$W_{TD,\min} = k_B T \ln 2 \approx 3 * 10^{-21} j / bOP$$

(minimum energy dissipated for each bit)

The Heisenberg uncertainty principle of quantum mechanics imposes a second fundamental limit. Energy of a state with life time Δt can only be determined with a precision of ΔW given by :

$$W_{QM,\min} = \Delta W \ge h / \Delta t$$

Figure shows thermodynamic and quantum mechanical limits and 1000, 100 and 10 nm CMOS gates and also estimated values for neurons and synapses (fig 7)



Figure 7: Average dissipated power per gate  $P_d$  versus transition delay time  $t_d$ . The red area is inaccessible due to fundamental limits of thermodynamics (boundary:  $W_{\text{TD,min}}$ ) at T=300 K and quantum mechanics (boundary:  $W_{\text{QM,min}}$ ). The device limits for CMOS gates of the 1000-nm, the 100-nm, and the projected 10-nm technology generations are illustrated. Furthermore, estimated values for biological neurons in the human brain and synapses of these neurons are shown.

Physical limits to computation > Estimating power (and energy) for a CMOS inverter: (fig 8,9)  $C_L = C_{out} + C_{con} + C_{in}$  $P_d = P_{dvn,C_t} + P_{dvn,SC} + P_{stat}$  $P_{dvn,CL} = \sigma f C_L V_{dd}^2 \qquad ,0.25 \le \sigma \le 1.5$  $P_{dyn,SC}$ : Dynamic power of CMOS dissipated during a transition  $P_{\rm dvn,SC} \approx 10\% \text{ of } P_{dvn,CL}$  $P_{stat}$ : Static power consumption caused by off - currents of MOSFETs For older CMOS  $P_{stat} \approx 1\%$  of  $P_{dvn}$  $P_{stat}$  plays an increasing important role in modern CMOS  $W_d = P_d * t_d$ ,  $t_{d FFT} = L_{ch} / v$ ,  $L_{ch}$ : channel length *v* : carrier velocity,  $v_s \approx 3 \times 10^7 cm/s$  for Si based devices



Figure 9: Switching cycle for a CMOS inverter (Figure 8). The  $I_{dd}$ -t and the  $I_n$ -t curves show the changes during the transitions.  $Q_{SC}$  is the charge due to the transient conduction of both FETs,  $Q_L$ is the charge transferred onto  $C_L$  during the rising edge of  $V_{out}$  and further to ground during the falling edge of  $V_{out}$ .



Figure 8: CMOS inverter gate for illustrating the dynamic power dissipation, the load capacitance  $C_L$  comprises the output capacitance of the gate, the interconnect capacitance, and the input capacitance of the subsequent gate (shown in grey).

#### > Electromagnetism limit

- Specially over long distances, due to electromagnetic character of signals and finite speed of light (fig 11)
- Delay \(\tau\) of a signal traveling via an interconnect of length L is:



 Delay is also determined by resistance (R) and capacitance (C) of the interconnect



Figure 11: Sketch of the geometries assumed for the model interconnect.

#### > Electromagnetism limit

• Latency of a global interconnect with distributed RC:

 $\tau = \alpha RC = \alpha \ rcL^2$ 

r and c distributed resistance and capacitance per unit length

 $r \approx R \frac{B}{L} = \frac{\rho}{H_p}, \rho / H_p$  is the coductor sheet resistance in  $\Omega$ 

 $c \approx C \frac{1}{BL} = \frac{\varepsilon}{H_{\varepsilon}}, \frac{\varepsilon}{H_{\varepsilon}}$  is the sheet capacitance in  $\frac{F}{cm^2}$ 

 $\alpha \approx 0.5$  accounts for the distributed natue of RC network

#### > Electromagnetism limit

 Figure show speed limitation caused by electromagnetic limit (fig 12)



Figure 12: Reciprocal interconnect length squared,  $L^{-2}$ , versus interconnect delay time  $\tau$ , assuming a copper-polymer technology ( $\varepsilon_r = 2, \rho = 1.7 \cdot 10^{-6} \Omega$ cm) (after [6] with modifications).

#### Classifications

- Logic states can be represented by (fig 13)
- Number of terminals
  - Two-terminal devices
  - Three-terminal devices



Figure 13: Examples of input and output signals of logic devices.

- Physical properties representing logic states must arise from a non-linear behavior:
  - Non-linearity of I-V function (FETs)
  - Discreteness of electrical charge (SETs)
  - Quntum mechanical discreteness of energy states (RTDs)
- > Two terminal devices
  - Lower number of terminals reduces the huge interconnect problem significantly

- Reconfigurable molecular switches and resonant tunneling diodes (RTDs)
- RTDs show a negative differential resistance that can be used to provide power amplification
- RTDs can be employed to implement a generic set of logic functions
- Several clock signals and modulated voltage supplies are needed

#### > Field effect devices

- Charging of a gate electrode creates an electric field in the channel between source and drain
- This field controls the conduction of the channel
- Challenges in reducing the size of MOSFETs and potential new materials are discussed in chapter 13

- Application of ferroelectrics as gate oxide is presented in chapter 14
- Carbon nanotube FETs explained in chapter 19
- Organic FETs (or Organic thin-film transistor (TFTs)) are not fast but are low-cost and can be developed on flexible substrates



> Coulomb blockade devices

- Voltage and charge on a macroscopic conductor are related according to  $V_{V}_{-}Q$
- Energy of this capacitor is given by

- For nanometer scale caps these change to non-linear relations due to discreteness of charge Q = ne
- To observe the non-linearities energy steps must be significantly larger then thermal energy K<sub>R</sub>T

• Energy step 
$$\Delta W = W_{n+1} - W_n = \frac{e^2}{2C}(2n+1)$$

For a plate capacitor with dielectric thickness d



• For first electron to charge the capacitor (from n=0)

$$\Delta W >> k_B T \quad \Rightarrow \quad A << \frac{e^2 d}{2\varepsilon_r \varepsilon_0 k_B T}$$

- For T =300 K and d = 3 nm , A = 2.6e-16 m<sup>2</sup> (a 16x16 nm square)
- For nanosized capacitors energy levels are discrete and determined by quantum mechanics (fig 16)



Non-linear characteristic of nanosized capacitors is employed in

- Single electron transistors
- Nanowire memories
- Non-volatile nanodot memories

 Energy per operation versus minimum feature size (fig 17)



Figure 17: Dissipated switching energy  $W_d$  versus minimum feature size F for MOSFETs and SETs. The area below the horizontal boundary at  $10 \cdot W_{TD,min}$  is not accessible for reliable information processing at T = 300 K. Note: the line for FETs is calculated from the *device* data estimated in the ITRS 2001. Hence, the energy per switching operation is much lower than the energy in Figure 10, which refers to an average logic gate in a *circuit*. The number of electrons for FETs are the estimated excess electrons in the inversion channel.

#### Spintronics

- Besides electric charge electrons have another fundamental property: Spin
- Spin orientation effect on electronic transport observed in ferromagnetic/nonferromagnetic/ferromagnetic multilayers (giant megneto resistance, GMR), used in read heads of hard disks
- Spin dependence of tunneling current through ultra-thin insulating films (tunnel magneto resistance, TMR), used in magnetic RAMs
- Engaging magnetoelectronic effects in active logic devices (spintronics)

# Concepts of logic devicesHypothetical spin FET (fig 18)

Figure 18: Hypothetical spin FET. Source and drain are oriented ferromagnets acting as polarizer and analyser of spin-polarized electrons. An electric field rotates the spin polarization direction of the electrons travelling in the channel with relativistics speeds.



- Spin transistors are based on three effects:
  - Electrons injected to the active region of transistor need to show a high degree of spin polarization
  - There must be a control signal which makes it possible to tune the spin polarization
  - The spin polarization must sustain the traveling time and distance in the active region

#### > Quantum Cellular Automata

- Instead of flow of particles, logic states are implemented by means of discrete stationary states
- Basic idea of QCA is an elementary cell of two stable states representing 0 and 1, which can be toggled by fields emerging from neighboring cells
- An ideal QCA circuit operates near the thermodynamic limit of information processing

### Concepts of logic devices A QCA cell made with four quantum dots (fig.)



Figure 19: Upper row: potential barriers for the electrons in a QCA cell made from four quantum dots. If the system is clocked, the potential barriers may be decreased in order to enable the transition of the polarization state. Lower row: electron probability distribution of polarization state "0" (left), "1" (center), and a delocalized state, e. g. by loading with four electrons or as time-average of an isolated QCA (right), shown by Goser and Pacha [47].



Figure 20: The two logic states of a QCA cell.

## Concepts of logic devices Logic gates can be easily implemented (fig 21)

Figure 21: QCA gates.

(a) a linear row of QCA cells for transferring a logic state,

(b) a majority gate,

(c) a NOT gate. The signal is branched into two identical one first (upper and lower branch), before it is affecting the output cell, since the inversion needs to be across a corner. The interaction across two corners makes the gate more reliable.



- QCAs have no power amplification
- Feedback is not prevented
- Low temperature operation for quantum dot based QCAs
- For room temperature operation dots need to be smaller than 5 nm and edge of cell less than 25 nm, fabrication precision of 1 nm

#### > Quantum computing

- QCA circuits can be extended to quantum computers by replacing cells by so-called qubits
- A quantum computer processes all possible states of inputs at once
- Potential solutions for the complete set of possible input values is calculated concurrently

- Quantum computers would be specially suitable for non-deterministic polynomial problems like: traveling salesman problem, factoring integers ....
- > DNA computer
  - Encoding information on a DNA molecule by a sequence of DNA bases

- Techniques for manipulating DNA strands can be used to execute parallel computation by modification of information encoded in DNA
- DNA computation experimentally demonstrated by solving NP problems
- This approach is not likely to be useful in practice
- Specific applications might be able to take advantage of the DNA based computations

#### Flexibility of systems of information processing

- Classification
  - Free-programmable systems: an instruction flow fed into the system controls the sequence of operations
  - Reconfigurable systems: hardwire configuration can be changed by corresponding instructions
  - Hardwired systems: internal hardware structure is mainly fixed (fig 25)

Figure 25: Types of realization of information processing systems sketched in a flexibility - performance - power dissipation diagram (adapted from [19]). The flexibility indicates the effort (measured, for example, in time or costs) required to prepare the system to solve new (arbitrarily selected) task. This axis.spans approximately three orders of magnitude from changing an application program in the case of general purpose processors to designing and fabricating a completely new physically optimized integrated circuit. The abscissa illustrates the computational performance (for a comparable number of logic gates). Physically optimized integrated circuits can be approx. five orders of magnitude faster than general purpose processes performing the same dedicated task. The right ordinate illustrates the power dissipation for the same task. As described in Sec. 2, the power dissipation of physically optimized ICs can be up to five orders of magnitude lower than for general purpose processors.



#### • Field programmable devices (fig 26, 27)



Figure 26: Example of a basic cell of a Field Programmable Device. In this variant, called Programmable Logic Array (PLA), the AND array as well as the OR array are configurable. There are others types, in which either the AND or the OR array are fixed to represent 1-of-n encoders or decoders. In the example shown here, the configuration represents Boolean functions according to:

 $y_0 = x_2 + \overline{x}_0 x_1$ 

 $y_1 = \overline{x}_0 x_1 + \overline{x}_0 \overline{x}_2 + x_0 \overline{x}_1 x_2$ 

Because the AND and OR arrays are configurable here, the right half of the arrays are unused and could be omitted. However, if the same Boolean functions were realized with either AND array or the OR array as 1-out-of-n decoders, the full arrays are required.

Figure 27: Illlustration of the cross bar array of an Programmable Logic Device. The double array structure consists of an address array for storing the configuration information in the memory elements and for controlling the switches of the actual data array. In standard CMOS technology, the memory elements and the switch control consist of six transistors (adapted from Ref. [57]).



input/ output access of the data array

- Power amplification, feedback prevention and a complete set of boolean operators is missing in these arrays
- Often, periphery of the arrays provides these requirements
- A CMOS FPGA may consist of thousands or millions of programmable units
- Fabricating configurable logic systems is highly attractive for nanoelectronics

- Molecular switches are conceivable which are opened and closed at relatively high voltages and operate att much lower voltages
- It is possible to build defect tolerance into FPGA based logic
- Parallel processing and granularity
  - Performance can be improved by decreasing delay times

- However, there are physical limits to decreasing delay
- The other choice is to employ more than one processing unit (parallel processing)
- classification:
  - SISD : single instruction single data
  - SIMD : single instruction multiple data
  - MISD : multiple instruction single data

 MIMD : multiple instruction – multiple data (fig 28)



Figure 28: Classification of single and parallel processing systems after Flynn. Abbreviations: SISD Single Instruction – Single Data, SIMD Single Instruction – Multiple Data, MISD Multiple Instruction – Single Data, MIMD Multiple Instruction – Multiple Data; CU: Control Unit, ALU Arithmetic-Logic Unit, Mem Memory.

 The degree of distribution of the total computational power of a system on parallel processing individual units is called granularity

> Tramac

With growing number of devices per chip:

- The required design effort strongly increases
  - Regular, repetitive structures show this in a much lower degree
- The probability of defects grows statistically with the number of components

- Defect probability for transistors is 10<sup>-7</sup> to 10<sup>-9</sup>
- For systems with more than a billion transistor defects will hardly be avoidable
- Modern fabrication techniques may be cheaper but more defect prone
- Teramac : a reconfigurable and defect tolerant computer

- 1 million gates, 1 MHz clock
- 8 PCBs, 4 MCMs on each board, 27 FPGA on each MCM, 256 64 bit LUT in each FPGA
- From 27 FPGA, 8 used for logic, 19 used for interconnects
- Interconnect structure follows the fat-tree concept (fig 31), a highly redundant structure

Figure 31: Tree-type interconnect structures of distributed computing systems.

regular tree



- Approximately 3% of all FPGAs have been defective
- For configuration, Teramac is connected to a workstation which runs a test program or loads a test program to Teramac
- Test program locates the defective components
- During configuration the compiler routes around defective components

- Teramac will not economically substitute general processors or DSPs because it needs one more order of magnitude transistors for the same task
- Cost effective implementations like molecular switches might change this situation

Basic binary operations

- Basic binary operation is defined as an approximation of the logic operation of a basic binary gate
- Half adder consists of a XOR and an AND gate (2 bOp)
- Binary addition of two 16 bit operands requires 16 full adders, i.e. 16 x 5 = 80 bOp
- A 64 bit floating point addition requires approx. 300 bOp, a multiplication approx. 16500 bOp

• A 4 input threshold gate is approx. 249 bOp (see the text for details of approximation)

Measures of performance

perfromanc  $e = \frac{f_{clk}}{CPI} . n_{VAX}$ 

• One measure of performance for processors (MIPS) Performanc  $e = \frac{f_{clk}}{CPI} = \frac{1}{t_{clk}.CPI}$ can be obtained as:

CPI : Cycles per instructio n

• Relative performance has been introduced that uses VAX 11/780 as a reference processor:

average number of VAX instructio ns  $n_{VAX} = \frac{1}{\text{average number of instructions on system M}}$ 

 To account for applications with high load of floating point operations FLOPS has been introduced

 Development of FLOPS for processors (fig 35)



Performance of information processing systems Processing capability of biological neurons We use threshold gate as a simplified model Its output is binary It contains no temporal information It considers no noise and stochastic process A 1 state at the output of biological neuron is represented by a train of action potential spikes, sent down the axon at maximum firing rate  $f_{max}$  (500 Hz) Integrate and fire model used for neurons (fig

(37)

- Studies result in an information content in range from 0.1 bits/spike to 3 bits/spike
- > Ultimate computation
- Power dissipation limit
  - F: minimum feature size
  - Device density (DD) is inversely proportional to F<sup>2</sup>
  - $X_{aa}$  : average gate area measured in units of F<sup>2</sup>
  - Operation density (OD) can be measured based on DD and f<sub>max</sub>

$$DD = \frac{1}{X_{aa}F^2}, \quad OD = DD.f_{max}$$

#### • Density of dissipated power PD is given by $PD = DD.P_d$

Figure reveals the scaling of the OD and PD (fig 39)

Figure 39: Operation density *OD*, device density *DD*, and density of the dissipated power *PD* for physically optimized, lowpower CMOS circuits versus minimum feature size *F* as derived from the trends of the frequency f and energy per operation  $W_d$ shown in Figure 10. Note: For  $F \leq 100$  nm, these are formal trends as no limit for *PD* is assumed (see text).



- Removal of the dissipated heat is a serious problem
- Power density (PD) up to 100 w/cm<sup>2</sup> is reasonable
- Figure (fig 40) demonstrates two different schemes of circuit parameter interdependencies:
  - (a) PD has not reached its limit, F controls the parameters
  - (b) PD is the second controlling parameter

# Performance of information processing systems Figure (fig 41) shows the changes in parameters versus F

**Figure 41:** Illustration of the circuit parameters versus the feature size *F*. Regime I is the *F* controlled regime, regimes II to V are controlled by the limit of dissipated power density *PD*. In all regimes, a scaling of the device density *DD* according to the available area is assumed (i.e.  $DD \propto F^2$ ). Regime I: All characteristics are directly or indirectly determined by the min. feature size *F*. The scaling characteristics is a coarse approximation to the evolution of the CMOS technology in the *F* = 100 regime (see Figure 40).

Regime II: Since *PD* is limited, the frequency  $f_{\text{max}}$  can only grow as much as allowed by the scaling of  $W_{\text{d}}$  and of *DD*.

Regime III: If the scaling of  $W_d$  is weaker (e.g.  $\propto F^2$ ), then  $f_{\text{max}}$  may not scale at all anymore.

Regime IV: If the scaling of  $W_d$  is further reduced (e.g.  $\propto F^1$ ), then  $f_{\text{max}}$  must drop with decreasing F, to meet a constant *PD*.

Regime V: if  $W_d$  shows no scaling, f must scale  $\propto F^2$  to compensate for  $DD \propto F^2$ .

On the route from regime I to V, scaling of the operation density *OD* chances dramatically.







Figure 40: Interdependencies of the circuit parameters

(a) in the feature size controlled regime and(b) in the power dissipation controlled regime.

#### > The ultimate computer

- Homogeneous arrays, which are relatively fine grained
- Parallelism at different hierarchical levels
- Emphasis on local interconnects
- Universal non-volatile memory
- Defect and fault tolerance
- In addition these systems must be: small and light, cheat, fast, robust, work at room temperature!