

# **Advanced VLSI**

## **Intro to Computer Arithmetic**

**2025**

## Arithmetic Circuits:

- **Core of every digital chip**

Arithmetic circuits are the heart of digital chips.

- **Determine the performance of the system**

The performance of digital chips is dictated by the arithmetic circuits. If the arithmetic circuit is optimized, performance improves.

- **Opportunities for improvement**

New algorithms require novel combinations of arithmetic circuits. There is always room for improvement.

# Adding Multiple Digits

$$\begin{array}{r} & 1 & 1 \\ & \downarrow & \downarrow \\ 9 & 1 & 8 \\ + & 4 & 3 & 7 \\ \hline 1 & 3 & 5 & 5 \end{array}$$

*Decimal*

$$\begin{array}{r} & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ & \downarrow \\ 0 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 \\ + & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0 \\ \hline 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 1 \end{array}$$

*Binary*

## Similar to Decimal Addition

- Starting from the Least Significant Digit, each digit is added.
- The carry of one digit is added to the digits on the right



The full adder is THE essential building block for almost all critical datapath operations!

$$S = A \oplus B \oplus C_i$$

$$= A\bar{B}\bar{C}_i + \bar{A}B\bar{C}_i + \bar{A}\bar{B}C_i + ABC_i$$

$$C_o = AB + BC_i + AC_i$$

$$S = ABC + (A+B+C) \bar{C}_o$$

useful for reducing circuit complexity !

# Ripple Carry Adder



# Curse Of The Carry

*The most significant output of the adder depends on the least significant inputs.*



## Carry Propagation

- The carry signal has to propagate from LSB to MSB
- **Fortunately, this case is rare**

# Carry-Propagate Adders

---

- Add two  $n$ -bit operands  $A$  and  $B$  and an optional carry-in  $c_{in}$  by performing **carry-propagation** [1, 2, 11]
- Sum  $(c_{out}, S)$  is *irredundant*  $(n + 1)$ -bit number

$$(c_{out}, S) = c_{out}2^n + S = A + B + c_{in}$$

$$\begin{aligned}2c_{i+1} + s_i &= a_i + b_i + c_i ; \\i &= 0, 1, \dots, n - 1 \\c_0 &= c_{in}, \quad c_{out} = c_n \quad (\text{r.m.a.})\end{aligned}$$



# Carry-Propagate Adders

---

## Ripple-carry adder (RCA)

- *Serial arrangement of  $n$  full-adders*
- *Simplest, smallest, and slowest CPA structure*

$$A = 7n, T = 2n, AT = 14n^2$$



# Carry-Propagate Adders

## Carry-propagation speed-up techniques

a) Concatenation of *partial CPAs* with fast  $c_{in} \rightarrow c_{out}$



Basic idea: try to do the carry propagation quickly, independently of the result for each bit and then enter the fast carry into partial adders which operate otherwise independently on parts of the sum

# Full Adder Revisited

## Carry Propagation

Consider the path  $C_i \rightarrow C_o$ :

- If **A=1** and **B=1**

A carry is **generated** no matter what the value of carry input ( $C_o$ ) is.

$$Generate = G = A \cdot B$$

- If only **A=1** or **B=1**

The value of the carry input ( $C_o$ ) is **propagated** to the carry output.

$$Propagate = P = A \oplus B$$





| $A$ | $B$ | $C_i$ | $S$ | $C_o$ | <i>Carry status</i> |
|-----|-----|-------|-----|-------|---------------------|
| 0   | 0   | 0     | 0   | 0     | delete              |
| 0   | 0   | 1     | 1   | 0     | delete              |
| 0   | 1   | 0     | 1   | 0     | propagate           |
| 0   | 1   | 1     | 0   | 1     | propagate           |
| 1   | 0   | 0     | 1   | 0     | propagate           |
| 1   | 0   | 1     | 0   | 1     | propagate           |
| 1   | 1   | 0     | 0   | 1     | generate            |
| 1   | 1   | 1     | 1   | 1     | generate            |

# Full Adder with Propagate and Generate

## Alternative Formulation

The Full Adder functionality can be expressed by using the Propagate (P) and Generate (G) signals:

$$G = A \cdot B$$

$$P = A \oplus B$$

$$S = P \oplus C_i$$

$$C_o = G + (P \cdot C_i)$$



## Area model:

3 simple gates (1x) and 2 complex (XOR) gates (2x) => 7

**Delay model:** 2 simple gates AB->CO => 2

# Carry-Propagate Adders

---



## Analysis of carry propagation

Sum digit in radix  $r$        $s_i = (x_i + y_i + c_i) \bmod r$

Special case of radix 2       $s_i = x_i \oplus y_i \oplus c_i$

Computing the carries is thus our central problem

For this, the actual operand digits are not important

What matters is whether in a given position a carry is

generated,      propagated,      or      annihilated (absorbed)

$$g_i = x_i \cdot y_i$$

$$p_i = x_i \oplus y_i$$

$$a_i = \overline{x_i \cdot y_i} = \overline{x_i} + \overline{y_i}$$

# Carry-Skip Adders



Idea: If ( $P_0$  and  $P_1$  and  $P_2$  and  $P_3 = 1$ )  
then  $C_{o,3} = C_0$ , else “kill” or “generate”.

# Carry-Skip Adders



$$t_{adder} = t_{setup} + M t_{carry} + \left( \frac{N}{M} - 1 \right) t_{bypass} + (M - 1) t_{carry} + t_{sum}$$

## Carry-skip adder (CSKA)

- Type a) : partial CPA with fast  $c_k \rightarrow c_i$

$$\boxed{c_i = \bar{P}_{i-1:k} c'_i + P_{i-1:k} c_k \text{ (bit group } (a_{i-1}, \dots, a_k))}$$
$$P_{i-1:k} = p_{i-1} p_{i-2} \cdots p_k \text{ (group propagate)}$$

- 1)  $P_{i-1:k} = 0$  :  $c_k \xrightarrow{'} c'_i$  and  $c'_i$  selected ( $c'_i \rightarrow c_i$ )  
2)  $P_{i-1:k} = 1$  :  $c_k \rightarrow c'_i$  but  $c'_i$  **skipped** ( $c'_i \xrightarrow{'} c_i$ )

$\Rightarrow$  path  $c_k \rightarrow c'_i \rightarrow c_i$  *never sensitized*  $\Rightarrow$  fast  $c_k \rightarrow c_i$

$\Rightarrow$  *false path*  $\Rightarrow$  inherent *logic redundancy*  $\Rightarrow$  problems in circuit optimization, timing analysis, and testing

- *Variable* group sizes (faster) : larger groups in the *middle* (minimize delays  $a_0 \rightarrow c_k \rightarrow s_{i-1}$  and  $a_k \rightarrow c_i \rightarrow s_{n-1}$ )

- Partial CPA typ. is RCA or CSKA ( $\Rightarrow$  *multilevel* CSKA)
- *Medium* speed-up at *small* hardware overhead  
(+ AND/bit + MUX/group)

$$A \approx 8n, T \approx 4n^{1/2}, AT \approx 32n^{3/2}$$



- Partial CPA typ. is RCA or CSKA ( $\Rightarrow$  *multilevel* CSKA)
- *Medium* speed-up at *small* hardware overhead  
(+ AND/bit + MUX/group)

$$A \approx 8n, T \approx 4n^{1/2}, AT \approx 32n^{3/2}$$



## Carry-select adder (CSLA)

$$A \approx 14n, T \approx 2.8n^{1/2}, AT \approx 39n^{3/2}$$



## Carry-select adder (CSLA)

- Type a) : partial CPA with fast  $c_k \rightarrow c_i$  and  $c_k \rightarrow s_{i-1:k}$

$$\boxed{s_{i-1:k} = \bar{c}_k s_{i-1:k}^0 + c_k s_{i-1:k}^1}$$
$$c_i = \bar{c}_k c_i^0 + c_k c_i^1$$

- *Two CPAs* compute two possible results ( $c_{in} = 0/1$ ), group carry-in  $c_k$  **selects** correct one afterwards
- *Variable* group sizes (faster) : larger groups at *end* (MSB) (balance delays  $a_0 \rightarrow c_k$  and  $a_k \rightarrow c_i^0$ )
- Part. CPA typ. is RCA, CSLA ( $\Rightarrow$  *multil.* CSLA), or CLA
- *High* speed-up at *high* hardware overhead (+ MUX/bit + (CPA + MUX)/group)

# Carry-Select Adders



## Analysis of carry propagation

$$\begin{aligned} c_i &= g_{i-1} + c_{i-1}p_{i-1} && \text{carry} \\ &= g_{i-1} + (g_{i-2} + c_{i-2}p_{i-2})p_{i-1} && \text{recurrence} \\ &= g_{i-1} + g_{i-2}p_{i-1} + c_{i-2}p_{i-2}p_{i-1} \\ &= g_{i-1} + g_{i-2}p_{i-1} + g_{i-3}p_{i-2}p_{i-1} + c_{i-3}p_{i-3}p_{i-2}p_{i-1} \\ &= g_{i-1} + g_{i-2}p_{i-1} + g_{i-3}p_{i-2}p_{i-1} + g_{i-4}p_{i-3}p_{i-2}p_{i-1} \\ &\quad + c_{i-4}p_{i-4}p_{i-3}p_{i-2}p_{i-1} \\ &= \dots \end{aligned}$$

|       |     |                                                             |
|-------|-----|-------------------------------------------------------------|
| $c_4$ | $=$ | $g_3 + g_2p_3 + g_1p_2p_3 + g_0p_1p_2p_3 + c_0p_0p_1p_2p_3$ |
| $c_3$ | $=$ | $g_2 + g_1p_2 + g_0p_1p_2 + c_0p_0p_1p_2$                   |
| $c_2$ | $=$ | $g_1 + g_0p_1 + c_0p_0p_1$                                  |
| $c_1$ | $=$ | $g_0 + c_0p_0$                                              |

# Carry-Lookahead Adder (CLA)



Four-bit carry network with full lookahead.

# Manchester Carry Chain Circuit



## CLA from combined (prefix) Generate Propagate Signals

- Define individual sub-adders each comprise multiple bits
- For each sub-adder define **propagate signal  $P_i$**  and a **generate signal  $G_i$**



$$G_0 = g_7 + p_7 g_6 + p_7 p_6 g_5 + \dots + p_7 p_6 p_5 p_4 p_3 p_2 p_1 g_0$$

$$c_8 = G_0 + P_0 c_0 \quad \begin{aligned} c_8 = & g_7 + p_7 g_6 + p_7 p_6 g_5 + p_7 p_6 p_5 g_4 + p_7 p_6 p_5 p_4 g_3 + p_7 p_6 p_5 p_4 p_3 g_2 \\ & + p_7 p_6 p_5 p_4 p_3 p_2 g_1 + p_7 p_6 p_5 p_4 p_3 p_2 p_1 g_0 + p_7 p_6 p_5 p_4 p_3 p_2 p_1 p_0 c_0 \end{aligned}$$

Unrolling keeps growing  
the fanin and fanout  $\otimes$  and  
leads to CLA adder

$$\begin{aligned} c_{16} &= G_1 + P_1 c_8 \\ &= G_1 + P_1 G_0 + P_1 P_0 c_0 \end{aligned} \quad \begin{aligned} P_1 &= p_8 p_9 p_{10} p_{11} p_{12} p_{13} p_{14} p_{15} \end{aligned}$$

## Carry-lookahead adder (CLA)

- High speed-up at *medium* hardware overhead

$$A \approx 14n, T \approx 4 \log n, AT \approx 56n \log n$$



## Hierarchical CLAs with two levels

- Objective: decompose the carry propagation into multiple levels of logic
- Keep the hierarchical structure in two levels: fanin and fanout still grow but more slowly with each level



## Carry-lookahead adder (CLA)



**Building a 64-bit carry-lookahead adder from 16 4-bit adders and 5 lookahead carry generators.**

# Prefix Operations

We define a set of signals

$$(G_{i:j}^k, P_{i:j}^k)$$

Which define the **combined** *Generate* and *Propagate* signals covering the bits from  $i$  to  $j$  at the  $k^{th}$  stage.

- The initial *Generate* and *Propagate* signals  $G_i$  and  $P_i$  will be written as

$$(G_{i:i}^0, P_{i:i}^0)$$

in this notation.

- The goal is to calculate

$$(G_{n-1:0}^k, P_{n-1:0}^k)$$

for all  $n$ , in any number of  $k$  stages.

# Parallel Prefix Adders

*Parallel Prefix Adders (PPA) represent a systematic approach to designing optimized adder architectures.*

## Principle

- Pre-processing ( $P$  and  $G$  generation)
- Carry Propagation Tree
- Post Processing ( $S$  generation)

## Merging Generate and Propagate Signals

- How to merge signals from two subsequent stages

$$\begin{aligned}(G_{i:i}^0, P_{i:i}^0) &= (g_i, p_i) \\ (G_{i:k}^l, P_{i:k}^l) &= (G_{i:j+1}^{l-1}, P_{i:j+1}^{l-1}) \bullet (G_{j:k}^{l-1}, P_{j:k}^{l-1}) ; \quad k \leq j \leq i \\ &= (G_{i:j+1}^{l-1} + P_{i:j+1}^{l-1} G_{j:k}^{l-1}, P_{i:j+1}^{l-1} P_{j:k}^{l-1}) \\ c_{i+1} &= G_{i:0}^m ; \quad i = 0, \dots, n-1, \quad l = 1, \dots, m\end{aligned}$$



# Main Parallel Prefix Operators



At each level we have two operations

- **Merge**

Merges the two  $(P, G)$  of **adjacent** bit ranges.

- **Feedthrough**

No operation, just copies the  $(P, G)$  signals to the next stage.

## Prefix problem

- Inputs  $(x_{n-1}, \dots, x_0)$ , outputs  $(y_{n-1}, \dots, y_0)$ , associative binary operator  $\bullet$  [11, 13]

$$(y_{n-1}, \dots, y_0) = (x_{n-1} \bullet \dots \bullet x_0, \dots, x_1 \bullet x_0, x_0) \text{ or}$$
$$y_0 = x_0, y_i = x_i \bullet y_{i-1}; i = 1, \dots, n-1 \text{ (r.m.a.)}$$

- Associativity of  $\bullet \Rightarrow$  tree structures for evaluation :

$$x_3 \bullet (x_2 \bullet (\underbrace{x_1 \bullet x_0}_y)) = (\underbrace{x_3 \bullet x_2}_z) \bullet (\underbrace{x_1 \bullet x_0}_y), \text{ but } y_2 ?$$
$$\underbrace{y_1 = Y_{1:0}^1}_y \qquad \underbrace{Y_{3:2}^1 \qquad y_1 = Y_{1:0}^1}_z \qquad y_3 = Y_{3:0}^2$$
$$y_2 = Y_{2:0}^2$$
$$y_3 = Y_{3:0}^3$$

- *Group variables*  $Y_{i:k}^l$  : covers bits  $(x_k, \dots, x_i)$  at level  $l$
- *Carry-propagation* is prefix problem :  $Y_{i:k}^l = (G_{i:k}^l, P_{i:k}^l)$

# Basic Structure of Parallel Prefix Adders



# Example: Ripple Carry Adder (PPA-RCA)



The following parameters define the performance

- **Number of Black Dots**

Determines the area, since only the **Merge** cells contain logic:

$$(G_{i:m}^k, P_{i:m}^k) = (G_{i:j}^{k-1} + P_{i:j}^{k-1} \cdot G_{j-1:m}^{k-1}, P_{i:j}^{k-1} \cdot P_{j-1:m}^{k-1})$$

- **Number of Stages**

Determines the critical path →  
fewer stages result in a faster the adder.

- **Maximum Fan Out**

Determines the per stage delay.  
Cells with a higher fan out will be slower.

# Parallel Prefix Adders: Sklansky (PPA-SK)



# Parallel Prefix Adders: Brent Kung (PPA-BK)



# Parallel Prefix Adders: Kogge Stone (PPA-KS)



# Parallel Prefix Adders: Han Carlson (PPA-HC)



# Parallel Prefix Adders: Carry Increment Adder (PPA-CIA)





# Performance Summary

| Adder | Area  | Speed | FO   | $\approx A_\bullet$  | $\approx T_\bullet$ | $\approx FO_{max}$ |
|-------|-------|-------|------|----------------------|---------------------|--------------------|
| RCA   | small | slow  | min  | $n - 1$              | $n - 1$             | 2                  |
| SK    | large | fast  | high | $\frac{n}{2} \log n$ | $\log n$            | $\frac{n}{2}$      |
| BK    | med   | med   | med  | $2n - \log n$        | $2 \log n - 2$      | $\log n$           |
| KS    | huge  | fast  | min  | $n \log n$           | $\log n$            | 2                  |
| HC    | large | fast  | min  | $\frac{n}{2} \log n$ | $\log n + 1$        | 2                  |
| CIA   | med   | med   | med  | $2n - 2\sqrt{n}$     | $2\sqrt{n}$         | $2\sqrt{n}$        |

**The formulas above are approximations**

Some adders (KS, HC) have long connections, that are sometimes hard to route, their post layout results differ.

## Some observations

- **RCA is a very efficient adder**

It is by far the smallest and simplest adder, especially for small bit sizes it is not too slow either.

- **BK offers good compromise**

Traditional Carry Look-Ahead Adder (CLA) is BK with 4-bit groups.

- **SK suffers from high fan-out**

In theory it is very fast, but the large fan out make it difficult to implement.

- **KS and HC are very fast, but suffer from routing**

These are very large and fast adders, routing is a serious problem.

# Simple Rule For Optimizing Parallel Prefix Adders



→

*Faster*

←

*Smaller*

