### Round Gating for Low Energy Block Ciphers

#### Subhadeep Banik, Andrey Bogdanov

DTU Compute, Technical University of Denmark, Lyngby, Denmark

#### Francesco Regazzoni

ALARI, University of Lugano, Switzerland

Takanori Isobe, Harunaga Hiwatari, Toru Akishita

Sony Corporation, Japan

**IEEE HOST 2016** 

McLean, VA



#### **DTU** Compute

Department of Applied Mathematics and Computer Science

#### Outline



- Preliminaries
- AES-128: A Case Study
- CMOS Energy Consumption Model
- Round Gating
- Conclusion

#### Preliminaries Lightweight Block Ciphers

#### State of the Art

- Numerous block ciphers since AES
- E.g.: Present, TWINE, Piccolo, KATAN, Prince. Simon/Speck, ...
- Low area and low power designs widely studied
- $\bullet$  Low energy  $\Rightarrow$  largely unexplored
- Kerckhof et al (CHES 2012), Batina et al (RFIDSec 2013)

#### Preliminaries Power vs Energy

#### **Power and Energy**

- Both are important lightweight design metrics
- Power is the rate of energy consumption
- Energy is the time integral of power

$$E = \int_t P \ dt$$

 $\bullet$  Energy  $\Rightarrow$  total electric work done by the system

#### Tradeoffs

- Designing for low power/energy can be quite different
- Example: serial architectures for block ciphers
- In general, lower hardware area implies lower power consumption
- $\bullet$  More cycles per encryption  $\Rightarrow$  energy optimality NOT guaranteed

#### AES-128: A Case Study Effect of Frequency



#### **Frequency Dependence**

• 
$$P_{dyn} \propto Freq \Rightarrow P_{dyn} = \frac{CONST}{T} \Rightarrow E_{dyn} = P_{dyn}T = CONST$$
  
•  $E_{stat} = \int_T P_{stat} dt$ 



Figure: Energy consumption for round-based AES-128 vs Clock frequency

#### Not Surprising

• Clock Frequency: For low leakage process, not a factor at sufficiently high frequencies (upto  $f_{max} = \frac{1}{\tau}$ ).

• Same conclusion reached by Kerckchoff at al. (CHES 2012)

5 DTU Compute

#### AES-128: A Case Study Energy Consumption: Case Study AES-128

#### Observation

• Serialization/Unrolling: Round-based designs are clearly best

| # | Design                          | Area(in GE)           | #Cycles   | Energy | Energy/bit |  |
|---|---------------------------------|-----------------------|-----------|--------|------------|--|
|   |                                 |                       |           | (pJ)   | (pJ)       |  |
| 1 | 8-bit                           | 2722.0                | 226       | 1913.1 | 14.94      |  |
| 2 | 32-bit $(A_1)$                  | 4069.7                | 94        | 1123.3 | 8.77       |  |
|   | 32-bit (A <sub>2</sub> )        | 4061.8                | 54        | 819.2  | 6.40       |  |
|   | 32-bit (A <sub>3</sub> ) 5528.4 |                       | 44        | 801.7  | 6.26       |  |
| 3 | 64-bit $(B_1)$                  | 6380.9                | 52        | 1018.7 | 7.96       |  |
|   | 64-bit (B <sub>2</sub> )        | 6362.6                | 6362.6 32 |        | 6.79       |  |
|   | 64-bit (B <sub>3</sub> )        | 64-bit $(B_3)$ 7747.5 |           | 616.2  | 4.81       |  |
| 4 | Round based 12459.0             |                       | 11        | 350.7  | 2.74       |  |
| 5 | 2-round                         | 22842.3               | 6         | 593.6  | 4.64       |  |
| 6 | 3-round                         | 32731.9               | 5         | 1043.0 | 8.15       |  |
| 7 | 4-round                         | 43641.1               | 4         | 1416.5 | 11.07      |  |
| 8 | 5-round                         | 53998.7               | 3         | 1634.4 | 12.77      |  |
| 9 | 10-round                        | 101216.7              | 1         | 2129.5 | 16.64      |  |

#### Table: Area and energy figures for different AES-128 architectures

- Two major sources of power in CMOS circuits:
  - Dynamic dissipation due to the charging and discharging of load capacitances.
  - Static dissipation due to leakage current and other current drawn continuously from the power supply.
- $\bullet$  In a given time interval, if the cell makes n transitions, then

#### Observation

$$E_{dyn} = E \cdot n = \left(\frac{1}{2}C_L V_{DD}^2 + E_{int}\right) \cdot n$$

#### CMOS Energy Consumption Model Case Study: Two Rijndael S-Boxes





|     |        |            |       | Total Time Range: 199742 - 204426 Page 1 of 1                                              |
|-----|--------|------------|-------|--------------------------------------------------------------------------------------------|
| #   | Desig. | Signal     | Value | Time: 199742 - 204426 X 1PS (C1: 2017812REF)                                               |
|     |        | Group 1    |       | , ,   200000 , , , , ,   201000 , , , , ,   202000 , , , , ,   203000 , , , , , ,   204000 |
| 130 |        | Group 1    |       |                                                                                            |
| 001 | Sim    | S1xD [7:0] | 8'hbb | 70) bb                                                                                     |
| 002 | Sim    | S2xD [7:0] | 8'hea | 51 ( ) ( ) ( ) ( ) ( 65) ( ) ( 65) ( ) ( ) ea                                              |
| 003 | Sim    | S3xD [7:0] | 8'h87 |                                                                                            |
|     |        |            |       |                                                                                            |
|     |        |            |       |                                                                                            |
|     |        |            |       | $0$ $\tau_d$ $2\tau_s$                                                                     |

#### CMOS Energy Consumption Model Case Study: *n* Rijndael S-Boxes

• If we place n Rijndael S-Boxes sequentially:  $E_1, E_2, E_3, \ldots$  is an arithmetic sequence.



Figure: Actual and Predicted Energy consumptions per cycle  $E_i$ 

DTU

÷





Figure: Block Cipher Architecture

#### **Energy consumptions**

- Energy consumed in each of the  $RF_i$  and  $RK_i$  blocks is in arithmetic progression.
- If there are R rounds in the algorithm, encryption in  $1 + \left\lceil \frac{R}{r} \right\rceil$  rounds.

#### Total Energy per encryption

• Energy consumption per encryption: shown in Banik et al. (SAC 2015)

$$\mathbf{E}_r = E_r \cdot \left(1 + \left\lceil \frac{R}{r} \right\rceil\right) = (Ar^2 + Br + C) \cdot \left(1 + \left\lceil \frac{R}{r} \right\rceil\right)$$

- For "light" round functions like PRESENT, TWINE, SIMON, MIDORI r = 2 is the optimal configuration
- For "heavy" round functions like AES, LED, PICCOLO, NOEKEON r = 1 is optimal

#### ${\rm Tradeoff} \, \, {\rm on} \, \, r$

- Suitable value of r ?
- $\bullet \ {\sf High} \ r$ 
  - Low latency: Critical in e.g. memory encryption
  - Lower energy required to update registers
  - More energy in later rounds due to compounding switching activity

#### $\bullet$ Low r

- Lower energy consumed per cycle
- Avoids compounding switching activity in later rounds
- High latency!

#### Round Gating The Idea of Round Gating

- $\bullet$  For high r (unrolled designs): compounding switching of transient signals across round functions
- Primarily responsible for high energy consumption.
- What if transients are limited to one round?
- The idea is to present the output of RF<sub>i</sub> to the input of RF<sub>i+1</sub> only when the signal has stabilized
- Can lead to substantial energy savings for unrolled low-latency designs!

#### Round Gating The Idea of Round Gating



- Construct a delay unit with delay  $au_D > au_{RF}$  i.e. the delay in round function.
- The ENABLE signal is transmitted through a chain of delay units.
- The AND gate is active only when ENABLE is High after  $\tau_D$  seconds.
- $RF_{i+1}$  gets input only when output of  $RF_i$  has become stable.

#### Round Gating Implementation



- The  $EN_i$  signals are constructed by a network of OR gates.
- The delay units made of buffers.

#### Round Gating Snapshot for Unrolled AES Circuit (10 Rounds)



| Total Time Range 599038 - 630110 x 1ps |       |               |                                                |     | Total Time Range 598181 - 655504 x 1ps |                |                                                                                |  |  |  |
|----------------------------------------|-------|---------------|------------------------------------------------|-----|----------------------------------------|----------------|--------------------------------------------------------------------------------|--|--|--|
| #                                      | Desig | ; Signal      |                                                | #   | Desiş                                  | : Signal       | , 1600000, 1610000, 1620000, 1630000, 1640000, 165000                          |  |  |  |
| 001                                    | Sim   | InsDI[127.0]  | 12 ba41 c6/8 11d4 /884 dha 7c5b 9021 5d1c      | 001 | Sim                                    | IncDi[127.0]   | 12 ba41 c6f8 11d4 f884 dtta 7c5b 9021 5dtc                                     |  |  |  |
| 000                                    | Sim   | CT1xD[127.0]  | 759 33899 7ed4 42b2 fx39 5b74 e55b ck03 0b46   | 002 | Sim                                    | Eout1xD[127:0] | 159 10000 3x69 7e44 4252 fx39 5b74 e55b cb03 0b46                              |  |  |  |
| 003                                    | Sim   | CT2xD[127:0]  | 177 aa83 461a 1556 c292 875f 2d5d 1b33 bo4c    | 003 | Sim                                    | Eout2xD[127:0] | 1777 10000 0000 111 aa83 461a f558 c292 875f 245d 1033 loc4c                   |  |  |  |
| 004                                    | Sim   | CT3xD[127:0]  | "62e 8f91 5121 cc00 70fe 6ac9 7bbe cf5a 02ae   | 004 | Sim                                    | Eout3xD[127:0] | 1620 1000 0000 0000 0000 1000 8191 5121 cc00 70fe 6ac9 7kbe cf5a 02ae          |  |  |  |
| 003                                    | Sim   | CT4xD[127:0]  | "teff 78b2 x0bx 0860 cf12 dxb6 b406 c317 fe03  | 005 | Sim                                    | Eout4xD[127:0] | 1 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/                                       |  |  |  |
| 008                                    | Sim   | CT5xD[127:0]  | "3133 25de a748 4e4e 180e 5892 b13: 9fb1 945:  | 006 | Sim                                    | Eout5xD[127:0] | 1733 100 0000 0000 0000 0000 0000 1000 25de s7x8 4e4e f50e 5892 b13c 9fbf 945c |  |  |  |
| 007                                    | Sim   | C76xD[127:0]  | 16899 2005 e13c 7e30 31 d5 fed4 19ad 9a3d 94a0 | 007 | Sim                                    | Eout5xD[127:0] | 1999 0000 0000 0000 0000 0000 0000 0000                                        |  |  |  |
| 000                                    | Sim   | C17xD[127:0]  | * 419c                                         | 000 | Sim                                    | Eout7xD[127:0] | *19c 0000 0000 0000 0000 0000 0000 0000 0                                      |  |  |  |
| 005                                    | Sim   | C18xD[127:0]  | ** 4b61                                        | 009 | Sim                                    | Eout8xD[127:0] | 7561 📕 ( 0000 0000 0000 0000 0000 0000 0000                                    |  |  |  |
| 010                                    | Sim   | CT9xD[127:0]  | 1e e895                                        | 010 | Sim                                    | Eout8xD[127:0] | *885 0000 0000 0000 0000 0000 0000 0000                                        |  |  |  |
| 011                                    | Sim   | OutxD0[127:0] | 15 fc1x                                        | 011 | Sim                                    | OutxD0[127:0]  | 1c1x 3800 0000 3800 0000 3800 0000 3800 0000 3800 0000                         |  |  |  |
|                                        |       |               |                                                |     |                                        |                |                                                                                |  |  |  |
|                                        |       |               |                                                |     |                                        |                |                                                                                |  |  |  |
| F                                      |       |               |                                                |     | -                                      |                |                                                                                |  |  |  |
|                                        |       |               |                                                |     |                                        |                |                                                                                |  |  |  |
| L                                      |       |               |                                                |     |                                        |                |                                                                                |  |  |  |

- Waveforms for the fully unrolled AES-128 circuit (normal and roundgated)
- The waveforms listed are the output signals of each successive round function
- With round gating, compounding of switching is prevented

## Round Gating Experimental Results for $1 \le r \le 4$





Figure: Normal and Round Gated Energy consumptions

## Round Gating Experimental Results for $1 \le r \le 4$





Figure: Normal and Round Gated Energy consumptions

### Round Gating **Experimental Results for** $1 \le r \le 4$





Figure: Normal and Round Gated Energy consumptions

### Round Gating Conclusions

#### ${\rm Tradeoff} \, \, {\rm on} \, \, r$

- For lower degrees of unrolling ( $1 \le r \le 4$ ):
  - Round gating not always beneficial
  - The round gating circuit itself consumes some energy
  - For ciphers like **PRESENT** incremental switching is negligible
  - Hence round gating does more harm than good
- For higher degrees of unrolling/ fully unrolled designs
  - Round gating is always beneficial
  - $\bullet$  Huge energy savings (over 60 %) with only minimal additional hardware
  - Latency approximately doubles

#### Round Gating Comparison of fully unrolled circuits for various ciphers



| #  | Cipher      | Blocksize/ | Area(GE) |             | Total Energy (pJ) |         |             | Latency (ns) |        |             |
|----|-------------|------------|----------|-------------|-------------------|---------|-------------|--------------|--------|-------------|
|    |             | Keysize    | Normal   | Round gated | % Change          | Normal  | Round gated | % Change     | Normal | Round gated |
| 1  | AES-128     | 128/128    | 101217   | 105931      | +4.7%             | 2129.5  | 707.7       | -66.8%       | 28.5   | 54.3        |
| 2  | Noekeon     | 128/128    | 24538    | 27113       | +10.5%            | 3631.2  | 650.0       | -82.1%       | 35.5   | 57.7        |
| 3  | Midori128   | 128/128    | 21647    | 24109       | +11.4%            | 1760.1  | 328.5       | -81.3%       | 18.8   | 37.9        |
| 4  | Midori64    | 64/128     | 8416     | 9612        | +14.2%            | 563.1   | 168.9       | -70.0%       | 14.4   | 30.9        |
| 5  | LED 128     | 64/128     | 47257    | 52161       | +10.4%            | 13526.5 | 705.8       | -94.8%       | 121.3  | 229.3       |
| 6  | Prince      | 64/128     | 7729     | 8567        | +10.8%            | 369.5   | 137.3       | -62.8%       | 11.5   | 22.0        |
| 7  | Present     | 64/80      | 16036    | 20596       | +28.4%            | 982.8   | 261.4       | -73.4%       | 20.2   | 43.8        |
| 8  | Piccolo     | 64/80      | 16132    | 18707       | +16.0%            | 2617.7  | 350.7       | -86.6%       | 45.1   | 88.0        |
| 9  | Twine       | 64/80      | 15399    | 21260       | +38.1%            | 1987.3  | 294.6       | -85.2%       | 43.1   | 75.6        |
| 10 | Simon 64/96 | 64/96      | 18403    | 25568       | +38.9%            | 1459.9  | 282.0       | -80.7%       | 15.6   | 37.8        |

#### Round Gating Energy reduction for fully unrolled circuits



Comparative Energy Reduction for fully unrolled implementation

#### Conclusion Some Inferences

#### **Energy optimality**

- $\bullet$  Signal delay across round  $\Rightarrow$  more switching activity in later rounds
- $\bullet$  So, r=1,2 usually the optimal energy configuration
- However higher r may be required in specific applications (eg. low delay memory encryption)
- Simpler round functions tend to have smaller signal delay
- Eg: Present, TWINE, Simon 64/96
- For low r, round gating does not improve energy performance

#### For fully unrolled ciphers

- Round gating is highly effective
- Substantial energy savings with minimal hardware overhead



# THANK YOU