# Low-Power Fanout Optimization using Multi Threshold Voltages and Multi Channel Lengths Behnam Amelifard, Farzan Fallah, Senior Member, Massoud Pedram, Fellow, IEEE Abstract— This paper addresses the problem of low-power fanout optimization for near-continuous size inverter libraries. It is demonstrated that because of neglecting short-circuit current, previous techniques proposed to optimize the area of a fanout tree may result in excessive power consumption. The paper describes how the problem of low-power fanout optimization can be reduced to inverter chain optimization problem and formulates the minimization of the total powerconsumption of an inverter chain as a geometric program. Moreover, it describes an efficient method to minimize the total power consumption of a fanout tree by using multi channel length (multi- $L_{\text{Gate}}$ ) and multi threshold voltage (multi-V<sub>t</sub>) techniques. Experimental results show that the proposed technique can reduce the power consumption of the fanout trees by an average of 11.17% over SIS fanout optimization program. Index Terms—Low-power design, Logic synthesis, technology mapping, fanout optimization, multiple threshold voltage, multiple channel length #### I. INTRODUCTION ERY often in VLSI circuits, a signal needs to be distributed to several destinations under a required timing constraint at each destination. In practice, there may also be a limitation on the load that can be driven by the source signal. Fanout optimization is the problem of building an inverter tree topology between a source and some sinks and sizing the inverters so that the driving capacitance at the source is less than an upper bound and the timing constraints at sinks are met, while an objective function is minimized [1-3]. Different objective functions have been considered for the fanout optimization problem, such as minimizing area [2, 4, 5], minimizing power consumption [4, 6], and minimizing load on the source [7]. Unlike buffer insertion which is a back-end process and is performed after the global routing when the interconnect information is available, fanout optimization is performed during logic synthesis often interleaved with the technology mapping process in order to provide the global placer with Manuscript received August 10, 2007; revised February 6, 2008 and July 25, 2008. This research was sponsored in part by a grant from the National Science Foundation. Behnam Amelifard is with Qualcomm Inc., San Diego, CA 92121 USA (e-mail behnama@qualcomm.com). Farzan Fallah is with Envis Corporation, Santa Clara, CA 95054 USA (e-mail farzan@envis.com). Massoud Pedram is with the University of Southern California, Los Angeles, CA 90089 USA (e-mail: pedram@usc.edu) accurate information about the number and sizes of the logic gates in the netlist. The fanout optimization problem to achieve minimum area for libraries with discrete sizes has been proven to be NP-complete [1, 8]. However, it has been shown that using an inverter library with near-continuous sizes greatly simplifies the problem [9]. More precisely, the assumption of near-continuous library allows one to model the problem as a mathematical optimization problem with continuous variables and solve it efficiently. With utilizing a near-continuous library, the mapping of optimized continuous variables to discrete ones in the library results in a near optimal solution. Several techniques have been proposed to address the fanout optimization problem using simplified delay models. Reference [7], for example, introduced two transformations, namely "merging" and "splitting", used to convert any fanout tree to a set of inverter chains. It was shown that these transformations maintain the area, delay, and input capacitance. Using the transformation introduced in [7], reference [2] proposed a logical effort-based fanout optimizer for area which attempts to minimize the total buffer area under the required time and input capacitance constraints. Although much research has been done to address fanout optimization problem, there is little work on low-power fanout optimization. More specifically, since both dynamic and leakage power dissipation of a fanout chain are proportional to its area, it has been widely accepted that power minimization of the fanout tree is equivalent to its area optimization [4, 6]. In this paper, however, we show that due to short-circuit power dissipation, minimizing area does not necessarily result in a minimized power dissipation solution. In particular, the solution obtained from an area optimized fanout tree may dissipate excessive short-circuit power. We formulate the problem of minimizing the power dissipation of a fanout chain and show how to build a fanout tree out of these poweroptimized chains. Additionally, to suppress the leakage power dissipation in a fanout tree, we use multi-L<sub>Gate</sub> [10, 11] and multi-V<sub>t</sub> techniques. In the presence of multi-L<sub>Gate</sub> and multi-V<sub>t</sub> options, we accurately model the delay and power dissipation of inverters as posynomials; therefore, our proposed problem formulation results in a convex mathematical program comprising of a posynomial objective function with posynomial inequality constraints. When there is only one sink, the fanout tree is reduced to a chain of inverters between the source and sink and the fanout optimization problem becomes that of finding the number and sizes of the inverters to satisfy the input capacitance and timing constraints while minimizing some objective function such as area or power dissipation. For multiple sinks, on the other hand, by using the split and merge transformations [7] or by limiting the types of the fanout trees to the so called LT-trees [1], a fanout tree can be constructed from the inverter chains. In this paper we use *fanout chain* to describe the fanout topology with one sink and *fanout tree* to describe it when there are multiple sinks. The remainder of the paper is organized as follows. Section II describes the delay and power models models that will be used throughout the paper. Section III investigates the problem of minimizing the area of a fanout chain and shows that a minimized area fanout chain may dissipate excessive short circuit power. Section IV formulates the problem of low-power fanout chain optimization (i.e., when there is only one sink). Section V shows how a low-power fanout tree can be constructed from the fanout chains. Simulation results and conclusions are given in Sections VI and VII, respectively. #### II. DELAY AND POWER MODELS ## A. The Delay Model The delay model we use in this paper is based on logical effort [12]. The logical effort is a technique for modeling and analyzing delay in CMOS circuits and has been widely used to solve a variety of synthesis problems including technology mapping [13, 14], gate sizing [15], and fanout optimization [2, 6, 7]. Additionally, it has also been incorporated in some industry synthesis tools [16, 17]. Although the accuracy of logical effort delay model is reduced for deep-submicron devices, the main advantage of this technique is that it is very simple, quite efficient, and exhibits high fidelity as far as the gate propagations delays are concerned. Therefore, it has found broad applications in the early design stages, when the interconnect information is not available. By using this technique, the initial sizing of logic gates can be performed and the results provided to a global placer. After doing the placement/routing and extracting interconnect information, more accurate models, e.g., non-linear delay models or lookup tables, may be used for delay analysis and resizing of the gates if needed. In this section we first review this model and then describe its extension to handle multi-V<sub>t</sub> and multi-L<sub>Gate</sub> techniques. Using the notion of logical effort, the delay of a gate with input capacitance $C_{in}$ , which drives the load capacitance $C_L$ , is modeled as, $$D = \tau_0(p + gh) \tag{1}$$ where $\tau_0$ is a conversion coefficient that characterizes the semiconductor process being used and converts the unit- less part, p+gh, to a time unit. For the sake of simplicity, in the remainder of this paper, we set $\tau_0$ to one. Parameter p denotes the parasitic delay of the gate. The major contributor to the parasitic delay is the capacitance of the source/drain regions of the transistors that drive the output. Parameter g denotes the "logical effort" of the gate which depends only on the topology of the gate and its relative ability to produce output current. Finally, parameter h denotes the "electrical effort" of the gate and is defined as the ratio of the output capacitance of the gate to its input capacitance, i.e., $h = C_L/C_{in}$ . For an inverter, the value of logical effort g equals one and can be shown that p is the ratio of output diffusion capacitance to input gate capacitance of the template inverter, denoted by $p_0 = C_{diff,T} / C_{in,T}$ . Notice that since both input gate and diffusion capacitances of an inverter are scaled linearly by changing the inverter's size, for a scaled inverter, the ratio of diffusion-to-gate capacitance remains constant, i.e., $$C_{diff} / C_{in} = p_0 \tag{2}$$ where $C_{diff}$ is the diffusion capacitance at the output and $C_{in}$ is the gate capacitance at the input. In the following, we show how to extend the concept of logical effort to handle multi-V<sub>t</sub> and multi-L<sub>Gate</sub> technologies. It is known that when the threshold voltage of a gate is changed, the new delay can be obtained from the alphapower law [18] by the following equation, $$d = d_0 \frac{(V_{dd} - V_{t0})^{\alpha}}{(V_{dd} - V_t)^{\alpha}}$$ (3) where $\alpha$ is a technology parameter which is around 1.3 for short channel devices, $V_{dd}$ is the supply voltage, $V_{t0}$ is the nominal threshold voltage, $d_0$ is the delay under the nominal threshold voltage, $V_t$ is an arbitrary threshold voltage, and d is the delay under the arbitrary threshold voltage. Using equations (1) and (3) one can verify that in a multi- $V_t$ technology, the values of the logical effort and parasitic delay change as follows, $$g_v = \frac{(V_{dd} - V_{t0})^{\alpha}}{(V_{dd} - v)^{\alpha}}, \ p_v = p_0 \frac{(V_{dd} - V_{t0})^{\alpha}}{(V_{dd} - v)^{\alpha}}$$ (4) where $g_v$ and $p_v$ are the logical effort and parasitic delay for an arbitrary threshold voltage, $\,v\,.\,$ Equations (1) and (4) are based on the assumption that the channel length of the gate, L, is equal to the nominal channel length of the technology, $L_{nom}$ . In a multi- $L_{Gate}$ technology, however, the delay of a logic gate is an increasing function of the channel length. Our SPICE simulations [19] show when the channel length of an inverter is increased, the new delay can be obtained from the following equation, $$d_l = d_0 l^{\beta_d} \tag{5}$$ where l is the normalized channel length, i.e., $l = L_{Gate} / L_{nom}$ and $\beta_d$ is a fitting parameter. Moreover, $d_0$ is the delay under the nominal channel length, while $d_l$ is the delay of the gate with the normalized channel length l. Using equation (5), one can easily establish that in a multi- $L_{\text{Gate}}$ technology, values of the logical effort and parasitic delay change as follows, $$g_l = l^{\beta_d} , p_l = p_0 l^{\beta_d}$$ (6) # B. Power Dissipation Model The power dissipation of a CMOS gate has three components: capacitive power, short circuit power, and leakage power. ## 1) Capacitive Power Dissipation The capacitive power dissipated in inverter capacitances, i.e., input gate capacitance and output diffusion capacitance, is equal to, $$P_{dyn} = \alpha f V_{dd}^2 C \tag{7}$$ where $\alpha$ is the switching activity of the inverter, f is the frequency, $V_{dd}$ is the supply voltage, and C is the sum of the input gate capacitance and output diffusion capacitance of the inverter, i.e., $C = C_{diff} + C_{in}$ . By using (2), equation (7) can be re-written as, $$P_{dyn} = \alpha f V_{dd}^2 (1 + p_0) C_{in} = k_{dyn} C_{in}$$ (8) In a multi- $L_{\text{Gate}}$ technology, the input gate capacitance of the inverter increases as a result of biasing the channel length, while the diffusion capacitance remains unchanged. Therefore, the capacitive power dissipation is obtained from, $$P_{dyn,l} = k_{dyn} \frac{l + p_0}{1 + p_0} C_{in}$$ (9) where $C_{in}$ denotes the input capacitance of the inverter under nominal gate-length. ## 2) Short-Circuit Power Dissipation The second source of power dissipation in digital circuits is short-circuit current. If a circuit is *well-designed*, its short-circuit power dissipation is about 10%-20% of the capacitive power dissipation [20]. Several techniques have been proposed to address the problem of short circuit power estimation [20], but due to their complexity, their use tend to be impractical during gate-level optimization. In this paper, by observing the fact that short-circuit power dissipation of an inverter is a linear function of its size and input transition time [20] and also the fact that input transition time itself can be approximated as a linear function of the electrical effort of its fanin gate (see Fig. 1), the short-circuit power dissipation of the *i*<sup>th</sup> inverter in a chain is calculated as, $$P_{sc} = \alpha A_{sc} h_{i-1} f V_{dd} C_{in} = k_{sc} h_{i-1} C_{in}$$ (10) where $A_{sc}$ is the short-circuit factor which is a technology-dependent parameter, $h_{i-1}$ is the electrical effort of the $(i-1)^{\rm th}$ inverter and $C_{in}$ is the input capacitance of the $i^{\rm th}$ inverter. From Fig. 1 one can see that this technique, despite its simplicity, is accurate enough to be used in gate-level optimization. From equations (8) and (10), one can see the ratio of the short-circuit to the dynamic power dissipation of an inverter can be expressed as, $$\frac{P_{sc}}{P_{dyn}} = \frac{k_{sc}}{k_{dyn}} h_{i-1}. {11}$$ For various values of $h_{i-1}$ this ratio is plotted in Fig. 1. Fig. 1. The percentage ratio of the short-circuit power dissipation of the $i^{th}$ inverter to its dynamic power dissipation, as a function of $h_{i-1}$ . It should be noted that in a multi- $V_t$ inverter chain, the short-circuit power dissipation, and consequently, $k_{sc}$ of the $i^{\rm th}$ inverter (henceforth, denoted as $k_{sc,i}$ ) is a function of the threshold voltages of the $i^{\rm th}$ inverter and its driver (i.e., the $(i-1)^{\rm th}$ inverter). If there are m threshold voltages in the library, then there will be $m^2$ distinct values for $k_{sc,i}$ 's. Utilizing longer channel length for PMOS and NMOS transistors in a CMOS inverter increases the threshold voltage of both transistors; therefore, the time during which both NMOS and PMOS transistors are ON during the output transition is decreased. Thus, the short-circuit power consumption of the inverter is reduced. On the other hand, since the output slew time of an inverter increases when using a longer channel length, the short circuit power of the fanout gate increases. Therefore, in an inverter chain, the short-circuit power dissipation of the $i^{\rm th}$ inverter is inversely proportional to the channel length of the inverter, i.e., $l_i$ , and directly proportional to the channel length of its driver, i.e., $l_{i-1}$ . Based on these observations, we model the short-circuit power dissipation of the $i^{\rm th}$ inverter in a chain as. $$P_{sc} = k_{sc} h_{i-1} l_i^{-\beta_{sc1}} l_{i-1}^{\beta_{sc2}} C_{in}$$ (12) where $\beta_{sc1}$ and $\beta_{sc2}$ are technology constants found by fitting (12) to data extracted from SPICE level simulations. It should be mentioned that although the accuracy of the model is reduced for large $l_i$ 's, since for these values of $l_i$ the short-circuit power dissipation becomes quite small compared to the capacitive power, the error in the total power consumption model remains small [19]. ## 3) Leakage Power Dissipation The third source of the power dissipation is the leakage current. In the present CMOS technologies, the major components of the leakage current are sub-threshold and gate-tunneling currents [21]. The sub-threshold leakage is the drain-source current of a transistor operating in the weak inversion region which can be expressed as [21], $$I_{sub} = A_{sub}\mu_0 C_{ox} \left(\frac{w}{L_{eff}}\right) \exp\left(\frac{q}{n'kT} \left(V_{gs} - V_{t0} - \gamma' V_{sb} + \eta V_{ds}\right)\right) \times \left(1 - \exp\left(-qV_{ds} / kT\right)\right)$$ (13) where $A_{sub} = (kT/q)^2 \exp(1.8)$ , $\mu_0$ is the zero bias mobility, $C_{ox}$ is the gate oxide capacitance per unit area, w and $L_{eff}$ denote the width and effective length of the transistor, k is the Boltzmann constant, T is the absolute temperature, and q is the electrical charge of an electron. In addition, $V_{t0}$ is the zero biased threshold voltage, $\gamma'$ is the linearized body-effect coefficient, $\eta$ denotes the Drain-Induced Barrier Lowering (DIBL) coefficient, and n' is the sub-threshold swing coefficient of the transistor. Let $C_N$ denote the input capacitance of an NMOS transistor. Since $V_{ds}$ of the OFF transistor is $V_{dd}$ which is more than a few $kT/q \approx 26mV$ and noting that in an NMOS transistor $w_N = C_N/(L_{eff}C_{ox})$ , the sub-threshold leakage power of an NMOS transistor can be written as, $$P_{sub,N} = A'_{sub} C_N \mu_N e^{-\lambda V_{t0,n}}$$ (14) where $\lambda=q/n'kT$ and $A'_{sub}=A_{sub}V_{dd}/L_{eff}^2\exp(\lambda\eta V_{dd})$ are technology constants. A similar formula can be derived for the sub-threshold leakage power of a PMOS transistor. From the sub-threshold leakage power expressions for the NMOS and PMOS transistors, the sub-threshold leakage power dissipation of an inverter, $P_{sub}$ , can be written as, $$P_{sub} = \rho P_{sub,P} + (1 - \rho) P_{sub,N}$$ (15) where $\rho$ is the probability that the input of the inverter is at logic 1. If the ratio of the width of the PMOS transistor to that of the NMOS transistor is $\gamma$ , i.e., $w_P/w_N=\gamma$ , by considering the fact that for an inverter $C_{in}=C_N+C_P$ , (15) can be re-written as, $$P_{sub} = \frac{A'_{sub}}{1+\gamma} \left( \rho \gamma \mu_P e^{-\lambda V_{t0,p}} + (1-\rho) \mu_N e^{-\lambda V_{t0,n}} \right) C_{in}$$ = $k_{sub} C_{in}$ (16) From (16) one can see increasing the threshold voltage results in an exponential decrease in sub-threshold leakage current. Based on this observation, multi- $V_t$ and gate-length biasing techniques have been proposed to reduce the leakage power dissipation. Without losing generality, we assume the threshold voltage of the NMOS and PMOS transistors are equal. In this case, when the threshold voltage of an inverter is changed to v, the new sub-threshold leakage power consumption is obtained as, $$P_{sub,h} = k_{sub} \exp(-\lambda(v - V_{t0})) C_{in}$$ $$= k_{sub,h} C_{in}$$ (17) Utilizing a longer channel length for an inverter increases the threshold voltage of both PMOS and NMOS transistors, which in turn reduces the sub-threshold leakage. Based on these observations, we model the sub-threshold power dissipation of the i<sup>th</sup> inverter in an inverter chain as, $$P_{sub,l} = k_{sub} l^{-\beta_{sub}} C_{in} \tag{18}$$ where $\beta_{sub}$ is a technology constant. SPICE simulations shows that despite its simplicity, this model is quite accurate [19]. The other major source of the leakage power dissipation is the gate-oxide tunneling current. If SiO<sub>2</sub> is used for the gate oxide, the main source of gate-oxide tunneling leakage in CMOS circuits is the gate-to-channel tunneling current of the ON NMOS transistors, which can be modeled as [21, 22], $$I_{ox} = A_{ox} w_N L_{eff} \left(\frac{V_{ox}}{t_{ox}}\right)^2 e^{-B_{ox} \frac{t_{ox}}{V_{ox}}}$$ (19) where $A_{ox}$ and $B_{ox}$ are technology constants, $t_{ox}$ is the oxide thickness, and $V_{ox}$ is the potential drop across the oxide. When the transistor is ON, $V_{ox} = V_{gs} - \psi_s$ , where $\psi_s$ is the surface potential of the transistor. Ignoring the gate-tunneling leakage of the PMOS transistor, the gate tunneling leakage power dissipation of an inverter, $P_{ox}$ , can be calculated by, $$P_{ox} = \frac{A'_{ox}}{1+\gamma} \rho C_{in} = k_{ox} C_{in}$$ (20) where $A'_{ox} = A_{ox}V_{dd} (V_{dd} - \psi_s)^2 \exp(-B_{ox}t_{ox}/(V_{dd}\psi_s))/(t_{ox}\varepsilon_0\varepsilon_{ox})$ is independent of the size and the threshold voltage of the inverter. From (19) one can see that the gate-oxide tunneling leakage is proportional to the area of the gate; therefore, in a multi- $L_{Gate}$ technology, (20) should be modified as, $$P_{ox,l} = k_{ox}lC_{in} (21)$$ # III. MINIMUM AREA FANOUT CHAIN In minimizing the area of a fanout chain, shown in Fig. 2, the goal is to find the number of inverters in the chain and their corresponding sizes so that the delay constraint for the sink and the load capacitance constraint for the source are satisfied, while the total area of the chain is minimized: $$\begin{cases} Min & Area \\ s.t. & (i) \ Delay \le T \\ & (ii) \ C_1 \le C_{in,\max} \end{cases}$$ (22) where T is the required time at the sink, $C_1$ is the input capacitance of the first inverter and $C_{in,\max}$ is the maximum tolerable load at the source. Fig. 2. A fanout chain driving a lumped capacitance. In [2], based on the fact that the area of an inverter chain is proportional to the sum of input capacitance of the inverters in the chain and noticing that in an inverter chain with n inverters, the input capacitance of the $i^{\, \text{th}}$ inverter can be expressed as $C_i = C_L \, / \prod_{j=i}^n h_j$ , it is shown that the problem of minimizing the area of the chain with n inverters can be formulated in the logical effort notion as, $$\begin{cases} Min & Area(\vec{h}) = \sum_{i=1}^{n} \frac{C_L}{\prod_{j=i}^{n} h_j} \\ s.t. & (i) & \sum_{i=1}^{n} p_0 + h_i \le T \\ (ii) & H = \prod_{i=1}^{n} h_i \ge \frac{C_L}{C_{in,\text{max}}} \end{cases}$$ (23) where $C_L$ is the load capacitance and $\vec{h} = (h_1, ..., h_n)$ . Problem stated in (23) is called the <u>Fanout Chain Optimization for Area with n inverters, FCOA(n). The minimized area fanout chain can be found by solving FCOA(n) for different values of n. However, depending on the polarity of the sink, only even or odd values for n should be considered. On the other hand, it can be shown that [2] for a fixed number of inverters in the chain (i.e., a fixed n), (23) will have a solution when $n\left(C_L/C_{in,\max}\right)^{1/n}+np_0\leq T$ . This inequality defines a lower bound and an upper bound for the values of n satisfying the constraints of (23) and limits the number of FCOA(n) instances needed to be solved to find the minimum area fanout chain [2].</u> **Lemma 1:** In the optimum solution of FCOA(n), the delay of the fanout chain is exactly equal to the required time T, i.e., [2] $$\sum_{i=1}^{n} p_0 + h_i = T. {(24)}$$ ## A. Convex Representation In the following, we show one important property of FCOA(n) which guarantees the problem of minimizing area of a fanout chain has an optimal polynomial-time solution. More precisely, we show with a slight modification, the problem shown in (23) is converted to a convex program. A convex optimization problem is one of the form [23], $$\begin{cases} Min & f_0(\vec{x}) \\ s.t. & f_i(\vec{x}) \le b_i, \end{cases} \qquad i = 1,...,m$$ (25) where the functions $f_0,...,f_m: \Re^n \to \Re$ are convex, $b_1,...,b_m$ are some positive real numbers, and $\vec{x}=(x_1,...,x_n)$ is a vector. One important property of convex optimization problem is that a local optimal solution is also the global optimum solution. **Lemma 2:** Function f defined as $f(\vec{x}) = 1/\prod_{i=1}^{m} x_i$ is convex on $dom(f) = \Re_{++}$ . **Proof:** It is removed for brevity. Interested reader may refer to [19] for the proof. Theorem 1: By changing the second constraint of FCOA(n) as $$\frac{1}{\prod_{i=1}^{n} h_i} \le \frac{C_{in,\text{max}}}{C_L} \tag{26}$$ FCOA(n) becomes a convex optimization problem for all values of n. **Proof**: According to Lemma 2 the objective function of FCOA(n) is a summation of convex functions and because the summation operation preserves the convexity property [23], the objective function of the problem given by (23) is convex. On the other hand, the first constraint of (23) is a linear function of $h_i$ 's; hence, it is convex. The function $f(\vec{x}) = \prod_{i=1}^n x_i$ is neither convex nor concave [23]. However, according to Lemma 2, by re-writing it as (26) it becomes convex. Since the objective function and constraints of (23) are convex on $\Re_{++}$ , the mathematical problem stated in (23) is convex. Since FCOA(n) is a convex program, it can be efficiently solved by using standard mathematical program solvers. B. Minimum Area versus Minimum Power Fanout Chain Since both dynamic and leakage power dissipation of a fanout chain are proportional to its area, it has been widely accepted that power minimization of a fanout chain is equivalent to its area optimization [4, 6]. In the following, however, we show that due to short-circuit power dissipation, minimizing area does not necessarily result in a minimized power dissipation solution and the solution obtained from an area optimization technique may dissipate excessive short-circuit power. First, note if the constraints of (23) do not intersect at any point, i.e., $n\left(C_L/C_{in,\max}\right)^{1/n}+np_0>T$ there is no solution for the problem. On the other hand, if the intersection of the constraints of (23) results in exactly one point, i.e., when $n\left(C_L/C_{in,\max}\right)^{1/n}+np_0=T$ , the only solution to FCOA(n) is when all $h_i$ 's are equal to $T/n-p_0$ . In other cases the optimization problem (23) can be solved by using the Lagrangian relaxation technique. In this technique, the constraints are relaxed and summed up in the objective function after multiplying them by nonnegative coefficients, called the Lagrange multipliers. The new objective function is called the Lagrangian. In FCOA(n), the Lagrangian is written as, $$L(\vec{h}, \lambda_1, \lambda_2) = Area(\vec{h}) + \lambda_1(\sum_{i=1}^n h_i - T_0 + np_0) + \lambda_2(H_0 - \prod_{i=1}^n h_i)$$ (27) where $\lambda_1$ and $\lambda_2$ are non-negative Lagrange multipliers, $\vec{h}=(h_1,...,h_n)$ , and $H_0=C_L/C_{in,\min}$ . The set of Kuhn-Tucker conditions implies that at the optimal solution of FCOA(n), $$\frac{\partial L}{\partial h} = 0 \qquad i = 1, ..., n \tag{28}$$ and $$\lambda_1 \left( \sum_{i=1}^n h_i - T_0 + n p_0 \right) = 0 \tag{29}$$ $$\lambda_2 \left( H_0 - \prod_{i=1}^n h_i \right) = 0. \tag{30}$$ Now, considering the first set of conditions shown in (28), from $\partial L/\partial h_1 = 0$ , it is concluded that, $$-\frac{1}{h_1\pi_1} + \lambda_1 - \frac{\pi_1}{h_1}\lambda_2 = 0 \tag{31}$$ where $\pi_i$ is defined as, $$\pi_i = \prod_{i=1}^n h_i \,. \tag{32}$$ Similarly, because $\partial L/\partial h_i = \partial L/\partial h_{i+1} = 0$ , we have $h_i \partial L/\partial h_i = h_{i+1} \partial L/\partial h_{i+1}$ , which results in, $$\lambda_{\mathbf{l}} h_i = \lambda_{\mathbf{l}} h_{i+1} - \frac{1}{\pi_{i+1}}. \tag{33}$$ One immediate result of (33) is that in the optimal solution of FCOA(n), the values of $h_i$ 's are increasing, i.e., $$h_1 \le h_2 \le \dots \le h_n \,. \tag{34}$$ The equality happens if and only if the required time and input capacitance constraints intersect at exactly one point. Going back to the remaining Kuhn-Tucker conditions, from Lemma 1, one can see (29) is already satisfied. The remaining condition, as given in (30), implies that one of its terms is zero. If the input capacitance constraint of the optimization problem is "loose", i.e., in the optimal solution $H_0 < \prod_{i=1}^n h_i$ , it is necessary that $\lambda_2 = 0$ . In this case, (31) implies that $\lambda_1 = 1/(h_1\pi_1)$ and (33) may be rewritten as. $$\frac{1}{h_1 \pi_1} h_i = \frac{1}{h_1 \pi_1} h_{i+1} - \frac{1}{\pi_{i+1}}.$$ (35) Similarly, $$\frac{1}{h_1 \pi_1} h_{i-1} = \frac{1}{h_1 \pi_1} h_i - \frac{1}{\pi_i}$$ (36) and since $\pi_i = h_i \pi_{i+1}$ , from (35) and (36), it is concluded that, $$h_{i+1} = h_i(h_i - h_{i-1} + 1) (37)$$ where $h_0 = 0$ . Equation (37) is a recursive equation from which the values of all $h_i$ 's may be found as functions of $h_1$ . Plugging the values of $h_i$ 's as functions of $h_1$ into (24) and solving the polynomial equation, the value of $h_1$ which minimizes the objective function is found. To the best of our knowledge, there is no closed form solution to (37); however, one important property of this recurrence equation may be expressed by the following Lemma. **Lemma 3:** In recurrence equation (37), $$h_i > h_1^{2^{i-1}}$$ (38) **Proof:** It is removed for brevity. Interested reader may refer to [19] for the proof. From Lemma 3, one can see when the input capacitance constraint of FCOA(n) is loose, in the optimal solution of (23) the values of $h_i$ 's grow exponentially and based on (11) and Fig. 1, the ratio of short circuit to dynamic power dissipation of the inverters grows accordingly. #### IV. LOW-POWER FANOUT CHAINS The discussion in Section III establishes that minimizing the area of a fanout chain will not minimize its power consumption. In this section, we generalize the problem and propose a mathematic program for low-power fanout chain design in multi- $V_t$ and multi- $L_{\rm Gate}$ technologies. More precisely, we assume m discrete threshold voltages are available to be used in the inverters of the chain. In addition, we assume the channel length of inverters can be increased up to $L_{\rm max}$ . The objective is to find the optimal number of inverters and their corresponding threshold voltages, channel lengths, and sizes to achieve the minimum power consumption in the active mode. When m=1 and $L_{\rm max}=L_{nom}$ , this problem simply becomes that of finding the optimal number of inverters and their corresponding sizes. #### A. Problem Formulation A multi- $V_t$ and multi- $L_{\rm Gate}$ fanout chain is shown in Fig. 3. In this figure, $h_i$ 's denote the electrical efforts of the inverters, $C_i$ 's are the input capacitances, $l_i$ 's denote the channel lengths, and $v_i$ 's are the threshold voltages of the inverters. The goal is to find the number of inverters, n, $h_i$ 's, $l_i$ 's, and $v_i$ 's to minimize the total power dissipation while meeting both a timing constraint and an input capacitance upper bound constraint. Moreover, there is an upper bound on the length of the channel and the threshold voltage of each inverter should be selected from a given set of available threshold voltages. Since increasing the channel length increases the threshold voltage of a transistor as well, we do not consider increasing both the channel length and threshold voltage of an inverter because the delay penalty tends to be too high. Moreover, we assume a multi-V<sub>t</sub> design is achieved by ion implantation in the channel of the gate. Since changing the channel doping has negligible effect on the diffusion and gate capacitance, this assumption implies the dynamic and gate-tunneling leakage power consumptions are not affected by changing threshold voltages. However, changing the threshold voltage of an inverter alters its delay and sub- threshold leakage according to equations (4) and (16). On the other hand, as discussed in Section II.B, this change also has an effect on the short-circuit power consumption of the fanout chain. Changing the channel length, on the other hand, alters delay and all components of power dissipation, as described in Section II.B. To simplify the equations, without loss of generality, we assume the driver and load of the chain are fixed-sized inverters. The driver is called the $0^{th}$ inverter, while the load is called the $(n+1)^{th}$ inverter. Using the formulation derived in Section II, the power dissipation of the $i^{th}$ inverter in the chain with the normalized channel length $l_i$ can be expressed as, $$P_{i} = \frac{C_{L}\left(\gamma_{i}k_{dyn} + k_{sub,i}l^{-\beta_{sub}} + k_{ox}l_{i} + k_{sc,i}h_{i-1}l_{i}^{-\beta_{sc1}}l_{i-1}^{\beta_{sc2}}\right)}{\prod_{j=i}^{n}h_{j}}$$ (39) where $\gamma_i=(l_i+p_0)/(1+p_0)$ . Moreover, $k_{sub,i}$ is obtained from equation (17) and $k_{sc,i}$ is the short-circuit factor for the $i^{\rm th}$ inverter. Therefore, the problem of optimizing the fanout chain for power dissipation becomes, $$\begin{aligned} Min & P(\vec{h}) = \sum_{i=1}^{n} P_i + k_{sc,n+1} h_n C_L \\ s.t. & (i) & \sum_{i=1}^{n} (p_i + g_i h_i) l_i^{\beta_d} \leq T \\ & (ii) & H = \prod_{i=1}^{n} h_i \geq \frac{1}{l_1} \frac{C_L}{C_{in,\text{max}}} \\ & (iii) & 1 \leq l_i \leq \frac{L_{\text{max}}}{L_{nom}} \\ & (iv) & v_i \in \{V_1, \dots, V_m\} \end{aligned} \tag{40}$$ where $p_i$ and $g_i$ are the parasitic delay and logical effort of the ith inverter which operates with the threshold voltage of $v_i$ . The first two constraints in (40) are the delay and input capacitance constraints while the third constraint of (40) imposes that there is an upper bound on the length of the channels. Finally, the forth constraint of (40) enforces the threshold voltages of the transistors of the inverters to be from the set of available threshold voltages $\{V_1,...,V_m\}$ , where $V_1$ is the nominal threshold voltage and $V_1 \leq ... \leq V_m$ . The size and threshold voltage of the load are fixed; therefore, the dynamic and leakage power dissipations of the load inverter are constant. However, the short-circuit power dissipation of the load inverter is a function of the electrical effort of the last stage in the chain, i.e., $h_n$ ; thus, we include the short-circuit power dissipation of the load into the objective function. Problem stated in (40), which is the <u>Fanout Chain Optimization</u> for minimum <u>Power with n inverters, m threshold voltages, and an upper bound $L_{\max}$ for the channel length, will be called $FCOP(n,m,L_{\max})$ in the rest of this paper. To find the minimum-power fanout chain, $FCOP(n,m,L_{\max})$ should be solved for different values of n. Based on the polarity of the sink, only even or odd numbers should be considered for n.</u> **Lemma 4:** In the $FCOP(n, m, L_{max})$ problem, the total electrical effort, H, is maximized when all $v_i$ 's are equal to $V_1$ and all $l_i$ 's are 1, and all $h_i$ 's are equal. **Proof:** The geometric mean of a number of positive numbers is less than or equal to their arithmetic mean. The equality holds if and only if all values are equal. From the first constraint it can be seen that, $$T \ge \sum_{i=1}^{n} p_{i} l_{i}^{\beta_{d}} + \sum_{i=1}^{n} g_{i} h_{i} l_{i}^{\beta_{d}}$$ $$\ge \sum_{i=1}^{n} p_{i} l_{i}^{\beta_{d}} + n \prod_{i=1}^{n} (g_{i} h_{i} l_{i}^{\beta_{d}})^{1/n}$$ (41) From (41) it is concluded that in order to have a solution to $FCOP(n,m,L_{\max})$ , the following relation must hold, $$\frac{T - \sum_{i=1}^{n} p_i l_i^{\beta_d}}{n \prod_{i=1}^{n} \left( g_i l_i^{\beta_d} \right)^{1/n}} \ge \prod_{i=1}^{n} (h_i)^{1/n} = H^{1/n} . \tag{42}$$ Since $p_i \geq p_0$ , $l_i \geq 1$ and $g_i \geq 1$ , the maximum of H happens when all $h_i$ 's are equal, all $l_i$ 's are equal to 1, and all $p_i$ 's and $g_i$ 's assume their minimum values at $p_0$ and 1, respectively. The latter condition implies that all $v_i$ 's are equal. In this case, the maximum value of $H = \prod_{i=1}^n h_i$ is $H_{\max} = (T/n - p_0)^n$ . According to Lemma 4, there is a maximum value for H, $H_{\rm max}$ , for any given buffer count; on the other hand, since $l_1 \leq L_{\rm max}/L_{nom}$ , the second constraint of $FCOP(n,m,L_{\rm max})$ implies that H must be greater than $C_L/C_{in,{\rm min}} \times L_{nom}/L_{\rm max}$ . Therefore, the only feasible buffer counts are those for which $H_{\rm max}$ is not less than the ratio $C_L/C_{in,{\rm min}} \times L_{nom}/L_{\rm max}$ . Fig. 3. A multi-Vt fanout chain. One important property of $FCOP(n,m,L_{\max})$ is that in its optimal solution, the delay of the fanout chain may not be equal to the specified required time T. To see why this is true, notice the objective function of $FCOP(n,m,L_{\max})$ is not a decreasing function of $h_i$ 's or $l_i$ 's; therefore, increasing $h_i$ 's or $l_i$ 's up to the point that $\sum_{i=1}^n (p_i + g_i h_i) l_i^{\beta_d} = T$ may not result in the minimum objective function. If the design is not multi- $L_{\rm Gate}$ , i.e., $L_{\rm max} = L_{nom}$ , then the third constraint in (40) will be eliminated from the problem and values of all $l_i$ 's become 1. Similarly, if the design is not multi- $V_t$ , i.e., m=1, the fourth constraint in (40) is eliminated and the values of all $p_i$ 's and $g_i$ 's become $p_0$ and 1, respectively. One can verify that constraints of $FCOP(n,1,L_{nom})$ are the same as FCOA(n). If the design is multi- $V_t$ , i.e., $m \geq 2$ , due to discrete values of $v_i$ 's in $FCOP(n,m,L_{\max})$ , a posynomial problem solver needs to enumerate all possible assignments of the threshold voltages, i.e., $m^n$ assignments, and solve the resulting mathematical program to find the minimum-power fanout chain by optimally selecting $h_i$ 's and $l_i$ 's. Due to its exponential runtime, such an enumeration is not possible. Hence, we use the same approach as in [6] to assign the threshold voltages. In this approach, the assignment of the threshold voltages is done as follows: starting from the source and going to sink, the values of the threshold voltages are increased. This heuristic called *monotone assignment* of the threshold voltages, greatly simplifies the problem and reduces the number of possible candidates to nm. It is known that each additional threshold voltage needs one more mask layer in the fabrication process which results in increasing the fabrication cost. As a result, in many cases, only two threshold voltages are utilized in the circuit. At the same time, there are studies that show the benefit of having more than two threshold voltages is small [24]. So, in the seguel we concentrate on the problem of 2- $V_t$ low-power fanout optimization, i.e., $FCOP(n, 2, L_{max})$ . The results can be extended to handle more threshold voltages. It is worth mentioning that in multi-L<sub>Gate</sub> technique, typically a limited and discrete number of L<sub>Gate</sub> values are chosen for leakage reduction. However, unlike the multi-V<sub>t</sub> technique, the number of discrete length values in not limited to two or three. This is due to the fact that different channel lengths can be created by simply changing the geometry of the device and using only one mask. Using discrete values for the channel length, however, is needed for mask production. That is why in our results, after optimally sizing the channel lengths, we round the channel lengths to the nearest 1nm. The pseudo-code for the BestChain algorithm is provided in Fig. 4. First, by using the result of Lemma 4, for a given $C_{in,\max}$ , $C_L$ , and T, the BestChain algorithm finds the lower and upper bounds of n. Based on the polarity of the sink node, only even or odd numbers of inverters between these bounds are considered when searching for the optimum solution. For a given n, the BestChain algorithm attempts the $FCOP(n, 2, L_{max})$ problem with all threshold voltages set to $V_1$ , i.e., the nominal threshold voltage. If there is no feasible solution, then the timing and/or input capacitance constraints are too tight. The algorithm goes through a number of iterations where in each iteration, the threshold voltages of the last m inverters in the chain are set to $V_2$ . This process is repeated until we find $\widetilde{m}$ such that there exists a feasible solution to the $FCOP(n,2,L_{max})$ with $\widetilde{m}$ inverters, but not with $\widetilde{m} + 2$ inverters. In the pseudo-code, function FCOP - FV finds the optimum solution to the $FCOP(n,2,L_{max})$ problem with known threshold voltage values as captured by the assignment vector, $\vec{v}$ . More precisely, FCOP - FV algorithm finds $l_i$ 's of the first n-m inverters, which have the nominal threshold voltage, and also $h_i$ 's of all inverters. Note since the FCOP - FVfunction is called for fixed $\vec{v}$ 's; this optimization problem is the minimization of a posynomial function with posynomial inequality constraints. This posynomial formulation is translated into a convex one by a change of variables $h_i = \exp(x_i)$ and $l_i = \exp(y_i)$ and is solved in polynomial time [23]. ``` BestChain(C_{in,max}, C_L, T, pol){ (\tilde{n}_1, \tilde{n}_2) = \text{solution} (C_L/C_{in,\text{max}} \cdot L_{nom}/L_{\text{max}}) = (T/n - p_0)^n; n_1 = \lfloor \tilde{n}_1 \rfloor \text{or} \lfloor \tilde{n}_1 \rfloor + 1 \text{(depending on } pol) (pwr^*, \vec{h}^*, \vec{l}^*, \vec{v}^*) = (+\infty, \varnothing, \varnothing, \varnothing) For n = n_1 \operatorname{to} n_2 \operatorname{step} 2\{ For i = 1 to n \vec{v}(i) = V_2 (\vec{h}, \vec{l}, pwr) = FCOP - FV(n, T, C_{in, max}, C_L, \vec{v}) If \vec{h} = \emptyset continue If pwr < pwr^* (pwr^*, \vec{h}^*, \vec{l}^*, \vec{v}^*) = (pwr, \vec{h}, \vec{l}, \vec{v}) For m = n \text{ to 1 step -1} \vec{v}(m) = V_2 (\vec{h}, \vec{l}, pwr) = FCOP - FV(n, T, C_{in, \max}, C_L, \vec{v}) If pwr > pwr^* (pwr^*, \vec{h}^*, \vec{l}^*, \vec{v}^*) = (pwr, \vec{h}, \vec{l}^*, \vec{v}) \operatorname{Return}(pwr^*, \vec{h}^*, \vec{l}^*, \vec{v}^*) ``` Fig. 4. BestChain algorithm #### V. BUILDING A FANOUT TREE In this section we show how to build a fanout tree with more than one sink. Reference [7] introduced two transformations that could be performed on a fanout tree, namely merging and splitting, and showed these transformations preserve area, delay, and input capacitance of the fanout tree. We have extended the merging and splitting transformations to handle multi- $V_t$ and multi- $L_{\text{Gate}}$ fanout trees, as depicted in Fig. 5. Fig. 5. Extended split/merge transformations for Multi- $V_{\rm t}$ and multi- $L_{\text{Gate}}$ inverters **Theorem 2:** The extended split/merge transformations applied to a multi- $V_t$ and multi- $L_{Gate}$ fanout tree as depicted in Fig. 5 preserve the delay, input capacitance, and power dissipation values of the tree. **Proof:** We provide the proof for the split transformation. Before splitting, the delay of the inverter is $(p_x + g_x h) l^{\beta_d}$ while the input capacitance is $(C_1 + C_2)/h$ . After splitting the original inverter into two inverters with equal electrical efforts of h and equal channel length l and threshold voltages of $v_x$ , the delay through the inverter in either branch will be $(p_x + g_x h)l^{\beta_d}$ while the input capacitances will be $C_1/h$ and $C_2/h$ which sum up to $(C_1 + C_2)/h$ . Therefore, this transformation preserves the delay and input capacitance values. Since this transformation does not change the input capacitance, the electrical effort of the previous stage, which characterizes the short-circuit power dissipation of two inverters before the merge transformation, does not change; it is easy to see the capacitive and leakage power consumption of the tree remains the same after the transformation. Moreover, since this transformation does not change the channel length of the inverter transistors, the short circuit power dissipations of $C_1$ and $C_2$ remain the same. Hence, the total power dissipation of the fanout tree before and after the split transformation remains the same. Since extended split/merge transformations preserve the delay, input capacitance, and power dissipation values, by using these transformations, any fanout optimization problem with m sink nodes, can be converted to m fanout chain optimization problems, whose respective power dissipations will be the same. To apply these transformations, two issues should be addressed. The first issue is the input capacitance allocation to different chains in a decomposed fanout tree and the second issue is the validity of a continuous-size inverter library. In the following we address these questions. # A. Input Capacitance Allocation The Input Capacitance Allocation to achieve minimum Power (ICAP) problem is defined as follows: Given a number of sinks, each with a required time, polarity, and capacitive load, and a total budget on input capacitance $C_{in,tot}$ , allocate portions of $C_{in,tot}$ to each fanout chains such that the total power is minimized while the given constraints for all sinks are satisfied. In this section we show the ICAP problem is NP-complete and we use a heuristic to allocate the input capacitance to different chains in a decomposed fanout tree. **Lemma 5:** For a fixed number of inverters in a multi- $V_t$ and multi- $L_{\text{Gate}}$ fanout chain, the power cost is a decreasing function of the input capacitance bound, $C_{in,\max}$ . **Proof:** From the second constraint in (40), it is seen that increasing the input capacitance constraint of a fanout chain expands the feasible space of the optimization problem. Therefore, there exists either a better solution with lower power consumption or one with the same power consumption; that is, the power cost in a fanout chain is a decreasing function of the input capacitance bound. **Theorem 3:** The ICAP problem is NP-Complete. **Proof:** To prove that ICAP is NP-Complete, we show the 0-1 Knapsack problem may be reduced to the ICAP problem. In the 0-1 Knapsack problem, there are some items, each with its own value and weight; the objective is to select some items such that the total value of the selected items is maximized while their total weight is not more than a given budget. In the ICAP problem, however, the objective is to minimize power. To make ICAP a maximization problem, we consider the negative of power as the objective function. According to Lemma 5, the power cost is a decreasing function of the input capacitance constraint; therefore, the graph of the maximum of negative power over all inverter counts looks like Fig. 6. Notice this graph exhibits a piecewise behavior because power is represented by different functions for different inverter counts. The piecewise nature of power versus input capacitance helps us to reduce the 0-1 Knapsack problem to the ICAP problem. Fig. 6. Negative of power dissipation versus the input capacitance curve. This reduction is similar to the reduction of the Knapsack problem to the problem of input capacitance allocation for minimum area, hence, it is omitted here. Interested readers may refer to [2] for details. After proving the ICAP is NP-Hard, we show the decision version of the ICAP can be tested in polynomial time. This is clearly true because one can add up the input capacitances of each branch and compare it with the input capacitance budget in linear time. Therefore, the ICAP is in NP; since it was shown that the ICAP is NP-Hard, therefore, the ICAP problem is NP-Complete. The heuristic we use for solving the ICAP problem is similar to that of [2] and starts by allocating the minimum input capacitance required for each branch to have a feasible fanout chain solution. Next, the remaining total input capacitance is divided between the chains in proportion to the positive slopes of $H_{\max,i}$ versus $n_i$ for each branch i. #### B. Discrete-Size Inverter Library The second issue to address is the assumption of the availability of a continuous-size inverter library. In reality, in the ASIC libraries, although many different inverter sizes are available, these sizes are discrete (there are typically 8-16 different inverter sizes in an industrial state-of-the-art ASIC library.) So the solution needs to be mapped onto one of the available inverters in the library. The main problem when rounding the inverter sizes is that it may result in significant errors. To address this problem, reference [2] defined a constant $\varepsilon_h$ and merged two inverters on different chains if the difference between their electrical efforts was less than or equal to $\varepsilon_h$ . Notice, in general, two inverters are merged if the rounding error after merging is smaller than the sum of the rounding errors of inverters before the merge operation. We adopt the same heuristic with the additional requirement that the two candidate inverters should also have the same threshold voltage and the difference between $l_1$ and $l_2$ should be smaller than a constant $\varepsilon_l$ . Merging is performed starting at the source of the signal and proceeds toward sinks. #### VI. SIMULATION RESULTS The proposed technique in Section IV, which we call *LPFO*, has been developed in the SIS framework [25]. The MOSEK convex optimization tool [26] has been used to solve the mathematical problems. To extract the parameters used in the optimization problems, we performed transistor level simulation of devices in HSPICE [27] on a 65nm technology node [28]. The simulations have been done at the frequency of 1GHz, supply voltage of 1.1V, and die temperature of 100°C. Moreover, we assumed the switching activity of the source node is 5% and the probability of this node being at logic one is 0.5 in all circuits. The parameters of this technology node are shown in Table I. In this table, $k_{sc,LH}$ is the short-circuit factor of an inverter whose threshold voltage is high while the threshold voltage of its driver is low. $k_{sc,LL}$ , $k_{sc,HL}$ , and $k_{sc,HH}$ are defined similarly. The values of short circuit factors as well as $k_{sub,low}$ , $k_{sub,high}$ , and $k_{ox}$ are normalized with respect to $k_{dyn}$ . In this set of experiments, a standard cell library consisting of sixteen different inverters was used to map the fanout trees. TABLE I TECHNOLOGY PARAMETERS USED IN SIMULATIONS | Parameter | Value | Parameter | Value | |----------------|---------|--------------------------|-------| | $V_{t,low}$ | 0.2V | $k_{sc,LL}$ | 0.069 | | $V_{t,high}$ | 0.3V | $k_{sc,LH}$ | 0.006 | | $\gamma$ | 3.5 | $k_{sc,HL}$ | 0.099 | | $ au_0$ | 8.6e-12 | $k_{sc,HH}$ | 0.014 | | $p_0$ | 1.33 | $eta_{sub}$ | 7.4 | | $k_{dyn}$ | 1.000 | $\beta_{sc1}$ | 22.5 | | $k_{sub,low}$ | 0.343 | $\beta_{sc2}$ | 4.4 | | $k_{sub,high}$ | 0.078 | $eta_d$ | 1.6 | | $k_{ox}$ | 0.096 | $L_{ m max}$ / $L_{nom}$ | 1.1 | To study the efficiency of our technique in reducing the power consumption of the fanout trees, we conducted two sets of experiments. In the first set of experiments, whose results are shown in Table II, we assumed the options of multi- $V_t$ and multi- $L_{\text{Gate}}$ are not available in the library and compared the results of LPFO with the results of low-area fanout optimization (LEOPARD) [2] for a few random problems in the form of fanout chains. In this table $C_{in.\max}$ denotes the maximum allowed capacitance at the input of the fanout chain, $C_{out}$ is the load capacitance, and pol is the polarity of the sink. In each fanout chain, first the path delay was minimized using the technique proposed in [12]. Next, each chain was given some additional slack and either LPFO or LEOPARD algorithm was invoked to minimize the power dissipation or the area of the fanout chain. Each optimized chain was mapped to the library of inverters, and detailed SPICE simulation was carried out on the circuit to measure the power consumption. From Table II one can see minimizing the area of the fanout chains in many cases increases the total power consumption. On the other hand, when the fanout chains are optimized for power, by increasing the available slack in the chain, the power reduction saturates at some point. From the table, the power consumption of the minimum power fanout chains is not always a decreasing function of available slack. This is due to round-off error in mapping the continuous-size inverters to discrete-size inverters in the library. The second set of experimental results compares LPFO with LEOPARD and the SIS fanout optimization program for a set of problems in the form of fanout trees. SIS runs different fanout optimization algorithms, namely Two-Level, Bottom-Up, Balanced, LT-Tree, and reports the best one [1]. In this set of experiments, the same standard cell library used for LPFO and LEOPARD has been utilized as the SIS library. For each inverter $\tau_{intrinsic}$ and $R_{out}$ were specified for the SIS library delay model and $p_0$ and $\tau_0$ were specified for the logical effort delay model. A very close match between the SIS delay and logical effort delay model values was enforced. The fanout optimization programs of SIS were first used to perform fanout optimization for a set of problems. Next the delay and input capacitance resulting from SIS were used as constraints for LPFO and LEOPARD. After performing the fanout optimization, the SPICE netlist for each circuit was generated and detailed HSPICE simulation was performed to measure the delay and the power consumption of the circuit. The results of these experiments are reported in Table III. The first column is the name of the problem instance, the second column denotes the number of sinks in the fanout problem, columns 3 and 4 respectively show the area and power consumption of each fanout problem achieved by running the SIS fanout optimization and the remaining columns show the area and power reduction of LEOPARD and LPFO algorithms over corresponding values of SIS program. From Table III one can see fanout trees resulting from LEOPARD, on average, consume 11.79% more power than those achieved by SIS. Utilizing LPFO, on the other hand, reduces not only the power consumption of fanout trees by an average of 11.17% but also their area by an average of 29.64%. Our last set of experimental results demonstrates how the size of inverter library affects the quality of results in the proposed technique (the size of a library is defined as the number of gates in it). Table IV shows the average and maximum error in power consumption of fanout chains (shown in Table II) as a result of mapping continuous inverter sizes to discrete values in inverter libraries with different sizes. From this table one can see that with an inverter library size of 10 or more, the mapping error becomes quite negligible. Note in our problem setup and in the simulation results, we ignored the interconnect power dissipation and delay costs. The reason is that we do the TABLE II THE COMPARISON OF THE TOTAL POWER CONSUMPTION IN MINIMUM DELAY FANOUT CHAINS, LEOPARD, AND LPFO | | Cinquit Engaification | | Min Delay<br>Circuit | | Power Reduction (%) | | | | | | | | | |---------|-----------------------|-----------|----------------------|----------------------|----------------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------| | | Circuit Specification | | | | LEOPARD | | | LPFO | | | | | | | Circuit | $C_{in,\mathrm{max}}$ | $C_{out}$ | pol | Org<br>Power<br>(µW) | Org<br>Delay<br>(ps) | Slack<br>10% | Slack<br>20% | Slack<br>30% | Slack<br>40% | Slack<br>10% | Slack<br>20% | Slack<br>30% | Slack<br>40% | | FC1 | 1 | 64 | + | 20.9 | 140 | 5.94 | -31.51 | -55.9 | -55.9 | 10.3 | 10.17 | 7.10 | 7.10 | | FC2 | 1 | 100 | + | 14.3 | 129.8 | -2.54 | -12.85 | -41.79 | -72.3 | 3.81 | 4.52 | 2.57 | 2.59 | | FC3 | 20 | 100 | + | 23.9 | 61.2 | 13.13 | 16.44 | 16.68 | 15.95 | 13.25 | 17.2 | 18.04 | 18.62 | | FC4 | 30 | 80 | + | 7.5 | 36.9 | 21.11 | 28.5 | 33.49 | 35.14 | 21.61 | 28.77 | 33.58 | 36.14 | | FC5 | 50 | 200 | + | 7.6 | 52.3 | 17.16 | 24.2 | 28.37 | 29.84 | 18.52 | 25.65 | 29.75 | 31.33 | | FC6 | 20 | 50 | - | 9.4 | 69 | 5.02 | 7.32 | 7.98 | 7.70 | 5.02 | 7.32 | 7.98 | 7.70 | | FC7 | 15 | 200 | - | 22.5 | 65.2 | 15.04 | 14.72 | 12.69 | -27.62 | 15.92 | 17.84 | 18.32 | 18.05 | | FC8 | 2 | 100 | - | 48.4 | 94.6 | -7.23 | -20.06 | -35.59 | -47.64 | 0 | 0 | 0 | 0 | | FC9 | 8 | 50 | - | 7.5 | 115.2 | -7.06 | -17.61 | -33.83 | -33.83 | 0 | 0 | 0 | 0 | | FC10 | 10 | 150 | - | 19.1 | 42.2 | 13.48 | 12.17 | 9.46 | 5.00 | 13.87 | 15.85 | 17.27 | 18.25 | | Average | | | | | 7.40 | 2.13 | -5.84 | -14.37 | 10.23 | 12.73 | 13.46 | 13.98 | | TABLE III COMPARISON OF SIS, LEOPARD, AND LFPO FANOUT OPTIMIZATION ALGORITHMS | | Sink | SIS | | LEOP | PARD | LPFO | | | |---------|------|------|------------|-----------------------------------|------------------------------------|-----------------------------------|------------------------------------|--| | Circuit | | Area | Power (µW) | Area<br>Reduction<br>over SIS (%) | Power<br>Reduction<br>over SIS (%) | Area<br>Reduction<br>over SIS (%) | Power<br>Reduction<br>over SIS (%) | | | FT1 | 5 | 304 | 14.4 | 47.70 | 11.81 | 43.09 | 16.67 | | | FT2 | 7 | 1082 | 119.0 | 62.38 | -16.81 | 9.89 | 6.72 | | | FT3 | 8 | 1026 | 63.3 | 48.34 | -18.17 | 42.01 | 12.48 | | | FT4 | 10 | 1139 | 68.3 | 79.54 | -16.40 | 53.99 | 13.47 | | | FT5 | 20 | 1347 | 105.0 | 54.94 | -28.57 | 18.63 | 2.76 | | | FT6 | 12 | 928 | 64.4 | 45.37 | -8.07 | 26.51 | 12.73 | | | FT7 | 14 | 1490 | 109.1 | 67.92 | -22.82 | 45.97 | 17.60 | | | FT8 | 14 | 838 | 86.3 | 34.01 | -9.04 | -7.28 | 9.15 | | | FT9 | 25 | 2853 | 150.0 | 78.48 | -18.00 | 56.78 | 15.33 | | | FT10 | 30 | 2496 | 160.0 | 60.10 | -15.63 | 27.92 | 6.88 | | | FT11 | 10 | 715 | 46.7 | 52.73 | -0.86 | 30.91 | 13.49 | | | FT12 | 12 | 1465 | 73.4 | 59.73 | 3.00 | 50.17 | 13.62 | | | FT13 | 15 | 1218 | 92.8 | 38.83 | -11.31 | 16.67 | 13.15 | | | FT14 | 16 | 1099 | 94.1 | 38.31 | -7.76 | 8.64 | 8.29 | | | FT15 | 22 | 1334 | 115.0 | 48.20 | -18.26 | 20.69 | 5.22 | | | Average | | | 54.44 | -11.79 | 29.64 | 11.17 | | | fanout optimization during logic synthesis and prior to generating layout. Therefore, locations of the source and the sinks are not known. As a result the interconnect delay information cannot be accurately modeled. It is thus reasonable to assume the expected values of delay and power dissipation per wire in the inverter chain or the fanout tree are nearly the same. This constant contribution can, thus, be taken out of the problem formulation by properly adjusting the required time constraints on the sinks and adding a constant term to the total power equation. ## VII. CONCLUSION In this paper we showed the fanout optimization with area and power objective functions are not the same and a fanout tree optimized for area may dissipate excessive short-circuit power. By modeling all components of power dissipation, i.e., dynamic, short-circuit, sub-threshold leakage and gate tunneling leakage, we formulated the fanout optimization problem as a geometric program for a circuit with one sink. To reduce the leakage power consumption, we proposed using multi- $V_t$ and multi- $L_{\text{Gate}}$ inverters in the fanout trees. Experimental results show the proposed technique is effective in reducing the total power consumption of fanout trees. TABLE IV MAPPING ERROR AS A FUNCTION OF INVERTER LIBRARY SIZE | Inverter | Maximum | Average | | | | | | | | | |--------------|-----------|-----------|--|--|--|--|--|--|--|--| | Library Size | Error (%) | Error (%) | | | | | | | | | | 4 | 15.5 | 57.3 | | | | | | | | | | 6 | 4.1 | 8.7 | | | | | | | | | | 8 | 3.4 | 7.3 | | | | | | | | | | 10 | 1.8 | 7.3 | | | | | | | | | | 12 | 0.8 | 2.1 | | | | | | | | | | 14 | 0.9 | 2.1 | | | | | | | | | ## REFERENCES - H. Touati, "Performance-oriented technology mapping," Ph.D. dissertation, University of California, Berkeley, 1990. - [2] P. Rezvani and M. Pedram, "A fanout optimization algorithm based on the effort delay model," *IEEE Trans. on Computer Aided Design*, vol. 22, no. 12, Dec. 2003, pp. 1671-1678. - [3] B. Amelifard, F. Fallah, and M. Pedram, "Low-power fanout optimization using MTCMOS and multi-Vt techniques," in *Proc. of International Symposium on Low Power Electronics and Design*, 2006, pp. 334 -337. - [4] D. Zhou and X. Liu, "Minimization of chip size and power consumption of high-speed VLSI buffers," in *Proc. of International* Symposium on Physical Design, 1997, pp. 186-191. - [5] K. J. Singh and A. Sangiovanni-Vincentelli, "A heuristic algorithm for the fanout problem," in *Proc. of Design Automation Conference*, 1990, pp. 357-360. - [6] B. Amelifard, F. Fallah, and M. Pedram, "Low-power fanout optimization using multiple threshold voltage inverters," in *Proc. of International Symposium on Low Power Electronics and Design*, 2005, pp. 95-98. - [7] D. S. Kung, "A fast fanout optimization algorithm for nearcontinuous buffer libraries," in *Proc. of Design Automation Conference*, 1998, pp. 352-355. - [8] C. L. Berman, J. L. Carter, and K. F. Day, "The fanout problem: from theory to practice," in *Proc. of Decennial Caltech Conference Advanced Research in VLSI*, 1989, pp. 69-99. - [9] K. Kodandapani, J. Grodstein, A. Domic, and H. Touati, "A simple algorithm for fanout optimization using high-performance buffer libraries," in *Proc. of International Conference on Computer-Aided Design*, 1993, pp. 466-471. - [10] N. Sirisantana, L. Wei, and K. Roy, "High performance low power CMOS circuits using multiple channel length and multiple oxide thickness," in *Proc. of International Conference on Computer Design*, 2000, pp. 227-232. - [11] P. Gupta, A. B. Kahng, P. Sharma, and D. Sylvester, "Selective gate-length biasing for cost-effective runtime leakage control," in *Proc. of Design Automation Conference*, 2004, pp. 327-330 - [12] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San Francisco, CA: Morgan Kaufmann, 1999. - [13] B. Hu, Y. Watanabe, A. Kondratyev, and M. Marek-Sadowska, "Gain-based technology mapping for discrete-size cell libraries," in *Proc. of Design Automation Conference*, 2003, pp. 574-579. - [14] S. Karandikar and S. Sapatnekar, "Logical effort based technology mapping," in *Proc. of International Conference on Computer-Aided Design*, 2004, pp. 419-422. - [15] W. Chen, C. Hsieh, and M. Pedram, "Simultaneous gate sizing and fanout optimization," in *Proc. of International Conference on Computer-Aided Design*, 2000, pp. 374-378. - [16] Magma Design Automation. Gain Based Synthesis: Speeding RTL to Silicon, 2002. - [17] L. Stok, D. S. Kung, D. Brand, and A. D. Drumm, "BooleDozer: logic synthesis for ASICs," *IBM Journal of Research and Development*, vol. 40, no. 4, Jul. 1996, pp. 407-430. - [18] T. Sakurai and A. R. Newton, "Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas," *IEEE Journal of Solid-State Circuits*, vol. 25, no. 2, Apr. 1990, pp. 584-594. - [19] B. Amelifard, "Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits," Ph.D. dissertation, University of Southern California, 2007. - [20] M. Pedram, "Power minimization in IC design: principles and applications," ACM Trans. on Design Automation of Electronic Systems, vol. 1, no. 1, Jan. 1996, pp. 3-56. - [21] V. De, A. Keshavarzi, S. Narendra, and J. Kao, "Techniques for leakage power reduction," in *Design of High-Performance Microprocessor Circuits*, A. Chandrakasan, W. J. Bowhill, and F. Fox, Eds. Piscataway, NJ: IEEE, 2001. - [22] D. Lee, D. Blaauw, and D. Sylvester, "Gate oxide leakage current analysis and reduction for VLSI circuits," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, vol. 12, no. 2, Feb. 2004, pp. 155-166. - [23] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge University Press, 2003. - [24] A. Sirvastava, "Simultaneous Vt selection and assignment for leakage optimization," in *Proc. of International Symposium on Low Power Electronics and Design*, 2003, pp. 146-151. - [25] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, and H. Savoj, "SIS: A System for Sequential Circuit Synthesis," University of California, Berkeley, Report M92/41, May 1992. - [26] MOSEK Optimization Software, [online] http://www.mosek.com - [27] HSPICE: The gold standard for accurate circuit simulation, [online] http://www.synopsys.com/products/mixedsignal/hspice/hspice.html - [28] Predictive Technology Model, [online] http://www.eas.asu.edu/~ptm/ **Behnam Amelifard** received the B.Sc. degree from Sharif University of Technology, Tehran, Iran, in 2001, the M.Sc. degree from University of Tehran, Iran, in 2003, both in electrical and electronics engineering, and the Ph.D. degree in electrical engineering from the University of Southern California, Los Angeles, CA, in December 2007. During the summer of 2005, he was with Fujitsu Laboratories of America, where he worked on low-power SRAM design. He worked on runtime and litho-aware leakage power optimization techniques at Magma Design Automation during the summer of 2006. Since 2008 he has been with Qualcomm Inc. as a Senior Hardware Engineer. His primary research interests are power analysis and optimization, signal and power delivery integrity, and semiconductor memory design. Dr. Amelifard was the recipient of the Honorable Mention Award at the 2008 International Symposium on Low Power Electronics and Design. Farzan Fallah received his Ph.D. in Electrical Engineering and Computer Science from MIT in 1999. From 1999 to 2008, he worked at Fujitsu Laboratories of America, where he led the low power design project. He is currently the engineering director at Envis Corporation in Santa Clara, California. His primary research interests are low power design and verification. He has authored and co-authored over 60 papers on these topics and has received a number of awards including two best paper awards at the Design Automation Conference and the International Conference on VLSI Design. Dr. Fallah has served on the technical program committees of DATE, ICCD, ISLPED, HLDVT, ISQED, ICESS, ICSICT and ALPS as well as the organizing committees of ISLPED and ISQED. He is currently the cochair of the Low Power Technical Committee of ACM SIGDA and an associate editor of the ACM Transactions on Design Automation of Electronic Systems. Massoud Pedram, a professor of Electrical Engineering since 1991 and current Chair of the Computer Engineering at the University of Southern California, is an IEEE Fellow and a Board member of the ACM Special Interest Group on Design Automation. Dr. Pedram cofounded and served as the Technical Co-chair and General Co-chair of the International Symposium on Low Power Electronics and Design in 1996 and 1997, respectively. He was also the Technical Program Chair and the General Chair of the 2002 and 2003 International Symposium on Physical Design. Dr. Pedram has published four books, nine book chapters, and more than 300 journal and conference papers. His research has received a number of awards including two DAC Best Paper Awards, a Distinguished Paper Citation from ICCAD, two ICCD Best Paper Awards, and two IEEE Transactions Best Paper Awards. He is a recipient of the NSF's Young Investigator Award (1994) and the Presidential Faculty Fellows Award (a.k.a. PECASE Award) (1996). Dr. Pedram was a member of the Board of Governors of the IEEE Circuits and Systems Society from 2000 to 2002, Chair of the Distinguished Lecturer Program of the IEEE Circuits and Systems Society (CASS) for 2003 and 2004, and the CASS VP of Publications in 2005 and 2006. Dr. Pedram currently serves as the Editor-in-Chief of the ACM Transactions on Design Automation of Electronic Systems (TODAES). Dr. Pedram earned a Bachelor of Science degree in Electrical Engineering from the California Institute of Technology in 1986, and M.S. and Ph.D. degrees in Electrical Engineering and Computer Sciences from the University of California, Berkeley in 1989 and 1991, respectively. His current work focuses on developing design methodologies and techniques for low power CMOS VLSI circuits as well as dynamic power and thermal management in various computing platforms including chip multiprocessors and embedded systems.