Energy-aware resource allocation for multicores with per-core frequency scaling
- Xinghui Zhao^{1}Email author and
- Nadeem Jamali^{2}
DOI: 10.1186/s13174-014-0009-x
© Zhao and Jamali; licensee Springer. 2014
Received: 16 April 2014
Accepted: 18 July 2014
Published: 28 September 2014
Abstract
With the growing ubiquity of computer systems, the energy consumption of these systems is of increasing concern. Multicore architectures offer a potential opportunity for energy conservation by allowing cores to operate at lower frequencies when the processor demand low. Until recently, this has meant operating all cores at the same frequency, and research on analyzing power consumption of multicores has assumed that all cores run at the same frequency. However, emerging technologies such as fast voltage scaling and Turbo Boost promise to allow cores on a chip to operate at different frequencies.
This paper presents an energy-aware resource management model, DREAM-MCP, which provides a flexible way to analyze energy consumption of multicores operating at non-uniform frequencies. This information can then be used to generate a fine-grained energy-efficient schedule for execution of the computations – as well as a schedule of frequency changes on a per-core basis – while satisfying performance requirements of computations. To evaluate our approach, we have carried out two case studies, one involving a problem with static workload (Gravitational N-Body Problem), and another involving a problem with dynamic workload (Adaptive Quadrature). Experimental results show that for both problems, the energy savings achieved using this approach far outweigh the energy consumed in the reasoning required for generating the schedules.
Keywords
Energy conservation Resource management Performance Frequency scheduling1Introduction
With growing concerns about the carbon footprint of computers – computers currently produce 2–3% of greenhouse gas emissions related to human activities – there is ever greater interest in power conservation and efficient use of computational resources. The relationship between a processor’s speed and its power requirement emerged as a significant concern: the dynamic power required by a CMOS-based processor is proportional to the product of its operating voltage and clock frequency; and for these processors, the operating voltage is also proportional to its clock frequency. Consequently, the dynamic power consumed by a CMOS processor is (typically) proportional to the cube of its frequency [1]. This motivated the general shift away from faster processors to multicore processors for delivering the more processor cycles to applications with ever increasing demands.
At the same time, another opportunity lay in the fact that not all computations always have to be carried out at the quickest possible speed. Dynamic voltage and frequency scaling (DVFS) can be used to deliver only the required amount of speed for such computations.
Existing analytical models for power consumption of multicores typically assume that all cores operate at the same frequency [2]-[4]. Although this is correct for current processors which use off-chip voltage regulators (i.e., a single regulator for all cores on the same chip), which set all sibling cores to the same voltage level [5], it does not fully capture the range of control opportunities available. For instance, in a multi-chip system, off-chip regulators can be used for per-chip frequency control [6] which enables a finer-grained control by allowing each chip’s cores to operate at a different frequency. Even in the absence of the ability to control chip frequencies at a fine-grain, there is often a way to temporarily boost the frequency of cores. For example, Turbo Boost [7] provides flexibility of frequency control by boosting all cores to a higher frequency to achieve better performance when necessary and possible. Note that the frequency can be increased only when the processor is otherwise operating below rated power, temperature, and current specification limits.
Beyond these opportunities, the most recent advances in on-chip switching regulators [8] will enable cores on the same chip to operate at different frequencies, promising far greater flexibility for frequency scaling. Studies have shown that per-core voltage control can provide significant energy-saving opportunities compared to traditional off-chip regulators [9]. Furthermore, it has been shown recently [10] that an on-chip multicore voltage regulator (MCVR) can be implemented in hardware. Essentially a DC-DC converter, the MCVR can take a 2.4 V input and scale it down to voltages ranging from 0.4 to 1.4V. To support efficient scaling, MCVR uses fast voltage scaling to rapidly cut power according to CPU demands. Specifically, it can increase or decrease the output by 1 V in under 20 nanoseconds.
To fully exploit the potential of these technologies, a finer-grained model for power consumption and management is required. Because the frequency of a core represents the available CPU resources in time (cycles/second), it can naturally be treated as a computational resource, which makes it possible to address the problem of power consumption from the perspective of resource management. In this paper, we present a model for reasoning about energy consumed by concurrent computations executing on multicore processors, and mechanisms involved in creating schedules – of resource usage as well as frequencies at which processor cores should execute – for completing computation in an energy-efficient manner.
The rest of the paper is organized as follows. We review related work in Section 2; to better motivate our work, in Section 3, we take two frequency scaling technologies as examples to illustrate the effect of these technologies on energy consumption; Section 4 presents our DREAM-MCP model for multicore resource management and energy analysis; results from our experimental involving two problems with different characteristics are presented in Section 5; Section 6 concludes the paper.
2Related work
Although Moore’s Law has long predicted the advance in processing speeds, the exponential increase in corresponding power requirements (sometimes referred to as the power wall) presented significant challenges in delivering the processing power on a single processor. Multicore architectures emerged as a promising solution [11]. Since then, power management on multicore architectures has received increasing attention [12], and power consumption has become a major concern for both hardware and software design for multicore.
Li et al. were among the first to propose an analytical model [2] which brought together efficiency, granularity of parallelism, and voltage/frequency scaling, and to establish a formal relationship between the performance of parallel code running on multicore processors and the power they would consume. They established that by choosing granularity and voltage/frequency levels judiciously, parallel computing can bring significant power savings while meeting a given performance target.
Wang et al. have analyzed the performance-energy trade-off [3]. Specifically, they have proposed different ways to deploy the computations on the processors, in order to achieve various performance-energy objectives, such as energy or performance constraints. However, their analysis is based on a particular application (matrix multiplication) running on a specific hardware (FPGA based mixed-mode chip multiprocessors). A more general quantitative analysis has been proposed by Korthikanti et al. [4], which is not limited to any application or hardware. They propose a methodology for evaluating energy scalability of parallel algorithms while satisfying performance requirements. In particular, for a given problem instance and a fixed performance requirement, the optimal number of cores along with their frequencies can be calculated, which minimize energy consumption for the problem instance. This methodology has then been used to analyze the energy-performance trade-off [13] and reduce energy waste in executing applications [14].
These analytical studies make an assumption that all cores operate at the same frequency because of the hardware limitation of traditional off-chip regulators – a limitation that is about to be removed by recent advances.
There are a number of scenarios where finer grained control is possible. Even when off-chip regulators are used, if there are multiple chips, cores on different chips can be operating at different frequencies. For example, Zhang et al. have proposed a per-chip adaptive frequency scaling, which partitions applications among multiple multicore chips by grouping applications with similar frequency-to-performance effects, and sets a chip-wide desirable frequency level for each chip. It has been shown that for 12 SPECCPU2000 benchmarks and two server-style applications, per-chip frequency scaling can save approximately 20 watts of CPU power while maintaining performance within a specified bound of the original system.
However, two recent advances in hardware design promise even greater opportunities. The first of these is Turbo Boost [7], which can dynamically and quickly change the frequency at which the cores on a chip are operating during execution. Specifically, depending on the performance requirements of the applications, Turbo Boost automatically allows processor cores to run faster than the base operating frequency if they are operating below power, current, and temperature specification limits. Turbo Boost is already available on Intel’s new processors (codename Nehalem). The second, and perhaps more important, is the emergence of on-chip switching regulators [8]. Using these regulators, the different cores on the same chip can operate at different frequencies. Studies [9] have shown that the energy savings made possible by using on-chip regulators far outweigh the overhead of having these regulators on the chip.
As for commercial hardware, the first generation of multicore processors which support per-core frequency selection are the AMD family 10h processors [15], but the energy savings on these processors are limited, because they still maintain the highest voltage level required for all cores. Most recently, it has been shown that the on-chip multicore voltage regulator together with the fast voltage scaling can be efficiently implemented in hardware [10], which can rapidly cut power supply according to CPU demand, and perform voltage transition within tens of nanoseconds.
These new technologies provide opportunities for energy savings on multicore architectures. However, a flexible analytical model is required to analyze power consumption on multicores with non-uniform frequency settings. Cho et al. addressed part of the problem in [16] by proposing an analysis which can be used to derive optimal frequencies allocated to the serial and parallel regions in an application, i.e., non-uniform frequency over time. Specifically, for a given computation which involves a sequential portion and a parallel portion, the optimal frequencies for the two portions can be derived, which can achieve minimum power consumption while maintaining the same performance as running the computation sequentially on a single core. However, this work is a coarse-grained analysis, and it does not consider non-uniform frequencies for different cores.
Besides theoretical model and analysis, significant work has been done to optimize power consumption at run-time through software-controlled mechanisms, or knobs. Approaches include dynamic concurrency throttling (DCT) [17], which adapts the level of concurrency at runtime based on execution properties, dynamic voltage and frequency scaling (DVFS) [18], or a combination of the two [19]. Among these [18] is particular interesting, because it considers per-core frequency. Specifically, a global multicore power manager is employed which incorporates per core frequency scaling. Several power management policies are proposed to monitor and control per-core power and performance state of the chip at periodic intervals, and set the operating power level of each core to enforce adherence to known chip level power budgets. However, the focus of this work is on passively monitoring power consumption, rather than modelling power and resource consumption at fine-grain, and actively deploying computations power-efficiently.
In this paper, we address the problem from a different perspective: resource management point of view. First, we model resources and computations at fine-grain, and the evolution of the system as the process of resource consumption; second, we model energy consumption as the cost/consequence of a specific CPU resource allocation; third, the model is energy-aware, and can be used to generate an energy-efficient resource allocation plan for any given computations.
3Effect of frequency scaling on energy consumption
where N is the number of cores, and α is the exponential factor of power consumption (we use the value of 3 for α, as is typical in the literature). In other words, the power consumption of a core running at frequency f is proportional to f^{ α }.
In this section, we illustrate the effects of non-uniform frequency scaling on multicore energy consumption. Particularly, we extend the analysis in [16] to consider two specific technologies: per-core frequency, and Turbo Boost.
3.1 Per-core frequency
where T_{ busy } is the time during which the computation is carried out, λ is a hardware constant which represents the ratio of the static power consumption to the dynamic power consumption at the maximum processor speed. The first term in the formula corresponds to energy consumed for carrying out the computation (dynamic power), and the second term represents energy for the static power consumption during the entire period of execution. Processor temperature is not considered; therefore, energy for static power consumption is only related to λ and T.
Obviously, the frequency at which the core executing the sequential part of the computation executes, remains unchanged regardless of whether uniform or non-uniform frequencies are employed. We assume that the same core carries out the heavier of the two uneven workloads to be carried out in parallel. Any energy savings to be achieved from non-uniform frequency scaling are therefore on the other core operating at a lower frequency.
3.2 Turbo boost
Our analysis thus far has shown that energy savings can be achieved by using non-uniform frequency technologies. However, the scenario in the analysis is simple: only one computation is considered, and workload and structure of the computation is well known. Next we address the problem of finding the optimal frequency schedule for a complex computation, with frequencies varying multiple times over the course of the computation’s execution.
4Reasoning about multicore energy consumption
In our previous work, we have constructed DREAM^{a} (Distributed Resource Estimation and Allocation Model) [20] and related mechanisms [21] for reasoning about scheduling of deadline constrained concurrent computations over parallel and distributed execution environments. In the most recent work [22], this approach have been repurposed to achieve dynamic load balancing for computations which do not constrained by deadlines. Fundamental to this work is a fine grained accounting of available resources, as well as the resources required by computations. Here, we connect the use of resources by computations to the energy consumed in their use, leading to a specialized model, called DREAM-MCP (DREAM for Multicore Power). DREAM-MCP defines resources over time and space, and represents them using resource terms. A resource term specifies values for attributes defining a resource: specifically, the maximum available frequency, the time interval during which the resource is available, and the location of existence for the resource, i.e., the core id. Computations are represented in terms of the resources they require. System state at a specific instant of time is captured by the resources available at that instant and the computations which are being accommodated. We use labeled transition rules to represent progress in the system, and an energy cost function is associated with each transition rule to indicate the energy required for carrying out the transition.
4.1 Resource representation
Multicore processor resources are represented using resource terms of the form , where represents the maximum available frequency of the specific core (in cycles/time), τ is the time interval during which the resource is available ($\mathfrak{\ud52f}\times \tau $ is the number of CPU cycles over interval τ), and ξ specifies the location of the available resource, which is the id of the specific core.
Possible relations between time intervals τ _{ 1 } and τ _{ 2 }
Relation | Inverse relation | Interpretation | Illustration |
---|---|---|---|
τ_{1}<τ_{2} | τ_{2}>τ_{1} | τ_{1} before τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} | |||
τ _{1} m τ _{2} | τ _{2} mi τ _{1} | τ_{1} meets τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} | |||
τ_{1}=τ_{2} | τ_{2}=τ_{1} | τ_{1} equal τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} | |||
τ _{1} d τ _{2} | τ _{2} di τ _{1} | τ_{1} during τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} | |||
τ _{1} o τ _{2} | τ _{2} oi τ _{1} | τ_{1} overlaps τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} | |||
τ _{1} s τ _{2} | τ _{2} si τ _{1} | τ_{1} starts τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} | |||
τ _{1} f τ _{2} | τ _{2} fi τ _{1} | τ_{1} finishes τ_{2} | τ _{1} τ _{1} τ _{1} |
τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} τ _{2} |
Each time interval τ has a start time t_{ start }, and an end time t_{ end }. In this paper, we also use (t_{ start },t_{ end }) as an alternative notation for time interval τ. Furthermore, binary operations on sets, such as union (∪), intersection (∩), relative complement (∖) are also available for time intervals.
Resources in a multicore system can be represented by a set of resource terms. If two resource terms in a resource set have the same location and overlapping time intervals, they can be combined by a process of simplification, where for any interval for which they overlap, their frequencies are added, and for remaining intervals, they are represented separately in the set:
The simplification essentially aggregates resources available simultaneously at the same core, which can lead to a larger number of terms. Resource terms can reduce in number if two collocated resources with identical rates have time intervals that meet.
Note that if the time interval of a resource term is empty, the value of the resource term is 0, or null. In other words, resources are only defined during non-empty time intervals.
The notion of negative resource terms is not meaningful in this context; so, resource terms cannot be negative. We define an inequality operator to compare two resource terms, from the perspective of a computation’s potential use of them. We say that a resource term is greater than another if a computation that requires the latter, can instead use the former, with some to spare. We specifically state it as follows:
if and only if ξ_{1}=ξ_{2}, ${\mathfrak{\ud52f}}_{1}>{\mathfrak{\ud52f}}_{2}$, and τ_{2}d τ_{1}. Note that it is not necessarily enough for the total amount of resource available over the course of an interval to be greater. Consider a computation that is able to utilize needed resources only during interval τ_{2}; if additional resources are available outside of τ_{2}, but not enough during τ_{2}, it does not help satisfy the computation.
The relative complement of two resource sets Θ_{1}∖Θ_{2} is defined only when for each resource term in Θ_{2}, there exists a resource term , such that . The relative complement of two resource sets is defined as follows:
Union and relative complement operations on resource sets allow modeling of resources that join or leave the system dynamically, as typically happens in open distributed systems such as the Internet.
4.2 Computation representation
A computation consumes resources at every step of its execution. We abstract away what a distributed computation does and represent it by the using what sequence of its resource requirements for each step of execution. The idea is inspired by CyberOrgs [24],[25], which is a model for resource acquisition and control in resource-bounded multi-agent systems.
Note that for a computation which is composed of sequential and parallel portions, its resource requirement can be represented by several simple resource requirements which would need to be simultaneously satisfied.
4.3 DREAM-MCP
For a computation that can be accommodated, different scheduling schemes result in different levels of energy consumption. To model all possible system evolution paths and the effects they have on overall energy consumption, we developed the DREAM-MCP model. DREAM-MCP models system evolution as a sequence of states connected by labeled transition rules specifying multicore resource allocation, and represents energy consumption as a cost function associated with each transition rule.
where ξ is a core, f is the utilized frequency for core ξ, and Γ is a computation. The transition rule specifies that the utilization of CPU resource on core ξ – which is operating at frequency f – for computation Γ makes the system progress from state to the next state . Here u(ξ, f)_{ Γ } denotes the resource utilization. If we replace the states in the above transition rule with the detailed (Θ,ρ,t) format, the transition rule would alternatively be written as:
Note that f, the frequency at which core ξ is operating, may be different from the maximum available frequency $\mathfrak{\ud52f}(\phantom{\rule{0.3em}{0ex}}f\le \mathfrak{\ud52f})$. This enables cores to operate at lower frequencies for saving power.
where the first term on the right-hand side represents energy for dynamic power consumption and the second represents energy for static power consumption, where λ is a hardware constant.
Note that if certain resource becomes available, yet no computations require that type of resource, the resource expires. The resource expiration rule is defined as follows:
where u(ξ)_{ ϕ } represents that core ξ is idle, i.e., it is not utilized by any computation.
The energy consumption for an expired resource only includes static power: e=λ×Δ t.
If there are multiple cores in the system, and during a time interval (t,t+Δ t), some resources are consumed, while others expire, we use a more general concurrent transition rule to represent this scenario:
Note that in this scenario, there are m cores and n computations. To simplify the notation, we number the cores and corresponding resources by the numbers of the computations that are utilizing them. As a result, when there are n computations, the n cores serving them are named ξ_{1} through ξ_{ n } respectively, and the rest are named ξ_{n+1} and beyond.
where the first term on the right-hand side represents energy for dynamic power consumption, and the second represents energy for static power consumption. Note that non-uniform frequency scaling allows f_{ i } to have different values for different cores, where uniform frequency requires them to be the same.
DREAM-MCP represents all possible evolutions of the system as sequences of system states connected by transition rules. Energy consumption of an evolution path can be calculated using the energy cost functions associated with the transition rules on that path; consumptions of these paths can then be compared to find the optimal schedule. In addition to exploring heuristic options, our ongoing work is also aimed at explicitly balancing the cost of reasoning against the quality of solution (See Section 6).
5Experimental results
A prototype of DREAM-MCP has been implemented for multicore processor resource management and energy consumption analysis. The prototype is implemented by extending ActorFoundry [26], which is an efficient JVM-based framework for Actors [27], a model for concurrency. A key component of DREAM-MCP is the Reasoner, which takes as parameters the resource requirements of a computation and its deadline, and decides whether the computation can be accommodated using resources available in the system. For computations which can be accommodated, the Reasoner generates a fine-grained schedule, as well as a frequency schedule which instructs the system to perform corresponding frequency scaling.
To evaluate our prototype, we have implemented two applications, the Gravitational N-Body Problem (GNBP), and the Adaptive Quadrature, as two case studies. The way we evaluated our approach is as follows. We first carried out the computations on two systems, DREAM-MCP and an unextended version of ActorFoundry (AF). Note that in these experiments, we run the processors at the maximum frequency, because processors with per-core frequency scaling are not yet available. Specifically, we measured the execution times of a computation on DREAM-MCP, and the time taken for carrying the same computation AF. We treat the difference as the overhead of using DREAM-MCP mechanisms.
Although DREAM-MCP introduces overhead, it helps conserve energy by generating a per-core frequency schedule for the computation. We then calculated the energy consumption for the two systems, with the assumption that in DREAM-MCP the cores can be operated at non-uniform frequency as our frequency schedule specifies. We then compared the energy consumption of the two systems, and also calculated the portion of the energy cost due to the overhead introduced by DREAM-MCP.
For both case studies, the hardware we used to carry out the experiments is an Xserve with 2×Quad-Core Intel Xeon processors (8 cores) @ 2.8 GHz, 8 GB memory and 12 MB L2 cache. The experimental results are presented in the following sections.
5.1 Case study I: gravitational N-body problem
GNBP is a simulation problem which aims to predict the motion of a group of celestial objects which exert a gravitational pull on each other. The way we implement GNBP is as follows. A manager actor sends the information about all bodies to the worker actors (one for each body), which use the information to calculate the forces, velocities, and new positions for their bodies, and then send their updated information to the manager. This computation has a sequential portion in which the manager gathers all information about the bodies, and sends it to all worker actors, and a parallel portion is that each individual body calculates its new position, and sends a reply message to the manager.
Execution time at maximum frequency (8-Body)
System | Sequential | Parallel | Overhead (%) |
---|---|---|---|
portion (ms) | portion (ms) | ||
DREAM-MCP | 70 | 85 | 11.5% |
AF | 54 | 85 | 0 |
Execution time at maximum frequency (12-Body)
System | Sequential | Parallel | Overhead (%) |
---|---|---|---|
portion (ms) | portion (ms) | ||
DREAM-MCP | 79 | 168 | 9.3% |
AF | 58 | 169 | 0 |
Note that the experimental results on energy savings only indicate dynamic power consumption. Since the reasoning increases the total execution time of the computation, energy for static power consumption also increases. From Equation 3 in Section 3 (assuming we ignore processor temperature), it is only related to λ (hardware constant) and T (execution time), i.e. E_{ static }=λ×T. Because the computational overhead of using DREAM-MCP is 11.5% for the case when computation can be evenly distributed, and 9.3% for the case when it cannot be evenly distributed, extra energy for static power consumption is also 11.5% and 9.3% of the total static energy required by the computation respectively. Because different hardware chips have different λ values, given a λ, the total energy saving by using DREAM-MCP for a specific hardware chip, including both dynamic and static power consumption, can be calculated. Previous studies show that the static power for the current generation of CMOS technologies is in the order of magnitude 10% of the total chip power [28]. Therefore, the extra static power of our approach is approximately 1% of the total power, which is negligible.
5.2 Case study II: adaptive quadrature
where c is any point between a and b. To calculate the integral value, we assume that within a predefined fault tolerance, ε, the area of the trapezoid (a,b,f(b),f(a)) can be used as an estimation of the integral.
As should be obvious, the recursive nature of adaptive quadrature makes it an inherently different type of problem than GNBP. Particularly, the number of subproblems is not known in advance, making the workload dynamic.
We implement a concurrent version of adaptive quadrature as an actor system. Initially we create an actor to calculate the value of adaptive quadrature of f(x) in the interval [a,b]. We then divide the interval [a,b] into two subintervals: [a,m] and [m,b], where m is the mid point in [a,b], and calculate the difference between the area of the trapezoid (a,b,f(b),f(a)) and the sum of the areas of two trapezoids in the two subintervals. if the difference is less than ε, the area of the trapezoid will be reported as the estimation of the integral for the interval. On the other hand, if the difference is greater than the predefined fault tolerance ε, the actor then creates two child actors, each of which is responsible for calculating the integral value on a subinterval. The original actor waits for the results from its child actors, and once they arrive, adds them.
Adaptive quadrature: execution time at maximum frequency
System | Sequential | Parallel | Overhead (%) |
---|---|---|---|
portion (ms) | portion (ms) | ||
DREAM-MCP | 416 | 1404 | 27% |
AF | 20 | 1404 | 0 |
5.2.1 Discussion
The Gravitational N-Body Problem and the Adaptive Quadrature represent two different types of computations. The workload of N-Body problem is static, that for Adaptive Quadrature is dynamically generated at run-time. As a result, more reasoning is required in Adaptive Quadrature, in order to calculate the frequency schedules for the cores. In the N-Body Problem, for both the cases where the workload is evenly and unevenly distributed among the cores, our approach can effectively save significant amount of energy. In Adaptive Quadrature, although the overhead caused by the reasoning is relatively high, at an extra 3.5% of the energy required by the actual computation, the savings achieved by DREAM-MCP are higher at 13.6%.
Note that our approach presented here is based on the assumption that per-core frequency scaling on a single chip is available. This is a finer-grained frequency scaling than the ones that are generally available, e.g., per-chip frequency scaling. Our approach can be generalized to support per-chip frequency scaling in a multi-chip context, by restricting the frequencies for the cores on the same chip to be uniform. However, this analysis is beyond the scope of this paper.
6Conclusion
Power consumption of multicore architectures is becoming important in both hardware and software design. Existing power analysis approaches have assumed that all cores on a chip must execute at the same frequency. However, emerging hardware technologies, such as fast voltage scaling and Turbo Boost, offer finer-grained opportunities for control and consequently energy conservation by allowing selection of different frequencies for individual cores on a chip. Deciding what these frequencies should be – the next challenge – is non-trivial.
Here, we first analyze the energy conservation opportunities presented by these two important hardware advances, and then build on our previous work on fine-grained resource scheduling in order to support reasoning about energy consumption. This reasoning enables creation of fine-grained schedules for the frequencies at which the cores should operate for energy-efficient execution of concurrent computations, without compromising on performance requirements. Our experimental evaluation shows that the cost of the reasoning is well worth it: it requires only a fraction of the energy it helps save.
Work is ongoing in a number of directions. First, instead of first building a processor schedule based on computations’ processor requirements and then translating it into a frequency schedule, we are working on an approach to build the schedules directly aiming for energy conservation; this would essentially pick the schedule with the best energy consumption profile from a number of schedules equally good from the processor scheduling perspective. Second, we hope to generalize our approach to make it applicable to distributed systems, mobile devices and systems involving them, each of which present different challenges. For instance, although our approach would apply to multicore mobile devices in principle, mobile applications can have very different characteristics from the types of problems we have evaluated our approach for in this paper. In that direction, the first author’s group has made efforts toward profiling power consumption of different types of functionalities, and developing power-aware scheduling for mobile applications [29]. Finally, although the computational overhead of reasoning in the system is far below the benefit of doing it, we want to explore opportunities for explicitly balancing the overhead involved in reasoning against the quality of the schedule required. We hope to build on our previous work implementing a tuner facility for balancing the computational cost of creating fine-grained processor schedules against the cost of carrying out the actual computations [21]. The tuner carries out meta-level resource balancing between the reason and the computations being reasoned about; its parameters can be set manually or be set to self-tune at run-time in response to observations about the ongoing computation. We plan to adapt the approach to DREAM-MCP to enable a similar facility in terms of energy consumption.
7Endnote
^{a} Previously called ROTA (Resource Oriented Temporal logic for Agents) model [30].
Declarations
Authors’ Affiliations
References
- Burd TD, Brodersen RW: Energy efficient CMOS microprocessor design. In Proceedings of the 28th Hawaii international conference on system sciences, vol. 1. IEEE Computer Society, Washington DC; 1995:288–2971.Google Scholar
- Li J, Martínez JF: Power-performance considerations of parallel computing on chip multiprocessors. ACM Trans Archit Code Optim 2005, 2: 397–422. 10.1145/1113841.1113844View ArticleGoogle Scholar
- Wang X, Ziavras SG: Performance-energy tradeoffs for matrix multiplication on FPGA-based mixed-mode chip multiprocessors. In Proceedings of the 8th international symposium on quality electronic design. IEEE Computer Society, Washington, DC; 2007:386–391.Google Scholar
- Korthikanti VA, Agha G: Analysis of parallel algorithms for energy conservation in scalable multicore architectures. In Proceedings of the 38th international conference on parallel processing. IEEE Computer Society, Washington, DC; 2009:212–219.Google Scholar
- Naveh A, Rotem E, Mendelson A, Gochman S, Chabukswar R, Krishnan K, Kumar A: Power and thermal management in the Intel Core Duo processor. Intel Technol J 2006, 10(2):109–122.View ArticleGoogle Scholar
- Zhang X, Shen K, Dwarkadas S, Zhong R: An evaluation of per-chip nonuniform frequency scaling on multicores. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference. USENIX Association, Berkeley; 2010.Google Scholar
- (2008) Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors. White paper, Intel. . Accessed 16 Apr 2014., [http://www.intel.com/technology/turboboost/] (2008) Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors. White paper, Intel. . Accessed 16 Apr 2014.
- Kim W, Gupta MS, Wei G-Y, Brooks DM: Enabling OnChip switching regulators for multi-core processors using current staggering. In Proceedings of the workshop on architectural support for Gigascale integration. IEEE Computer Society, San Diego, CA, USA; 2007.Google Scholar
- Kim W, Gupta MS, Wei G-Y, Brooks D: System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the 14th IEEE international symposium on high performance computer architecture. IEEE Computer Society, Salt Lake City, UT, USA; 2008:123–134.Google Scholar
- Kim W, Brooks D, Wei G-Y: A fully-integrated 3-Level DC/DC converter for nanosecond-scale DVS with fast shunt regulation. In Proceedings of the IEEE international solid-state circuits conference. IEEE Computer Society, San Francisco, CA, USA; 2011.Google Scholar
- Agerwala T, Chatterjee S: Computer architecture: challenges and opportunities for the next decade. IEEE Micro 2005, 25: 58–69. 10.1109/MM.2005.45View ArticleGoogle Scholar
- Kant K: Toward a science of power management. Computer 2009, 42: 99–101.View ArticleGoogle Scholar
- Korthikanti VA, Agha G (2010) Energy-performance trade-off analysis of parallel algorithms. In: USENIX workshop on hot topics in parallelism USENIX Association, Berkeley, CA. Korthikanti VA, Agha G (2010) Energy-performance trade-off analysis of parallel algorithms. In: USENIX workshop on hot topics in parallelism USENIX Association, Berkeley, CA.Google Scholar
- Korthikanti V, Agha G: Avoiding energy wastage in parallel applications. In Proceedings of the international conference on green computing. IEEE Computer Society, Washington, DC; 2010:149–163.View ArticleGoogle Scholar
- (2009) AMD BIOS and kernel developers guide (BKDG) for AMD family 10h processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf. 16 Apr 2014. (2009) AMD BIOS and kernel developers guide (BKDG) for AMD family 10h processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf. 16 Apr 2014.Google Scholar
- Cho S, Melhem RG: Corollaries to Amdahl’s law for energy. Comput Architect Lett 2008, 7(1):s25-s28.View ArticleGoogle Scholar
- Chakraborty K (2007) A case for an over-provisioned multicore system: energy efficient processing of multithreaded programs. Technical report, Department of Computer Sciences, University of Wisconsin-Madiso. Chakraborty K (2007) A case for an over-provisioned multicore system: energy efficient processing of multithreaded programs. Technical report, Department of Computer Sciences, University of Wisconsin-Madiso.Google Scholar
- Isci C, Buyuktosunoglu A, Cher C-Y, Bose P, Martonosi M: An analysis of efficient multi-core global power management policies: maximizing performance for a given power budget. In Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture. IEEE Computer Society, Washington, DC; 2006:347–358.Google Scholar
- Curtis-Maury M, Shah A, Blagojevic F, Nikolopoulos DS, de Supinski BR, Schulz M: Prediction models for multi-dimensional power-performance optimization on many cores. In Proceedings of the 17th international conference on parallel architectures and compilation techniques. ACM, New York; 2008.Google Scholar
- Zhao X (2012) Coordinating resource use in open distributed systems. PhD thesis, University of Saskatchewan. Zhao X (2012) Coordinating resource use in open distributed systems. PhD thesis, University of Saskatchewan.Google Scholar
- Zhao X, Jamali N: Supporting deadline constrained distributed computations on grids. In Proceedings of the 12th IEEE/ACM international conference on grid computing. IEEE Computer Society, Washington DC, Lyon, France; 2011:165–172.Google Scholar
- Zhao X, Jamali N: Load balancing non-uniform parallel computations. In ACM SIGPLAN notices: proceedings of the 3rd international ACM SIGPLAN workshop on programming based on actors, agents and decentralized control (AGERE! at SPLASH 2013). ACM, Indianapolis; 2013:1–12.Google Scholar
- Allen JF: Maintaining knowledge about temporal intervals. Commun ACM 1983, 26(11):832–843. 10.1145/182.358434MATHView ArticleGoogle Scholar
- Jamali N, Zhao X: A scalable approach to multi-agent resource acquisition and control. In Proceedings of the 4th international joint conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2005). ACM Press, Utrecht; 2005:868–875.View ArticleGoogle Scholar
- Jamali N, Zhao X: Hierarchical resource usage coordination for large-scale multi-agent systems. In Lecture notes in artificial intelligence: massively multi-agent systems I vol. 3446. Edited by: Ishida T, Gasser L, Nakashima H. Springer, Berlin Heidelberg; 2005:40–54.Google Scholar
- Karmani RK, Shali A, Agha G: Actor frameworks for the jvm platform: a comparative analysis. In In Proceedings of the 7th international conference on the principles and practice of programming in Java. ACM, New York, NY, Calgary, Alberta, Canada; 2009.Google Scholar
- Agha GA: Actors: a model of concurrent computation in distributed systems. MIT Press, Cambridge; 1986.Google Scholar
- Su H, Liu F, Devgan A, Acar E, Nassif S: Full chip leakage estimation considering power supply and temperature variations. In Proceedings of the 2003 international symposium on low power electronics and design. ISLPED ‘03. ACM, New York; 2003:78–83.Google Scholar
- Wang B, Zhao X, Chiu D: Poster: a power-aware mobile app for field scientists. In Proceedings of the 12th annual international conference on mobile systems, applications, and services. MobiSys ‘14. ACM, New York; 2014:383–383.Google Scholar
- Zhao X, Jamali N: Temporal reasoning about resources for deadline assurance in distributed systems. In Proceedings of the 9th international Workshop on Assurance in Distributed Systems and Networks (ADSN 2010), at the 30th International Conference on Distributed Computing Systems (ICDCS 2010). IEEE Computer Society, Washington DC, Genoa, Italy; 2010.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.