The evaluation of the proposed scheduling algorithm was done empirically, through simulation and measurement experiments. Our simulation experiments follow a full factorial design of two factors: the scheduling policy used and the infrastructure size. The former has two levels: the proposed QoS-driven scheduler, and a reference priority-based scheduler. The latter varies in three different levels. Changes in the infrastructure size factor affect the level of contention for resources in the system. The larger is the infrastructure, the smaller is the resource contention. We performed simulation experiments considering 10 different workloads. Measurement experiments were also performed to validate the simulation models.
In this section we present the materials and methods applied to evaluate the QoS-driven scheduler, including the simulation models, workloads and infrastructures samples used in the experiments, the prototype implementation, and details on the validation of the simulation models.
4.1 Simulation models
We implemented event-driven simulation models in Erlang on top of the Sim-diasca simulation frameworkFootnote 3 for our proposed QoS-driven scheduler, and for a state-of-practice priority-based scheduler. The structure of the two simulation models is essentially the same. They differ only on the decisions made regarding how to sort the pending queue, in which conditions preemption is allowed, which instances to preempt when performing feasibility checks, and how to rank multiple feasible hosts. Both simulation models receive as input three files: a workload trace, an infrastructure description, and a set of allocation overheads.
4.1.1 Input data
The workload trace is a file with information regarding the requests to be processed. Each request consists of the amount of CPU and memory required by its instance, the service class, placement constraints (optionally), and the time needed to complete it (i.e. for how long resources must be allocated to the instance associated with the request). We note that the latter is used to drive the simulator, but is unknown to the scheduler.
The infrastructure description provides information about the hosts that form the datacenter. Each host of the infrastructure description is defined by its CPU and memory capacities, and a set of attributes in the “key=value” format. The latter is used to match placement constraints that requests may carry. The amount of CPU and memory of a request are specified in the same unit as the CPU and memory capacities of the hosts.
The set of allocation overheads gives a range of allocation times to be considered while simulating the allocation of an instance in a host. We recall that this overhead represents the time required to prepare a host to continue the execution of an instance. It depends on whether the instance has already run in the host, or not. For this reason, each allocation overhead present in the set is classified as either hot, representing the allocation overhead of an instance that has previously run in the host where it is going to be allocated, or cold, representing the overhead when the instance has never run in the host. Thus, whenever an instance is allocated, the simulator randomly selects an allocation overhead from this set (accordingly with the type of the allocation), allowing the simulation to appropriately take this overhead into account.
In this work, both the workload trace and the infrastructure description are obtained from a real cluster usage trace shared by Google. This trace is used to create workload and infrastructure samples of the Google’s data as described in Sections 4.2 and 4.3, respectively. Additionally, the set of allocation overheads was obtained from measurement experiments in a Kubernetes cluster. In these experiments, we measured the time a Kubernetes instance (called pod) took to start, considering both hot and cold allocations. We used the nginxFootnote 4 web server as the application running in the pods created. For each node in the infrastructure, a pod was created, the allocation overhead was measured, and then the given pod was terminated. For each type of allocation, we repeated these steps for 1 h (with 1 s interval between two consecutive measurements), to gather a large and representative set of allocation overheads. In case of the cold allocation measurements, we make sure that the local repository of the nodes is cleaned before a new measurement is made.
4.1.2 The priority-based simulation model
For comparison reasons, we developed a simulator that models the default priority-based scheduler of Kubernetes [14]. It was chosen as a reference due to its popularity, and because it is open source, which allowed us to implement the simulation model exactly as the actual system is implemented. This scheduler assigns priorities to the instances according to the service classes to which their associated requests were submitted. These priorities are set in such a way that the higher is the QoS expectation (SLO) of the service class, the higher is the priority assigned to the request, and as a consequence, to the instance associated with the request.
Requests in the pending queue are sorted in decreasing order of priorities, and requests of the same priority are sorted in increasing order of their respective admission times.
Preemptions of lower priority instances may occur only for the benefit of higher priority ones. The scheduler first preempts the instances with the lowest priorities, and when choosing among several instances with the same priority, the most recently admitted instances are selected. This naturally limits the overhead due to preemptions, since an allocated instance can only be preempted by the arrival of a new higher priority request.
As discussed before, when preemptions are not needed, the scheduler uses an allocation scoring function to rank feasible hosts. In Kubernetes, this function is a combination of priority functions. In our simulator we have used the two default priority functions available: (i) the Least Requested Priority, which favors hosts with more available resources (to avoid too small leftovers), and (ii) the Balanced Resource Allocation, which favors hosts with a more balanced resource usage rate (to avoid resource strandingFootnote 5). We use these two functions to compute two scores for each host and use the arithmetic mean of these two scores to determine the score of the host. The host with the larger score is selected; a random choice is applied in case of ties.
If preemptions are needed, then the preemption cost scoring function used to rank the hosts returns the hosts that need the minimum number of preemptions of highest priority instances. Ties are broken by minimum number of preemptions of instances of other service classes, sorted in decreasing order of priorities. If a tie persists, then we use the allocation scoring function described above to select one of the tied hosts. In other words, this preemption cost scoring function favors the hosts in which the smallest number of preemptions of the most QoS demanding instances are needed.
4.1.3 The qoS-driven scheduler simulation model
The simulator implements a QoS-driven scheduler that operates exactly as described in Section 3. Recall, that the QoS-driven scheduler requires some configurations, namely: (i) the threshold for the preemption overhead limitation mechanism; (ii) the watchdog timeout to trigger periodic executions of the scheduler; (iii) safety margins for the different service classes; (iv) the allocation scoring function; and (v) the preemption cost scoring function.
Threshold configuration. We have set the threshold for preemption of requests of class i to be 1−σi. Recall that σi is the availability target for service class i, thus, these thresholds essentially limit the accumulated overhead due to preemptions to the maximum time that an instance can stay in the pending queue without violating its SLO.
Watchdog timeout. The watchdog timeout, which defines the maximum period between two sequential executions of the scheduler (see line 29 of Algorithm 1), was to 10 seconds. This value was defined empirically by observing Kubernetes in action.
Safety margins. The safety margin (ϕi) was also set to 10 seconds for all classes.
Allocation scoring function. Since in this step neither priorities nor QoS metrics are involved, we have used the same allocation scoring function used for the priority-based simulation model.
Preemption cost scoring function. The cost of a preemption cannot be easily modeled, since it involves anticipating the impact that the preemption would have on the QoS delivered by the system. Thus, we need to resort to some heuristic that can estimate this cost, as it was done for the priority-based scheduler.
The rationale of the heuristic used is the following. We consider that instances that are very close to violate, or are already violating their SLO have the highest cost of preemption, while instances that are far from violating their SLO have lower preemption costs. Moreover, among the instances that have higher preemption costs, we also consider their importance, with more important instances having even higher preemption costs.
Let Ph(t) be the non-empty set of instances that need to be preempted in host h to enable the allocation of an instance j at time t. We divide Ph(t) in two disjoint subsets, Ph+(t) and Ph−(t). Ph+(t) is the set of instances that need to be preempted, and that are not too close to violate their SLOs. Formally:
$$P_{h+}(t) = \left\{ k \in P_{h}(t); \mathcal{Q}_{k}(t) - \sigma_{i_{k}} \geq 0 \right\},$$
where \(\sigma _{i_{k}}\) is the safety margin for the service class ik to which instance k is associated. We compute a partial preemption cost score s+ as follows:
$$s_{+} = \frac{1}{\sum_{k \in P_{h+}(t)} \left(\mathcal{Q}_{k}(t) - \sigma_{i_{k}}\right)},$$
Then, we further divide the set of instances that are already violating or are close to violate their SLOs (Ph−(t)=Ph(t)−Ph+(t)) into m disjoint sets, \(P_{h-}^{i}(t), 1 \leq i \leq m\), one for each of the m service classes offered. \(P_{h-}^{i}(t)\) contains only the instances of class i belonging to Ph− that need to be preempted in h to accommodate j at time t. We compute m partial preemption cost scores \(s_{-}^{i}, 1 \leq i \leq m\) as follows:
$$s_{-}^{i} = \frac{1}{\sum_{k \in P_{h-}^{i}(t)} (\mathcal{Q}_{k}(t) - \sigma_{i_{k}})}.$$
The lower is i, the more important is the class. Thus, the preemption cost score of a host is given by a tuple of partial scores \(S=< s_{-}^{1}, s_{-}^{2},..., s_{-}^{m}, s_{+}>\). Recall that when preemptions are needed, the scheduler selects the host with the smallest preemption cost score. A score S is smaller than a score S′ (S<S′) if there is an x such that the xth element of S is smaller than that of S′, and all other elements y, y<x, have the same value in both S and S′. Equation 9 formalizes this relation.
$$ S<S' \iff \exists x \in \left[1,m+1\right] | S[x] < S'[x] \land \forall y \in \left[1, x\right[, S[y] = S'[y]. $$
(9)
Similarly to the priority-based scheduler, when two or more hosts have the same smallest value for their preemption cost scores, then the host chosen is the one with the largest allocation score value among those that tied.
It is important to mention that this is just one of the heuristic that could be used. Although it has produced good results for the scenarios that we have evaluated in Section 5, it may be the case that other heuristics could perform even better. However, the evaluation of different heuristics is beyond the scope of this paper.
4.2 Workload details
The workloads used in the simulation experiments come from a trace of a production cloud at GoogleFootnote 6. This trace spans 29 days in May 2011, and comprises more than 25 million allocation requests for the resources of a cluster.
Google’s trace have information of jobs that have been submitted, including their CPU and memory demands, and duration (i.e. for how long it needs to run). Jobs may also have placement constraints, and may comprise multiple tasks, typically with the same resource requirements, duration, and placement constraints (if any). The trace also includes the resource capacities of the hosts where the requests were executed. These capacities are normalized as a percentage of the capacity of the most powerful host (whose capacity has not been disclosed). Thus, the requests demands present in the trace also use the same normalized scale to describe the amount of CPU and memory that an instance requires.
For simplicity, we consider that each task belonging to a job is an independent request submitted to the system. Each request consists of the amount of CPU and memory required by its instance, the request duration, and, possibly, some placement constraints. We recall that the duration of a request is not considered by the schedulers when taking their decisions, and used simply to drive the simulations.
Requests in the trace may be classified according to the 12 different priorities that can be assigned to them (from 0 to 11). We use these priorities to define the different service classes that will be considered in our simulation experiments. Based on the description of the trace [31] and on previous works [16, 32], it is possible to group the requests into three service classes. The availability SLO established for each service class is the same used by Carvalho et al. [32], when they used this trace to evaluate an admission control mechanism. The service classes considered in this work and their respective SLOs are described below.
-
1
the class gold consists of requests with a priority higher than 8. This class encompasses critical monitoring tasks and interactive user-facing applications, which require very high availability [7]. They are the most important requests in the workload, since their instances are never supposed to be preempted. For this reason, this class promises an SLO of 100% of availability. This is the most demanding class in terms of QoS, thus, the priority-based scheduler associates the highest priority to this class, while the QoS-driven one sets this as the most important one (i.e. i=1);
-
2
the class silver consists of requests assigned to intermediary priorities (higher than 1 and lower than 9 – [2,8]). It includes applications that can cope with a slightly degraded QoS. This is the case of non-interactive user-facing applications, as well as some critical batch applications that can accommodate some downtime, but have strict deadlines to meet [7]. In our experiments we arbitrated an SLO of 90% of availability for this class. The priority-based scheduler associates the second highest priority to this class, while the QoS-driven scheduler sets an intermediate value for its importance (i=2);
-
3
the class bronze is the less demanding in terms of QoS, with a promised SLO of 50% of availability. Requests with a priority lower than 2 are classified as bronze requests. Instances of this class are often preempted for the benefit of instances associated with higher priority requests [16]. This is the lowest priority class according to the priority-based scheduler, and the least important one according to the QoS-driven scheduler (i=3). This class is targeted to best-effort batch applications. We note that this is the kind of workload that is currently being executed on opportunistic resources in public clouds. Providing an SLO for this class, even if it is a low one, allows users to have some predictability for the running time of their applications, which is not the case when no guarantees are offered. Some housekeeping tasks common in large infrastructures, including logging services and file-system cleanup, also fall in this class. Although these tasks can run at the lowest possible priority, they cannot starve [7].
More examples on how multiple service classes are used in production cloud environments, and how this can benefit applications with different SLA requirements, can be found in the literature [7, 33].
Simulating the whole Google trace was too expensive in terms of processing time. Thus, we generated ten different workload samples from Google’s trace. For each treatment of the two factors discussed before (scheduling policy and infrastructure size), we have executed experiments with these ten samples, leading to 60 different scenarios tested.
The workload8tab:credittoviolatingrequests samples were generated as follows. Firstly, we conducted a clustering analysis on the Google users, by applying the well-known k-means clustering algorithm, taking into account, for each user, the number of requests submitted by the user, and the variance of the CPU demand, memory demand, and duration of these requests. This analysis led to six groups of users. In order to generate a sample workload, we randomly selected 10% of the users in each group. The resulting sample workload consists of all the requests submitted by the selected users. Figures 1 and 2 present, respectively, the amount of CPU and memory allocated over time (measured at 1-min intervals) for each workload, when submitted to a hypothetical infrastructure with infinite CPU and RAM capacities. In these graphs we differentiate the workloads by the three service classes.
The ten workloads have requests for all service classes, and differ substantially from each other; their shapes, request mix per service class, peak demands and intensities are different. This workload heterogeneity comes from the fact that different subsets of real users lead to different bundles of requests. This variability is important to analyze the scheduler under different (yet realistic) workloads.
4.3 Infrastructure
Changes in the infrastructure size affect the level of contention for resources in the system. The larger is the infrastructure, the smaller is the resource contention. Resource contention is also affected by the demand that the system incurs. Since each of the ten workload samples generated leads to a different demand, the infrastructure used to allocate the workloads should also vary from workload to workload, so that we have similar experiments across different workloads. To achieve this goal, we consider a size N for the infrastructure, that is defined accordingly to the peak demand for resources of each workload. This is the first level of the infrastructure size factor. The other two levels are set to be 0.9N, and 0.8N, which correspond to infrastructures that are smaller by 10% and 20%, respectively. The value of N is established as follows:
-
1
Given a workload sample, we simulate the allocation of the workload considering a hypothetical infrastructure comprised by a single host with infinite CPU and memory capacities. It does not matter which scheduler is used in this setup experiment, because all the requests in the workload are always allocated straight away, without any queuing delays or preemptions.
-
2
Then, we evaluate the results of the simulation done in step (1) to identify the maximum amount of CPU and memory ever used to process the workload. Let these maximum quantities be Nc and Nm, respectively. N is set to be the maximum value between Nc and Nm, and the generation of the infrastructure is driven by the resource with the largest peak, i.e. CPU if Nc>Nm, or memory if Nm>Nc.
To generate an infrastructure of size N we randomly sampled Google’s trace, and added hosts one at a time, until the aggregate size of the infrastructure reached N for the resource with the largest peak. Infrastructures of sizes 0.9N and 0.8N were generated by randomly removing one host at a time from the infrastructure of size N previously generated, until the desired size was reached.
4.3.1 Evaluation metrics
The QoS-driven scheduler was compared with the priority-based one considering different metrics. The basic metric used in this assessment is the QoS (i.e. the availability) that is delivered to the requests served, which is computed using Eq. 1, previously defined. We also measure the QoS deficit experienced by requests whose respective SLOs were violated. The QoS deficit is computed as the difference between the SLO and the actual availability delivered to these requests.
Finally, we compute the SLO fulfillment metric. This is simply the ratio between the number of requests that had their SLO fulfilled, i.e. received a QoS at or above the promised target, and the total number of requests served. All these metrics are computed separately for the three service classes considered.
We also evaluate how fair the schedulers share the resources among the instances; we want to evaluate the equity on the QoS delivered to requests of the same class that were active at approximately the same time. In order to evaluate fairness, we compute the Gini coefficient [34], which is a well known coefficient used to reveal inequality between subjects inside a population/sample. The Gini coefficient varies in the interval [0,1], where 0 corresponds to perfect income equality (i.e. everyone has the same income) and 1 corresponds to perfect income inequality (i.e. one person has all the income, while everyone else has zero income).
4.4 Validation of the simulation models
Since the main results of this research come from simulation experiments, it is of utmost importance to validate our simulation models. The validation of the simulation models was carried out by comparing the results of paired measurement and simulation experiments using actual implementations of the schedulers, and our simulators, under the same environment conditions — infrastructure and scheduler configuration — and workload. For these experiments, the metric of interest was the final availability of the instances.
4.4.1 Proof of concept implementation
The implementation of the priority-based scheduler is the default scheduler available in Kubernetes. From an architectural perspective, a Kubernetes cluster is comprised of two types of nodes: master and worker. The master node runs the Control Plane services that control and orchestrate the Kubernetes cluster, such as (i) the API server, which provides endpoints to process RESTful API calls to regulate and manage the cluster, (ii) the Scheduler, which assigns physical resources to instances, called pods, (iii) the Replication Controller, which manages pods within the cluster, and (iv) the Node Controller, which detects and responds when nodes go down or come up. A worker node handles the runtime environment of the pods, which is based on containers. Each worker node runs a Kubelet agent that takes care of containers running in their associated pods, and periodically reports the health status of pods and nodes to the Control Plane in the master node. A Kubernetes cluster has at least one worker node, but in production environments it usually contains multiple worker nodes.
In addition to being popular and open source, Kubernetes is also easy to be modified. Its modular design facilitates replacing parts of the system without affecting other parts. We implemented a proof of concept (PoC) of the QoS-driven scheduler for Kubernetes by simply changing the appropriate parts of the default priority-based scheduler to incorporate the features described in Section 3. Our approach to implement a QoS-driven scheduler for Kubernetes was to be as non-intrusive as possible. Thus, we departed from the code of Kubernetes Version 1.9, which was the latest stable version at the time that coding took place, and simply modified the default scheduler to incorporate the required changes. In particular, we have changed mainly the preemption logic of the scheduler, and the pending list sorting algorithm.
In order to allow the QoS-driven scheduler to compute the required QoS metrics, still keeping changes to the original scheduler to a minimal, we have used two additional external services: Kubewatch and Prometheus. Kubewatch is responsible for monitoring the pod events in the system, such as creation, allocation, preemption, and deletion. Whenever one of these events happens, Kubewatch collects and updates the data related to the involved pod. In the case of a pod creation event, the creation timestamp is registered, allowing the scheduler to infer the amount of time that the pod has been pending. In an allocation event, Kubewatch registers the amount of time that the pod has been pending and the allocation timestamp, which allows the scheduler to calculate the amount of time that the pod has been running. In a preemption event, the service registers the amount of time that the pod has been running. Lastly, in a deletion event, it registers the amount of time that the pod has been pending or running, since it was created. These data are required by the QoS-driven scheduler to calculate \(\mathcal {Q}_{j}(t)\) of an instance (i.e. a pod) j at some time t. Since we deploy Kubewatch and the scheduler service in the same node (master node), both services use the same clock while calculating the pending or running times of instances, and there is no need for running a clock synchronization protocol. Prometheus, on the other hand, is responsible for storing the data collected by Kubewatch and making them available to the scheduler. Then, whenever the scheduler runs, it gets the required data from Prometheus, and calculates the QoS metrics accordingly.
Figure 3 shows a sketch of the PoC architecture. In summary, the QoS-driven scheduler, Prometheus and Kubewatch services are the new components deployed on the master node. Whenever a pod event occurs, the Control Plane registers the event; some of these events are reported by the Kubelet agents running at the worker nodes. Kubewatch monitors pod events in the Control Plane, compute the metrics of interest, and registers them on Prometheus. Whenever the scheduler runs, it gets the required data from Prometheus, and calculates the QoS metric of the pods. The scheduler generates the allocation plan taking into account these QoS metrics. Finally, based on the allocation plan, the Control Plane instructs the appropriate Kubelet agents to allocate or preempt pods on the worker nodes.
4.4.2 Experimental design of the validation tests
The validation was performed in two tests, with the execution of two synthetic workloads over the same infrastructure. In both cases, the infrastructure consisted of a Kubernetes cluster with 20 homogeneous hosts — virtual machines on an OpenStack cloud — each with 4 Gbytes of RAM, and 4 vCPUs. In this deployment, Kubernetes used approximately 0.25 Gbytes of the memory made available in each host. Both schedulers were configured in the same way in both the simulation and the measurement experiments, following what was detailed in Section 4.1.
The workloads were conceived in a way that it was possible to anticipate the expected behaviour of the system, and that could test it under different stressing scenarios. In both cases, all requests were submitted at the beginning of the test, with a one-second interval between the submission of two subsequent requests. The tests ran for one hour, and all requests were active until the end of the tests, when the availabilities were computed. All requests required the same amount of CPU and memory (0.375 Gbyte of RAM and 0.375 vCPUs), allowing 10 instances of any request class to be simultaneously allocated on each host. In the two tests, the maximum acceptable overhead configured for both the simulation and the measurement experiments were 1−σi, i.e. 0%, 10% and 50% for the gold, silver and bronze classes, respectively.
In the first test, the synthetic workload consisted of 256 requests, with 80 requests for gold instances, 80 requests for silver instances and 96 requests for bronze instances. The order in which these requests appear in the workload was randomly defined. The expected behavior is that the priority-based scheduler will provide 100% availability for all requests of classes gold and silver, and 56 requests of class bronze will have an availability close to 0%, while the other 40 will have availability close to 100%. On the other hand, the QoS-driven scheduler will provide availabilities for all requests very close to the SLO of their respective classes (small differences are expected due to the preemption and scheduling overheads involved). The goal of this experiment is to assess the impact of the simplifications made in the simulation models. In particular, the main simplification is the fact that the simulators do not consider the overhead involved in the processing of the pending queue. Thus, the experiment was designed in such a way that a reasonable number of requests were always present in the pending queue. More specifically, soon after the experiment is started, there are always 56 requests in the pending queue, which corresponds to 22% of the whole workload.
The second validation test aimed at exercising the mechanism adopted to limit the number of preemptions made by the QoS-driven scheduler. To accomplish this, the synthetic workload used consisted of 221 silver requests. Since the silver class has a high SLO (90%), and all active requests have the same importance, preemptions would soon become very frequent, and the mechanism to limit preemptions was more likely to be triggered. In this case, the expected behavior is that the priority-based scheduler will allocate the first 200 requests, and leave the other 21 requests in the pending queue. Thus, 200 requests will have an availability of 100%, while 21 will have availability of 0%. For the QoS-driven scheduler, all requests will have a chance to run, and will achieve a QoS that is close to their respective SLOs. Again, some requests are expected to have small QoS deficits due to the overheads involved.
4.4.3 Results of the validation tests
In Fig. 4 we plot the final availabilities calculated for the instances in the workload of the first testFootnote 7. In purple we have the final availabilities of the instances calculated in the simulation experiments, while the final availabilities of the instances calculated in the measurement experiments are shown in green. In the left-hand side we have results for the priority-based scheduler, while in the right-hand side we have those for the QoS-driven scheduler.
As expected, the priority-based scheduler maintains the availabilities of the instances of the highest priority classes in 100%. Besides, 40 bronze instances received 100% of availability, because they arrived before other instances of higher priority classes and were never selected to be preempted. The remaining bronze instances have availabilities below their SLO, and close to 0%. The bronze instances that violated their SLOs were submitted when the infrastructure was already fully utilized or were preempted when other requests related to higher priority classes were submitted. On its turn, the QoS-driven scheduler delivers availabilities for all instances that are very close to their respective SLOs.
Figure 5 shows the results for the second test, using the same notation used in Fig. 4. We observe that both schedulers work as expected. The priority-based scheduler maintains the availabilities of the instances that were allocated in 100%, and the 21 last requests submitted received 0% of availability. As each node is able to allocate 10 requests, these last requests were submitted when the infrastructure was already fully utilized, and they were never allocated. On its turn, the QoS-driven scheduler delivers availabilities that are very close to the instances’ QoS targets.
Finally, in both tests, the final availabilities computed from the measurement and the simulation experiments are very close to each other. The ranges of availabilities were wider in the measurements, compared with those obtained in the simulations. This is due to the less controlled environment in the measurement experiment. However, a t-test revels that there is no significant difference between their results.