We present Tesseract, a system that directly and transparently addresses the double-paging problem. Tesseract tracks when guest and hypervisor I/O operations are redundant and modifies these I/Os to create indirections to existing disk blocks containing the page contents. Although our focus is on reconciling I/Os between the guest disks and hypervisor swap, our technique is general and can reconcile, or deduplicate, I/Os for guest pages read or written by the VM.
Deduplication of disk blocks for file contents accessed in a common manner is well-understood. One challenge that our approach faces is that the locality of guest I/Os (reflecting the guest's notion of disk layout) often differs from that of the blocks in the hypervisor swap. This loss of locality through indirection results in significant performance loss on subsequent guest reads. We propose two alternatives to recovering this lost locality, each based on the idea of asynchronously reorganizing the indirected blocks in persistent storage.
We evaluate our system and show that it can significantly reduce the costs of double-paging. We focus our experiments on a synthetic benchmark designed to highlight its effects. In our experiments we observe Tesseract can improve our benchmark's throughput by as much as 200% when using traditional disks and by as much as 30% when using SSD. At the same time worst case application responsiveness can be improved by a factor of 5. This paper presents virtual asymmetric multiprocessor, a new scheme of virtual desktop scheduling on multi-core processors for user-interactive performance. The proposed scheme enables virtual CPUs to be dynamically performance-asymmetric based on their hosted workloads. To enhance user experience on consolidated desktops, our scheme provides interactive workloads with fast virtual CPUs, which have more computing power than those hosting background workloads in the same virtual machine. To this end, we devise a hypervisor extension that transparently classifies background tasks from potentially interactive workloads. In addition, we introduce a guest extension that manipulates the scheduling policy of an operating system in favor of our hypervisor-level scheme so that interactive performance can be further improved. Our evaluation shows that the proposed scheme significantly improves interactive performance of application launch, Web browsing, and video playback applications when CPU-intensive workloads highly disturb the interactive workloads. Multi-tier applications are widely deployed in today's virtualized cloud computing environments. At the same time, management operations in these virtualized environments, such as load balancing, hardware maintenance, workload consolidation, etc., often make use of live virtual machine (VM) migration to control the placement of VMs. Although existing solutions are able to migrate a single VM efficiently, little attention has been devoted to migrating related VMs in multi-tier applications. Ignoring the relatedness of VMs during migration can lead to serious application performance degradation.
This paper formulates the multi-tier application migration problem, and presents a new communication-impact-driven coordinated approach, as well as a system called COMMA that realizes this approach. Through extensive testbed experiments, numerical analyses, and a demonstration of COMMA on Amazon EC2, we show that this approach is highly effective in minimizing migration's impact on multi-tier applications' performance. Cross-ISA system-mode emulation has many important applications. For example, Cross-ISA system-mode emulation helps computer architects and OS developers trace and debug kernel execution-flow efficiently by emulating a slower platform (such as ARM) on a more powerful plat-form (such as an x86 machine). Cross-ISA system-mode emulation also enables workload consolidation in data centers with platforms of different instruction-set architectures (ISAs). However, system-mode emulation is much slower. One major overhead in system-mode emulation is the multi-level memory address translation that maps guest virtual address to host physical address. Shadow page tables (SPT) have been used to reduce such overheads, but primarily for same-ISA virtualization. In this paper we propose a novel approach called embedded shadow page tables (ESPT). EPST embeds a shadow page table into the address space of a cross-ISA dynamic binary translation (DBT) and uses hardware memory management unit in the CPU to translate memory addresses, instead of software translation in a current DBT emulator like QEMU. We also use the larger address space on modern 64-bit CPUs to accommodate our DBT emulator so that it will not interfere with the guest operating system. We incorporate our new scheme into QEMU, a popular, retargetable cross-ISA system emulator. SPEC CINT2006 benchmark results indicate that our technique achieves an average speedup of 1.51 times in system mode when emulating ARM on x86, and a 1.59 times speedup for emulating IA32 on x86_64. Efficient and secure networking between virtual machines is crucial in a time where a large share of the services on the Internet and in private datacenters run in virtual machines. To achieve this efficiency, virtualization solutions, such as Qemu/KVM, move towards a monolithic system architecture in which all performance critical functionality is implemented directly in the hypervisor in privileged mode. This is an attack surface in the hypervisor that can be used from compromised VMs to take over the virtual machine host and all VMs running on it.
We show that it is possible to implement an efficient network switch for virtual machines as an unprivileged userspace component running in the host system including the driver for the upstream network adapter. Our network switch relies on functionality already present in the KVM hypervisor and requires no changes to Linux, the host operating system, and the guest.
Our userspace implementation compares favorably to the existing in-kernel implementation with respect to throughput and latency. We reduced per-packet overhead by using a run-to-completion model and are able to outperform the unmodified system for VM-to-VM traffic by a large margin when packet rates are high. Data center servers are typically overprovisioned, leaving spare memory and CPU capacity idle to handle unpredictable workload bursts by the virtual machines running on them. While this allows for fast hotspot mitigation, it is also wasteful. Unfortunately, making use of spare capacity without impacting active applications is particularly difficult for memory since it typically must be allocated in coarse chunks over long timescales. In this work we propose repurposing the poorly utilized memory in a data center to store a volatile data store that is managed by the hypervisor. We present two uses for our Mortar framework: as a cache for prefetching disk blocks, and as an application-level dis- tributed cache that follows the memcached protocol. Both prototypes use the framework to ask the hypervisor to store useful, but recoverable data within its free memory pool. This allows the hypervisor to control eviction policies and prioritize access to the cache. We demonstrate the benefits of our prototypes using realistic web applications and disk benchmarks, as well as memory traces gathered from live servers in our university’s IT department. By expanding and contracting the data store size based on the free memory available, Mortar improves average response time of a web application by up to 35% compared to a fixed size mem- cached deployment, and improves overall video streaming performance by 45% through prefetching. Limited main memory size is considered as one of the major bottlenecks in virtualization environments. Content-Based Page Sharing (CBPS) is an efficient memory deduplication technique to reduce server memory requirements, in which pages with same content are detected and shared into a single copy. As the widely used implementation of CBPS, Kernel Samepage Merging (KSM) maintains the whole memory pages into two global comparison trees (a stable tree and an unstable tree). To detect page sharing opportunities, each tracked page needs to be compared with pages already in these two large global trees. However since the vast majority of compared pages have different content with it, that will induce massive futility comparisons and thus heavy overhead.
In this paper, we propose a lightweight page Classification-based Memory Deduplication approach named CMD to reduce futile page comparison overhead meanwhile to detect page sharing opportunities efficiently. The main innovation of CMD is that pages are grouped into different classifications based on page access characteristics. Pages with similar access characteristics are suggested to have higher possibility with same content, thus they are grouped into the same classification. In CMD, the large global comparison trees are divided into multiple small trees with dedicated local ones in each page classification. Page comparisons are performed just in the same classification, and pages from different classifications are never compared (since they probably result in futile comparisons). The experimental results show that CMD can efficiently reduce page comparisons (by about 68.5%) meanwhile detect nearly the same (by more than 98%) or even more page sharing opportunities. Dynamic languages have been gaining popularity to the point that their performance is starting to matter. The effort required to develop a production-quality, high-performance runtime is, however, staggering and the expertise required to do so is often out of reach of the community maintaining a particular language. Many domain specific languages remain stuck with naive implementations, as they are easy to write and simple to maintain for domain scientists. In this paper, we try to see how far one can push a naive implementation while remaining portable and not requiring expertise in compilers and runtime systems. We choose the R language, a dynamic language used in statistics, as the target of our experiment and adopt the simplest possible implementation strategy, one based on evaluation of abstract syntax trees. We build our interpreter on top of a Java virtual machine and use only facilities available to all Java programmers. We compare our results to other implementations of R. Multi- and many-core processors are becoming increasingly popular in embedded systems. Many of these processors now feature hardware virtualization capabilities, such as the ARM Cortex A15, and x86 processors with Intel VT-x or AMD-V support. Hardware virtualization offers opportunities to partition physical resources, including processor cores, memory and I/O devices amongst guest virtual machines. Mixed criticality systems and services can then co-exist on the same platform in separate virtual machines. However, traditional virtual machine systems are too expensive because of the costs of trapping into hypervisors to multiplex and manage machine physical resources on behalf of separate guests. For example, hypervisors are needed to schedule separate VMs on physical processor cores. In this paper, we discuss the design of the Quest-V separation kernel, which partitions services of different criticalities in separate virtual machines, or sandboxes. Each sandbox encapsulates a subset of machine physical resources that it manages without requiring intervention of a hypervisor. Moreover, a hypervisor is not needed for normal operation, except to bootstrap the system and establish communication channels between sandboxes. This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.
In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim’s stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intranode hardware parallelism. Abstract Program instrumentation techniques form the basis of many recent software security defenses, including defenses against common exploits and security policy enforcement. As compared to source-code instrumentation, binary instrumentation is easier to use and more broadly applicable due to the ready availability of binary code. Two key features needed for security instrumentations are (a) it should be applied to all application code, including code contained in various system and application libraries, and (b) it should be non-bypassable. So far, dynamic binary instrumentation (DBI) techniques have provided these features, whereas static binary instrumentation (SBI) techniques have lacked them. These features, combined with ease of use, have made DBI the de facto choice for security instrumentations. However, DBI techniques can incur high overheads in several common usage scenarios, such as application startups, system-calls, and many real-world applications. We therefore develop a new platform for secure static binary instrumentation (PSI) that overcomes these drawbacks of DBI techniques, while retaining the security, robustness and ease-of-use features. We illustrate the versatility of PSI by developing several instrumentation applications: basic block counting, shadow stack defense against control-flow hijack and return-oriented programming attacks, and system call and library policy enforcement. While being competitive with the best DBI tools on CPU-intensive SPEC 2006 benchmark, PSI provides an order of magnitude reduction in overheads on a collection of real-world applications. We are interested in implementing dynamic language runtimes on top of language-level virtual machines. Type specialization is a critical optimization for dynamic language runtimes: generic code that handles any type of data is replaced with specialized code for particular types observed during execution. However, types can change, and the runtime must recover whenever unexpected types are encountered. The state-of-the-art recovery mechanism is called deoptimization. Deoptimization is a well-known technique for dynamic language runtimes implemented in low-level languages like C. However, no dynamic language runtime implemented on top of a virtual machine such as the Common Language Runtime (CLR) or the Java Virtual Machine (JVM) uses deoptimization, because the implementation thereof used in low-level languages is not possible.
We created Stackdb, a debugging library with VMI support that allows one to monitor and control a whole system through multiple, coordinated targets. A target corresponds to a particular level of the system's software stack; multiple targets allow a user to observe a VM guest at several levels of abstraction simultaneously. For example, with Stackdb, one can observe a PHP script running in a Linux process in a Xen VM via three coordinated targets at the language, process, and kernel levels. Within Stackdb, higher-level targets are components that utilize lower-level targets; a key contribution of Stackdb is its API that supports multi-level and flexible "stacks" of targets. This paper describes the challenges we faced in creating Stackdb, presents the solutions we devised, and evaluates Stackdb through its application to a security-focused, whole-system case study. Dynamic Binary Instrumentation (DBI) is a core technology for building debugging and profiling tools for application executables. Most state-of-the-art DBI systems have focused on the same instruction set architecture (ISA) where the guest binary and the host binary have the same ISA. It is uncommon to have a cross-ISA DBI system, such as a system that instruments ARM executables to run on x86 machines. We believe cross-ISA DBI systems are increasingly more important, since ARM executables could be more productively analyzed on x86 based machines such as commonly available PCs and servers.
In this paper, we present DBILL, a cross-ISA and re- targetable dynamic binary instrumentation framework that builds on both QEMU and LLVM. The DBILL framework enables LLVM-based static instrumentation tools to become DBI ready, and deployable to different target architectures. Using address sanitizer and memory sanitizer as implementation examples, we show DBILL is an efficient, versatile and easy to use cross-ISA retargetable DBI framework. Welcome to the 10th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE'14). We are happy to present the community with a strong program covering a wide range of virtualization topics.
This year, authors registered 56 papers, of which 49 were finalized as complete submissions. The program committee (PC) consisted of 2 chairs and 18 researchers active in virtualization-related aspects of programming languages and operating systems. Members were allowed to submit papers; the co-chairs chose not to submit anything. Reviewing was double-blind and was done almost entirely by the committee, with a little assistance from outsiders with special expertise. All submissions received 4--5 reviews, and authors were given the opportunity for rebuttal before the PC meeting.
The program committee meeting was held in January at the IBM T.J. Watson Research Center in New York. Most of the committee members were present in person. In an 8-hour session, we individually discussed all papers but those that were marked as early rejects due to receiving only negative reviews. We followed conventional rules for conflict of interest, with conflicted members (including co-chairs) leaving the room during discussion of the conflicted papers. In the end, we accepted 18 papers for presentation at the conference, of which about half were shepherded by PC members.
In addition to the 18 accepted papers, the VEE'14 program includes two keynote presentations by Galen Hunt and Jan Vitek. We hope that the resulting proceedings will serve as a valuable reference for researchers and practitioners in the area of virtualization. The Microsoft Research Drawbridge Project began with a simple question: Is it possible to achieve the benefits of hardware virtual machines without the overheads? Following that question, we have built a line of exploratory prototypes. These prototypes range from an ARM-based phone that runs x86 Windows binaries to new forms of secure computation. In this talk, I'll briefly describe our various prototypes and the evidence we have accumulated that our first question can be answered in the affirmative. Computer systems research spans sub-disciplines that include embedded systems, programming languages, networking, and operating systems. In this talk my contention is that a number of structural factors inhibit quality systems research. Symptoms of the problem include unrepeatable and unreproduced results as well as results that are either devoid of meaning or that measure the wrong thing. I will illustrate the impact of these issues on our research output with examples from the development and empirical evaluation of the Schism real-time garbage collection algorithm that is shipped with the FijiVM - a Java virtual machine for embedded and mobile devices. I will argue that our field should foster: repetition of results, independent reproduction, as well as rigorous evaluation. I will outline some baby steps taken by several computer conferences. In particular I will focus on the introduction of Artifact Evaluation Committees or AECs to ECOOP, OOPLSA, PLDI and soon POPL. The goal of the AECs is to encourage authors to package the software artifacts that they used to support the claims made in their paper and to submit these artifacts for evaluation. AECs were carefully designed to provide positive feedback to the authors that take the time to create repeatable research. Joint work with Kalibera, Pizlo, Hosking, Blanton and Ziarek.