VM Performance vs. Implementation Complexity

by chad

Dummy VM examples, works on x86-64 Linux

I was working on a post about recent advances in CHDL and realized that there was another topic I have been meaning to cover in this venue for years, the results of a simple exporation of VM performance vs. implementation complexity. In a recent demo for the Linux Users’ Group at Georgia Tech, I explained an FPGA development toolchain using CHDL and briefely demoed a CPU implementation I had thrown together last night using an accumulator instruction set which happened to already have an implementation of my favoite test algorithm due to this earlier work.

Platform VMs, Briefly

Virtual machines implementing instruction sets different from those of the host machine must provide a way to execute their guest instruction sets. When the instruction set happens to be that of another physical machine, the term emulator is typically used. When the instruction set is not that of a physical machine, the appelation is usually the more generic “platform VM,” everyones’ go-to example of which is probably the JVM.

My personal favorite example is the execution engine of QEMU. While it is, considered in its entirety, an emulator, emulating the instruction sets of some physical machines that have been implemented in hardware, its current approach to emulation uses a two-step approach to emulation, translating first to an intermediate language which is then translated to native operations by the Tiny Code Generator. The TCG’s input can be thought of as a virtual instruction set and QEMU’s execution engine as a platform VM executing this instruction set.

A Dummy VM Architecture

If you were to naively set out comparing platform VMs implementations in use in real software today, you would be faced with a ton of incomparable implementations implementing different architectures with different goals. If you have an application that requires a VM, it is useful to have some a priori knowledge about the difficulties and advantages of different implementations. Toward this end, I defined a very simple architecture and implemented it seven times using different techniques.

The architecture is built around a memory space and a set of fifteen instructions which may or may not take a single immediate operand, shown below with their names and functional descriptions. In the functional descriptions, “mem” is used to represent the entire array of virtualized memory locations, “a” is the accumulator, and “pc” represents the location of the next instruction, by analogy to the program counter in a hardware architecture:

HALT Stop fetching new instructions.
LOAD a = mem[imm];
INDIRECT_LOAD a = mem[mem[imm]];
STORE mem[imm] = a;
INDIRECT_STORE mem[mem[imm]] = a;
BRANCH_ALWAYS pc = imm;
BRANCH_ZERO if (!a) pc = imm;
BRANCH_NOT_ZERO if (a) pc = imm;
MEM_ADD a += mem[imm];
MEM_SUB a -= mem[imm];
MEM_NAND a = ~(a & mem[imm]);
IMM_ADD a += imm;
IMM_SUB a -= imm;
IMM_NAND a = ~(a & imm);
IMM_LOAD a = imm;

In the attached file, you can see the identical program expressed in each implementation, the Sieve of Eratosthenes yet again, this time operating over a much larger array.

Seven Implementations

Here I have enumerated the implementations, in increasing order of complexity:

Polymorphic
Uses C++ polymorphism as the basis of the interpreter. Each operation is a subclass of Op, having an execute() function. This is probably what software engineers would call the “cleanest” design.
Big Switch Statement
The Op class was made monomorphic and given a type field. This type field is the argument to a switch statement in the execute function. Perhaps easier to read than the polymorphic implementation, if less extensible. For this simple example it is benefitted by the fact each of the fifteen possible operations’ implementations fits on a single line in the switch statement, producing code almost as readable as the previous Section’s table.
Dispatch Table
Instead of using a switch statement, uses a table of function pointers, one for each type of operation. Because their prototypes must be identical, all take an immediate argument, even the noes which do not use it.
Dispatch, No Table
The “dispatch” implementation suggests the obvious improvement that the pointers to the implementation functions themselves, instead of an enum indexing a table, could be used to identify operations. This has the drawback of making Op objects difficult to identify, and the advantage of removing a layer of abstraction.
Threaded Code
The “threaded” code mentioned here has nothing to do with parallelsim. This is “threaded code” in the sense of the original implementation of the B programming language at Bell Labs. This fills a code buffer with, instead of direct translations of instructions, a series of calls to functions
implementing the functions. Like the initial “translated” code, there is no direct jumping between basic blocks.
Translation
The “translation” version improves on the threaded code version by implementing the most common operations in x86 machine code directly.
Translation With Basic Block Chaining
Basic block chaining is a technique where, instead returning to a separate loop at the end of a basic block, basic blocks in the guest are actually
linked to one another by jumps in the host.

The Results

These numbers suffer, of course, from excessive precision in their reporting:

Impl. Run Time(s) Guest MIPS
Poly 31.96 94.95
Switch 26.74 113.49
Disp 28.93 104.90
Disp-notable 34.86 87.05
Threaded 20.52 147.89
Translation 21.82 139.08
Chained 10.02 302.86

For this simple application, at least, there is a clear winner in the translator with basic block chaining. Surprisingly, the removal of the lookup table actually hurt performance for the “dispatch” based VM.