Dummy VM -- an experiment in clean C++ virtual machine APIs as part of the
             ongoing planning for a finalized QSim.

Features from Dummy VM will ultimately be incorporated into the micro-operation
processing stack of QSim/CEMU.

The program (sieve of erastophanes):
#define mem[0] i
#define mem[1] j
#define END 1<<30 
for (i = 2; i < 1024; i++) mem[i] = i;
i = 1; // So the scan will catch 2. mem[1] will be ignored.
while (1) {
  // Scan to next prime
  do {i++; if (i == 1024) goto done; } while (mem[i] == 0);
  // i is now prime. Zero all of its multiples.
  for (j = 2*i; j < 1024; j+=i) mem[j] = 0;
}

Some more experiments:
  I    - 146 MIPS 196 Lines
  II   - 225 MIPS 108 Lines
  III  - 124 MIPS 114 Lines
  IIIa - 142 MIPS 103 Lines
  IV   - 187 MIPS 166 Lines
  V    - 336 MIPS 226 Lines
  Va   - 631 MIPS 242 Lines


Some experiments:

  I - Interpretation of operations using polymorphism and a base Op class.
 
      Much more "clean OO" -- makes writing algorithms to process collections
      of instructions easier.

      1. With successor map:     6MIPS (~1/200 of 1GIPS)
      2. With successor field:  40MIPS (~1/25  of 1GIPS)
      3. " -O3:                130MIPS (~1/8   of 1GIPS) 

      194 (non comment-or-blank) lines of code. 
      Cleanest design, but classes for each instruction type need a lot of 
      boilerplate.

  II - Interpretations of operations using enum type and big case statement.

      Simpler design. Possibly less powerful algorithmically, but the code is
      very compact the representation in memory is far more forgiving.

      1. With successor field:        40MIPS (~1/25 of 1GIPS)
      2. With successor computation:  43MIPS (~1/25 of 1GIPS)
      3. " -O3:                      180MIPS (~1/5  of 1GIPS)

      106 (non comment-or-blank) lines of code.
      Second shortest implementation and probably the easiest to grok, though 
      the design makes less use of language features than I and so would not be
      a good choice for an LLVM-style compiler infrastructure that needs to be
      able to handle arbitrary collections of operations and perform generic
      compilation algorithms on them.

  III - Dispatch-to-function-pointer-in-execute.
      1. -O3: 110MIPS (~1/8 of 1GIPS)
      
      112 (non comment-or-blank) lines of code.
      Almost as succinct as the big case statement; requires extra lines for
      the dispatch table. 

      It would be possible to use function pointers instead of enum values for
      the operation type, thereby eliminating the need for a dispatch table.

  III-a - Dispatch without table
      1. -O3: 120MIPS (~1/8 of 1GIPS)

      101 (non comment-or-blank) lines of code.
      Shortest implementation. Eliminates the execute function from the op
      class and makes the operation type field a function pointer to the
      operation's implementation. Because these function names are global they
      are just as useful as enum values in uniquely identifying operations
      while also eliminating a lookup (resulting in a slight speedup).

  IV - Direct-threaded code.

      Pain in the ass to debug and maintain; probably only just a bit less
      awful than direct binary translation. When you accidentally save and
      restore a value to different registers, no one's there to point out
      your mistake.

      1. Returning to loop after each basic block(no opt): 84MIPS
      2. " -O3:                                            310MIPS

      164 (non comment-or-blank) lines of code.
      Takes III (dispatch with a table) and translates the dispatch to a series
      of call instructions, thereby increasing performance by almost a factor
      of three.

  V - Binary translation.

      Even more of a pain than direct threading; only an overall 1.12 speedup
      without basic block chaining (and with only 6 operations implemented 
      with translation).

      1. Returning to loop after each basic block: 110MIPS
      2. " -O3:                                    350MIPS

      224 (non comment-or-blank) lines of code.
      Adds to IV inline translations of some of the functions. The code is
      unfortunately bulky because x86-64's pc-relative addressing mode is
      useless when the code exists in a mmap more than 2^31 away from the
      data structures being modified. Reserving a register to use as the
      accumulator may help, but this solution does not extend to machines with
      a large-register-count machines.


Options III(or IIIa)-V are related; a VM designed as III would be easy to
convert (by adding a little extra code) to IV, which would then be easy to 
convert to V. This inherent open-endedness in application performance
optimization makes the dispatch table approach sensible. 

I->V is also a reasonable jump to make, since each operation class could have a
"generate code" virtual member function (this is probably what LLVM does). The
lack of an easily implementable threaded code option in this range, due to the
not-easily-threadable nature of virtual member functions, makes it a less
attractive option for a simulator front end.