☮ Chad D. Kersey ♡

A Weblog

The Busyboard: My New Favorite Toy

When I was perhaps ten years old my dad pointed out to me a dingy little white box laying out on a tarp among various other pieces of junk at a flea market. For five dollars, I got my very own Pencilbox LD-1 Logic Designer, a member of a class of artifacts with which I was already somewhat familiar. A solderless breadboard is fastened down to a panel surrounded by various electronic miscellany, including prehaps most importantly an integrated power supply. I’d been putting quite a few hours at the kitchen table on my father’s Heathkit ET-3100, an analog-oriented device of the same class, but my interests were more digital.

The Pencilbox, even moreso than the Heathkit, was perfect for a certain class of play and study:

  • A sample component is installed into the breadboard. Its interface is understood by the experimenter, but she has no first-hand experience employing it in a design.
  • Its power pins are connected to the supply rails; its signal pins are connected to the various I/O options, perhaps tri-state capable DIP switches, LEDs, and debounced pushbutton switches, available.
  • Power is applied and the experimenter simply plays with the device’s pins, asserting various inputs and observing the outputs.

The principal advantage I see in this is that, at a very low cost, the experimenter gains confidence in her understanding of the component before using it in a design. There is also some use of these for prototyping small designs consisting of multiple components, but the procedure is the same. The components are assembled, switches are flipped, and the design is interactively explored to the satisfaction of the experimenter.

Recapturing the Spirit

This was both a severely limited and intrinsically enjoyable way to approach learning about, or more often simply playing with electronics. I recently found myself in need of an EEPROM programmer and a demonstration vehicle for libpcb and thought it would be as good a time as any to attempt to update the concept for my present needs, and this meant replacing the switches and buttons with digital I/O attached to a host computer.

I decided it was better to sacrifice speed for the number of I/O pins available by using a series of shift registers for both input and output. A final shift register was added to the output chain to provide output enable signals, allowing individual 8-bit ports to be placed in input or output mode. The final design had six such ports, limited by the 54-pin breadboard-style terminal strip that was available.


In the finished busyboard, the top row of 6 ICs is the set of input shift registers. The bottom row is the set of main output shift registers. To the right of these is a final output shift register used to provide the output enable signals for the rest.

Building the Board

I initially coded up the board design using CHDL for the digital components and connectors and a separate handwritten netlist for the bypass capacitors and LEDs. This used the CHDL submodule feature to produce a netlist containing all of the devices as submodules. This was post-processed to produce a netlist in a more standard one-net-per-line “pin instance pin instance…” format. This worked as a proof of concept, but future board-level designs in CHDL will include some sort of additional state to include the passives in the CHDL code as well as perform the netlist generation (and simulation) within the same binary. This design was considered so simple that no simulation was performed.

If there are future revisions of the busyboard, some of this simplicity will be discarded for functionality. A microcontroller, almost certainly itself programmed using the current generation busyboard, will be added to provide some basic initialization and a better communication protocol. The current board design, in a wonderful display of anachronism for the sake of simplicity, contains both a USB connector for power and a 36-pin Centronics parallel port for data.

With a netlist I was reasonably confident in, it was time to lay out the PCB. By this time my Digkey order had arrived, so I could actually physically measure the components to ensure that my footprints were reasonable. At least once I verified the zoom level with a piece of Letter paper then physically pressed the Centronics-36 connector I had purchased against the screen to check the locations of the pins and screw holes.

Placement and routing was performed manually, using gerbv for periodic visual checks. This led to source code that looked largely like the following (units in inches):

  // Carry to U7                                                                
  (new track(0, 0.01))->
  (new via(point(6.6,0.175),0.06,0.035));
  (new track(1,0.01))->

What makes this tolerable compared to writing straight Gerber files is the ability to add higher-level constructs like device footprints and text. What makes this tolerable compared to visual editors like KiCAD is the same set of things that make HDLs appealing when compared to schematic capture. The busyboard design looks like page after page of meaningless numbers, but the framework allows for generators, so classes of designs can be written instead of point solutions, and managed, tested, and developed as source code, with all of the advantages in productivity that come along with that.

PCB in gerbv

gerbv was used to manually inspect placement and routes.

Perhaps, it’ll only be when automatic routing and generation of advanced structures like distributed element filters and differential interconnects show up that libpcb will be a truly attractive alternative, but those types of features will have to wait for future projects.

The Host Software

The semantics of the busyboard are very simple (read, write, set input/output) and so is the API, written in C and based entirely around a single structure:

  /* Busyboard control structure. */
  struct busyboard {
    int fd; /* Parallel port file descriptor. */
    unsigned trimask; /* One bit per I/O byte, 1=out 0=Hi-Z */
    unsigned char out_state[BUSYBOARD_N_PORTS],

All interaction with the board is by manipulating trimask, out_state, and in_state. This allows for future revisions of the board to change the interface used between the host machine and board without the need to change host-side source code. The state of this structure is read from and written to the board with busyboard_in(struct busyboard *bb) and busyboard_out(struct busyboard *bb) respectively. These, and an init_busyboard() and close_busyboard() function, are the whole of the API.

The initial test was a persistence-of-vision based raster display, spelling out CHDL on 8 LEDs to quickly moving eyes (or camera). This was quickly followed by interfacing with a simple 128kB SPI SRAM in an 8-pin DIP package, and of course a 32kB EEPROM, the reason this was built.

"CHDL" displayed on busyboard

Initial test– a persistence-of-vision based raster display.

I won’t spare too many words detailing all of the other devices that have been interfaced with the busyboard, but these include:

  • A 65c02 CPU, with memory contents stored on the host machine.
  • A 512kB parallel SRAM; the largest currently available in a DIP package.
  • A 2×16 character module.
  • An SPI analog-to-digital converter.
z80 in busyboard

Z80 CPU installed in busyboard.

z80 sieve screenshot

Result of Sieve of Eratosthenes run on Z80 processor installed in Busyboard.

65c02 in busyboard

65C02 CPU installed in busyboard.

busyboard 6502 sieve

Result of Sieve of Eratosthenes run on busyboard, with memory state provided by host program.

Build Your Own!

Want to hack together your own busyboard? I still have some spare PCBs; just shoot me an email. I’ll give you the unpopulated board if you promise to share what you do with it. If you’d like to play with or improve this rather simple design, the entire source (including this article) is available on GitHub:

chdl needed for the netlist generator.
libpcb used by board layout generator.
busyboard source, including netlist and board layout generator.

CHDL Tutorial Slides

Approximately annually, I have been giving brief but in-depth introductions to CHDL. The slide carousel I use for this is now self-contained enough that I am comfortable posting it to the Web for general consumption:

Spring 2015 CHDL Tutorial Slides

FAT32 Filesystem Recovery for Agents of Chaos

It was EverQuest night, about 9:30pm and my housemate’s desktop was beginning to stutter and seize, announcing its arrival at the first way station on its return trip to the star dust from which it was hewn. Desperate to head it off at the pass and preserve the illusion of permanence, he announced his intent to perform a memory test. Hoping to enlarge my own part in the unwinnable war against entropy, I offered to copy memtest86+ onto a USB stick.

A few minutes in and the tests seemed to be passing. This energy’s time as DRAM had not yet ended. He booted his machine once more to check his hard drives and asked, “You didn’t erase what was on that USB stick, did you?” Of course I did. I cp‘d the image right over the device node. What else would I have done? Realization bled into my consciousness. This time, entropy had made me her agent.

The solution was obvious to anyone who has stared with sufficient desire at an empty directory where there should be files, photorec. I’d just let it run and place its output into a directory for easy perusal later. Having already been an instrument of chaos, I discussed the likely results with my friend. “That’s not very useful. The filenames and directories are kind of important.”

There doesn’t seem to be a free tool in existence that extracts files from FAT32 filesystems with their names and directory hierarchy intact after the first 150 megabytes of that filesystem, along with the partition tables, have been overwritten with another. It was around 10pm by this point, and the time had come to get creative.

Within 30 minutes, I had about 75 lines in bffr.c, the Big FAT File Recoverer. Despite its hastily assembled name, it is in retrospect more of a directory entry recoverer. Based on the Wikipedia description of FAT directories, this extracted filenames, timestamps, and attributes from FAT32 directory entries. Despite the terse nature of old-school 8.3 filenames, this provided my housemate some comfort; evidence that these directory entries still existed, like the following examples from what appears to be a copy of Minecraft:

  === SECTOR 0004bb80 ===
  Candidate at 0x0097700c0: MUSIC    [DIR] attrib=10   2014-11-24
  Starts at cluster 000025fe; 0 bytes
   4d 55 53 49 43 20 20 20 20 20 20 10 08 64 2b a7
   78 45 78 45 00 00 11 bd f6 42 fe 25 00 00 00 00
  Candidate at 0x0097700e0: RECORDS  [DIR] attrib=10   2014-11-24
  Starts at cluster 00002aa4; 0 bytes
   52 45 43 4f 52 44 53 20 20 20 20 10 08 7b 2c a7
   78 45 78 45 00 00 11 bd f6 42 a4 2a 00 00 00 00
  Candidate at 0x009778040: ICON_1~1PNG attrib=20 2014-11-24
  Starts at cluster 000024f5; 3665 bytes
   49 43 4f 4e 5f 31 7e 31 50 4e 47 20 00 88 2a a7
   78 45 89 46 00 00 06 bd f6 42 f5 24 51 0e 00 00

Now 2 additional pieces of data were needed; the offset from the start of the disk of the start of the filesystem, and the filesystem cluster size. Since directories contain subdirectories as 1-cluster pseudo-files, these were fairly easy to acquire by matching up subdirectory names with their likely contents. The Minecraft music directory was an obvious choice, starting at cluster 0x25fe. Its presumed contents were spotted at 0x9ba0020, which would fit a cluster size of 32 512-byte blocks and an offset of 0x3a8000 bytes.

With this information, a strategy for recovering the directory hierarchy began to emerge, but first it had to be determined whether any of the files on the device could be recovered in full. Filesystems tend to have a linked structure. Unix-like filesystems intersperse “indirection nodes” or “i-nodes” throughout the partition, spreading out accesses across the disk. DOS filesystems had a single table, the FAT, which contained a “next cluster” pointer for each cluster in the filesystem. This had certainly been overwritten, but the directory entries provide a start cluster and a size for each file. If they had been written contiguously (a safe assumption when files aren’t appended to and the filesystem hasn’t been filled up) this would be enough.

I made bffr.c emit dd commands:

  === SECTOR 0004bfc0 ===
  Candidate at 0x0097f8000: CALM1   OGG attrib=20 2014-11-24
  Starts at cluster 000025ff; 2530812 bytes
   $ dd if=/home/chad/usr_chad/joshrec/drive.img of="CALM1.OGG"\
     bs=512 skip=318752 count=4943
   43 41 4c 4d 31 20 20 20 4f 47 47 20 18 66 2b a7
   78 45 89 46 00 00 0f bd f6 42 ff 25 fc 9d 26 00
  Candidate at 0x0097f8020: CALM2   OGG attrib=20 2014-11-24
  Starts at cluster 0000269a; 1976731 bytes
   $ dd if=/home/chad/usr_chad/joshrec/drive.img of="CALM2.OGG"\
     bs=512 skip=323712 count=3861
   43 41 4c 4d 32 20 20 20 4f 47 47 20 18 82 2b a7
   78 45 89 46 00 00 06 bd f6 42 9a 26 9b 29 1e 00

Trying a few of these out made it clear that many files remained intact. The battle raged on.

He wouldn’t settle for 8.3 filenames. The most complex part of bffr.c by far was the handling of long filenames. These appear in reverse order, 13 characters at a time, in hidden directory entries prior to the file they describe, scattered haphazardly thorugh these directory entries to carefully avoid breaking backward compatibility.

  LFN: fallsmall.ogg
  Candidate at 0x00bba0040: FALLSM~1OGG attrib=20 2014-11-24
  Starts at cluster 00002eea; 5232 bytes
   $ dd if=/home/chad/usr_chad/joshrec/drive.img \
     of="fallsmall.ogg" bs=512 skip=391808 count=11
   46 41 4c 4c 53 4d 7e 31 4f 47 47 20 00 0a 2e a7
   78 45 89 46 00 00 05 bd f6 42 ea 2e 70 14 00 00

By 1:40am it was time to switch gears. Files could be recovered, including their long file names. Directories and files and the cluster addresses at which they started. Two more output lines per entry were added to bffr.c, made to be easily filtered by grep into separate files. These represented directory entries and files, providing directory cluster number, starting cluster number, size (for files), and filename:

   @ 9458 9459 561 "READ_ME_I_AM_VERY_IMPORTANT"
   # 9458 9460 "ICONS   "
   # 9458 9471 "LANG   "
   # 9458 9726 "MUSIC   "
   # 9458 10916 "RECORDS   "
   # 9458 11954 "SOUND   "
   @ 9460 9461 3665 "icon_16x16.png"
   @ 9460 9462 5362 "icon_32x32.png"
   @ 9460 9463 114786 "minecraft.icns"
   @ 9471 9472 50026 "af_ZA.lang"
   @ 9471 9476 59508 "ar_SA.lang"
   @ 9471 9480 66536 "bg_BG.lang"

C++ was chosen for a second program which parsed this input, and assembled it into a forest of trees in memory. The language switch was made in order to take advantage of the C++ standard library. The following data structure was used to represent all files and directories, with all maps and sets indexed by cluster number and dirs indexed by filename with cluster numbers as values. s is the set of roots, initially set to all keys from f and then pruned down by removing all entries which are also contained in directories.

  typedef map<string, unsigned> dir;

  // Forest of dirs.
  set<unsigned> s;
  map<unsigned, dir*> f;
  map<unsigned, unsigned> sz;

This data structure was ultimately traversed to produce a script in the following format:

  dd if="/home/chad/usr_chad/joshrec/drive.img" \
    of="READ_ME_I_AM_VERY_IMPORTANT" bs=512 count=2 skip=310176
  mkdir "RECORDS   "
  cd "RECORDS   "
  # PULL A 591997 BYTE FILE CALLED "11.OGG" FROM 10917
  dd if="/home/chad/usr_chad/joshrec/drive.img" \
    of="11.OGG" bs=512 count=1157 skip=356832
  # PULL A 1071028 BYTE FILE CALLED "13.OGG" FROM 10954
  dd if="/home/chad/usr_chad/joshrec/drive.img" \
    of="13.OGG" bs=512 count=2092 skip=358016

This created a root directory with a lot of numerically-named directories, each containing its own part of the hierarchy. By 3am this is what I had. Thermodynamics will, inevitably, have the last laugh, but I for one will not go down without a fight.

Frame from video file found with sound card drivers in recovered filesystem.

Frame from video file found with sound card drivers in recovered filesystem.

Feel free to download the code, but keep in mind that the code itself is not a product. It was hastily assembled late at night to battle entropy, and as such should only be read by the curious, and never run.

CHDL Tutorial 1: Conway’s Game of Life

This is the second in a series of articles introducing the CHDL C++ Hardware Desgin Libary. This time around, we’ll be looking at some of the basic features of CHDL, such as vectors of bits, and some of the features that make CHDL unique, including the use of C++ template metaprogramming. To illustrate these, we will implement a popular cellular automaton and software toy, Conway’s Game of Life.

bvecs: Collections of Nodes

Before we dive into the example, let’s build a very basic circuit, a 4-bit binary counter, using CHDL bit vectors:

bvec<4> ctr;
ctr = Reg(ctr + Lit(1));

We can wrap this in a function, yielding:

bvec<4> Ctr() {
  bvec<4> c;
  c = Reg(c + Lit<4>(1));
  return c;

This demonstrates several features of the bvec type. Just like Lit(0) and Lit(1) provide literal 0 and 1 values for nodes, Lit<N>(x) provides literal 2’s-complement integer values for bvec<N>s. A register function is provided and operates in exactly the same way, creating a register for each node in the bvec. Assignment, just like with nodes, is retroactive. The initial value of c is overwritten when it is assigned with the output of a register, and it is a function of the output of that register (and not the original contents of c) that serves as the input to the same register. The addition operator (as well as subtraction, multiplication, and division) are overloaded.

Exercise: Create a program that instantiates the 4-bit counter shown here and simulates it for 16 cycles, dumping the simulation waveform to a .vcd file. View this result in GTKWave to verify the correct behavior.

Indexing bvecs

CHDL bvecs can be indexed in two ways, one safer than the other. Arbitrary integers can be used to index CHDL arrays, allowing their assignment from within loops. For example, the following loop initializes a 256×8 array with literal binary numbers from 0 through 255:

vec<256, bvec<8> > x;
for (unsigned i = 0; i < 256; ++i)
  x[i] = Lit<8>(i);

Because of the nature of integral types in C++ it is impossible to catch the following error at compile time:

bvec<2> v;
v[2] = Lit(0);

Instead, this leads to a run-time error that causes the program to abort. Such errors are easy to diagnose in a debugger like GDB.

The other way to index arrays in CHDL leads to errors that can be found at compile time and allows for ranges of values to be selected:

bvec<32> addr;
bvec<20> tag = addr[range<12,31>()];
bvec<8> idx = addr[range<4,11>()];
bvec<4> offset = addr[range<0,3>()];

In this example, a 32-bit address is divided into a 20-bit tag, 8-bit index, and a 4-bit offset for use by a cache with 16-byte blocks and 4kB sets. Because they are actually part of the type’s identity, the template arguments of range must be compile time constant. This means that the following mistake:

bvec offset = addr[range<1,3>()];

leads to a compile time error.

Multi-Dimensional Arrays

The bvec type itself is really just a convenient shorthand. An equivalent for bvec<N&gt is vec<N, node>. The actual basic template here is not bvec, but an even more general vec template that acts as an extension to the usual fixed-length C array. There is no requirement that vec types contain only CHDL types, but they are intended to be used as such. In order to create multi-dimensional arrays, vecs can contain other vecs:

vec<8, bvec > matrix;

This can then be addressed by a multiplexer:

bvec sel;
bvec byte = Mux(sel, matrix);

Conway’s Game of Life

Before we implement Conway’s Game of Life, we should probably spend some time talking about what it is. Conway’s Game of Life is best described as a cellular automaton. It is not a game by any lay usage of the word. It is instead a system; a way for state to evolve over time based on certain rules. It just happens that despite the simplicity of the rules, some very fascinating and complex patterns emerge in Life when it is initialized with random input. When it is initialized with carefully-designed input, wonderfully complex creations have been constructed, as shown here, here, and here.

The game board is a 2D grid of cells with one of two states: set or cleared. All of the cells advance to their next state at the same time and the next state of any cell depends only on the current state of the cell itself and its eight neighbors (four in the cardinal directions and four to the corners). The rules are simple: any cell with exactly three set neighbors becomes set. Any cell with less than two neighbors set is cleared. Any cell with more than three neighbors set is cleared.

Implementing Conway’s Game of Life

A Quick Review of C++ Template Functions

Most of the features of CHDL are designed to interoperate with C++ template metaprogramming. Because of the reliance on C++ features, most errors in the type semantics of CHDL designs can be caught at compile time, and the resulting object code is optimized for relatively quick design elaboration. Since our design for a universal population count module relies on C++ template functions, we will now briefly review the concepts used. A detailed introduction C++ templates can be found in the C++ FAQ here.

C++ templates give us a way to define, in header files, functions and structs that have certain parameters. These parameters can be types, integers, functions, and other templates. The vec<int N, typename T> template we’ve already seen in this tutorial is a templated type that takes an integer and a type argument and uses these to describe a vector of length N containing objects of type T. In addition to templated types like vec, there are also template functions. An example of a template function we’ve already seen is the bvec version of Reg, as used above. Its signature is:

template<unsigned N> bvec<N> Reg(const bvec<N> &d);

Another example if Lit<N>, whose signature is:

template<unsigned N> bvec<N> Lit(unsigned long val);

Because it is easy to infer the length of the register from its parameter, it is not necessary to specify a length for Reg(). For Lit(), the length of the return type cannot be inferred from the arguments, so a template parameter is needed. This is why, in the counter example, the expression for the value of the counter is Reg(c + Lit<4>(1)).

A useful feature of templates is template specialization. If, say, I had a really well-optimzied implementation of a function like Reg that only worked for bvecs of length 8, I could declare it as:

template<> bvec Reg(const bvec &d);

This could live alongside my default implementation, only being used when N is 8.

constexpr Functions and CLOG2

Since C++11, C++ has offered a feature that allows certain simple functions to be evaluated at compile time with the use of the constexpr keyword. The results are constant as long as the inpts remain constant. This ensures type safety and succinct error messages.

In CHDL, constexpr functions are provided as utility functions for common arithmetic performed when evaluating the dimensions of hardware designs: especially the base-2 logarithm, with both the integer floor and integer ceiling operator provided.

The CHDL CLOG2() constexpr function provides a convenient way to evaluate the number of bits needed to uniquely identify N elements. The signature of the Mux function seen above, for example, is:

  T Mux(const bvec<CLOG2(N)> &sel, const vec<N, T> &in);

In previous versions of CHDL, CLOG2 was implemented as a preprocessor macro. In addition to providing no type checking, this lengthy bit of combinational logic was expanded in error messages by G++.

Population Count

An operation that comes up occasionally in computing; frequently enough, in fact that it should probably be in the CHDL standard library but currently is not, is the following: How many bits in a given word are set? This is known as population count, and is popular enough in real software that it is often featured as a processor instruction. (It is POPCNT on x86 and VCNT on ARM.)

The population count operation needed to implement Conway’s Game of Life could be implemented in a very rigid, fixed way, supporting exactly the 8 requisite neighboring bits, no more, no less. In the following code, Zext provides a zero extension operation. It pads the upper bits of a word with zeroes until it is the length of Zext‘s template argument:

bvec<4> PopCount(bvec<8> in) {
  vec<4, bvec<2> > count2;
  for (unsigned i = 0; i < 4; ++i)
    count2[i] = Zext<2>(in[2*i]) + Zext<2>(in[2*i + 1]);
  vec<2, bvec<3> > count4;
  for (unsigned i = 0; i < 2; ++i)
    count4[i] = Zext<3>(count2[2*i]) + Zext<3>(count2[2*i + 1]);

  return Zext<4>(count4[0]) + Zext<4>(count4[1]);

This code is ugly: It’s verbose, and it solves a problem so specific it will almost never happen. This is code that yearns for a recursive solution. The same basic pattern repeats itself three times, in a borderline violation of the DRY principle. If we tried to simply implement PopCount as a recursive function, we would run into a problem: the size of the input and output bvecs would remain the same, leading to quite a bit of wasted space. It is still possible to do:

bvec<4> PopCount(bvec<8> in, unsigned level = 0) {
  if (level < 3) {
    bvec<4> a, b;
    Cat(a, b) = in;
    return PopCount(Zext<8>(a), level+1) + PopCount(Zext<8>(b), level+1);
  } else {
    return in;

This is perhaps a little better, and not as bad as it seems; all of those additional zeroes will ultimately be reclaimed by the optimizer. Still, unnecessary zero extensions abound, and the level parameter is shoehorned in to provide a base case. If we didn’t care about type safety and wanted to use the C++ STL vector instead of CHDL vec, we could write (assuming we had proper operator overloads for arithmetic available):

vector<node> PopCount(vector<node> in) {
  if (in.size() == 1) return in;
  else {
    // Split input into two vectors
    vector<node> a(in.size()/2), b(in.size() - a.size());
    for (unsigned i = 0; i < in.size(); ++i)
      if (i < a.size()) a[i] = in[i];
      else b[i - a.size()] = in[i];

    // Get population count of each vector.
    return PopCount(a) + PopCount(b);

This is also ungainly in its own way; C++ vectors were not designed to be easily divisible into subvectors so we use loops to manually copy the input bit-by-bit.

Exercise: Why might we avoid using the insert() operator on a C++ vector of CHDL nodes? Hint: insert() will use the assignment operator to move the contents of the vector over by one position to make room for the new element.

None of the preceding examples represent the preferred way of implementing population count in CHDL. They represent designs with their own trade-offs that may be used as called for by this situation or that, but which are wholly unnecessary for an operation as basic as population count.

Recursively-defined functions in CHDL are typically implemented using a template with the base case represented by a specialization:

template<unsigned N> bvec<CLOG2(N+1)> PopCount(bvec<N> x) {
  return Zext<CLOG2(N+1)>(PopCount(x[range<0,N/2-1>()])) +

template<> bvec<1> PopCount<1>(bvec<1> x) { return x; }

This is the CHDL way. This implementation uses the same algorithm as the previous three examples and ultimately reduces to the same hardware, but it is general, succinct, and type safe.

Next State

For computing the next state of a cell, we will use a feature of CHDL we have not discussed before, the conditional assignment or Cassign() function. Cassign() provides an unusual syntax for solving the problem of computing future state, as shown in the following example:

template<unsigned long X> node Pulse() {
  const unsigned N(CLOG2(X));
  bvec<N> next_ctr, ctr(Reg(next_ctr));
  node p(ctr == Lit<N>(X-1));

    IF(p, Lit<N>(0)).
    ELSE(ctr + Lit<N>(1));

  return p;

This function emits a 1-cycle-long pulse once every X cycles. The Cassign() function is used to determine the next value of the counter, with the output p acting like a reset signal. Assuming that a population count, count, for all neighboring cells is available, finding the next state next_alive for a given cell with current value alive is trivial:

  IF(count < Lit<4>(2), Lit(0)).
  IF(count > Lit<4>(3), Lit(0)).
  IF(count == Lit<4>(3), Lit(1)).

Using this in combination with our previously-defined PopCount() function, we can create our function for a single cell:

node LifeCell(bvec<8> neighbors, bool init = false) {
  bvec<4> count(PopCount(neighbors));
  node next_alive, alive(Reg(next_alive, init));

    IF(count < Lit<4>(2), Lit(0)).
    IF(count > Lit<4>(3), Lit(0)).
    IF(count == Lit<4>(3), Lit(1)).

  return alive;

The init argument to this function represents the initial state of the cell. The Reg function has a second argument with a default value of 0 representing the initial value of the registers. This is true for both Reg(bvec<N>) and, as shown here, Reg(node).

Getting to Know Your Neighbors

Conway’s Game of Life can be thought of as a simple stencil operation. The same function is applied to every element of an array and used to determine that element’s next value. The only inputs used to determine each element’s value come from the array and have the same shape at each point, translated by that point’s position. For Conway’s game of life, the shape of the input stencil is that of a node and its neighbors. In the following image, the red element’s next value depends on its own value and those of all of the blue elements. The highlighted region can be shifted throughout the region and the operation can be repeated:


This is a very straightforward computation but with one major caveat: what do we do at the edges and corners? The sensible options are to assume that every cell beyond the edge has the value 0 or to assume that the edges wrap around, giving the board a toroidal topology. To accomplish this, we create a function called Get which performs the indexing operation:

template <bool T, unsigned X, unsigned Y>
  node Get(vec<Y, bvec<X> > &g, int i, int j)
  if (T) {
    while (i < 0) i += X;
    while (j < 0) j += Y;
    while (i >= X) i -= X;
    while (j >= Y) j -= Y;

  if (i < 0 || i >= X || j < 0 || j >= Y) return Lit(0);
  else return g[i][j];

By making whether the space is toroidal or surrounded by zeroes a template parameter T, we make it possible to refer to one version or the other of Get through a function pointer in our Neighbors function. We define our function pointer G to avoid having to repeatedly call Get<T> for each of the eight neighboring points.

template <bool T, unsigned X, unsigned Y>
  bvec<8> Neighbors(vec<Y, bvec<X> > &g, unsigned i, unsigned j)
  auto G(Get<T, X, Y>);

  return bvec{
    G(g, i-1, j-1), G(g, i-1,   j), G(g, i-1, j+1), G(g,   i, j-1),
    G(g,   i, j+1), G(g, i+1, j-1), G(g, i+1,   j), G(g, i+1, j+1)

This introduces another not-previously-mentioned-in-these-tutorials feature, the list constructor for CHDL vecs. It allows a comma-separated list of nodes to be used to initialize a bvec, or for that matter any comma-separated list of objects of type T to be used to initialize a vec<T, N>.


Once we have our Neighbors function and our LifeCell function, we can implement the entire grid of cells. Once again, we punt on the decision of how the space should be layed out, by making it a bool template parameter T:

template <bool T, unsigned X, unsigned Y>
  vec<Y, bvec<X> > LifeGrid(bool *init)
  vec<Y, bvec<X> > g;

  for (unsigned i = 0; i < Y; ++i)
    for (unsigned j = 0; j < X; ++j)
      g[i][j] = LifeCell(Neighbors<T>(g, i, j), init[j * X + i]);

  return g;

In this function, init is a pointer to an X*Y-element array of bools representing the initial state of the grid. The 2D array of nodes g is declared and then assigned point-by-point with LifeCell functions. The return value of this function is this same array.

A main function using this is simple:

int main() {
  const unsigned X(16), Y(16);
  bool init[X*Y];

  for (unsigned i = 0; i < X*Y; ++i) init[i] = (rand()%4 == 0);

  vec<Y, bvec<X> > g = LifeGrid<1, X, Y>(init);



  ofstream vcd("life.vcd");
  run(vcd, 1000);

  return 0;

This function initializes our initial state array to random values and calls LifeGrid(). In this case, the toroidal cell space is used, but with a 0 instead of a 1 as the first template argument to LifeGrid(), we could use a board with no active cells beyond the boundaries instead.

Improved Output Using Egress

It would be nice if we could, instead of just calling run() and looking at the output file, see the results of this simulation spatially, in the terminal. In general, it would be nice if simulated hardware had flexible I/O so that it could interface with our C++ code. CHDL provides several ways to do this, the most common of which are the Ingress() and Egress() interfaces for getting data into and out of the CHDL simulator. Let’s add #include <chdl/egress.h> to the top of our source file and replace our call to run() with the folliowing:

bool val[Y][X];
for (unsigned i = 0; i < Y; ++i)
  for (unsigned j = 0; j < X; ++j)
    Egress(val[i][j], g[i][j]);

for (unsigned cyc = 0; cyc < 10000; ++cyc) {
  for (unsigned i = 0; i < Y; ++i) {
    for (unsigned j = 0; j < X; ++j) {
      cout << (val[i][j]?"[]":"  ");
    cout << endl;

The val array is now set, at the beginning of each cycle, with the values of each of the nodes on our board.
The node states can now be viewed directly. Try using the less command in a 17-line-tall terminal and holding spacebar to view the output of this program. The patterns seem far more interesting when a toroidal space is used, as in the following screenshot:


CHDL Internals 0: node, nodeimpl, and tickable

As a companion to the CHDL tutorial series, I would like to expose some of the details of the CHDL design in a series of in-depth technical articles. The intended audience of the tutorials consists of those wanting to use CHDL for their own hardware designs. The internals articles is for those wishing to extend, alter, or simply understand CHDL at a lower level.

There are three basic datatypes upon which all of the rest of CHDL is built:

  1. The nodeimpl, handling the simulation and output behavior of
  2. the node, the user-facing interface for manipulating nodes, and
  3. the tickable, representing objects which react to clock signals.

In this article, we will focus on these types and some of the more obvious types built around them, leaving the higher-level details for next week. Because the codebase we are discussing is a constantly-changing work in progress, I will not cite data structures and functions by line number, instead referring to them by name and file.

node and nodeimpl

In a CHDL design, you type:

node x(!y);

What actually happens? Let’s trace the function calls. First, we get to operator!(node y) in gateops.h. This is merely a wrapper for Inv(y). Inv(y) is declared in gates.h and defined in gates.cpp. All of the combinational logic functions in CHDL are implemented as combinations of (for now; there’s some desire to move to the industry standard And/Inverter Graph format) nand gates and inverters. The Inv() function, therefore, represents one of these fundamental gate types. Its implementation, then, will hopefully shed some light on the implementation of CHDL nodes. The meat of the implementation of Inv() in gates.cpp is:

node r((new invimpl(in))->id);

This seems odd. We’re creating this thing called an “invimpl” by allocating it on the heap, but despite the fact that we’re allocating it here, we’re not keeping the pointer here. Instead, we’re simply taking this “id” value and passing it to the node constructor. This reveals something about the true nature of node objects: they don’t so much represent nodes themselves as much as identify a nodeimpl. Why call them “node” and not “noderef” or similar? This is entirely for the benefit of users of the CHDL API. To the hardware implementer they certainly do represent nodes.

The nodes[] Vector

So, why aren’t we keeping a pointer to our node and what’s this “id” field about? The astute reader will have read the title of this section and understood. Pointers to all nodeimpl objects are stored in a global C++ vector<nodeimpl> called nodes, declared in nodeimpl.h. Indices into this vector have their own type, too, nodeid_t, an integer type typedef found in node.h. What’s the difference between nodeid_t and node, then? Consider the following:

node x = Foo(); // x has node id 100
node y = Reg(x); // y has node id 101
node z = Bar(); // z has node id 200
x = z; // now x and z both have node id 200

So, will y be the time-shifted-by-one-cycle output of Foo() or Bar()? Remember from the tutorial and the documentation that node assignments are transitive. All of the previous things given the value of x will now have its new value. This is useful because it means that CHDL designs can rewrite themselves. This is how optimization is implemented. It is also useful simply because hardware designs are not acyclic. Combinational logic may be, but as soon as registers are involved it will sometimes be necessary to use values before they are defined.

This behavior is implemented with a data structure called the node directory or node_dir, local to nodeimpl.cpp. The node directory keeps track of the node objects associated with each node ID. The node constructor adds a node to this directory and the node destructor removes it. The node assignment operator consults the node directory and updates all of the nodes to point to the new node ID. This means that node assignments necessarily, by design, completely remove all users of a given node ID. This is why the dead node elimination optimization is vital.

Finally, we have the issue of node sources. If I create an Inv(x), once this function has returned, a new invimpl object will have been created with x as the source. How do we store the source in such a way that, if x is overwritten later, the inverter’s source will be updated to match. We simply make the src field of nodeimpl a std::vector. This leads to the expected result without introducing a cyclic type dependency; remember, nodes don’t directly point to nodeimpls. They only contain indices.


We’ve dealt with the problem of how nodes are created, linked together, and evaluated, but this only allows us to create structures which evaluate Boolean functions, not systems which evolve over time. CHDL was not designed to evaluate low-level designs including implementations of sequential logic at the gate level, including stateful asynchronous logic. Cycles in the node graph lead to undefined behavior in both simulation and optimization, and can be detected with the cycdet() function in chdl/analysis.h.

The tickable type is completely unrelated to node and provides an orthogonal feature to node. It is only later, when we get to the register; the most useful instance of a tickable, that these two concepts are merged. In general, the node interface provides a mechanism for implementing arbitrary Boolean functions that may depend on other nodes or other parts of the program, and the tickable interface provides a mechanism for implementing time stepped simulations. The combination of these two behaviors leads to the CHDL simulation model.

The class tickable itself is a virtual class that defines four member functions:


During simulation, each of these is executed once per clock cycle. During each clock cycle, first all of the pre_tick() functions are called, then all of the tick() functions, etc. This somewhat ungainly state of affairs arose because the original simulator demanded both a tick() and tock() function, so that register state would not be updated until all register inputs had been read. Subsequent additions to CHDL required yet more priority levels to handle the updating of values both prior to and following the execution of user code when the simulation was being advanced one step at a time.

tickable is not a pure virtual class. It can itself be instantiated, although there is no practical reason to do so. Any combination of the four tick functions can be overridden or omitted from a tickable subclass. Most override only one or two.


The example from Tutorial 0:

node x;
x = Reg(!x);

makes use of two types of node: the register node and the inverter node. The implementation of the inverter node type, invimpl, was explained in the previous section. So what about x itself, a register node. Its node type is regimpl, a type which inherits from both nodeimpl and tickable. Its tick() function reads the state of its input (a variable unique to regimpl, q, not src[0]; remember, the node graph made using src should not have any cycles). The tock() function writes the value read from q to the output, another bool variable that is returned by the eval() function.

What’s Next?

With only nandimpl, invimpl, and regimpl, a world of practical and useful digital logic circuits could be constructed and simulated. A ton of software would have to be written to make this practical. This software, of course, has already been written and represents the rest of CHDL and its companion library the CHDL STL. There has been some serious thought about moving the non-fundamental software out of CHDL and into a separate library, but for the sake of the user, this would only make sense once CHDL has stabilized and packages are provided for major distributions.

CHDL Tutorial 0: Das Blinkenlights

CHDL and its associated standard library has grown by leaps and bounds over the past year, thanks in part to a decision by my employers to adopt it as a research vehicle. CHDL code now looks quite different than it did a year ago. There is a conditional assignment system that simplifies the creation of state machines. The chdl/statemachine.h API has been deprecated, there is now basic support for buses and tri-state drivers, and simulation speed is soon to receive a huge boost.

Because CHDL development has arrived at a point where I want to actively encourage others to try it, use it, and hopefully contribute to it, I have decided to immediately begin releasing tutorial and technical information related to CHDL in particular and digital electronics deisn in general through this web site. By doing this I hope to educate visitors interested in digital electronics, promote CHDL as a platform for hardware development, and thoroughly document the CHDL interfaces and internals.

This first set of articles, which I hope to make available in rotation with at least two other sets, is devoted entirely to the use of CHDL; written to an audience expected to consist primarily of amateur electical engineers, but with nothing that an experienced professional or engineering student would not be expected to understand. Throughout this tutorial I have included exercises for the reader. Most of these will require the consultation of other references to complete. The truly interested reader is likely to ditch these and do something they find fun instead, but I think of them as milestones. If you think you would be unable to complete an exercise, reading past it may be unwise.

Getting CHDL

It is important to note that CHDL is released under the GNU LGPL. This means that any derivative of CHDL must remain under this license, but does not extend to software linking (statically or dynamically) against the library. This means that, if you want to use CHDL on a commercial hardware project, binary simulation applications and libraries can be released under whatever license you feel like. This will satisfy all of the electrical engineers in your company with a list of patents on their CVs.

Previous articles included some instructions for getting and installing CHDL, but for the sake of completeness, slightly more detailed instructions will be included here. It is expected that anybody reading this is running a relatively recent Linux or Mac OS and has some level of comfort obtaining and using software. The only requirements are a recent G++ or Clang compiler that supports C++ 11 and GNU Make, but a waveform viewer program like GTKWave would be useful too, for viewing output. A large number of engineers primarily use Windows. CHDL will likely compile with a little coaxing in CygWin, which also provides a reasonable development environment.

CHDL is a library, and using it well requires some knowledge of how libraries work on your platform of choice. For Linux, there are many places to start, but see this guide for a tutorial.

Exercise: Write a library in C++ that contains a single function, “say_hi” that prints “Hello, world!” to the screen. Compile and link it as a shared object and write a program, in a separate directory, that calls this function once and exits. If you feel you can do this, you are certainly well-prepared to install and use CHDL!

Downloading the Source

The most recent CHDL source can be obtained from GitHub, either by using the Git client:

$ git clone https://github.com/cdkersey/chdl.git

or as a .zip archive.

Building CHDL

In the CHDL source directory, type make. This will invoke GNU Make, which will read the Makefile and automatically build the source files and then link them into libchdl.so. This is the CHDL library, ready to use. If make errors out, please send a full bug report to chad@cdkersey.com.

Installing CHDL (optional)

To make CHDL available system-wide, run sudo make install. This is not strictly necessary, but if /usr/local/lib is in your library search path, linking against an installed CHDL becomes somewhat more straightforward for any user on the system.

Building and Running the Tests

CHDL includes both a set of tests and a set of example designs. As a first step, just to make sure everything has built correctly, run some tests. They provide an automated way to verify that CHDL is running correctly.

chdl$ cd test
test$ make
g++ -std=c++11 -O2 test_adder.cpp test_common.h -lchdl -o test_adder
g++ -std=c++11 -O2 test_mult.cpp test_common.h -lchdl -o test_mult
g++ -std=c++11 -O2 test_logic.cpp test_common.h -lchdl -o test_logic
g++ -std=c++11 -O2 test_shift.cpp test_common.h -lchdl -o test_shift
test$ LD_LIBRARY_PATH=../ make run
./test_adder > test_adder.run 2> test_adder.opt
./test_mult > test_mult.run 2> test_mult.opt
./test_logic > test_logic.run 2> test_logic.opt
./test_shift > test_shift.run 2> test_shift.opt

This last command will take a few seconds but should complete with no errors.

Exercise: Build and install CHDL on your machine, and run all of the tests.

Creating a CHDL Project

Now that we have verified that CHDL is working, it’s time to build something! We will start with the simplest sequential circuit imaginable: a 1-bit toggle that alternates values on each clock cycle. This will allow us to focus our attention on the fundamentals of CHDL without being distracted by the complexity of the design.

Starting a Project

CHDL designs are just C++ programs, so feel free to use whatever environment you normally use for editing and compiling C++ and take the following as a detailed step-by-step suggestion.

We start by creating a new directory for our project. Somewhere in the filesystem:

$ mkdir 0_blinkenlights
$ cd 0_blinkenlights

We then create our code file and a Makefile:

0_blinkenlights$ touch Makefile blinkenlights.cpp

Makefile should be initialized with the following text:

CXXFLAGS ?= -std=c++11
LDLIBS ?= -lchdl
blinkenlights: blinkenlights.cpp
        rm -f blinkenlights

Remember that in makefiles, tab characters have syntactic meaning, so be sure to indent using tabs instead of spaces wherever indentation is needed.

Now initialize blinkenlights.cpp with:

#include <iostream>
#include <fstream>
#include <chdl/chdl.h>
using namespace std;
using namespace chdl;
int main() {
  return 0;

This is the most basic, empty C++ program using CHDL. It includes the proper header files, brings the CHDL namespace into file scope and then does nothing. Before you build or run the test program, make sure the CHDL library is in both your library and header search path.

A test build/run should complete successfully but do nothing interesting:

0_blinkenlights$ make
g++ -std=c++11 blinkenlights.cpp -lchdl -o blinkenlights
0_blinkenlights$ ./blinkenlights

Building and Simulating a Simple Design

Now that we have our boilerplate code in place, it’s time to build something! To the top of the main() function, add the following code:

node x;
x = Reg(!x);
ofstream vcd("blinkenlights.vcd");
run(vcd, 100);

The first three statements create a signal called x and connect it to the output of a D flip-flop. The input of this D flip-flop is the complement of its output. This is our strobe. This signal is then tapped, making it visible as simulation output. The next pair of statements create an output file called “blinkenlights.vcd” and then run a simulation. This file can be viewed in GTKWave or another waveform viewer.

Compile and run the program again. There will again be no output to the console, but instead of doing nothing, the program will generate an output file called “blinkenlights.vcd”.

0_blinkenlights$ make
g++ -std=c++11 blinkenlights.cpp -lchdl -o blinkenlights
0_blinkenlights$ ./blinkenlights
0_blinkenlights$ ls
blinkenlights blinkenlights.cpp blinkenlights.vcd Makefile
0_blinkenlights$ gtkwave blinkenlights.vcd

Unless otherwise specified, the register is initialized to 0, leading to the following waveform:

This circuit can be thought of as a frequency divider dividing the (implicit, global) clock signal’s frequency by two. It is easy to start imagining a multitude of circuits that could be built and simulated in a similar fashion, but the reader is cautioned to read on first to get a better understanding of what this code does.

An Analysis of the Example

Only three lines of the example program can be thought of as an actual “design”:

node x;
x = Reg(!x);

Let’s examine these lines in depth. (If you would like even more depth, consider reading the companion CHDL Internals articles.) The first line declares a variable named x of type node. This is similar to a node in the circuit sense. It is a Boolean logic signal that can take on only the values ‘0’ or ‘1’. It has a default value, and the value in a node can be overwritten.

The second statement overwrites the initial value of the node x with the output of a register. It also highlights a few important things about CHDL syntax and style. Remember that, in C++, arguments are evaluated before the function is called, and the function is called before its result is assigned. So how can !x mean “not this register’s output” when this register’s output has not been assigned yet. The reason for this is simple. Assignments to nodes are effective globally. If two objects of type node refer to the same signal, when the source of that signal changes, it changes for all node objects that pointed to the same signal. After all, they represent the same node in the circuit!

The second statement also demonstrates a few other things. All of the usual C++ logical and equality operators (||, &&, ==, !=, !) have been overloaded to work with CHDL nodes, creating a notation for implementing combinational logic. Also, it points out a characteristic of function names in CHDL. Uppercase initial characters (usually followed by camelCase) are reserved for functions that implement hardware. In this example, Reg() is a function that takes a node argument and returns a node result. Finally, it demonstrates an important fact about sequential logic in CHDL. Clock signals are implicit. All registers have an implied clock input that is connected to a global clock signal.

The third line of the design simply illustrates the easiest way to make signals visible in simulation. Signals are non-visible by default and must be explicitly exposed as taps. The TAP(x) preprocessor macro creates a tap with the same name as the signal passed to it. Because this is used to name the signal, the argument should only be a C++ identifier. If you would like to create taps on expressions, either give the expression a name first or rename it by using the tap() function instead:

tap("new_name_for_x", x);

This also offers an example of the capitalization scheme. Because tap(), like run() represents a utility function and not a piece of hardware, it is given a lowercase name. Compare to the function version of the ! operator, Inv() in the following alternate way to write the second line:

x = Reg(Inv(x));

Because a introduction to CHDL would be criminally incomplete without mentioning literals, let’s examine a third way to implement this:

x = Reg(Xor(x, Lit(1)));

Lit(1) creates a node that has the fixed value of 1, and Lit(0) does the same thing for 0. (The Xor() function here was chosen instead of its equivalent != operator for the sake of clarity.)

Those are the basics. There’s quite a bit you can do with only this information, and in fact this can be used as the basis for a quite complex simulation system. The only limitation being that taps are single bit signals. All of the extra functionality provided to the designer in CHDL could be implemented using only these basic elements, and indeed, moving it to a library separate from the CHDL core has been considered.

Exercise: Create a second node, y, that blinks half as fast as x.

Netlists and Optimization

Simulation is useful, but it would be nice to be able to fabricate these designs or at least load them into an FPGA. In order to do this there has to be a way to dump the design out of CHDL. Indeed there is. Add the following lines somewhere after the design and before the return statement in your main function:

ofstream netl(“blinkenlights.nand”);

When you build and run your program again, you will receive the following output in a file called “blinkenlights.nand”:

  x 2
  litX 0
  inv 2 1
  reg 1 2

Because it was tapped, x is listed as a 1-bit output. Each line in the design section corresponds to a single gate. Its inputs are listed first followed by its output, as node indices. This one should make sense for the most part. The register’s output is inverted and fed back into the register’s input. The output is also tapped as x. But what’s this litX here on node 0. Its value is unused. If you guess that it must be the leftover initial value of x before it was reassigned, you would be correct. But it’s not used. Shouldn’t we be able to get rid of it?

Of course we should, and we can. A set of simple optimizations is built into CHDL to handle cases of unused (dead) nodes, duplicated logic, and simple constant folding. To invoke it, call the optimize() function with no arguments. If we add the following line:


just after the last line of our design, we get the following output (on stderr) when we run the program:

 Before optimization: 3
 After dead node elimination: 2
 After contraction: 2
 After combining literals: 2
 After redundant expression elimination: 2
 After tri-state merge: 2

The numbers represent the number of nodes in the circuit after each optimization step. Because of the simplicity of our design, only the single dead node elimination was necessary. Our netlist is now a little more compact:

  x 1
  inv 1 0
  reg 0 1

Verilog Output

This ad-hoc netlist format, while easy enough to parse by software, is of no use to us if we would like to actually, say, run this design on an FPGA. For that we are going to need to use CHDL’s Verilog output feature. Somewhere after the call to optimize() and before return, add the following lines to your file:

ofstream verilog("blinkenlights.v");
print_verilog("blink", verilog);

The first argument to print_verilog() is the name of the Verilog module being output, as a string. Compile, run, and inspect the output. Nothing too terribly interesting, all of the nodes are wires or registers whose names start with __x. Something is missing, though. If you look at the top of the Verilog output file, you will notice that the module has a single input called phi for the clock signal and no output. Instead, x is made an internal signal so that it can be easily viewed in simulation. How do we create an output, then? Luckily, there’s a macro for that. Simply replace TAP(x) with OUTPUT(x). The variable x is now a module output. Success!

Moving Forward

This tutorial has hopefully provided a starting point for anybody interested in using CHDL to develop hardware. We’re still a long way from building floating point units and microprocessors, but perhaps not as far away as the reader expects. Next time, we will use a more complex design to explore CHDL’s other hardware design features. Until then, there is plenty to try!

Exercise: Get CHDL-generated Verilog to compile as a top-level design in your FPGA toolchain of choice.

Exercise: Build a simple ripple-carry adder that takes two arrays of nodes as input and returns an array of nodes as output.

Exercise: Use the adder from the previous exercise to build a binary counter.


FPGAs are fun. They provide an execution environment for RTL models quick enough to rival custom hardware, at a tiny fraction of the initial investment. This makes them highly appealing to builders of prototypes, those needing to use high-frequency interfaces, and hobbyists, who build copies of historically significant computers from the mid 1980’s that run slower than emulators on modern hardware because why not (http://www.bigmessowires.com/plus-too/).

So of course CHDL code can be run on FPGAs. Just not directly. The configuration formats for FPGAs are proprietary and technology mapping to FPGA lookup tables has not yet been implemented. But why bother to generate an FPGA configuration bitstream for one product by one vendor when you can generate synthesizable Verilog and let the vendor’s proprietary tools do the translation and optimization?

This is what I’ve done. The result is satisfying:


So, of course, if you do FPGA development at all, this is something you should do as well. Here’s how:

  • Get CHDL. (https://github.com/cdkersey/chdl)
  • Create a design…
  • Compile the design with an ordinary C++ compiler.
  • Write a program that:
    • Instantiates the design.
    • Calls optimize().
    • Calls print_verilog().
  • Place this Verilog code in a file called “chdl_design.v” and import it into the FPGA toolchain of your choice.

The rest is your standard FPGA workflow, for which there are plenty of tutorials on the Internet. Among the advantages of the CHDL workflow is that it pushes the proprietary tools out to the margins of the design flow. They are still responsible for much of the optimization, technology mapping, pin assignment, and the like, but as long as they speak a simple synthesizable subset of Verilog, they become completely interchangeable.

A Bit About the Demo

The demo code is a simple VGA terminal meant to be clocked at 50 MHz, with a simple parallel interface. The character ROM is stored in human-readable format in the file FONT, from which it is converted to hex by the simple program in font2hex.cpp. Attached to this VGA controller is a text generator, which outputs characters at a human-readable rate from a ROM, whose contents are initialized based on the file TEXT, which is converted into hex in the makefile using a simple hexdump command.

The entire design uses 1691 LUTs and 29k bits of block RAM on an Altera Cyclone II. Somewhat surprising is that this is more than the total number of nodes in the design after optimization by the CHDL toolchain, but this is easily explained away by duplication for the purpose of performance, since area is plentiful (on the demo board, this is ~10% of available resources).

VM Performance vs. Implementation Complexity

Dummy VM examples, works on x86-64 Linux

I was working on a post about recent advances in CHDL and realized that there was another topic I have been meaning to cover in this venue for years, the results of a simple exporation of VM performance vs. implementation complexity. In a recent demo for the Linux Users’ Group at Georgia Tech, I explained an FPGA development toolchain using CHDL and briefely demoed a CPU implementation I had thrown together last night using an accumulator instruction set which happened to already have an implementation of my favoite test algorithm due to this earlier work.

Platform VMs, Briefly

Virtual machines implementing instruction sets different from those of the host machine must provide a way to execute their guest instruction sets. When the instruction set happens to be that of another physical machine, the term emulator is typically used. When the instruction set is not that of a physical machine, the appelation is usually the more generic “platform VM,” everyones’ go-to example of which is probably the JVM.

My personal favorite example is the execution engine of QEMU. While it is, considered in its entirety, an emulator, emulating the instruction sets of some physical machines that have been implemented in hardware, its current approach to emulation uses a two-step approach to emulation, translating first to an intermediate language which is then translated to native operations by the Tiny Code Generator. The TCG’s input can be thought of as a virtual instruction set and QEMU’s execution engine as a platform VM executing this instruction set.

A Dummy VM Architecture

If you were to naively set out comparing platform VMs implementations in use in real software today, you would be faced with a ton of incomparable implementations implementing different architectures with different goals. If you have an application that requires a VM, it is useful to have some a priori knowledge about the difficulties and advantages of different implementations. Toward this end, I defined a very simple architecture and implemented it seven times using different techniques.

The architecture is built around a memory space and a set of fifteen instructions which may or may not take a single immediate operand, shown below with their names and functional descriptions. In the functional descriptions, “mem” is used to represent the entire array of virtualized memory locations, “a” is the accumulator, and “pc” represents the location of the next instruction, by analogy to the program counter in a hardware architecture:

HALT Stop fetching new instructions.
LOAD a = mem[imm];
INDIRECT_LOAD a = mem[mem[imm]];
STORE mem[imm] = a;
INDIRECT_STORE mem[mem[imm]] = a;
BRANCH_ZERO if (!a) pc = imm;
BRANCH_NOT_ZERO if (a) pc = imm;
MEM_ADD a += mem[imm];
MEM_SUB a -= mem[imm];
MEM_NAND a = ~(a & mem[imm]);
IMM_ADD a += imm;
IMM_SUB a -= imm;
IMM_NAND a = ~(a & imm);
IMM_LOAD a = imm;

In the attached file, you can see the identical program expressed in each implementation, the Sieve of Eratosthenes yet again, this time operating over a much larger array.

Seven Implementations

Here I have enumerated the implementations, in increasing order of complexity:

Uses C++ polymorphism as the basis of the interpreter. Each operation is a subclass of Op, having an execute() function. This is probably what software engineers would call the “cleanest” design.
Big Switch Statement
The Op class was made monomorphic and given a type field. This type field is the argument to a switch statement in the execute function. Perhaps easier to read than the polymorphic implementation, if less extensible. For this simple example it is benefitted by the fact each of the fifteen possible operations’ implementations fits on a single line in the switch statement, producing code almost as readable as the previous Section’s table.
Dispatch Table
Instead of using a switch statement, uses a table of function pointers, one for each type of operation. Because their prototypes must be identical, all take an immediate argument, even the noes which do not use it.
Dispatch, No Table
The “dispatch” implementation suggests the obvious improvement that the pointers to the implementation functions themselves, instead of an enum indexing a table, could be used to identify operations. This has the drawback of making Op objects difficult to identify, and the advantage of removing a layer of abstraction.
Threaded Code
The “threaded” code mentioned here has nothing to do with parallelsim. This is “threaded code” in the sense of the original implementation of the B programming language at Bell Labs. This fills a code buffer with, instead of direct translations of instructions, a series of calls to functions
implementing the functions. Like the initial “translated” code, there is no direct jumping between basic blocks.
The “translation” version improves on the threaded code version by implementing the most common operations in x86 machine code directly.
Translation With Basic Block Chaining
Basic block chaining is a technique where, instead returning to a separate loop at the end of a basic block, basic blocks in the guest are actually
linked to one another by jumps in the host.

The Results

These numbers suffer, of course, from excessive precision in their reporting:

Impl. Run Time(s) Guest MIPS
Poly 31.96 94.95
Switch 26.74 113.49
Disp 28.93 104.90
Disp-notable 34.86 87.05
Threaded 20.52 147.89
Translation 21.82 139.08
Chained 10.02 302.86

For this simple application, at least, there is a clear winner in the translator with basic block chaining. Surprisingly, the removal of the lookup table actually hurt performance for the “dispatch” based VM.

Running the CHDL CPU Example

CHDL example 6, a simulated MIPS-esque CPU with a 5-stage pipeline running a Sieve of Eratosthenes program finally works. This example can be loaded and simulated and will surely provide an important test case for future modifications to CHDL. For those of you interested in seeing CHDL run for yourself, I will provide a (very) brief tutorial on building and running the examples.

Step 1: Getting CHDL

CHDL is hosted on github, which (among other things) means that it must be accessed using a git client, like so:

  $ git clone git://github.com/cdkersey/chdl.git

Step 2: Building CHDL

Once you have downloaded CHDL, building it should just be a matter of running make in the root source directory. There are a lot of compilation units, so if you’re on a multicore machine don’t forget to use the -j option to speed up the build.

  $ cd chdl
  $ make -j 8

Step 3: Building and Running the Examples

Once you have built the core CHDL library, you can build the examples in the examples/ directory.

  $ cd examples
  $ make -j 8

Once this finishes, you will have a set of files called example[i].vcd, for i in 1 through 7. These are waveform files, containing the state of every node (or bit vector) tapped with the “TAP()” macro, and viewable in waveform viewers like the free gtkwave. If you do not have a waveform viewer installed, go ahead and obtain gtkwave.

  $ sudo aptitude install gtkwave

If you’re not on Debian, and/or do not have aptitude installed, you’re beyond help.

Step 4: Viewing the Waveforms

The waveform files can then be viewed with gtkwave:

  $ gtkwave example6.vcd

The TAP()s from the source code are listed along the left side of the window and can be dragged to the viewing area.


In the above figure, the fetch program counter is being viewed in “analog” mode, showing the path taken through the program memory over time. Note that after the program is finished executing, the processor enters a tight loop. The output of the program is the series of prime numbers in register 12 (a good test case because it’s a simple program whose results are hard to produce accidentally). Happy hacking!

Announcing CHDL

Strange things happen as a consequence of the lack of freedom in the hardware world. Consider the case of the man who made a CPU out of discrete transistors because he was uncomfortable with FPGA vendor lock-in. (http://www.6502.org/users/dieter/mt15/mt15.htm) I do not pretend to understand the EDA community enough to make any claims about the tools they do or do not have, open source or otherwise. I have experienced a lack, but it may just be the rarified air of the field when compared to software-oriented disciplines. Other tools exist. I have not found the ones I have encountered particularly well-suited to my personal needs, so I have built another.

CHDL (call it “the” CHDL at your peril) is two things: a C++-based hardware description language and a C++ hardware design library. The former fills a perceived need for radical generality and simplicity in gate-level design specification and the latter fills a need for a free software toolchain for realizing these designs.

CHDL (the language) can be used to specify abstract digital logic designs with uncommonly terse syntax. Designs specified in this way can then be subject to optimizations and simulated directly or written out to netlists. It is the processing of these netlists that is the domain of CHDL (the library) and the related utilities. The netlist files may be translated to other HDLs (like ones supported by FPGA vendor toolchains), translated to C and simulated more quickly, or technology mapped and physically implemented.

What does CHDL code look like? Here is a a nontrivial design from the standard library: a Kogge-Stone adder:

 template  bvec Adder(bvec a, bvec b, node cin = Lit(0)) {
    vector<bvec<N+1>> g(log2(N)+3), p(log2(N)+3), i(log2(N)+3);
    bvec s;

    g[0][0] = cin;
    p[0][0] = Lit(0);

    for (unsigned j = 0; j < N; ++j) p[0][j+1] = Xor(a[j], b[j]);
    for (unsigned j = 0; j < N; ++j) g[0][j+1] = a[j] && b[j];

    unsigned k;
    for (k = 0; (1l<<k) < 2*N; ++k) {
      for (unsigned j = 0; j < N+1; ++j) {
        if (j < (1l<<l)) {
          g[k+1][j] = g[k][j];
          p[k+1][j] = p[k][j];
        } else {
          i[k+1][j] = p[k][j] && g[k][j - (1l<<k)];
          g[k+1][j] = i[k+1][j] || g[k][j];
          p[k+1][j] = p[k][j] && p[k][j - (1l<<k)];

    for (unsigned j = 0; j < N; ++j) s[j] = Xor(p[0][j+1], g[k][j]);
    return s;

This template function, when called, instantiates an adder of the given size. It will instantiate one of these adders each time it is called. Note that all of the loops are run at design instantiation time. The function’s goal is to create some gates. Once it has returned, those gates occupy some global state, where they can then be simulated, optimized, or written out to a file.

I have placed the very-much-in-development CHDL on GitHub (https://github.com/cdkersey/chdl) along with the hope that I am not alone in my ambition, and that likeminded individuals will find value in these manic machinations.