☮ Chad D. Kersey ♡

A Weblog

Tag: hacks

The Busyboard: My New Favorite Toy

When I was perhaps ten years old my dad pointed out to me a dingy little white box laying out on a tarp among various other pieces of junk at a flea market. For five dollars, I got my very own Pencilbox LD-1 Logic Designer, a member of a class of artifacts with which I was already somewhat familiar. A solderless breadboard is fastened down to a panel surrounded by various electronic miscellany, including prehaps most importantly an integrated power supply. I’d been putting quite a few hours at the kitchen table on my father’s Heathkit ET-3100, an analog-oriented device of the same class, but my interests were more digital.

The Pencilbox, even moreso than the Heathkit, was perfect for a certain class of play and study:

  • A sample component is installed into the breadboard. Its interface is understood by the experimenter, but she has no first-hand experience employing it in a design.
  • Its power pins are connected to the supply rails; its signal pins are connected to the various I/O options, perhaps tri-state capable DIP switches, LEDs, and debounced pushbutton switches, available.
  • Power is applied and the experimenter simply plays with the device’s pins, asserting various inputs and observing the outputs.

The principal advantage I see in this is that, at a very low cost, the experimenter gains confidence in her understanding of the component before using it in a design. There is also some use of these for prototyping small designs consisting of multiple components, but the procedure is the same. The components are assembled, switches are flipped, and the design is interactively explored to the satisfaction of the experimenter.

Recapturing the Spirit

This was both a severely limited and intrinsically enjoyable way to approach learning about, or more often simply playing with electronics. I recently found myself in need of an EEPROM programmer and a demonstration vehicle for libpcb and thought it would be as good a time as any to attempt to update the concept for my present needs, and this meant replacing the switches and buttons with digital I/O attached to a host computer.

I decided it was better to sacrifice speed for the number of I/O pins available by using a series of shift registers for both input and output. A final shift register was added to the output chain to provide output enable signals, allowing individual 8-bit ports to be placed in input or output mode. The final design had six such ports, limited by the 54-pin breadboard-style terminal strip that was available, which was easier to use and more ergonomic so better for health, which is an important matter for me, that’s why I always exercise and even take supplements, you could learn about kratom extract online, since these are the mainly supplements I take.

busyboard

In the finished busyboard, the top row of 6 ICs is the set of input shift registers. The bottom row is the set of main output shift registers. To the right of these is a final output shift register used to provide the output enable signals for the rest.

Building the Board

I initially coded up the board design using CHDL for the digital components and connectors and a separate handwritten netlist for the bypass capacitors and LEDs. This used the CHDL submodule feature to produce a netlist containing all of the devices as submodules. This was post-processed to produce a netlist in a more standard one-net-per-line “pin instance pin instance…” format. This worked as a proof of concept, but future board-level designs in CHDL will include some sort of additional state to include the passives in the CHDL code as well as perform the netlist generation (and simulation) within the same binary. This design was considered so simple that no simulation was performed.

If there are future revisions of the busyboard, some of this simplicity will be discarded for functionality. A microcontroller, almost certainly itself programmed using the current generation busyboard, will be added to provide some basic initialization and a better communication protocol. The current board design, in a wonderful display of anachronism for the sake of simplicity, contains both a USB connector for power and a 36-pin Centronics parallel port for data.

With a netlist I was reasonably confident in, it was time to lay out the PCB. By this time my Digkey order had arrived, so I could actually physically measure the components to ensure that my footprints were reasonable. At least once I verified the zoom level with a piece of Letter paper then physically pressed the Centronics-36 connector I had purchased against the screen to check the locations of the pins and screw holes.

Placement and routing was performed manually, using gerbv for periodic visual checks. This led to source code that looked largely like the following (units in inches):

  // Carry to U7                                                                
  (new track(0, 0.01))->
    add_point(5.7,0.3).add_point(6.225,0.3).
    add_point(6.35,0.175).add_point(6.6,0.175);
  (new via(point(6.6,0.175),0.06,0.035));
  (new track(1,0.01))->
    add_point(6.6,0.175).add_point(6.6,0.5);

What makes this tolerable compared to writing straight Gerber files is the ability to add higher-level constructs like device footprints and text. What makes this tolerable compared to visual editors like KiCAD is the same set of things that make HDLs appealing when compared to schematic capture. The busyboard design looks like page after page of meaningless numbers, but the framework allows for generators, so classes of designs can be written instead of point solutions, and managed, tested, and developed as source code, with all of the advantages in productivity that come along with that.

PCB in gerbv

gerbv was used to manually inspect placement and routes.

Perhaps, it’ll only be when automatic routing and generation of advanced structures like distributed element filters and differential interconnects show up that libpcb will be a truly attractive alternative, but those types of features will have to wait for future projects.

The Host Software

The semantics of the busyboard are very simple (read, write, set input/output) and so is the API, written in C and based entirely around a single structure:

  /* Busyboard control structure. */
  struct busyboard {
    int fd; /* Parallel port file descriptor. */
    unsigned trimask; /* One bit per I/O byte, 1=out 0=Hi-Z */
    unsigned char out_state[BUSYBOARD_N_PORTS],
                  in_state[BUSYBOARD_N_PORTS];
  };

All interaction with the board is by manipulating trimask, out_state, and in_state. This allows for future revisions of the board to change the interface used between the host machine and board without the need to change host-side source code. The state of this structure is read from and written to the board with busyboard_in(struct busyboard *bb) and busyboard_out(struct busyboard *bb) respectively. These, and an init_busyboard() and close_busyboard() function, are the whole of the API.

The initial test was a persistence-of-vision based raster display, spelling out CHDL on 8 LEDs to quickly moving eyes (or camera). This was quickly followed by interfacing with a simple 128kB SPI SRAM in an 8-pin DIP package, and of course a 32kB EEPROM, the reason this was built.

"CHDL" displayed on busyboard

Initial test– a persistence-of-vision based raster display.

I won’t spare too many words detailing all of the other devices that have been interfaced with the busyboard, but these include:

  • A 65c02 CPU, with memory contents stored on the host machine.
  • A 512kB parallel SRAM; the largest currently available in a DIP package.
  • A 2×16 character module.
  • An SPI analog-to-digital converter.
z80 in busyboard

Z80 CPU installed in busyboard.

z80 sieve screenshot

Result of Sieve of Eratosthenes run on Z80 processor installed in Busyboard.

65c02 in busyboard

65C02 CPU installed in busyboard.

busyboard 6502 sieve

Result of Sieve of Eratosthenes run on busyboard, with memory state provided by host program.

Build Your Own!

Want to hack together your own busyboard? I still have some spare PCBs; just shoot me an email. I’ll give you the unpopulated board if you promise to share what you do with it. If you’d like to play with or improve this rather simple design, the entire source (including this article) is available on GitHub:

chdl needed for the netlist generator.
libpcb used by board layout generator.
busyboard source, including netlist and board layout generator.

CHDL on FPGA

FPGAs are fun. They provide an execution environment for RTL models quick enough to rival custom hardware, at a tiny fraction of the initial investment. This makes them highly appealing to builders of prototypes, those needing to use high-frequency interfaces, and hobbyists, who build copies of historically significant computers from the mid 1980’s that run slower than emulators on modern hardware because why not (http://www.bigmessowires.com/plus-too/).

So of course CHDL code can be run on FPGAs. Just not directly. The configuration formats for FPGAs are proprietary and technology mapping to FPGA lookup tables has not yet been implemented. But why bother to generate an FPGA configuration bitstream for one product by one vendor when you can generate synthesizable Verilog and let the vendor’s proprietary tools do the translation and optimization?

This is what I’ve done. The result is satisfying:

shot0001

So, of course, if you do FPGA development at all, this is something you should do as well. Here’s how:

  • Get CHDL. (https://github.com/cdkersey/chdl)
  • Create a design…
  • Compile the design with an ordinary C++ compiler.
  • Write a program that:
    • Instantiates the design.
    • Calls optimize().
    • Calls print_verilog().
  • Place this Verilog code in a file called “chdl_design.v” and import it into the FPGA toolchain of your choice.

The rest is your standard FPGA workflow, for which there are plenty of tutorials on the Internet. Among the advantages of the CHDL workflow is that it pushes the proprietary tools out to the margins of the design flow. They are still responsible for much of the optimization, technology mapping, pin assignment, and the like, but as long as they speak a simple synthesizable subset of Verilog, they become completely interchangeable.

A Bit About the Demo

The demo code is a simple VGA terminal meant to be clocked at 50 MHz, with a simple parallel interface. The character ROM is stored in human-readable format in the file FONT, from which it is converted to hex by the simple program in font2hex.cpp. Attached to this VGA controller is a text generator, which outputs characters at a human-readable rate from a ROM, whose contents are initialized based on the file TEXT, which is converted into hex in the makefile using a simple hexdump command.

The entire design uses 1691 LUTs and 29k bits of block RAM on an Altera Cyclone II. Somewhat surprising is that this is more than the total number of nodes in the design after optimization by the CHDL toolchain, but this is easily explained away by duplication for the purpose of performance, since area is plentiful (on the demo board, this is ~10% of available resources).

VM Performance vs. Implementation Complexity

Dummy VM examples, works on x86-64 Linux

I was working on a post about recent advances in CHDL and realized that there was another topic I have been meaning to cover in this venue for years, the results of a simple exporation of VM performance vs. implementation complexity. In a recent demo for the Linux Users’ Group at Georgia Tech, I explained an FPGA development toolchain using CHDL and briefely demoed a CPU implementation I had thrown together last night using an accumulator instruction set which happened to already have an implementation of my favoite test algorithm due to this earlier work.

Platform VMs, Briefly

Virtual machines implementing instruction sets different from those of the host machine must provide a way to execute their guest instruction sets. When the instruction set happens to be that of another physical machine, the term emulator is typically used. When the instruction set is not that of a physical machine, the appelation is usually the more generic “platform VM,” everyones’ go-to example of which is probably the JVM.

My personal favorite example is the execution engine of QEMU. While it is, considered in its entirety, an emulator, emulating the instruction sets of some physical machines that have been implemented in hardware, its current approach to emulation uses a two-step approach to emulation, translating first to an intermediate language which is then translated to native operations by the Tiny Code Generator. The TCG’s input can be thought of as a virtual instruction set and QEMU’s execution engine as a platform VM executing this instruction set.

A Dummy VM Architecture

If you were to naively set out comparing platform VMs implementations in use in real software today, you would be faced with a ton of incomparable implementations implementing different architectures with different goals. If you have an application that requires a VM, it is useful to have some a priori knowledge about the difficulties and advantages of different implementations. Toward this end, I defined a very simple architecture and implemented it seven times using different techniques.

The architecture is built around a memory space and a set of fifteen instructions which may or may not take a single immediate operand, shown below with their names and functional descriptions. In the functional descriptions, “mem” is used to represent the entire array of virtualized memory locations, “a” is the accumulator, and “pc” represents the location of the next instruction, by analogy to the program counter in a hardware architecture:

HALT Stop fetching new instructions.
LOAD a = mem[imm];
INDIRECT_LOAD a = mem[mem[imm]];
STORE mem[imm] = a;
INDIRECT_STORE mem[mem[imm]] = a;
BRANCH_ALWAYS pc = imm;
BRANCH_ZERO if (!a) pc = imm;
BRANCH_NOT_ZERO if (a) pc = imm;
MEM_ADD a += mem[imm];
MEM_SUB a -= mem[imm];
MEM_NAND a = ~(a & mem[imm]);
IMM_ADD a += imm;
IMM_SUB a -= imm;
IMM_NAND a = ~(a & imm);
IMM_LOAD a = imm;

In the attached file, you can see the identical program expressed in each implementation, the Sieve of Eratosthenes yet again, this time operating over a much larger array.

Seven Implementations

Here I have enumerated the implementations, in increasing order of complexity:

Polymorphic
Uses C++ polymorphism as the basis of the interpreter. Each operation is a subclass of Op, having an execute() function. This is probably what software engineers would call the “cleanest” design.
Big Switch Statement
The Op class was made monomorphic and given a type field. This type field is the argument to a switch statement in the execute function. Perhaps easier to read than the polymorphic implementation, if less extensible. For this simple example it is benefitted by the fact each of the fifteen possible operations’ implementations fits on a single line in the switch statement, producing code almost as readable as the previous Section’s table.
Dispatch Table
Instead of using a switch statement, uses a table of function pointers, one for each type of operation. Because their prototypes must be identical, all take an immediate argument, even the noes which do not use it.
Dispatch, No Table
The “dispatch” implementation suggests the obvious improvement that the pointers to the implementation functions themselves, instead of an enum indexing a table, could be used to identify operations. This has the drawback of making Op objects difficult to identify, and the advantage of removing a layer of abstraction.
Threaded Code
The “threaded” code mentioned here has nothing to do with parallelsim. This is “threaded code” in the sense of the original implementation of the B programming language at Bell Labs. This fills a code buffer with, instead of direct translations of instructions, a series of calls to functions
implementing the functions. Like the initial “translated” code, there is no direct jumping between basic blocks.
Translation
The “translation” version improves on the threaded code version by implementing the most common operations in x86 machine code directly.
Translation With Basic Block Chaining
Basic block chaining is a technique where, instead returning to a separate loop at the end of a basic block, basic blocks in the guest are actually
linked to one another by jumps in the host.

The Results

These numbers suffer, of course, from excessive precision in their reporting:

Impl. Run Time(s) Guest MIPS
Poly 31.96 94.95
Switch 26.74 113.49
Disp 28.93 104.90
Disp-notable 34.86 87.05
Threaded 20.52 147.89
Translation 21.82 139.08
Chained 10.02 302.86

For this simple application, at least, there is a clear winner in the translator with basic block chaining. Surprisingly, the removal of the lookup table actually hurt performance for the “dispatch” based VM.