It was EverQuest night, about 9:30pm and my housemate’s desktop was beginning to stutter and seize, announcing its arrival at the first way station on its return trip to the star dust from which it was hewn. Desperate to head it off at the pass and preserve the illusion of permanence, he announced his intent to perform a memory test. Hoping to enlarge my own part in the unwinnable war against entropy, I offered to copy memtest86+ onto a USB stick.
A few minutes in and the tests seemed to be passing. This energy’s time as DRAM had not yet ended. He booted his machine once more to check his hard drives and asked, “You didn’t erase what was on that USB stick, did you?” Of course I did. I cp‘d the image right over the device node. What else would I have done? Realization bled into my consciousness. This time, entropy had made me her agent.
The solution was obvious to anyone who has stared with sufficient desire at an empty directory where there should be files, photorec. I’d just let it run and place its output into a directory for easy perusal later. Having already been an instrument of chaos, I discussed the likely results with my friend. “That’s not very useful. The filenames and directories are kind of important.”
There doesn’t seem to be a free tool in existence that extracts files from FAT32 filesystems with their names and directory hierarchy intact after the first 150 megabytes of that filesystem, along with the partition tables, have been overwritten with another. It was around 10pm by this point, and the time had come to get creative.
Within 30 minutes, I had about 75 lines in bffr.c, the Big FAT File Recoverer. Despite its hastily assembled name, it is in retrospect more of a directory entry recoverer. Based on the Wikipedia description of FAT directories, this extracted filenames, timestamps, and attributes from FAT32 directory entries. Despite the terse nature of old-school 8.3 filenames, this provided my housemate some comfort; evidence that these directory entries still existed, like the following examples from what appears to be a copy of Minecraft:
=== SECTOR 0004bb80 === Candidate at 0x0097700c0: MUSIC [DIR] attrib=10 2014-11-24 Starts at cluster 000025fe; 0 bytes 4d 55 53 49 43 20 20 20 20 20 20 10 08 64 2b a7 78 45 78 45 00 00 11 bd f6 42 fe 25 00 00 00 00 Candidate at 0x0097700e0: RECORDS [DIR] attrib=10 2014-11-24 Starts at cluster 00002aa4; 0 bytes 52 45 43 4f 52 44 53 20 20 20 20 10 08 7b 2c a7 78 45 78 45 00 00 11 bd f6 42 a4 2a 00 00 00 00 Candidate at 0x009778040: ICON_1~1PNG attrib=20 2014-11-24 Starts at cluster 000024f5; 3665 bytes 49 43 4f 4e 5f 31 7e 31 50 4e 47 20 00 88 2a a7 78 45 89 46 00 00 06 bd f6 42 f5 24 51 0e 00 00
Now 2 additional pieces of data were needed; the offset from the start of the disk of the start of the filesystem, and the filesystem cluster size. Since directories contain subdirectories as 1-cluster pseudo-files, these were fairly easy to acquire by matching up subdirectory names with their likely contents. The Minecraft music directory was an obvious choice, starting at cluster 0x25fe. Its presumed contents were spotted at 0x9ba0020, which would fit a cluster size of 32 512-byte blocks and an offset of 0x3a8000 bytes.
With this information, a strategy for recovering the directory hierarchy began to emerge, but first it had to be determined whether any of the files on the device could be recovered in full. Filesystems tend to have a linked structure. Unix-like filesystems intersperse “indirection nodes” or “i-nodes” throughout the partition, spreading out accesses across the disk. DOS filesystems had a single table, the FAT, which contained a “next cluster” pointer for each cluster in the filesystem. This had certainly been overwritten, but the directory entries provide a start cluster and a size for each file. If they had been written contiguously (a safe assumption when files aren’t appended to and the filesystem hasn’t been filled up) this would be enough.
I made bffr.c emit dd commands:
=== SECTOR 0004bfc0 === Candidate at 0x0097f8000: CALM1 OGG attrib=20 2014-11-24 Starts at cluster 000025ff; 2530812 bytes $ dd if=/home/chad/usr_chad/joshrec/drive.img of="CALM1.OGG"\ bs=512 skip=318752 count=4943 43 41 4c 4d 31 20 20 20 4f 47 47 20 18 66 2b a7 78 45 89 46 00 00 0f bd f6 42 ff 25 fc 9d 26 00 Candidate at 0x0097f8020: CALM2 OGG attrib=20 2014-11-24 Starts at cluster 0000269a; 1976731 bytes $ dd if=/home/chad/usr_chad/joshrec/drive.img of="CALM2.OGG"\ bs=512 skip=323712 count=3861 43 41 4c 4d 32 20 20 20 4f 47 47 20 18 82 2b a7 78 45 89 46 00 00 06 bd f6 42 9a 26 9b 29 1e 00
Trying a few of these out made it clear that many files remained intact. The battle raged on.
He wouldn’t settle for 8.3 filenames. The most complex part of bffr.c by far was the handling of long filenames. These appear in reverse order, 13 characters at a time, in hidden directory entries prior to the file they describe, scattered haphazardly thorugh these directory entries to carefully avoid breaking backward compatibility.
LFN: fallsmall.ogg Candidate at 0x00bba0040: FALLSM~1OGG attrib=20 2014-11-24 Starts at cluster 00002eea; 5232 bytes $ dd if=/home/chad/usr_chad/joshrec/drive.img \ of="fallsmall.ogg" bs=512 skip=391808 count=11 46 41 4c 4c 53 4d 7e 31 4f 47 47 20 00 0a 2e a7 78 45 89 46 00 00 05 bd f6 42 ea 2e 70 14 00 00
By 1:40am it was time to switch gears. Files could be recovered, including their long file names. Directories and files and the cluster addresses at which they started. Two more output lines per entry were added to bffr.c, made to be easily filtered by grep into separate files. These represented directory entries and files, providing directory cluster number, starting cluster number, size (for files), and filename:
@ 9458 9459 561 "READ_ME_I_AM_VERY_IMPORTANT" # 9458 9460 "ICONS " # 9458 9471 "LANG " # 9458 9726 "MUSIC " # 9458 10916 "RECORDS " # 9458 11954 "SOUND " @ 9460 9461 3665 "icon_16x16.png" @ 9460 9462 5362 "icon_32x32.png" @ 9460 9463 114786 "minecraft.icns" @ 9471 9472 50026 "af_ZA.lang" @ 9471 9476 59508 "ar_SA.lang" @ 9471 9480 66536 "bg_BG.lang"
C++ was chosen for a second program which parsed this input, and assembled it into a forest of trees in memory. The language switch was made in order to take advantage of the C++ standard library. The following data structure was used to represent all files and directories, with all maps and sets indexed by cluster number and dirs indexed by filename with cluster numbers as values. s is the set of roots, initially set to all keys from f and then pruned down by removing all entries which are also contained in directories.
typedef map<string, unsigned> dir; // Forest of dirs. set<unsigned> s; map<unsigned, dir*> f; map<unsigned, unsigned> sz;
This data structure was ultimately traversed to produce a script in the following format:
dd if="/home/chad/usr_chad/joshrec/drive.img" \ of="READ_ME_I_AM_VERY_IMPORTANT" bs=512 count=2 skip=310176 mkdir "RECORDS " cd "RECORDS " # PULL A 591997 BYTE FILE CALLED "11.OGG" FROM 10917 dd if="/home/chad/usr_chad/joshrec/drive.img" \ of="11.OGG" bs=512 count=1157 skip=356832 # PULL A 1071028 BYTE FILE CALLED "13.OGG" FROM 10954 dd if="/home/chad/usr_chad/joshrec/drive.img" \ of="13.OGG" bs=512 count=2092 skip=358016
This created a root directory with a lot of numerically-named directories, each containing its own part of the hierarchy. By 3am this is what I had. Thermodynamics will, inevitably, have the last laugh, but I for one will not go down without a fight.
Feel free to download the code, but keep in mind that the code itself is not a product. It was hastily assembled late at night to battle entropy, and as such should only be read by the curious, and never run.