P.Mean: Dealing with a large text file that crashes your computer (created 2010-04-02)

Dealing with a large text file that crashes your computer (created 2010-04-02).

This page is moving to a new website.

At a meeting, a colleague was describing a text file that he had received that had crashed his system. No way, I thought, could a simple text file crash your system. I offered to investigate and he was right. The text file crashed my system too, and repeatedly. Here's what I did to figure out how a simple text file could crash your computer.

I got the file which was 2,185,579 kilobytes on a portable USB drive. First I tried to open it in R using the read.table function. That crashed my system. I held down the on-off button to reset my computer. I re-ran read.table with the argument nrows=100. Surely the first hundred rows of any text file should be manageable. My system crashed again. Time for another reset.

Next I tried to read the file into a text editor that I use (not notepad, of course, as that text editor has serious limitations on file size). System crash and a third reset.

Okay, now it's time to look for help on the Internet. I went to www.nonags.com, a website that offers free software without the typical shareware restrictions (no limited time window and no reduced feature set). I looked for text editors that specifically said that they could handle large text files. I tried two of these and neither one worked, though these did not crash my system. They just couldn't display the file.

At this point I had to start theorizing. Maybe the text file wasn't a text file, but rather a binary file that was mislabeled with .txt extension. I went back to nonags and downloaded a hex editor, a program that allows you to view and modify binary files. This also ended in failure.

So what else could be the problem? Well, the size of the file is roughly two gigabytes and perhaps that's the size of a file where ordinary software tends to break down just like the laws of physics break down near a black hole. Back to nonags to download a file splitter.

I split the file into 100 pieces and used my regular text editor to look at the second piece (not the first piece in case there was a "poison pill" at the start of the file that made most programs gag. The file opened! I could see a structure that looked somewhat rational.

5941|W0935500936|W|||08/10/2007|08/10/2007|0620|08/10/2007|1| $56.00|LEVEL 3 ESTAB PATIENT|88102|

This excerpt is fake numbers but the same general structure that the file had. There were individual data elements, each separated by the vertical bar character, |. The dates and the monetary values were two things that helped me see this.

This was progress. Most files use a comma to delimit files but this one used a vertical bar. The other interesting thing was that there was only one line in this file, one very very long line. Then I noticed a semicolon that appeared every so often, always three or four fields away from what looked like a name field.

|*S-SIMON|M|WHITE|MARRIED|OPR;28248|W09355000095

This excerpt is also fake but with the same general structure.

There were also a lot of repeating elements, as this file seemed to have multiple records per patient. So the S-SIMON and other variables that were probably medical record numbers and such seemed to appear over and over again at regular intervals.

Time to look at the first file in the split. It was just like the others, but it started with something that looked like variable names. So this looked not too far from a simple text file. There were three big differences:

this file had semicolons where most files might have line breaks,
this file had vertical bars where most files might have commas,
this file was extremely large

The file splitting program had an option of splitting after a specific number of occurrences of a particular character. So I decided to split the file after each millionth semicolon. This created 7 files, but each individual file was still too big to handle. Here, though, was a revealing error message:

Eureka! The problem wasn't the size of the files, it was the size of the lines. I clicked on the CANCEL button because now I knew exactly what happened and what to do. The file was a single line. In the original file it was about 2.1 billion characters long. In the files split into 100 pieces it was 21 million characters long. In the file split into 7 pieces it was about 300 million characters long. My text editor wanted to split those into lines about 30 million characters long. Maybe there is an upper limit for most text editors on line length of about 30 million characters. My text editor would know enough to offer to split lines when the size of a line was only one order of magnitude bigger than the upper limit, but couldn't get far enough along to even make this recommendation when the size of a line was two orders of magnitude bigger. That also explained why using nrows=100 in R did not work.

Now I was ready to rock and roll. I split the file after every 100,000 semicolons. This created seventy files. I downloaded a program from nonags.com that makes text substitutions using regular expressions across a range of files. I changed each ; to \r\n (the regular expression code for a line break). Now I had 70 files, each easily readable by a text editor and each easy to import into any statistical software program. It would be tedious to read in those 70 files and patch them all together, but it could be done. Ideally, some of the extraneous data could be discarded in the process.

I did have a few revelations as I did this. First, files of this size probably need a lot more computer memory than we might need for other programs. It might make sense to get a laptop computer with 8 gigabytes of memory (my current laptop is maxed out at 4 gigabytes). Second, a 64 bit operating system is critical, since 32 bit operating systems cannot take advantage of more than 3 gigabytes of memory. Third, I need some empty thumb drives larger than 2 gigabytes if I want to help work with files of this size. I didn't talk about this but simply getting the file onto my computer was trickier than it should have been. I have an 8 gigabyte and a 16 gigabyte thumb drive on order.

Why the extra memory? Well the hex editor should not have been fazed by a single line of 2.1 billion characters. It didn't work (I think) because the file itself was too big for the available memory. Maybe with more memory that wouldn't have been a problem. I'll be ordering a new laptop as soon as my taxes are done. I was just about ready to upgrade anyway, and now I had a solid reason to back up this choice.

Finally, I need to get far more comfortable with tools like grep, which allow you to view, filter, and change files using regular expressions. If I knew grep really well, I bet I wouldn't have needed to search for and download all those other utilities.