Predicting the Length of File Names Using Math! (Part 1)

Lets say you're building a database that lets you search for file names.  During the planning phase, we want to figure out the average file name length, so that you can optimize our algorithms.  If the average length of your files is say, 12, you optimize your database for strings of length twelve.  Fairly simple.  I wrote a simple console program to recurse through directories and calculate the average file length.  On my Vista system, the average length on my system drive is twenty-seven.  The naive approach is to use this number, and optimize everything for strings of length 27.

For argument, you were able to make your string comparison algorithm faster for lengths of 20 or greater, but slower for anything shorter.  Given the average, this seems like a good decision.  (You pre-alloced the arrays or something).  However, given the more data, this would actually result in a decrease of performance. I've made a program that gives more exact numbers.  I've uploaded the C# code for here.

What the code does is create a file, named lengths.csv with the number of files of every length, from length 0 to 254.  Example output from my system, clipped because it's 256 rows.

File Name Length Number of Files
1 0
2 9
3 24
4 36
5 338
6 836
7 1868
8 3115
9 5468
10 8865
11 10214
12 17114
13 4546
14 3427
... -
49 207
50 1068
51 209
... -

We can see some irregularities.  The length of files seems increase rapidly until you hit 12 (the length of 8.3 compliant), then decrease.  There are some irregularities in the output, with some file lengths.  Files of length 160, 107, 108, 99, 50, 42 (hah!), and 32 are shown to have higher numbers of files.  I've uploaded the full csv file here.

Now, with this CSV document, we can do some basic analysis using Excel (or whatever the Open Office equivalent is).  To predict the changes of a file being any one length, we can simply calculate the percentage of each file being a certain length.  Since that sentence was wonky, i'll do it with math: 

Files of Length Y / Total Number of Files = Chance of File X being Length Y

Or, in Excel, the formula =B2/SUM(B$2/B$256).  Given our input data, we can calculate that any given file has an 8% chance of a filename length of 11, and a 14% chance of it being 12.  Now, going back to our algorithm example, and out sample input, we can tell that 48002 files would benefit from our optimization, and 72273 would have a decrease in performance.  So on any given set of input, there is a 60% chance a performance decrease, and a 40% of performance increase.  That's a giant boatload of fail for any optimization algorithm.

But this method imprecise.  For instance, given our limited data, there is a 0% chance of a file having a length 166 and greater.  This simply isn't the case, but due to our limited sample size we don't see any evidence of it.  We'll address this in Part 2.

Print | posted on Monday, August 25, 2008 12:11 AM

Feedback

No comments posted yet.
Title  
Name
Email (never displayed)
Url
Comments   
Please add 1 and 2 and type the answer here: