Netflix Prize (I tried to resist, but...)

Posted on 2006-10-15 in Lisp
11 comments

I'll admit it, I've got a weak spot for programming competitions. One year I scheduled my vacations so that I would have a full free week for all of the official Perl Golf tournaments scheduled that summer. (It was worth it too, I won one of them).

Even so, I wasn't going to enter the Netflix Prize, no matter how cool it sounds. That much money at stake means that it's almost certainly going to be dominated by teams of people with real understanding of the problem domain, probably PhDs in statistics, and more computing resources than a single desktop PC. And while winning might not be the point, I want to at least have a good shot at getting a respectable result.

But after reading the ongoing comp.lang.lisp thread about the competition ("Subject: massive data analysis with lisp"), and disagreeing with most of what was written there, I needed to validate my opinions on how the data should be represented internally, by writing at least the data-handling parts of the program.

  • In-memory database, which each rating stored as a separate structure instance? That's going to take gigabytes of memory due to the headers and the inefficient struct slot packing.

  • Using an SQL database? That seems like a horrible idea, unless you intend to do all of your processing using SQL functions. Once you want to do anything even moderately complex on the lisp-side, you'll need to transfer the data over, and get buried in the overhead.

  • Storing the data as flat files, and using READ for reading them? AARGH!

My gut feeling – supported by complete lack of knowledge of the problem domain – is that there's simply the wrong amount of data for any of these to be feasible, compared to an efficient in-memory representation.

The plan was to pack each datapoint into a x86-64 fixnum (22 bits for the user info, 15 for the movie, and 3 for the rating, lots of padding), write the fixnums into a file with a suitable SBCL array header as a prefix, and then mmap the file. This has several nice properties: it loads instantly, can be accessed at memory speed, is reasonably memory-efficient, doesn't stress the GC, has good paging characteristics (on-demand, copy-on-write), and looks like a normal Lisp vector. mmap two copies with the entries sorted in a different order, and you can also do easy and quick searches and joins bon both movies and uids.

On the minus side, this does take 2*800MB of memory and disk space. Luckily that's just right for my machine, so the program will still run without any disk trashing (the whole working set won't really fit into memory at once, but the accesses have a pretty good locality).

The exact specifics of the data layout are obviously not crucial, just something that works for me. Somebody else might make different tradeoffs, like omitting the movie from the rating, and inferring it from the array index to get it to fit into less memory / 29-bit fixnums. Or you might store the different properties (movie / uid / rating) in separate arrays, to get better cache / page locality.

Initially this was all that I was going to do. But once the the data was available, I couldn't resist playing around with it a little. Pretty soon I had a working implementation of an unsophisticated brute-force algorithm, which I expected would be barely better than the naive approach of just predicting averages. But actually it ended up getting a slightly better score on the probe dataset than the current Cinematch algorithm does. The most likely reason was that I didn't bother separating the probe set from the training data, which would skew the results.

So I did the qualifying dataset too, and submitted it for scoring. As expected the score on that one is worse, but it's not nearly as bad as I feared. Rather it's 7th on the leaderboard, a mere 1.6% behind Cinematch. I'm fairly convinced that if I'd been using any other sort of data representation, I would've been forced to abandon this algorithm as too slow. Even as-is, it'll take several hours to process the full qualifying set.

There are a few morals to this story:

  • It's not very hard to get reasonably good results for this task (I spent about 10 hours coding, starting from scratch). And I've forgotten the little I learned in the intro to statistics course, and never knew anything about implementing databases in the first place. So if the problem looks interesting, but you don't think you're qualified... Just give it a shot.

  • Lisp is an excellent tool for these kinds of problems. Interactive image-based development and a compiler with reasonable performance is a great combination. Especially if you dig into the implementation-specific features for the exotic operations, instead of requiring absolute portability. I would still be coding the first version in C++, I probably wouldn't have even imported the data efficiently into Java, and my heart is bleeding for the poor souls who are apparently trying to use PHP and MySQL for this.

  • Don't submit your first result ;-) Now I have to wait for a week to see whether the simple changes that improve the probe result by about 0.015 RMSE will also work properly on the qualifying set.

Next Evil Compiler Engineer (2006-10-20)
Previous Cusp, a Lisp plugin for Eclipse (2006-10-14)

Comments

By Juho on 2006-10-16

>> Can you expand on the movie/uid/rating in separate arrays bit?

Hmm... I'm not sure what there is to expand on. You'd have three arrays (one for each of movie, uid, stars), all sorted in the same order. You identify a data point by its index (since the arrays are sorted in the same order, the index of rating X will be the same in all three arrays). If you need to pass a rating from one function to another, you pass the index, and the function will read the data from the appropriate arrays. No explicit connection needed.

>> (also, inferring movieid seems difficult? You've got to keep which movies a user have rated stored somewhere.)

I'm thinking of something along the lines of having an array sorted by movie, and keeping a separate table of the boundaries between movies. So to determine the movie of the rating at index X, you just consult the other table. Or more likely, you'll first consult the other table for the start/end indexes of the movie, and then process all of its ratings at once.

If you want to keep track of what movies a user has rated, you'd store that data elsewhere. For example in an array of ratings sort by uid, which only contains the movie and the rating, and has an auxiliary array for the start/end indexes for each uid.

>> I found that not everyone agrees that mmap is the best/only solution

The issue isn't just about speed, but about being able to access the data transparently Lisp-side. Having to manually keeping seeking / read-sequencing doesn't sound like fun.

>> Keep it up, you're down to 11th. :-)

This isn't a very fast-paced competition, since you can only submit once a week... :-) And I'm really more interested in the absolute score that I can get than in the exact position on the leaderboard anyway.

By bhauth on 2006-10-16

Just as a reference point for difficulty vs effectiveness, what was the brute-force method that you you to -1.64% ?

You're pretty much right about the arrays; consider yourself lucky about not having to use on-disk techniques. Been there, done that.

By pixelglow on 2006-10-16

Parallel evolution. I came to a similiar conclusion, an index of (customer,count) and separate arrays of ratings, dates etc. sorted in the same order, all mmap'ed into memory. But I used C++ and STL, which proved to be very productive. Sadly my score isn't in the leaderboard :-( ...

By Joel on 2006-10-16

I'm interested in the technique of mmap'ing arrays in memory. Do you have a short example anywhere of doing this? I don't mind if its sbcl specific or not, although I'm wondering if cffi could be used. Hmmm, probably not, since each CL implementation probably has its own array header layout. Anyhow, thanks for the very interesting post!

By Juho on 2006-10-17

To bhauth:

I don't want to spoil the competition, so I'll wait until that score is no longer competitive before posting the method. I expect it won't take very long.

To Joel:

You can try adapting from the following code:

http://paste.lisp.org/display/27891

Basically you need one header word with the widetag (the array om the example is specialized on positive fixnums aka. (unsigned-byte 60)), and another with the length of the array. Write the headers + the data (remembering to tag the words with the proper lowtag, if needed) into a file, and then just call SB-POSIX:MMAP on the file.

By Your mom on 2006-10-18

Dude, you can fit all the data into 400MB in memory. Always listen to your mom.

By Asbjørn Bjørnstad on 2006-10-18

Gotcha on the 3 arrays, brainfart on my part, thought about arrays of unique movieids/userids, which was a rather silly.

I ended up removing the movieid from the array and instead having a separate movieid array pointing into where the data for each movie start. That way I only need 4 bytes for each datapoint.

I also remapped all the userids so they are sequential from 0 to 480188 (one-time operation, don't need the hash in memory), and did the same thing with a userid array pointing into where movieid/rating data starts. Only need 3 bytes for each datapoint there, so both arrays together becomes 700MB and I've got indexes to cut down the search space when I need to search for a user/movie rating. My machine only got 1GB memory...

LW doesn't like arrays with 100 million entries, though, so I use seek/read in files instead of mmap. I see very little disk io when running, and with a (fref file filepos) function access is almost as easy as for an array.

Thanks for the post, finally kicked me into (looking at) learning binary file usage. Now, to actually do something useful with the ratings, that seems like a harder problem...

By Bob Carpenter on 2006-10-30

I used about 304MB this way, indexing by user, not film.

Set up two parallel arrays as long as the rating set (100M ratings) of films and ratings (2 bytes and 1 bytes, for a total of 300MB). Then an array as long as the number of users (500K) pointing into the first rating by that user (500K * 4 bytes for a total of 2MB). Then another array as long as the number of users containing the actual user IDs (500K * 4 bytes).

By Jmosk on 2006-11-03

What are the specs on the computers that people are using to process their algorithms? How long is it taking to process the qualifying dataset? Even with efficient data representations, there are 2.8M customer IDs to be processed. This would take 32.4 days at one entry per second, 3.2 days at 100ms per entry and 7 hours at 10ms per entry.

By Juho on 2006-11-04

My code takes 2.5-3 hours for the full qualifying set, with no advance calculation or training. This is on an Athlon x2 3800+.

Name
Message

As an antispam measure, you need to write a super-secret password below. Today's password is "xyzzy" (without the quotes).

Password

Formatting guidelines for comments: No tags, two newlines can be used for separating paragraphs, http://... is automatically turned into a link.