Could you expand on how prediction.io would handle a real world data set containing a few million items/users?
How long would it take to generate a single user<->user recommendation at this scale? Does prediction.io require that I keep the whole dataset in main memory and how much memory would I need?
I'm asking, because for us (dawanda.com, one of the biggest ecommerce platforms in germany) most of the development effort on our soon-to-be-opensourced recommendation engine was spent on scaling the CF up from a few thousand test records to a 150 million record production data set.
In the first iteration we also built it completely in scala, but as we were putting more and more data into it, memory usage was exploding. We realized that boxed types had too much overhead and that we had to implement the whole sparse rating/similarity matrix in C [1]. Also we decided to go for a hybrid memory/disk approach which allowed us to process 80GB datasets on a machine with only 64GB main memory.
How did you manage to solve the memory consumption issue for prediction.io in scala? Did you use java raw memory access or did you also swap out data to disk/ssd?
PredictionIO is a serving and evaluation framework on top of a bunch of algorithms. Currently a majority of them come from the Apache Mahout library [1].
Computation time and resource requirement depend on the choice of technology. If a non-distributed implementation is chosen using the framework, the rule of thumb from Apache [2] is a good guideline. For distributed implementations based on Hadoop, the 10M MovieLens data set [3] finish training on a single m1.large AWS instance (7.5GB RAM) within 30 minutes. Although we do not have an accurate account of how much computation time and resource will be required for your production data set's scale, a user has reported using his own production data set of similar size with 2M users, and finished training in about an hour using Amazon EMR.
That said, PredictionIO does not do anything special on memory consumption or has a special memory access model. It really depends on the underlying libraries that do the actual work.
We imagine your project requires a much faster turnaround time according to your spec, which is an interesting application to us as well.
I'm asking, because for us (dawanda.com, one of the biggest ecommerce platforms in germany) most of the development effort on our soon-to-be-opensourced recommendation engine was spent on scaling the CF up from a few thousand test records to a 150 million record production data set.
In the first iteration we also built it completely in scala, but as we were putting more and more data into it, memory usage was exploding. We realized that boxed types had too much overhead and that we had to implement the whole sparse rating/similarity matrix in C [1]. Also we decided to go for a hybrid memory/disk approach which allowed us to process 80GB datasets on a machine with only 64GB main memory.
How did you manage to solve the memory consumption issue for prediction.io in scala? Did you use java raw memory access or did you also swap out data to disk/ssd?
[1] http://github.com/paulasmuth/libsmatrix