Could you expand on how prediction.io would handle a real world data set contain...

dszeto · on Oct 19, 2013

PredictionIO is a serving and evaluation framework on top of a bunch of algorithms. Currently a majority of them come from the Apache Mahout library [1].

Computation time and resource requirement depend on the choice of technology. If a non-distributed implementation is chosen using the framework, the rule of thumb from Apache [2] is a good guideline. For distributed implementations based on Hadoop, the 10M MovieLens data set [3] finish training on a single m1.large AWS instance (7.5GB RAM) within 30 minutes. Although we do not have an accurate account of how much computation time and resource will be required for your production data set's scale, a user has reported using his own production data set of similar size with 2M users, and finished training in about an hour using Amazon EMR.

That said, PredictionIO does not do anything special on memory consumption or has a special memory access model. It really depends on the underlying libraries that do the actual work.

We imagine your project requires a much faster turnaround time according to your spec, which is an interesting application to us as well.

PS. The work you posted is pretty cool. :)

[1] http://mahout.apache.org/ [2] https://cwiki.apache.org/confluence/display/MAHOUT/Recommend... [3] http://grouplens.org/datasets/movielens/