ApacheCon US 2009 Session

Apache Mahout - Going from raw data to information

It has become very easy to create, publish, and collect data in digital form. The volume of structured and unstructured data is increasing at a tremendous pace. This has led to a whole new set of applications that can be build to solve the problem of turning raw data into valuable information. Possible applications include everything from discovering new trends out of a stream of weblog entries, to automatic learning approaches that supplementing market research processes for new products. Machine learning provides tools for building these applications. A large community of researchers has been working on the topic of learning from data. Although solutions to common problems are publicly available, scaling these solutions into the range of terabytes and petabytes is an open issue. To scale algorithms to such dimensions it is vital to distribute data as well as computation. The mission of the Apache Mahout project is to build a suite of scalable machine-learning algorithms that can cope with today's quantities of data. Mahout is built on top of Apache Hadoop. This talk provides a beginner-friendly introduction to the topic of machine learning. It presents a broad set of applications that benefit machine learning, as well as a high-level overview of Mahout. It also covers the types of tasks that can be solved with each algorithm, and the pitfalls to look out for along the way.