We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key issue to optimize both is to follow a data-driven method. Data allows us to come up with optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart on the algorithms we use, and how we apply them. In this talk I will describe some of the machine learning algorithms that power our product. I will also describe the kind of data and features we use as well as the optimization approach that includes offline experimentation as well as online AB testing.
Xavier Amatriain (PhD) is Director of Algorithms Engineering at Netflix. He leads a team of researchers and engineers designing the next wave of machine learning approaches to power the Netflix product. Previous to this, he was a Researcher in Recommender Systems, and neighboring areas such as Data Mining, Machine Learning, Information Retrieval, and Multimedia. He has authored more than 50 papers including book chapters, journals, and articles in international conferences. He has also lectured in different universities including the University of California Santa Barbara and UPF in Barcelona, Spain.
Intel is working hard to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop. The Graph Analytics Operation within Intel Labs is helping to transform Hadoop into a full-blown “knowledge discovery platform” that can deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. But, the analysis cannot start until features are engineered, a task that takes a lot of time and effort today. In this talk, I will describe some of the Hadoop-based tools we are developing to make it easier for data scientists to deal with data quality issues and construct features for scalable machine learning, including graph-based approaches.
Ted Willke is Principal Engineer and GM of the Graph Analytics Operation in Intel Labs. He leads a team of engineers and product experts developing new commercial tools for cluster-scale machine learning and data mining, with a keen focus on graph-based approaches. Previously, he led a Labs team that researched cluster computing systems. Before joining Intel Labs in 2010, Ted spent 12 years developing server I/O technologies and standards within Intel’s product organizations. He holds a Doctorate in electrical engineering from Columbia University, where he graduated with Distinction. He has authored over 25 papers in book chapters, journals, and conferences, and he holds 10 patents. He won the MASCOTS 2013 Best Paper Award for his work on Hadoop MapReduce performance modeling.
Personalization in general, and Recommender Systems to be specific, are typically looked at in a framework of collaborative filtering: users of the system and items of content within it are both represented merely as opaque unique identifiers, and the collective actions of boolean flags linking these (user X “clicked on” link A) or numeric affinity weights (like movie ratings) are used to predict the interest users will have in new or as-yet-unseen items. While this has been amazingly effective, the content of the items (the text of the webpage, the description of the movie, etc) and the metadata content about the user (like the user’s profile on a social network) provide an alternative data set to be used for personalization and recommender system inputs, both in conjunction with, and independently of, the raw user-item interaction matrix. In this talk, we will discuss how this kind of approach can ameliorate the cold-start problem, as well as provide some smoothing over extreme sparsity in traditional CF approaches.
Jake Mannix is a Staff Software Engineer on Twitter’s Personalization and Recommender Systems team, working on user interest modeling. Prior to joining Twitter, he worked at LinkedIn first on search, and then was a founding member of LinkedIn’s Recommender Engine team. Jake tends to operate at the intersection of distributed systems work and applied machine learning, especially in scaling algorithms to distributed environments, and has been a contributor to Apache Mahout for over five years, where he served as the Chair of the Project Management Committee from 2012 through August 2013.
Joseph Gonzalez, Co-Founder at GraphLab: Large-Scale Machine Learning on Graphs
From social networks, to protein molecules and the web, graphs encode structure and context, enable advanced machine learning, and are rapidly becoming the future of big-data. In this talk we will present the next generation of GraphLab, an open-source platform and machine learning framework designed to process graphs with hundreds of billions of vertices and edges on hardware ranging from a single mac-mini to the cloud.
We will present the GraphLab programming abstraction that blends a vertex and edge centric view of computation to enable users to express algorithms that can be efficiently executed on hardware ranging from multi-core to the cloud. We will describe some of the technical innovations that form the foundation of the GraphLab runtime and enable unprecedented scaling performance. Using PageRank as a running example we will show how to design, implement, and execute graph analytics on real-world Twitter-scale graphs. Finally, we will present the GraphLab machine learning frameworks and demonstrate how they can be used to identify communities and important individuals, target customers, and extract meaning from text data.
Joseph Gonzalez is a co-founder of GraphLab Inc. and a postdoc in the UC Berkeley AMPLab. Joseph graduated from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. Joseph is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.
Yelp has a clear long-term interest in bringing machine learning tools to bear on an array of problems including recommendation of businesses, inference of missing data, and validation of existing data. In the past year, we have advanced this effort by releasing our first business recommendation feature on our mobile apps. This talk will discuss the process of building this system, particularly focusing on Yelp’s domain-specific concerns and the challenges of building the first system like this at a company.
Scott Triglia is a Search and Data Mining Engineer at Yelp. He leads a team that uses machine learning to augment local search with a focus on automated data quality improvements. He was one of the engineers responsible for Yelp’s first deep foray into recommendation systems: the new mobile Nearby page. Prior to joining Yelp, he earned an MS researching probabilistic modeling under Padhraic Smyth at UC Irvine.
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/ MapReduce. In this session, we will take a look at how we parallelize parameter estimation for linear models on the nextgen YARN framework Iterative Reduce and the parallel machine learning library Metronome.
Josh Patterson is currently the Principal at Patterson Consulting. He was recently VP of Services at Continuuity. Prior to that role, Josh was a Principal Solution Architect at Cloudera. Prior to joining Cloudera, he was responsible for bringing Hadoop into the smartgrid during his involvement in the openPDC project. His focus in the smartgrid realm with Hadoop and HBase was using machine learning to discover and index anomalies in time series data. Josh spent three years as a Principal Solutions Architect with Cloudera helping Fortune 100 companies build out their hadoop and machine learning pipelines. Josh is a graduate of the University of Tennessee at Chattanooga with a Bachelors in Business Management and a Masters of Computer Science with a thesis titled “TinyTermite: A Secure Routing Algorithm” where he worked in mesh networks and social insect swarm algorithms. Josh has over 15 years in software development and continues to contribute to projects such as Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif in the open source community.
Data scientists know how hard it is to collect, categorize and label vast amounts of data. But some smart data scientists are effectively leveraging the human intelligence of the crowd to solve these problems, resulting in better training of machine learning models and improved system performance. Building on years working directly with data scientists and machine learning experts at LinkedIn, Twitter, Walmart and others, this session describes where and how crowdsourcing can improve results, and what is still infeasible. Using big data effectively almost always involves large amounts of cleaning and processing. Proper categorization and attribute labels are essential. In many cases some of the steps can only be done manually making crowdsourcing a crucial tool for data scientists.This talk will describe microtasking, where it fits in the crowdsourcing landscape, and how data scientists and developers can most effectively tap into the crowd to collect and process their data sets. Several real world cases will be used to illustrate the possibilities, including tweet analysis, social profile mining and pre-processing satellite imagery for big data queries. This session will also cover some predictions as to where the crowdsourcing industry is heading.
- Collecting large data sets using the crowd
- Augmenting, labeling and categorizing using microtasking
- Conducting big data experiments using the crowd
- Training machine learning models using results from the crowd
- Real-world examples of tweet and sentiment analysis, satellite imagery, and social profile mining
Lukas Biewald is CEO and co-founder CrowdFlower, the leader in crowd microtasking of crowdsourcing platform vendor CrowdFlower. Prior to co-founding CrowdFlower, Biewald was a senior scientist and manager within the Ranking and Management Team at Powerset, Inc., a natural language search technology company later acquired by Microsoft, and also led the Search Relevance Team for Yahoo! Japan. He graduated from Stanford University with a BS in Mathematics and an MS in Computer Science. Biewald is also an expert-level Go player.
Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and natural language processing. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature. Such network, when applied at the huge scale, is able to learn abstract concepts in a much more general manner than previously demonstrated. Specifically, we find that by training on 10 million unlabeled images, the network produces features that are very selective for high-level concepts such as human faces and cats. Using these features, we also obtain breakthrough performance gains on several large-scale computer vision tasks. Thanks to its scalability and insensitivity to modalities, our framework is also used in other domains with Web-scale data, such as speech recognition and natural language understanding, to achieve significant performance leaps.
Quoc Le is software engineer at Google and will become an assistant professor at Carnegie Mellon University in Fall 2014. At Google, Quoc works on large scale brain simulation using unsupervised feature learning and deep learning. His work focuses on object recognition, speech recognition and language understanding. Quoc obtained his PhD at Stanford, undergraduate degree with First Class Honours and Distinguished Scholar at the Australian National University, and was a researcher at National ICT Australia, Microsoft Research and Max Planck Institute of Biological Cybernetics. Quoc won best paper award as ECML 2007.
Machine listening is a field that encompasses research on a wide range of tasks, including speech recognition, audio content recognition, audio-based search, and content-based music analysis. In this talk, I will start by introducing some of the ways in which machine learning enables computers to process and understand audio in a meaningful way. Then I will draw on some specific examples from my dissertation showing techniques for automated analysis of live drum performances. Specifically, I will focus on my work on drum detection, which uses gamma mixture models and a variant of non-negative matrix factorization, and drum pattern analysis, which uses deep neural networks to infer high-level rhythmic and stylistic information about a performance.
Eric Battenberg recently received his PhD from the EECS department at UC Berkeley, where he was advised by Professors David Wessel and Nelson Morgan. At Berkeley, Eric carried out research on music information retrieval and audio processing applications as a member of the Center for New Music and Audio Technologies (CNMAT) and the Parallel Computing Laboratory (Par Lab). Eric currently works at Gracenote in Emeryville, where he develops algorithms for audio content recognition applications.
With more than 200 million registered listeners, more than a billion hours of music streamed every month, and more than 30 billion thumb ratings since launch, to say Pandora has an enormous amount of data would be an understatement. The popular personalized radio service was built on musicological data from The Music Genome Project and data plays an instrumental role today in determining what music plays and when on each individual listener’s stations. In this session, Pandora’s Chief Scientist will share how his team of music analysts, curators, engineers and data scientists all work in concert to figure out the perfect balance of familiarity, discovery, repetition and relevance for every individual listener. He will discuss how Pandora is equal parts man and machine, Pandora’s algorithmic approach, how to make sense of massive data sets, where the challenges are and how all of this ultimately impacts the future of music.
Eric Bieschke is Pandora’s Chief Scientist and runs playlist engineering for the leading internet radio service. As the second employee to join Pandora in 2000, he built the initial prototypes for the playlist algorithms, which are now deployed to provide the best personalized radio experience to more than 200 million registered users. With a powerful combination of more than 13 years of human music analysis in the Music Genome Project and proprietary playlist technology, Eric leads a team of data scientists, recommendation system specialists and engineers to take these billions of musicological data points and listener insights to figure out how to deliver the right track to the right listener at the right time. With immense scale and a colossal pool of data – Eric and his team optimally combine content-based recommendations, collective intelligence, and large scale machine learning to deliver the best personalized radio experience.
At most companies, advanced analytics expertise is contained in a lab environment: a small team of analysts sitting at their computers and churning out reports and insights to support business decisions. But the real potential for advanced analytics lies in building models that make real-time decisions within production workflows. We will discuss how to use the ecosystem of technologies around Hadoop to support bringing models out of the lab and into the factory, with a focus on strategies for data integration, large-scale machine learning, and experimentation.
Josh Wills is Cloudera’s Senior Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. He is the founder and VP of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open-source libraries and command-line tools for building machine learning models on Hadoop. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+.
Bio: Abhijit Bose heads data science and engineering with responsibilities
for large-scale modeling, algorithm R&D for recommendation and personalization, and
Hadoop platform engineering. Prior to joining American Express, Abhijit worked at Google,
Mountain View, from 2010 to 2012, developing analytical models and pipelines for Google+.
During 2007 to 2012, he was a researcher at IBM T. J. Watson Research center.
Abhijit has a PhD in Aerospace Engineering & Engineering Mechanics (University of Texas at
Austin) and a PhD in Computer Science & Engineering (University of Michigan, Ann Arbor).
Bio: Henry Yuan led the development of the merchant recommender system for the
American Express ‘My Offers’. He has been with American Express for over 10 years.
Henry holds a Master degree in Economics from Concordia University of Montreal, Canada.
Huiming Qu, Sr. Data Scientist, Risk and Information Management, American Express
Huiming Qu, Data Scientist, American Express
Bio: Huiming Qu leads the next generation customer relationship marketing model building.
She was previously with IBM T.J. Watson Research Center working on service research,
and Distillery (formerly, Media6Degrees) working on digital targeting. Huiming has a
PhD in Computer Science from University of Pittsburgh.
Unlike e-commerce and online advertising companies, recommender systems are a relatively new concept in the financial services industry. Several additional aspects of financial services industry may not exist in other domains. First, there are strict privacy guidelines that govern the usage of customer transaction data in addition to considerations such as banking regulations and brand reputation. Second, financial institutions have only recently started to deploy the type of large-scale distributed computing infrastructure such as Hadoop needed to build and serve recommendations to millions of customers. In this talk, we will share our experience of building one of the most sophisticated recommendation platforms in the financial industry, by blending different recommendation algorithms while adhering to the above principles, as well as incorporating our legacy and newer (Hadoop) infrastructure into the end-to-end recommendation platform architecture. We will discuss how we use customer purchase history to derive meaningful insight and deliver merchant and other offers, while dealing with cold-start, customer preferences and other issues.
If learning methods are to scale to the massive sizes of modern datasets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce Divide-Factor-Combine (DFC), a scalable divide-and-conquer framework for noisy matrix factorization. Our experiments with collaborative filtering, video background modeling, subspace segmentation, graph-based semi-supervised learning and simulated data demonstrate the near-linear to super-linear speed-ups attainable with our approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.
Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His work focuses on devising scalable machine learning algorithms, and more recently, on interdisciplinary approaches for connecting advances in machine learning to large-scale problems in science and technology. He obtained a bachelor’s degree from Yale University and a Ph.D. from the Courant Institute at New York University.
Large-scale machine learning offers a considerable advantage, but often the challenge with big data is not just the learning itself, but the preprocessing and cleaning needed to get data into a learnable form. Users are forced to mix a variety of tools (e.g. Hadoop, SQL, Pig) for preprocessing, which not only increase complexity, but can take longer to execute than the learning algorithm itself. Apache Spark is a general-purpose cluster computing engine that can solve this problem by efficiently supporting *both* data processing tasks and machine learning. Spark runs a generalization of the MapReduce model that supports both iterative algorithms (and hence most common ML tasks) and multi-stage data processing (e.g. SQL), outperforming MapReduce by as much as 100x in both cases. In addition, these computations can efficiently be combined, sharing data through memory. Using Spark, it’s possible to write an end-to-end ML workflow in one program that will not only outperform a traditional one but be easier to iterate on and maintain. Finally, Spark offers high-level APIs in Scala, Java and Python and supports interactive use, making it possible to use one tool for both initial prototyping / exploration and large-scale deployment on a cluster.
Matei Zaharia recently finished his PhD at UC Berkeley, where he worked on large-scale data processing systems. He created the Apache Spark project and developed code and algorithms that have also been incorporated into other popular projects, like Hadoop. In 2014, Matei will start an assistant professor position at MIT. Meanwhile, he is CTO at Databricks, a new company commercializing Spark.
Motivated by problems in large-scale data analysis, randomized algorithms for matrix problems such as regression and low-rank matrix approximation have been the focus of a great deal of attention in recent years. These algorithms exploit novel random sampling and random projection methods; and implementations of these algorithms have already proven superior to traditional state-of-the-art algorithms, as implemented in Lapack and high-quality scientific computing software, for moderately-large problems stored in RAM on a single machine. Here, we describe the extension of these methods to computing high-precision solutions in parallel and distributed environments that are more common in very large-scale data analysis applications. In particular, we consider both the Least Squares Approximation problem and the Least Absolute Deviation problem, and we develop and implement randomized algorithms that take advantage of modern computer architectures in order to achieve improved communication profiles on, e.g., MapReduce and on clusters that have high communication costs such as on an Amazon Elastic Compute Cloud cluster.
Michael Mahoney is at Stanford University. His research interests center around algorithms for very large-scale statistical data analysis. His current research interests include large-scale machine learning; developing approximate computation and regularization methods for large informatics graphs; applications to community detection, clustering, and information dynamics in large social and information networks; and randomized matrix algorithms. He has been a faculty member at Yale University and a researcher at Yahoo, and his PhD was is computational statistical mechanics at Yale University.