Principal Component Analysis vs Exploratory Factor Analysis

Recently, exploratory factor analysis (EFA) came up in some work I was doing, and I put some effort into trying to understand its similarities and differences with principal component analysis (PCA). Finding clear and explicit references on EFA turned out to be hard, but I can recommend taking a look at this book and this Cross Validated question. Here, I will review both PCA and EFA, and compare and contrast them. These are both techniques which can be used to figure out which combinations of features are important to your data, and for reducing the dimensionality of your feature space.

Implicit Recommender Systems - Biased Matrix Factorization

In today's post, we will explain a certain algorithm for matrix factorization models for recommender systems which goes by the name Alternating Least Squares (there are others, for example based on stochastic gradient descent). We will go through the basic ALS algorithm, as well as how one can modify it to incorporate user and item biases.

We will also go through the ALS algorithm for implicit feedback, and then explain how to modify that to incorporate user and item biases. The basic ALS model and the version from implicit feedback are discussed in many places (both online and in freely available research papers), but we aren't aware of any good source for implicit ALS with biases... hence, this post.

Unsupervised Anomaly Detection: SOD vs One-class SVM

In this article we test two algorithms that detect anomalies in high-dimensional data. For our purposes, "high-dimensional" means tens to hundreds of dimensions. Our results might generalize to "very high-dimensional" (100,000 dimensions is commonly seen in NLP and image processing), but we have not carefully experimented there.

Since we rarely have training data (nor even know what we are looking for), we are only interested in unsupervised algorithms.

The first algorithm, Subspace Outlier Degree (SOD) kriegel2009, is an unsupervised local anomaly detector. "Local" means that points are compared against their nearest neighbors (not against the entire dataset). The appeal of SOD is that it overcomes the curse of dimensionality that plagues distance and density-based algorithms such as Local Outlier Factor. For our experiments we use the implementation found in the ELKI framework.

The second algorithm, One-Class Support Vector Machine scholkopf2001, is a semi-supervised global anomaly detector (i.e. we need a training set that contains only the "normal" class). However, since SVM decision boundaries are soft, it can be used unsupervised as well.

We experiment with this, as well as two variants ("eta" and "robust") that were recently proposed amer2013 explicitly for unsupervised applications. We use the implementation in RapidMiner (which leverages a modified version of libsvm underneath).


Welcome to the Activision Game Science Blog!

I’m pleased to welcome you to the unofficial and (mostly) unsanctioned Game Science Blog from the team here at Activision. We’re a group of data scientists and software engineers working at the bleeding edge of real-time data at scale in support of Activision’s plethora of game studios and business operations.

We are passionate advocates of data science and data-informed decisions, whether those decisions are one-off human-made business or design decisions, or machine-made decisions happening thousands of times per second. Activision is full of extremely bright, extremely talented people, and we are lucky to both learn from them and inform them of what is possible in the burgeoning data-driven world. Hopefully, we come together to bring you the best interactive entertainment experiences we can.

As we carve inroads into the data-tech future, which involves working through many pain points and staying on top of the latest tech, we have learned (and are constantly learning) many lessons which are likely useful to a larger audience. Hence, this blog.

Expect to see posts from a variety of members of our team. Our goal is to share lessons learned; some more informal and expository, but most probably more technical. None of the posts will be Activision-specific or nonreproducible reports on proprietary data: we will link to code on github and public datasets. We hope you’ll find our posts both informative and useful.

Activision has put a lot of effort into building a strong, vibrant, prototype-to-production analytics team and given us the space and resources to do awesome things, and we are excited to start sharing some of what we do with you.

–Will Kirwin, Activision Game Science Team