Assessing Stability of K-Means Clusterings

In data analysis we often use unsupervised methods to assess the structure of unlabeled data. In doing so, we create models that allow us to segment players based on their behavior and treat them differently. Unfortunately, for many methods in the space, we run into a problem. How do we determine the number of segments with no inherent knowledge of the data? While at first we may look for a number of segments that fits the data well, this approach can lead to unstable results.

Guided by new academic research this post explores the concept of instability in K-means clustering (a common unsupervised method), our treatment for this issue and further insight into arriving reliably at the same segmentation and ultimately form a more solid foundation for conclusions about our data.


Python Kafka Client Benchmarking

Kafka is an incredibly powerful service that can help you process huge streams of data. It is written in Scala and has been undergoing lots of changes. Historically, the JVM clients have been better supported then those in the Python ecosystem. However, this doesn't need to be case! The Kafka binary protocol has largely solidified and many people in open source community are working to provide first class support for non-JVM languages.

In this post we will benchmark the three main Python Kafka clients. pykafka, python-kafka and the newest arrival confluent-kafka-client. This will also be useful as an introduction to the different clients in the python ecosystem and their high level APIs.


Pheidippides (Part 1)

Besides all the math and modeling, the Activision Game Science team also does a bunch of big data engineering at Activision. At this year’s MesosCon, we’ll be giving a talk on an application configuration and deploy tool we built called Pheidippides (named after the apocryphal Marathon-to-Athens runner). Pheidippides was initially designed to help us more easily deploy our services into Marathon, but with use it has evolved into something more.

Part 1 of this series is an abstract on our issues with configuration management, how we chose to solve these issues, and the resulting Pheidippides specification. No code yet, but in Part 2 we’ll continue with how exactly we implemented the Pheidippides specification and also provide some code that we use to glue it all together.

Problem Statement

In any given software system, configuration is an important component to the successful operation of that system. This is especially true in a service-oriented system that is dominated by micro-services. In these types of systems, micro-services can proliferate very rapidly and, as they proliferate, the total amount of configuration across the system can grow exponentially.


IPython Parallel Introduction

The members of the Analytics Services team here at Activision are heavy users of Mesos and Marathon to deploy and manage services on our clusters. We are also huge fans of Python and the Jupyter project.

The Jupyter project was recently reorganized from IPython, in a move referred to as "the split": One part that was originally part of IPython (IPython.parallel) was split off into a separate project ipyparallel. This powerful component of the IPython ecosystem is generally overlooked.

In this post I will give a quick introduction to the ipyparallel project and then introduce a new launcher we have open sourced to deploy IPython clusters into Mesos clusters. While we have published this notebook in HTML, please feel free to download the original to follow along.


GDC Follow-Up: Analyzing Churn Using the Survivor Model

I recently gave a talk at GDC on analyzing churn using a survivor model. If you attended my talk and would like to view the source code and data, please follow this link.