Predict customer churn using raw data from Yandex Clickhouse

Though this post may preeminently interest my Russian-speaking readers using Yandex Clickhouse, I hope it could also be of service to those leveraging BigQuery for daily analytics needs. That is because I’m going to demonstrate all the techniques used in this post with raw Google Analytics data in some of my future publications.

Now, let us delve into details concerning the subject – predict customer churn using Yandex Clickhouse. But before that let me state something concerning this post.

On its own, the ML approach used in this work is not something unique or worth a prolonged discussion. Its’ primary purpose is to show you can use ordinary data you’re messing with every single day for something that goes beyond the scope of web-analyst’s tasks.

predict-customer-churn-with-raw-yandex-clickhouse-data-sql pic

pic. by sendpulse

Before we begin

The whole task is as follows:

Predict customer churn rate for client’s mobile app using raw data stored in AppMetrica.

As you’ve already guessed this guide involves some techniques that are not commonly used by web-analysts. Besides time-honoured Clickhouse SQL, we will surely need at least basic knowledge of machine learning, math (liner algebra particularly) and, of course, Python.

Moreover, I presume you already have Clickhouse running with all data accessible through SQL queries.


What is AppMetrica?

AppMetrica is a marketing platform for app install attribution, push campaigns, and app analytics. More than that, the tool allows to track all kinds of your ad campaigns, get insights with user-centric analytics, and communicate with your users. You can learn more about it here.

The solution relies on data accessible through Yandex Logs API allowing to handle non-aggregate information stored in AppMetrica. You can use this data for building your own reports and/or create custom audiences for remarketing.

Why do it?

If you’re reading this post, I believe you understand why it is crucial for almost any kind of business to be able to predict customer churn. Otherwise, here are few tips.

On BigQuery ML

You can start doing your own predictions using GBQ. As you may know, Google has recently rolled out BigQuery ML, a product “allowing users to create and execute machine learning models using standard SQL queries”. There are couple of nice usage examples – for data analysts and for data scientists.

Finally, when I’m done with all the preparatory paragraphs, let’s look at the solution!


Yeah, I know! Jupyter notebooks look so ugly when inserted in an iframe. So, here are links to Jupyter NB Viewer  – English and Russian versions.

A few remarks on the offered solution

  1. As you can see, almost half of the notebook is dedicated to data gathering/structuring. This is made in order to demonstrate how you can turn raw analytics data into something applicable for machine learning. If you are only interested in feature engineering/model applying, you can jump right to the cell № ????
  2. There are at least 3 models used in the notebook. Though all of them deliver almost same results, for this very task I’d pick Light Gradient Boosting Model.
  3. This is my first attempt to use raw analytics data form ML, so, please, don’t be too strict 😉

(Instead of a) Conclusion

I hope this work won’t go amiss for those interested in how to predict customer churn using Yandex Clickhouse. As I’ve state right at the beginning of the post, I’m going to demonstrate the very same approach using Google Analytics raw data. So, stay tuned. As always, feel free to ask your questions in the comment section below!