What are the challenges Hengam faces?
Most of the machine learning applications face a variety of challenges when they come into real world applications; so does Hengam. Among these challenges, having clean data is a common basic problem, which usually leads to handling noise and missing values. Other challenges such as handling imbalanced data and considering the tradeoff between model and sample complexity are also important in our domain. In the following we will discuss these challenges.
1. Missing Values
Every app has the user diversity of normal users, new users (the ones that have just installed the app in the last week), and idle users (users that have opened the app for less than 10 times). For gathering data on the normal users there is usually no problems. However, the new users and the idle users don’t show any behavior within the app and their behavior-related data cannot be collected or is not enough to work with.
2. Imbalanced Classification
Hengam works based on the recognized pattern from the users’ data. However, the amount of churning users is really low compared to the whole number of users, meaning that this data is not enough to learn from and predict based on. This creates an imbalanced classification problem.
Classification is a predictive modeling problem that involves assigning a class label to each observation. An imbalanced classification problem is an example of a classification problem where the distribution of examples across the known classes is biased or skewed. Imbalanced classifications create a challenge for predictive modeling, as most of the machine learning algorithms used for classification were designed around the assumption of an equal ratio of examples for each one of the classes. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.
Two approaches can be taken towards this problem:
- Up-sampling the minority class
- Sub-sampling the majority class
Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal. There are several heuristics for doing so, but the most common way is to simply resample with replacement.
Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm. The most common heuristic for doing so is resampling without replacement. Hengam chooses to use sub-sampling, with the assumption that the number of active users is enough to create a balanced training data.
3. Lack of Data
For the apps that have been newly released or have a small user base, and the assumption we made that the number of active users is enough to create a balanced training data will fail. There is not enough information available to learn about the users of these apps. To solve this problem, Hengam uses its pre-trained models and tests it on the apps. Based on the information gathered from these tests, the model will be tuned to the characteristics of the app.
The pre-trained model aggregates the users of a handful of applications and builds a general churn prediction model. Then, it evaluates the model on a new application.
4. Privacy and Access Permissions:
Apps have different levels of access to users data, and moreover, users can set constraints for apps and don’t allow them to access personal data. This is also one the reasons that missing values are seen in the users’ data.