What recommendations can I give to a telecom company? A business case with customer churn.

12 min readNov 18, 2021

A very common business problem is to understand why customers leave and what to do about it. Knowing why customers do not want to use the service anymore, you can change some business processes to stop the outflow and save the company money. And maybe even multiply it.

Today I will try to investigate the dataset from a telecommunications company and give some recommendations on how to deal with the customer’s churn. This is old, but the real dataset was taken here: https://www.kaggle.com/becksddf/churn-in-telecoms-dataset.

He is quite old and has already been studied enough in the circles of data scientists.

The full code for this article is available on the Github repository here

The first look

Now my main task is to explore a dataset where the “churn” column will act as the target variable. So, what does the dataset consist of:

I have written a special class for analyzing a data set. It can be viewed at this link. It generates data tables and distribution graphs. I think this is enough to conclude what data structure the dataset has and how many missing values it has. I will call this class and pass it a data set.

# Building a dataset reportdata = pd.read_csv('../data/raw/dataset.csv', sep = ',')def build_anylize_report():
    DatsetAnalyzer(data, 'Customer data of the telecom company', ['international plan', 'voice mail plan', 'churn'], 'churn')
    
build_anylize_report()

Now I have received the very first data report.

1. The data

2. Information about columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   int64  
 15  total night charge      3333 non-null   float64
 16  total intl minutes      3333 non-null   float64
 17  total intl calls        3333 non-null   int64  
 18  total intl charge       3333 non-null   float64
 19  customer service calls  3333 non-null   int64  
 20  churn                   3333 non-null   bool   
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 524.2+ KB
None

3. Missing values

The dataset contains 21 columns. Total 0 columns with missing values.

4. Correlation

5. Histograms of distributions

6. Categorical values

There is more information in the report in the notebook, but I reflected only the most necessary info here.

What do I see in the raw data as a result? In total, we have 3333 records, and there are no missing values and duplicates, which will facilitate further data preprocessing and model creation. The average total number of calls during the day is 179 minutes (in the future it makes sense to calculate the average number of calls per day). The percentage of loyal visitors in this data set is 85.5% (respectively, the outflow is 14.5%). 9.6% of customers have an international tariff plan connected, and only 27% of users use voicemail. The total time of calls and the customer’s payment has the highest degree of correlation, which is normal in principle. There is a small relationship between customer care, contacting customer support, the number of minutes used during the day, and payment for minutes used. Based on the histograms of the distributions, it can be seen that there is a distribution offset for Customer service calls, Total int calls, and Number vmail messages (peak at zero). These properties need to be given closer attention in the analysis since these offsets can affect the outflow. The offset of the distribution in the Area Code field is most likely not important, but simply indicates the territorial location of users. Data preprocessing is not required, but it may be worth processing a column with a phone number, and it makes sense to turn columns with Yes/No values either into a boolean or numeric type.

Exploration data analysis

All the code will be available in the repository, and below I will analyze the graphs and tables that I received during the exploration of the dataset.

First, let’s look at the general information about the company to have an idea of what points it makes sense to pay attention to when further investigating the causes of outflow.

General information about the service provided by the company

In which states are the most and least used of the service provided? There is a possibility that this will show how the telecom company has advanced in the market in the past. States with more activity may have more traffic.

The most active / inactive states in minutes: WV(60789.7), CA(20091.0)

Let’s look at the number of customers

States with maximum and minimum values in days: WV(106), CA(34)

The above reports show that there is no uniform distribution of the number of customers among the states and also show a possible historical commercial expansion. It makes sense to look at how the average usage time of the service is distributed.

States with maximum and minimum mean values in minutes: TN(8.0), CA(1.9)

If I build a “boxplot” and a histogram for the average time, I can see that some states have statistical outliers. But in general, the average time of using the service per day does not correlate with the activity of users for all time. On average, users used the connection for about 3 minutes per day.

Customer churn

Exploration of customer churn depending on location

Unbalanced data usually refers to a problem with classification problems when classes are not represented equally. In our case, let’s explore the target Churn class for balancing. Let’s look at the unique number of values across the entire set to have a general idea (in fact, we have already seen this when analyzing the general information about the set):

The distribution of the target class values in the complete dataset has a ratio of 1:6, which indicates an imbalance. So we have a class imbalance for the target variable, which can lead to predictive models that are biased towards the majority. Let’s look at the imbalance by state:

There is an imbalance by state. The minimum customer churn in HI and AK states is 6%. The maximum customer churn in NJ and CA is 26%. The imbalance by the state should be taken into account when further building the model. Let’s see the outflow of by phone code:

There is an imbalance in all codes, but at the same time the churn is evenly distributed between telephone codes in the range of 14% — 15%

Exploration of customer churn depending on international calls

In a preliminary analysis of the data, we have already seen that only 10% of customers have a tariff for international calls. It is worth checking out the idea that customers who make a lot of international calls purchase an international plan.

After the above report, it can be clearly stated that 90% of customers call abroad and at the same time do not have a suitable tariff plan. With roaming enabled, the share of those who left is higher, which probably indicates that spending is higher and customers do not like it.

After 14 international calls, there is a sharp increase in customer churn. This is either a normal phenomenon, or there is the influence of some factors that increase the outflow (for example, an increase in the cost).

Exploration of customer churn depending on voicemail

There is not a single message sent via voicemail for users who do not have this option included in the plan. Probably the company does not provide such an opportunity at the technical level. It is worth noting that not all users who have this option use it (only 28% of customers with the corresponding tariff plan are used). Voicemail is a rarely used service from this operator.

After 40 voice messages, the percentage of churn increases. This is a rather strange dependency that displays either statistical outliers or an increase in prices for sending messages, or this is a normal process. Let’s just remember this fact.

Exploration of the customer churn depending on the customer support service

81% of the customers who contacted the support service have left. At the same time, 79% who are still customers have also contacted support at least once. At first glance, there is no direct dependence on whether the client asked for help or not, but in accordance with the average number of calls to support, it can be argued that people who left turned to support more often than those who stayed.

Customers who call more than 3 times in support have a higher percentage of churn. Probably these customers have serious problems and leave the company because of poor help. The company should pay attention to this fact.

Exploration of customer churn depending on the customer’s lifetime

The average lifetime of the client is approximately the same in the churn/not churn groups.

After about 6 months, there is an increase in the percentage of customer churn. This is probably due to a change in the price of services. Probably users were connected by promo action. It is worth checking when evaluating the cost of services.

Exploration of customer churn depending on the cost of services

The graphs show that there is an outflow of customers increases at an average price of 10 cents per minute during the day, including roaming. The drop in outflow on the right chart at a price of more than 12 cents for roaming calls is due to the fact that there is not enough data in this range and it makes sense to neglect this information.

Visualization of factors affecting customer churn

Let’s build a t-SNE representation of customer churn data. The t-SNE algorithm is a technique of nonlinear dimensionality reduction and visualization of multidimensional variables that will help us look at how different data correlate.

The blue color indicates customers who have left. The graphs show that there is a small group of customers who have left the company, who do not have roaming and voicemail (intersection of clusters). Most of the customers who left were not connected to the voice mail tariff, but they had an international call service-connected.

Hypotheses

Checking distributions

Let’s check whether this data set is a Gaussian sample or is Gaussian enough to use standard methods (Pearson test) by column (“NOK “ value means “Not Ok”).

Null hypothesis: the data is normally distributed.

Alternative hypothesis: the data is distributed abnormally

NOK:  0.032 account_length
NOK:  0.0 area_code
NOK:  0.0 international_plan
NOK:  0.0 voice_mail_plan
NOK:  0.0 number_vmail_messages
OK:   0.775 total_day_minutes
NOK:  0.001 total_day_calls
OK:   0.775 total_day_charge
OK:   0.805 total_eve_minutes
NOK:  0.033 total_eve_calls
OK:   0.806 total_eve_charge
OK:   0.584 total_night_minutes
OK:   0.523 total_night_calls
OK:   0.585 total_night_charge
NOK:  0.0 total_intl_minutes
NOK:  0.0 total_intl_calls
NOK:  0.0 total_intl_charge
NOK:  0.0 customer_service_calls
NOK:  0.0 phone_number_prefix
NOK:  0.0 phone_number_suffix
NOK:  0.0 churn
OK:   0.598 total_charge
OK:   0.591 total_minutes

Costs for the service of users with roaming and without

In the previous hypothesis, we confirmed that the total_charge field is normally distributed. This means that we can form hypotheses about user spending on different tariffs using the Student’s test.

Null hypothesis: the costs of users with roaming and without roaming are equal.

Alternative hypothesis: the costs of users with roaming and without roaming are not equal

Let’s take alpha = .05

p-value: 0.0051790877246315516
alpha: 0.05

Conclusion: We reject the Null hypothesis

Conclusion

The telecom company provides the service to its customers in all states of the USA. The average duration of calls (2.86 minutes per day) is distributed evenly among all states, excluding 3 states (IA, TN, SC) — there is an increased average conversation duration per day (6.5–7.9 minutes per day). Most of the clients are in the WV staff (106 subscribers), the least in CA (34 subscribers), respectively, the total time of conversations is distributed in the same way (where there are more clients, there are more people talking).

From the provided data set with the size of 3333 records, it can be concluded that the customer churn is 14.49%. The minimum customer churn in HI and AK states is 6%. The maximum — in NJ and CA is 26%, while the outflow is relatively evenly distributed between telephone codes ranging from 14.3% to 14.9%.

90% of customers call abroad and do not have a suitable tariff plan (International Plans option).

Probably the company does not provide the ability to send voice messages at the technical level for those who do not have such an option enabled (vmail option). It is worth noting that not all users who have this option use it (only 28% of customers with the corresponding tariff plan are used).

Customers who call more than 3 times to technical support have a sharply high percentage of churn (more than 40%). Such customers should be tracked and try to solve the problems as quickly as possible or include them in an additional retention marketing plan.

After 6 months of using the service, there is an increase in the percentage of customer churn. Probably, customers have joined with the help of a promo campaign.

Customer churn increases at an average price of 10 cents per minute during the day, including roaming. This is affected by the number of calls and the length of conversations. With a decrease in the cost of a minute during a long conversation, it will probably be possible to reduce the percentage of outflow. The customers who left spent the most, who talked a lot and did not have roaming connected.

It makes sense to add several tariff plans for roaming to reduce customer churn.

When examining the graphs of the ratio of parameters, it can be concluded that there is a small group of customers who have left the company, who have neither roaming nor voice mail. It is worth paying attention to such clients. Most of the customers who left were not connected to the voice mail tariff, but they had an international call service-connected.

References:

Repository:

notebooks/telecom at main · koav/notebooks

Contribute to koav/notebooks development by creating an account on GitHub.

github.com

notebooks/00_EDA_eng.ipynb at main · koav/notebooks

Contribute to koav/notebooks development by creating an account on GitHub.

github.com

Algorithms:

t-distributed stochastic neighbor embedding - Wikipedia

t-distributed stochastic neighbor embedding ( t-SNE) is a statistical method for visualizing high-dimensional data by…

en.wikipedia.org

Pearson's chi-squared test - Wikipedia

Pearson's chi-squared test () is a statistical test applied to sets of categorical data to evaluate how likely it is…

en.wikipedia.org

What recommendations can I give to a telecom company? A business case with customer churn.

The first look

1. The data

2. Information about columns

3. Missing values

4. Correlation

5. Histograms of distributions

6. Categorical values

Exploration data analysis

General information about the service provided by the company

Customer churn

Exploration of customer churn depending on location

Exploration of customer churn depending on international calls

Exploration of customer churn depending on voicemail

Exploration of the customer churn depending on the customer support service

Exploration of customer churn depending on the customer’s lifetime

Exploration of customer churn depending on the cost of services

Visualization of factors affecting customer churn

Hypotheses

Checking distributions

Costs for the service of users with roaming and without

Conclusion

References:

notebooks/telecom at main · koav/notebooks

Contribute to koav/notebooks development by creating an account on GitHub.

notebooks/00_EDA_eng.ipynb at main · koav/notebooks

Contribute to koav/notebooks development by creating an account on GitHub.

t-distributed stochastic neighbor embedding - Wikipedia

t-distributed stochastic neighbor embedding ( t-SNE) is a statistical method for visualizing high-dimensional data by…

Pearson's chi-squared test - Wikipedia

Pearson's chi-squared test () is a statistical test applied to sets of categorical data to evaluate how likely it is…

Written by Andrey Koleda