Analysis of Customer Churn in the Telco Customer Churn Dataset

Matthew Unrue, October 2019

Western Governors University Data Mining and Analytics II Project

Academic%20Excellence%20Award.png

The goal of this data mining and analysis project is to examine customer attrition patterns and to determine what indicators can be used to make better business decisions to prevent loss. Furthermore, a model will be produced in order to predict whether or not a given customer might discontinue business based on a model built from this analysis.

This project is done in Python using the various libraries in the scientific Python ecosystem of scikit-learn with a few other libraries for various uses. Specifically, this involves using pandas for data wrangling, manipulation, and cleaning, matplotlib and Seaborn for data visualization, and scikit-learn itself for model creation and analysis.

Like R and unlike SAS, all of these are easily available, free, and open-source. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.

The primary descriptive method used in this project is FAMD, or Factor Analysis of Mixed Data. FAMD is performed to both examine the variables in order to determine which are the most important, and to reduce the number of variables to examine without overly reducing the explained variance so that the predictive method will be easier to create and use.

The primary predictive method is logistic regression, and this technique is used to build the predictive model. Logistic regression is appropriate because the dependent variable is Boolean and is easily predicted using the probability of binary results that logistic regression provides. Additionally, three of the variables are continuous, which factor easily into any type of linear regression, and the remaining are nominal categorical, which can easily factor into the regression by encoding them in dummy variables. Finally, the coefficients of each parameter that the model will provide can be used to see how much that variable affects the chance of churn. The result is a model that is robust, easy to understand, and is simple to use in predicting if a customer will churn or not.

The essential criteria and phenomenon to be predicted are what factors are present with customers who are leaving, indicated by the Churn column, which can be detected by finding which independent variables have a high positive correlation with the Churn column. Additionally, finding the variables that are associated with a lower churn rate is useful for customer retention, and is an equally important goal. Beyond this, it is important to find what combination of all these factors is most associated with Churn rate so that smart business decisions can be made in the future.

The target and dependent variable in this data is the Churn column, which indicates whether or not that given customer has churned and ceased business. It is a nominal categorical variable that is originally loaded as a string or object dtype with yes/no values but is then tidied into a proper Boolean with True/False values.

There are many potential independent variables that can be used as predictor variables in this dataset. Two of the most important independent variables overall are the MonthlyCharges and the Tenure columns, which contains the continuous quantitative values that represent how much each customer is paying by month according to contract, and how long that customer has had a business relationship with this company. These are among the most important because of how strongly financial figures factor in to customers’ decisions. A higher MonthlyCharges value may easy be representative of more pressure to leave to a cheaper competitor as soon as a contract is over, or in the case of a month by month contract, to end it sooner rather than later. Meanwhile, a higher Tenure value is indicative of a customer who has invested more thus far and is less likely to leave based on any single factor.

For other independent variables, PhoneService and InternetService do an excellent job of grouping customer type, and the various columns of data that require one of these grouping columns to have a value indicating that they’re paying for that type of service can indicate that a disproportionate number of customers that are churning are all a very specific subset of customer. An example of a possibility is that if all customers with internet service, specifically DSL service, through this company are leaving and choosing to not renew contracts after previous ones are up, then reasons as to why this is needed to specifically be looked into. Additionally, that DSL customers could be prime targets for customer relations programs, or improved sales, because of this relation. A competitor may be offering a sale far better than the company currently offers or is willing to match.

The first goal and requirement in manipulating the data is to have a dataset that can be considered both tidy and clean. For data to be tidy, each variable must have its own column, each observation must have its own row, and each value must have its own cell. (Wickham & Grolemund, 2017, 12.2) For data to be clean, it needs to be free from obvious errors, be formatted into the correct form of data, be devoid of missing values, have irrelevant columns removed, and to be consistent within itself. Additionally, the data will be broken down so that dummy encodings (One-Hot Encoding) primarily represent the categorical data as opposed to the string values that exist in the initial dataset so that it can easily be implemented in an appropriate model. Beyond this, outliers and interactions must be found and removed if any exist to ensure the integrity of the resulting analyses and predictive model.

In [1]:
# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as scs
import prince
import pickle
from sklearn import preprocessing

%matplotlib inline

This dataset has 21 columns, with a number of which may be useful for predicting attrition, which is represented by the final column "Churn." Furthermore, it appears that none of the values are missing, or more specifically, NaN, but there is still potential for outliers or incorrect data.

In [2]:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [3]:
df.head(10)
Out[3]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
5 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No ... Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.5 Yes
6 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No ... No No Yes No Month-to-month Yes Credit card (automatic) 89.10 1949.4 No
7 6713-OKOMC Female 0 No No 10 No No phone service DSL Yes ... No No No No Month-to-month No Mailed check 29.75 301.9 No
8 7892-POOKP Female 0 Yes No 28 Yes Yes Fiber optic No ... Yes Yes Yes Yes Month-to-month Yes Electronic check 104.80 3046.05 Yes
9 6388-TABGU Male 0 No Yes 62 Yes No DSL Yes ... No No No No One year No Bank transfer (automatic) 56.15 3487.95 No

10 rows × 21 columns

To begin the data cleaning process after the initial loading and examining of the dataframe structure, each column is examined and any of the columns that have null values are flagged.

In [4]:
for col in df.columns.values:
    print(col + ": " + str(df[col].isnull().values.any()))
customerID: False
gender: False
SeniorCitizen: False
Partner: False
Dependents: False
tenure: False
PhoneService: False
MultipleLines: False
InternetService: False
OnlineSecurity: False
OnlineBackup: False
DeviceProtection: False
TechSupport: False
StreamingTV: False
StreamingMovies: False
Contract: False
PaperlessBilling: False
PaymentMethod: False
MonthlyCharges: False
TotalCharges: False
Churn: False
In [5]:
n_rows = len(df)
n_rows
Out[5]:
7043

The column customerID appears to be a structured string simply used for identifying customers. While this column is extremely important in relational databases and would be useful if we planned on any dataframe splits and subsequent joins, it is not useful for analysis and will be removed early on to simplify later steps.

In [6]:
df.drop('customerID', axis = 1, inplace = True)
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 1.1+ MB

As the customerID column is the only one that will need removed at this stage of the data exploration, the remainder of this first step is to go through each column, one by one, and do the most basic cleaning and explorative tasks. This includes:

  • Inferring the purpose of each column in the larger perspective of the business in order to interpret the information contained within properly.
  • Changing the column titles to be consistent.
  • Changing categorical values to be consistent like with Boolean values to True and False.
  • Changing the data types, or dtypes, of some dataframe columns to ensure the data is handled appropriately.
  • Examining the numbers and proportion of each categorical value in categorical columns.
  • Examining the distribution and calculate summary statistics like the mean, median, and quartiles of the values of the quantitative columns.
  • Finding and filling in any NaN or null values.

gender is a binary column and has a nearly perfect split between the values. 50.48% of customers are male and the remaining 49.52% are female. To make everything consistant, the column title will be updated.

In [8]:
df = df.rename(columns = {"gender": "Gender"})
In [9]:
df['Gender'].value_counts()
Out[9]:
Male      3555
Female    3488
Name: Gender, dtype: int64
In [10]:
df['Gender'].value_counts(normalize = True)
Out[10]:
Male      0.504756
Female    0.495244
Name: Gender, dtype: float64

SeniorCitizen appears to be a boolean value indicating whether or not the row corresponds to a customer that is a senior citizen, but what exact age that represents is unknown. 16.21% of customers are senior citizens in this dataset. This is the first example of a boolean column in this dataset, and they overall fail to follow a single convention. The boolean-like object columns will all be changed to an actual bool dtype as they come up.

In [11]:
df['SeniorCitizen'] = df['SeniorCitizen'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(1), float64(1), int64(1), object(17)
memory usage: 1.0+ MB
In [12]:
df['SeniorCitizen'].value_counts()
Out[12]:
False    5901
True     1142
Name: SeniorCitizen, dtype: int64
In [13]:
df['SeniorCitizen'].value_counts(normalize = True)
Out[13]:
False    0.837853
True     0.162147
Name: SeniorCitizen, dtype: float64

The Partner column may relate to a business partnership or some similar idea, but its position in between age related columns and dependants when the columns seem ordered by topic lead the belief of it indicating the existance of a spouse or established relationship partner instead in boolean form. 48.3% of customers in this dataset have a partner. It is another column that needs changed to the bool dtype.

In [14]:
df['Partner'] = df['Partner'].replace({"No":False, "Yes":True})
In [15]:
df['Partner'] = df['Partner'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(2), float64(1), int64(1), object(16)
memory usage: 1004.3+ KB
In [16]:
df['Partner'].value_counts()
Out[16]:
False    3641
True     3402
Name: Partner, dtype: int64
In [17]:
df['Partner'].value_counts(normalize = True)
Out[17]:
False    0.516967
True     0.483033
Name: Partner, dtype: float64

Likewise, the Dependents column appears to be a boolean value of whether children or other legal dependents exist under the customer a given row represents that also needs converted to bool. 29.96% of customers in this dataset have dependents.

In [18]:
df['Dependents'] = df['Dependents'].replace({"No":False, "Yes":True})
df['Dependents'] = df['Dependents'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(3), float64(1), int64(1), object(15)
memory usage: 956.2+ KB
In [19]:
df['Dependents'].value_counts()
Out[19]:
False    4933
True     2110
Name: Dependents, dtype: int64
In [20]:
df['Dependents'].value_counts(normalize = True)
Out[20]:
False    0.700412
True     0.299588
Name: Dependents, dtype: float64

The tenure column, whose name notably does not follow the column naming conventions, seems to be the business meaning of the term, or the number of years or some other time period that the customer has been with the company. Interestingly, the numbers range from 0 to 72. The highest value, 72, is the second most common while 0, is the lowest value and the least common. To make everything consistant, the column title will be updated.

In [21]:
df = df.rename(columns = {"tenure": "Tenure"})
In [22]:
df['Tenure'].value_counts()
Out[22]:
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: Tenure, Length: 73, dtype: int64

PhoneServices is a boolean value indicating whether or not a customer is paying for phone service or not, with 90.32% of customers at this telecommunications company actively having a phone plan. Like others, it needs converted to the bool dtype.

In [23]:
df['PhoneService'] = df['PhoneService'].replace({"No":False, "Yes":True})
df['PhoneService'] = df['PhoneService'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(4), float64(1), int64(1), object(14)
memory usage: 908.0+ KB
In [24]:
df['PhoneService'].value_counts()
Out[24]:
True     6361
False     682
Name: PhoneService, dtype: int64
In [25]:
df['PhoneService'].value_counts(normalize = True)
Out[25]:
True     0.903166
False    0.096834
Name: PhoneService, dtype: float64

MultipleLines is a string value that indicates whether the given customer has multiple lines on their account, has only a single line on their account, or if this column isn't applicable for the customer, that they do not have phone service through this company. 42.18% of total customers in this dataset have multiple lines, and 46.71% of customers with phone service in this dataset have multiple lines.

In [26]:
df['MultipleLines'].value_counts()
Out[26]:
No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64
In [27]:
n_multiple_lines = df['MultipleLines'].value_counts()[1]
lines_proportion = n_multiple_lines / n_rows
print("{0:.4f}".format(lines_proportion) + "%")
0.4218%
In [28]:
lines_proportion2 = n_multiple_lines / (n_rows - df['MultipleLines'].value_counts()[2])
print("{0:.4f}".format(lines_proportion2) + "%")
0.4671%

InternetService is a string value indicating whether or not a customer is paying for internet service or not, and whether or not this service is for a fiber optic plan or a DSL plan. 90.32% of customers in this data service have an internet plan. 43.96% and 34.37% of total customers in this dataset have a fiber or DSL plan respectively, and 56.12% and 43.88% of customers with internet service have fiber or DSL plans respectively.

Many of the columns in this dataset have a number of possibilities that signify whether or not a service is included in the given customer's plan. Many of these columns would be boolean, but have a third value indicating that they do not have the required service plan for it to be a possibility. While these may be converted into bool dtype columns for later analysis, the boolean-like yes and no values will be changed to True or False on this first pass.

In [29]:
df['InternetService'] = df['InternetService'].replace({"No":"False"})
In [30]:
df['InternetService'].value_counts()
Out[30]:
Fiber optic    3096
DSL            2421
False          1526
Name: InternetService, dtype: int64
In [31]:
n_internet = n_rows - df['InternetService'].value_counts()[2]
internet_proportion = n_internet / n_rows
print("{0:.4f}".format(internet_proportion) + "%")
0.7833%
In [32]:
n_fiber = df['InternetService'].value_counts()[0]
n_dsl = df['InternetService'].value_counts()[1]
fiber_proportion = n_fiber / n_rows
dsl_proportion = n_dsl / n_rows
print("Fiber: {0:.4f}".format(fiber_proportion) + "%\nDSL: {0:.4f}".format(dsl_proportion) + "%")
Fiber: 0.4396%
DSL: 0.3437%
In [33]:
fiber_proportion2 = n_fiber / n_internet
dsl_proportion2 = n_dsl / n_internet
print("Fiber: {0:.4f}".format(fiber_proportion2) + "%\nDSL: {0:.4f}".format(dsl_proportion2) + "%")
Fiber: 0.5612%
DSL: 0.4388%

OnlineSecurity is a string value that indicates whether the given customer has an online security package in addition to their internet service, did not opt for the online security package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 28.67% of total customers in this dataset have the online security package, and 36.6% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.

In [34]:
df['OnlineSecurity'] = df['OnlineSecurity'].replace({"No":"False", "Yes":"True"})
In [35]:
df['OnlineSecurity'].value_counts()
Out[35]:
False                  3498
True                   2019
No internet service    1526
Name: OnlineSecurity, dtype: int64
In [36]:
n_security = df['OnlineSecurity'].value_counts()[1]
security_proportion = n_security / n_rows
print("{0:.4f}".format(security_proportion) + "%")
0.2867%
In [37]:
security_proportion2 = n_security / (n_rows - df['OnlineSecurity'].value_counts()[2])
print("{0:.4f}".format(security_proportion2) + "%")
0.3660%

OnlineBackup is a string value that indicates whether the given customer has an online backup package in addition to their internet service, did not opt for the online backup package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.49% of total customers in this dataset have the online security package, and 44.03% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.

In [38]:
df['OnlineBackup'] = df['OnlineBackup'].replace({"No":"False", "Yes":"True"})
In [39]:
df['OnlineBackup'].value_counts()
Out[39]:
False                  3088
True                   2429
No internet service    1526
Name: OnlineBackup, dtype: int64
In [40]:
n_backup = df['OnlineBackup'].value_counts()[1]
backup_proportion = n_backup / n_rows
print("{0:.4f}".format(backup_proportion) + "%")
0.3449%
In [41]:
backup_proportion2 = n_backup / (n_rows - df['OnlineBackup'].value_counts()[2])
print("{0:.4f}".format(backup_proportion2) + "%")
0.4403%

DeviceProtection is a string value that indicates whether the given customer has a device protection plan in addition to their internet service, did not opt for any protection plan, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.39% of total customers in this dataset have a device protection plan, and 43.90% of customers with internet service in this dataset have a device protection plan. Its values will be updated to the more boolean-like format.

In [42]:
df['DeviceProtection'] = df['DeviceProtection'].replace({"No":"False", "Yes":"True"})
In [43]:
df['DeviceProtection'].value_counts()
Out[43]:
False                  3095
True                   2422
No internet service    1526
Name: DeviceProtection, dtype: int64
In [44]:
n_protection = df['DeviceProtection'].value_counts()[1]
protection_proportion = n_protection / n_rows
print("{0:.4f}".format(protection_proportion) + "%")
0.3439%
In [45]:
protection_proportion2 = n_protection / (n_rows - df['DeviceProtection'].value_counts()[2])
print("{0:.4f}".format(protection_proportion2) + "%")
0.4390%

OnlineBackup and DeviceProtection appear to be the first potential major issue for future analysis. Both require that the customer have a value other than no in the InternetService column, and also have extremely similar numbers, 2429 and 2422, 44.03% and 43.90% respectively. The correlation between the types of column values, namely the No internet service ones, are going to prevent proper testing and model building if tested together.

In [46]:
contingency_table = pd.crosstab(df['OnlineBackup'], df['DeviceProtection'])
contingency_table
Out[46]:
DeviceProtection False No internet service True
OnlineBackup
False 1984 0 1104
No internet service 0 1526 0
True 1111 0 1318

As we will see shortly, a number of the columns require internet service to have meaningful values and are in the same situation as the above section, so the data will have to be restructured to account for this.

TechSupport is a string value that indicates whether the given customer has a tech support package in addition to their internet service, did not opt for any tech support package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 29.02% of total customers in this dataset have a tech support package, and 37.05% of customers with internet service in this dataset have a tech support package. Its values will be updated to the more boolean-like format.

In [47]:
df['TechSupport'] = df['TechSupport'].replace({"No":"False", "Yes":"True"})
In [48]:
df['TechSupport'].value_counts()
Out[48]:
False                  3473
True                   2044
No internet service    1526
Name: TechSupport, dtype: int64
In [49]:
n_support = df['TechSupport'].value_counts()[1]
support_proportion = n_support / n_rows
print("{0:.4f}".format(support_proportion) + "%")
0.2902%
In [50]:
support_proportion2 = n_support / (n_rows - df['TechSupport'].value_counts()[2])
print("{0:.4f}".format(support_proportion2) + "%")
0.3705%

StreamingTV is a string value that indicates whether the given customer has a TV streaming package in addition to their internet service, did not opt for any TV streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.44% of total customers in this dataset have a TV streaming package, and 49.07% of customers with internet service in this dataset have a TV streaming package. Its values will be updated to the more boolean-like format.

In [51]:
df['StreamingTV'] = df['StreamingTV'].replace({"No":"False", "Yes":"True"})
In [52]:
df['StreamingTV'].value_counts()
Out[52]:
False                  2810
True                   2707
No internet service    1526
Name: StreamingTV, dtype: int64
In [53]:
n_tv = df['StreamingTV'].value_counts()[1]
tv_proportion = n_tv / n_rows
print("{0:.4f}".format(tv_proportion) + "%")
0.3844%
In [54]:
tv_proportion2 = n_tv / (n_rows - df['StreamingTV'].value_counts()[2])
print("{0:.4f}".format(tv_proportion2) + "%")
0.4907%

StreamingMovies is a string value that indicates whether the given customer has a movie streaming package in addition to their internet service, did not opt for any movie streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.79% of total customers in this dataset have a movie streaming package, and 49.52% of customers with internet service in this dataset have a movie streaming package. Its values will be updated to the more boolean-like format.

In [55]:
df['StreamingMovies'] = df['StreamingMovies'].replace({"No":"False", "Yes":"True"})
In [56]:
df['StreamingMovies'].value_counts()
Out[56]:
False                  2785
True                   2732
No internet service    1526
Name: StreamingMovies, dtype: int64
In [57]:
n_movie = df['StreamingMovies'].value_counts()[1]
movie_proportion = n_movie / n_rows
print("{0:.4f}".format(movie_proportion) + "%")
0.3879%
In [58]:
movie_proportion2 = n_movie / (n_rows - df['StreamingMovies'].value_counts()[2])
print("{0:.4f}".format(movie_proportion2) + "%")
0.4952%

Contract is a string value that indicates what type of contract the given customer is under. The contract type values include Month-to-month, One year, and two year. 55.02% of customers are under a month to month contract, 24.07% are under a single year contract, and the remaining 20.91% are under a two year contract.

In [59]:
df['Contract'].value_counts()
Out[59]:
Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64
In [60]:
df['Contract'].value_counts(normalize = True)
Out[60]:
Month-to-month    0.550192
Two year          0.240664
One year          0.209144
Name: Contract, dtype: float64

PaperlessBilling is a boolean value indicating whether or not the given customer has opted for paperless billing or not, with 59.22% of customers at this telecommunications company actively enrolled in paperless billing. As a true boolean column, it will be completely changed to the boolean dtype.

In [61]:
df['PaperlessBilling'] = df['PaperlessBilling'].replace({"No":False, "Yes":True})
df['PaperlessBilling'] = df['PaperlessBilling'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(5), float64(1), int64(1), object(13)
memory usage: 859.9+ KB
In [62]:
df['PaperlessBilling'].value_counts()
Out[62]:
True     4171
False    2872
Name: PaperlessBilling, dtype: int64
In [63]:
df['PaperlessBilling'].value_counts(normalize = True)
Out[63]:
True     0.592219
False    0.407781
Name: PaperlessBilling, dtype: float64

PaymentMethod is a string value that indicates what type of payment method the given customer uses. The payment method values include electronic checks, traditional checks via mail, automatic bank transfer, and automatic payment with a credit card. 33.58% of customers pay via electronic check, 22.89% pay via mailed check, 21.92% pay automatically via bank transfer, and the remaining 21.61% pay automatically via credit card.

In [64]:
df['PaymentMethod'].value_counts()
Out[64]:
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64
In [65]:
df['PaymentMethod'].value_counts(normalize = True)
Out[65]:
Electronic check             0.335794
Mailed check                 0.228880
Bank transfer (automatic)    0.219225
Credit card (automatic)      0.216101
Name: PaymentMethod, dtype: float64

MonthlyCharges is a float value indicating the required monthly payment of a given customer. The mean value is 64.76, the mode value 20.05, the minimum 18.25, and the maximum value 118.75. This distribution is extremely skewed to the right.

In [66]:
df['MonthlyCharges'].value_counts()
Out[66]:
20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
114.75     1
103.60     1
113.40     1
57.65      1
113.30     1
Name: MonthlyCharges, Length: 1585, dtype: int64
In [67]:
df['MonthlyCharges'].describe()
Out[67]:
count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64
In [68]:
plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()

TotalCharges appears to be a column that represents the total amount a given customer has paid throughout all transactions of the business-customer relationship. However, the column is loaded as the object dtype, or a string, instead of a float. This will have to be changed to examine the data effectively.

In [69]:
df['TotalCharges'].describe()
Out[69]:
count     7043
unique    6531
top           
freq        11
Name: TotalCharges, dtype: object

These 11 rows have missing values for their TotalCharges column, but this appears to be the only missing column. Notably, these customers' Tenure values are all 0, and none of them have churned. It seems likely that these customers don't have a TotalCharges value simply because they haven't been charged yet. As such, their TotalCharges value will be set to 0 to allow for later analysis.

In [70]:
df.loc[pd.to_numeric(df['TotalCharges'], errors='coerce').isnull()]
Out[70]:
Gender SeniorCitizen Partner Dependents Tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 Female False True True 0 False No phone service DSL True False True True True False Two year True Bank transfer (automatic) 52.55 No
753 Male False False True 0 True No False No internet service No internet service No internet service No internet service No internet service No internet service Two year False Mailed check 20.25 No
936 Female False True True 0 True No DSL True True True False True True Two year False Mailed check 80.85 No
1082 Male False True True 0 True Yes False No internet service No internet service No internet service No internet service No internet service No internet service Two year False Mailed check 25.75 No
1340 Female False True True 0 False No phone service DSL True True True True True False Two year False Credit card (automatic) 56.05 No
3331 Male False True True 0 True No False No internet service No internet service No internet service No internet service No internet service No internet service Two year False Mailed check 19.85 No
3826 Male False True True 0 True Yes False No internet service No internet service No internet service No internet service No internet service No internet service Two year False Mailed check 25.35 No
4380 Female False True True 0 True No False No internet service No internet service No internet service No internet service No internet service No internet service Two year False Mailed check 20.00 No
5218 Male False True True 0 True No False No internet service No internet service No internet service No internet service No internet service No internet service One year True Mailed check 19.70 No
6670 Female False True True 0 True Yes DSL False True True True True False Two year False Mailed check 73.35 No
6754 Male False False True 0 True Yes DSL True True False True False False Two year True Bank transfer (automatic) 61.90 No
In [71]:
tofix = [488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]

df.loc[tofix, 'TotalCharges'] = 0
In [72]:
df.iloc[tofix]['TotalCharges']
Out[72]:
488     0
753     0
936     0
1082    0
1340    0
3331    0
3826    0
4380    0
5218    0
6670    0
6754    0
Name: TotalCharges, dtype: object

Now we can change the column to the float dtype and then examine the data. The mean TotalCharges value is 2283.30, the minimum value is 18.80, and the maximum is 8684.80. Overall, the column is very right-skewed.

In [73]:
# Correct the TotalCharges column's dtype from object to float.
df['TotalCharges'] = df['TotalCharges'].astype('float')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null object
dtypes: bool(5), float64(2), int64(1), object(12)
memory usage: 859.9+ KB
In [74]:
df['TotalCharges'].describe()
Out[74]:
count    7043.000000
mean     2279.734304
std      2266.794470
min         0.000000
25%       398.550000
50%      1394.550000
75%      3786.600000
max      8684.800000
Name: TotalCharges, dtype: float64
In [75]:
plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()

Finally, the last column of the dataset is the Churn column, which is a boolean value indicating on whether or not the customer has ceased business. 26.54% customers in this dataset have churned. As a true boolean column, it will be completely changed to the boolean dtype.

In [76]:
df['Churn'] = df['Churn'].replace({"No":False, "Yes":True})
df['Churn'] = df['Churn'].astype('bool')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB
In [77]:
df['Churn'].value_counts()
Out[77]:
False    5174
True     1869
Name: Churn, dtype: int64
In [78]:
df['Churn'].value_counts(normalize = True)
Out[78]:
False    0.73463
True     0.26537
Name: Churn, dtype: float64
In [79]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB

In this dataset, the three continuous variable columns do not have any outliers. None are present based on IQR range, as seen on the boxplots, and none are present based on the z-score with a threshold of 3 standard deviations.

Tenure

In [80]:
sns.boxplot(df['Tenure'])
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x20574365b38>
In [81]:
tenure_z = np.abs(scs.zscore(df['Tenure']))
tenure_z
Out[81]:
array([1.27744458, 0.06632742, 1.23672422, ..., 0.87024095, 1.15528349,
       1.36937906])
In [82]:
print(np.where(tenure_z > 3))
(array([], dtype=int64),)

MonthlyCharges

In [83]:
sns.boxplot(df['MonthlyCharges'])
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x205743ba080>
In [84]:
monthly_z = np.abs(scs.zscore(df['MonthlyCharges']))
monthly_z
Out[84]:
array([1.16032292, 0.25962894, 0.36266036, ..., 1.1686319 , 0.32033821,
       1.35896134])
In [85]:
print(np.where(monthly_z > 3))
(array([], dtype=int64),)

TotalCharges

In [86]:
sns.boxplot(df['TotalCharges'])
Out[86]:
<matplotlib.axes._subplots.AxesSubplot at 0x20574438a90>
In [87]:
total_z = np.abs(scs.zscore(df['TotalCharges']))
total_z
Out[87]:
array([0.99261052, 0.17216471, 0.9580659 , ..., 0.85293201, 0.87051315,
       2.01389665])
In [88]:
print(np.where(total_z > 3))
(array([], dtype=int64),)

Contingency tables will be calculated for all variables in order to see how they appear to affect the churn rate before using more intensive methods.

However, this is also where the categorical variables become an issue for analyzing the data set. The next step in the overall process is to transform the data into the structure that the models need. As logistic regression will be performed, the categorical columns will need to be encoded into dummy variables, which are expressed in Boolean values.

In addition to the information encoding needing changed, a number of these variables are inherently correlated with each other; e.g. MultipleLines with PhoneService, and OnlineSecurity with InternetService. To do meaningful tests of significance, this intercorrelation needs to be removed before the tests are run. Specifically, the problem is that the values of "no phone service" and "no internet service" are also factoring into the tests. There are a few approaches available in order deal with this.

The first is to split the dataframe into three distinct dataframes, one for customers with only phone service, one for customers with only internet service, and the last for customers with both phone and internet service through this company. This is less than ideal for several reasons. Rather than a single dataframe with single calculations, this would then create three dataframes and require three tests for each calculation, further complicating how all of the tests would be combined.

In [89]:
phone_count = df.query("PhoneService == True and InternetService == 'False'").shape[0]
phone_count
Out[89]:
1526
In [90]:
internet_count = df.query("PhoneService == False and InternetService != 'False'").shape[0]
internet_count
Out[90]:
682
In [91]:
both_count = df.query("PhoneService == True and InternetService != 'False'").shape[0]
both_count
Out[91]:
4835
In [92]:
(phone_count + internet_count + both_count) == n_rows
Out[92]:
True

The second option is to simply remove a subsection of the rows, effectively removing columns of data in the process. This would technically work but it would also remove a significant amount of information in the process.

The remaining option is to manipulate the encoding and representation of the data. The categories that these all involve can be split into multiple boolean columns representing the "positive" values of the original columns. As an example, the MultipleLines column would be changed to be a boolean True/False option, with all rows that do not have phone service also being represented as False. Ultimately, the multiple logistic regression model is only going to adjust the calculation with a parameter if the value is true, and to create the model, these dummy variables have to be created regardless.

In [93]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB
In [94]:
original_df = df.copy(deep = True)

Gender

In [95]:
df['Gender'].value_counts()
Out[95]:
Male      3555
Female    3488
Name: Gender, dtype: int64
In [96]:
df['Gender'] = df['Gender'].replace({"Male":False, "Female":True})
df['Gender'] = df['Gender'].astype('bool')
df['Gender'].value_counts()
Out[96]:
False    3555
True     3488
Name: Gender, dtype: int64

MultipleLines

In [97]:
df['MultipleLines'].value_counts()
Out[97]:
No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64
In [98]:
df.loc[df['MultipleLines'] == "No phone service", 'MultipleLines'] = np.NaN
In [99]:
df['MultipleLines'] = df['MultipleLines'].replace({"No":False, "Yes":True})
df['MultipleLines'] = df['MultipleLines'].fillna(False).astype('bool')
df['MultipleLines'].value_counts()
Out[99]:
False    4072
True     2971
Name: MultipleLines, dtype: int64

FiberOpticService

In [100]:
df['InternetService'].value_counts()
Out[100]:
Fiber optic    3096
DSL            2421
False          1526
Name: InternetService, dtype: int64
In [101]:
df['FiberOpticService'] = False
In [102]:
df.loc[df['InternetService'] == "Fiber optic", 'FiberOpticService'] = True
In [103]:
df['FiberOpticService'].value_counts()
Out[103]:
False    3947
True     3096
Name: FiberOpticService, dtype: int64

DSLService

In [104]:
df['DSLService'] = False
In [105]:
df.loc[df['InternetService'] == "DSL", 'DSLService'] = True
In [106]:
df['DSLService'].value_counts()
Out[106]:
False    4622
True     2421
Name: DSLService, dtype: int64
In [107]:
df.drop('InternetService', axis = 1, inplace = True)

OnlineSecurity

In [108]:
df['OnlineSecurity'].value_counts()
Out[108]:
False                  3498
True                   2019
No internet service    1526
Name: OnlineSecurity, dtype: int64
In [109]:
df.loc[df['OnlineSecurity'] == "No internet service", 'OnlineSecurity'] = np.NaN
In [110]:
df['OnlineSecurity'] = df['OnlineSecurity'].replace({"False":False, "True":True})
df['OnlineSecurity'] = df['OnlineSecurity'].fillna(False).astype('bool')
df['OnlineSecurity'].value_counts()
Out[110]:
False    5024
True     2019
Name: OnlineSecurity, dtype: int64

OnlineBackup

In [111]:
df['OnlineBackup'].value_counts()
Out[111]:
False                  3088
True                   2429
No internet service    1526
Name: OnlineBackup, dtype: int64
In [112]:
df.loc[df['OnlineBackup'] == "No internet service", 'OnlineBackup'] = np.NaN
In [113]:
df['OnlineBackup'] = df['OnlineBackup'].replace({"False":False, "True":True})
df['OnlineBackup'] = df['OnlineBackup'].fillna(False).astype('bool')
df['OnlineBackup'].value_counts()
Out[113]:
False    4614
True     2429
Name: OnlineBackup, dtype: int64

DeviceProtection

In [114]:
df['DeviceProtection'].value_counts()
Out[114]:
False                  3095
True                   2422
No internet service    1526
Name: DeviceProtection, dtype: int64
In [115]:
df.loc[df['DeviceProtection'] == "No internet service", 'DeviceProtection'] = np.NaN
In [116]:
df['DeviceProtection'] = df['DeviceProtection'].replace({"False":False, "True":True})
df['DeviceProtection'] = df['DeviceProtection'].fillna(False).astype('bool')
df['DeviceProtection'].value_counts()
Out[116]:
False    4621
True     2422
Name: DeviceProtection, dtype: int64

TechSupport

In [117]:
df['TechSupport'].value_counts()
Out[117]:
False                  3473
True                   2044
No internet service    1526
Name: TechSupport, dtype: int64
In [118]:
df.loc[df['TechSupport'] == "No internet service", 'TechSupport'] = np.NaN
In [119]:
df['TechSupport'] = df['TechSupport'].replace({"False":False, "True":True})
df['TechSupport'] = df['TechSupport'].fillna(False).astype('bool')
df['TechSupport'].value_counts()
Out[119]:
False    4999
True     2044
Name: TechSupport, dtype: int64

StreamingTV

In [120]:
df['StreamingTV'].value_counts()
Out[120]:
False                  2810
True                   2707
No internet service    1526
Name: StreamingTV, dtype: int64
In [121]:
df.loc[df['StreamingTV'] == "No internet service", 'StreamingTV'] = np.NaN
In [122]:
df['StreamingTV'] = df['StreamingTV'].replace({"False":False, "True":True})
df['StreamingTV'] = df['StreamingTV'].fillna(False).astype('bool')
df['StreamingTV'].value_counts()
Out[122]:
False    4336
True     2707
Name: StreamingTV, dtype: int64

StreamingMovies

In [123]:
df['StreamingMovies'].value_counts()
Out[123]:
False                  2785
True                   2732
No internet service    1526
Name: StreamingMovies, dtype: int64
In [124]:
df.loc[df['StreamingMovies'] == "No internet service", 'StreamingMovies'] = np.NaN
In [125]:
df['StreamingMovies'] = df['StreamingMovies'].replace({"False":False, "True":True})
df['StreamingMovies'] = df['StreamingMovies'].fillna(False).astype('bool')
df['StreamingMovies'].value_counts()
Out[125]:
False    4311
True     2732
Name: StreamingMovies, dtype: int64

Contract

The most common type of contract is the Month-to-month contract, so it will be the reference level, having both the new one year and two year contract columns as False.

In [126]:
df['Contract'].value_counts()
Out[126]:
Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64
In [127]:
df['OneYearContract'] = False
df['TwoYearContract'] = False
In [128]:
df.loc[df['Contract'] == "One year", 'OneYearContract'] = True
df.loc[df['Contract'] == "Two year", 'TwoYearContract'] = True
In [129]:
df['OneYearContract'].value_counts()
Out[129]:
False    5570
True     1473
Name: OneYearContract, dtype: int64
In [130]:
df['TwoYearContract'].value_counts()
Out[130]:
False    5348
True     1695
Name: TwoYearContract, dtype: int64
In [131]:
df.drop('Contract', axis = 1, inplace = True)

PaymentMethod

The most common type of payment method is by electronic check, so it will be the reference level, having the new columns mailed check, bank transfer, and credit card as False.

In [132]:
df['PaymentMethod'].value_counts()
Out[132]:
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64
In [133]:
df['MailedCheckPayment'] = False
df['BankTransferPayment'] = False
df['CreditCardPayment'] = False
In [134]:
df.loc[df['PaymentMethod'] == "Mailed check", 'MailedCheckPayment'] = True
df.loc[df['PaymentMethod'] == "Bank transfer (automatic)", 'BankTransferPayment'] = True
df.loc[df['PaymentMethod'] == "Credit card (automatic)", 'CreditCardPayment'] = True
In [135]:
df['MailedCheckPayment'].value_counts()
Out[135]:
False    5431
True     1612
Name: MailedCheckPayment, dtype: int64
In [136]:
df['BankTransferPayment'].value_counts()
Out[136]:
False    5499
True     1544
Name: BankTransferPayment, dtype: int64
In [137]:
df['CreditCardPayment'].value_counts()
Out[137]:
False    5521
True     1522
Name: CreditCardPayment, dtype: int64
In [138]:
df.drop('PaymentMethod', axis = 1, inplace = True)

Finally, reorganize all of the columns back to the order that the original dataset was in.

In [139]:
cols = df.columns
cols
Out[139]:
Index(['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure',
       'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'FiberOpticService', 'DSLService', 'OneYearContract', 'TwoYearContract',
       'MailedCheckPayment', 'BankTransferPayment', 'CreditCardPayment'],
      dtype='object')
In [140]:
# Reorders the columns back to how the original dataframe had them.
df = df[['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure',
         'PhoneService', 'MultipleLines', 'FiberOpticService', 'DSLService',
         'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
         'StreamingTV', 'StreamingMovies', 'OneYearContract', 'TwoYearContract',
         'PaperlessBilling', 'MailedCheckPayment', 'BankTransferPayment',
         'CreditCardPayment', 'MonthlyCharges', 'TotalCharges', 'Churn']]

This section culminates into the dataframe below. All categorical variables have been recoded into Boolean values for the future regression modeling and the continuous variables are still intact as integer for the Tenure column, and floating point for the two charges columns.

In [141]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 24 columns):
Gender                 7043 non-null bool
SeniorCitizen          7043 non-null bool
Partner                7043 non-null bool
Dependents             7043 non-null bool
Tenure                 7043 non-null int64
PhoneService           7043 non-null bool
MultipleLines          7043 non-null bool
FiberOpticService      7043 non-null bool
DSLService             7043 non-null bool
OnlineSecurity         7043 non-null bool
OnlineBackup           7043 non-null bool
DeviceProtection       7043 non-null bool
TechSupport            7043 non-null bool
StreamingTV            7043 non-null bool
StreamingMovies        7043 non-null bool
OneYearContract        7043 non-null bool
TwoYearContract        7043 non-null bool
PaperlessBilling       7043 non-null bool
MailedCheckPayment     7043 non-null bool
BankTransferPayment    7043 non-null bool
CreditCardPayment      7043 non-null bool
MonthlyCharges         7043 non-null float64
TotalCharges           7043 non-null float64
Churn                  7043 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 309.6 KB

As all of the distributions have been examined in the data exploration and cleaning stage, this section acts as a recap and a point at which to actually represent them graphically. The following code reformats the cleaned dataframe into its long form so that the resulting graphs are generated together in a single chart with shared axes. This should allow for easy comparison of proportions. The large chart of all categorical variables is followed by the summary statistics and a histogram showing the distribution of each of the continuous variables.

Categorical Variable Distributions

In [142]:
viz_df = df.drop(['Tenure', 'MonthlyCharges', 'TotalCharges'], axis = 1)
viz_df.columns
Out[142]:
Index(['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'FiberOpticService', 'DSLService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'OneYearContract', 'TwoYearContract',
       'PaperlessBilling', 'MailedCheckPayment', 'BankTransferPayment',
       'CreditCardPayment', 'Churn'],
      dtype='object')
In [143]:
melted = viz_df.melt()
In [144]:
melted
Out[144]:
variable value
0 Gender True
1 Gender False
2 Gender False
3 Gender False
4 Gender True
... ... ...
147898 Churn False
147899 Churn False
147900 Churn False
147901 Churn True
147902 Churn False

147903 rows × 2 columns

In [289]:
g = sns.FacetGrid(
    melted,
    col = 'variable',
    hue = 'value',
    sharey = 'row',
    sharex = 'col',
    col_wrap = 7,
    legend_out = True,
)

g = g.map(sns.countplot, 'value', order = [False, True]).add_legend()

plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Categorical Variable Distributions')

# Add the percentage proportion value for each column in each variable plot.
for i, ax in enumerate(g.axes):
    # Set each plot's title to its corresponding column name.
    ax.set_title(viz_df.columns[i])
    for p in ax.patches:
        percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
        
        ax.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

        
g.savefig("Charts/Univariate Distributions.svg")
g
Out[289]:
<seaborn.axisgrid.FacetGrid at 0x20576d9dac8>

Continuous Variable Distributions

In [146]:
df['Tenure'].describe()
Out[146]:
count    7043.000000
mean       32.371149
std        24.559481
min         0.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: Tenure, dtype: float64
In [147]:
plt.hist(df['Tenure'])
plt.title('Distribution of Tenure')
plt.xlabel('Tenure')
plt.ylabel('Tenure Frequency')
plt.grid(True)
plt.show()
In [148]:
df['MonthlyCharges'].describe()
Out[148]:
count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64
In [149]:
plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()
In [150]:
df['TotalCharges'].describe()
Out[150]:
count    7043.000000
mean     2279.734304
std      2266.794470
min         0.000000
25%       398.550000
50%      1394.550000
75%      3786.600000
max      8684.800000
Name: TotalCharges, dtype: float64
In [151]:
plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()

Bivariate Distribution Visualization

For ease of interpretation and usage given the 2x2 relationships between churn and the other categorical variables, simple contingency tables were used to visualize the categorical bivariate distributions. Bar, pie, and mosaic plots were all considered, but all seemed to overcomplicate the presentation. In each of the following sections, the first table is the raw numbers of the categorical variables, and the second table contains the same information but normalized into proportions. The two columns Contract and PaymentMethod have more values than just True/False. These are visualized with bar plots grouped by the rows’ Churn values.

In [152]:
gender_contingency = pd.crosstab(df["Gender"], df["Churn"])
gender_contingency
Out[152]:
Churn False True
Gender
False 2625 930
True 2549 939
In [153]:
pd.crosstab(df["Gender"], df["Churn"], normalize = "index")
Out[153]:
Churn False True
Gender
False 0.738397 0.261603
True 0.730791 0.269209

The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.

In [154]:
scs.chi2_contingency(gender_contingency)
Out[154]:
(0.4840828822091383,
 0.48657873605618596,
 1,
 array([[2611.61010933,  943.38989067],
        [2562.38989067,  925.61010933]]))
In [155]:
senior_contingency = pd.crosstab(df["SeniorCitizen"], df["Churn"])
senior_contingency
Out[155]:
Churn False True
SeniorCitizen
False 4508 1393
True 666 476
In [156]:
pd.crosstab(df["SeniorCitizen"], df["Churn"], normalize = "index")
Out[156]:
Churn False True
SeniorCitizen
False 0.763938 0.236062
True 0.583187 0.416813

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [157]:
scs.chi2_contingency(senior_contingency)
Out[157]:
(159.42630036838742,
 1.510066805092378e-36,
 1,
 array([[4335.05239245, 1565.94760755],
        [ 838.94760755,  303.05239245]]))
In [158]:
partner_contingency = pd.crosstab(df["Partner"], df["Churn"])
partner_contingency
Out[158]:
Churn False True
Partner
False 2441 1200
True 2733 669
In [159]:
pd.crosstab(df["Partner"], df["Churn"], normalize = "index")
Out[159]:
Churn False True
Partner
False 0.670420 0.329580
True 0.803351 0.196649

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [160]:
scs.chi2_contingency(partner_contingency)
Out[160]:
(158.7333820309922,
 2.1399113440759935e-36,
 1,
 array([[2674.78830044,  966.21169956],
        [2499.21169956,  902.78830044]]))
In [161]:
dependents_contingency = pd.crosstab(df["Dependents"], df["Churn"])
dependents_contingency
Out[161]:
Churn False True
Dependents
False 3390 1543
True 1784 326
In [162]:
pd.crosstab(df["Dependents"], df["Churn"], normalize = "index")
Out[162]:
Churn False True
Dependents
False 0.687209 0.312791
True 0.845498 0.154502

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [163]:
scs.chi2_contingency(dependents_contingency)
Out[163]:
(189.12924940423474,
 4.9249216612154196e-43,
 1,
 array([[3623.93042737, 1309.06957263],
        [1550.06957263,  559.93042737]]))
In [164]:
phone_contingency = pd.crosstab(df["PhoneService"], df["Churn"])
phone_contingency
Out[164]:
Churn False True
PhoneService
False 512 170
True 4662 1699
In [165]:
pd.crosstab(df["PhoneService"], df["Churn"], normalize = "index")
Out[165]:
Churn False True
PhoneService
False 0.750733 0.249267
True 0.732904 0.267096

The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.

In [166]:
scs.chi2_contingency(phone_contingency)
Out[166]:
(0.9150329892546948,
 0.3387825358066928,
 1,
 array([[ 501.01774812,  180.98225188],
        [4672.98225188, 1688.01774812]]))
In [167]:
lines_contingency = pd.crosstab(df["MultipleLines"], df["Churn"])
lines_contingency
Out[167]:
Churn False True
MultipleLines
False 3053 1019
True 2121 850
In [168]:
pd.crosstab(df["MultipleLines"], df["Churn"], normalize = "index")
Out[168]:
Churn False True
MultipleLines
False 0.749754 0.250246
True 0.713901 0.286099

The $\chi^2$ test shows significance, so this column should be examined in more detail later. It may be that a greater number of lines increases the probability of churn as a result of the higher cost overall.

In [169]:
scs.chi2_contingency(lines_contingency)
Out[169]:
(11.143251001456251,
 0.0008433795342472428,
 1,
 array([[2991.41388613, 1080.58611387],
        [2182.58611387,  788.41388613]]))
In [170]:
fiber_contingency = pd.crosstab(df["FiberOpticService"], df["Churn"])
fiber_contingency
Out[170]:
Churn False True
FiberOpticService
False 3375 572
True 1799 1297
In [171]:
pd.crosstab(df["FiberOpticService"], df["Churn"], normalize = "index")
Out[171]:
Churn False True
FiberOpticService
False 0.855080 0.144920
True 0.581072 0.418928

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing fiber optic services in this data set have a substantial difference in churn rate and should be examined in much greater detail.

In [172]:
scs.chi2_contingency(fiber_contingency)
Out[172]:
(666.8080208747958,
 4.940476033744708e-147,
 1,
 array([[2899.58511998, 1047.41488002],
        [2274.41488002,  821.58511998]]))
In [173]:
dsl_contingency = pd.crosstab(df["DSLService"], df["Churn"])
dsl_contingency
Out[173]:
Churn False True
DSLService
False 3212 1410
True 1962 459
In [174]:
pd.crosstab(df["DSLService"], df["Churn"], normalize = "index")
Out[174]:
Churn False True
DSLService
False 0.694937 0.305063
True 0.810409 0.189591

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had not been purchasing DSL services in this data set have a substantial difference in churn rate and should be examined in much greater detail.

In [175]:
scs.chi2_contingency(dsl_contingency)
Out[175]:
(108.07545746651793,
 2.587376621108007e-25,
 1,
 array([[3395.46045719, 1226.53954281],
        [1778.53954281,  642.46045719]]))
In [176]:
security_contingency = pd.crosstab(df["OnlineSecurity"], df["Churn"])
security_contingency
Out[176]:
Churn False True
OnlineSecurity
False 3450 1574
True 1724 295
In [177]:
pd.crosstab(df["OnlineSecurity"], df["Churn"], normalize = "index")
Out[177]:
Churn False True
OnlineSecurity
False 0.686704 0.313296
True 0.853888 0.146112

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing internet service but not the online security package in this data set have a substantial difference in churn rate and should be examined in much greater detail.

In [178]:
scs.chi2_contingency(security_contingency)
Out[178]:
(205.63310416062058,
 1.2320984831180024e-46,
 1,
 array([[3690.78176913, 1333.21823087],
        [1483.21823087,  535.78176913]]))
In [179]:
backup_contingency = pd.crosstab(df["OnlineBackup"], df["Churn"])
backup_contingency
Out[179]:
Churn False True
OnlineBackup
False 3268 1346
True 1906 523
In [180]:
pd.crosstab(df["OnlineBackup"], df["Churn"], normalize = "index")
Out[180]:
Churn False True
OnlineBackup
False 0.708279 0.291721
True 0.784685 0.215315

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [181]:
scs.chi2_contingency(backup_contingency)
Out[181]:
(47.260854003612764,
 6.214092807254819e-12,
 1,
 array([[3389.58341616, 1224.41658384],
        [1784.41658384,  644.58341616]]))
In [182]:
dev_protection_contingency = pd.crosstab(df["DeviceProtection"], df["Churn"])
dev_protection_contingency
Out[182]:
Churn False True
DeviceProtection
False 3297 1324
True 1877 545
In [183]:
pd.crosstab(df["DeviceProtection"], df["Churn"], normalize = "index")
Out[183]:
Churn False True
DeviceProtection
False 0.713482 0.286518
True 0.774979 0.225021

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [184]:
scs.chi2_contingency(dev_protection_contingency)
Out[184]:
(30.513394539261306,
 3.315693222362861e-08,
 1,
 array([[3394.72582706, 1226.27417294],
        [1779.27417294,  642.72582706]]))
In [185]:
support_contingency = pd.crosstab(df["TechSupport"], df["Churn"])
support_contingency
Out[185]:
Churn False True
TechSupport
False 3440 1559
True 1734 310
In [186]:
pd.crosstab(df["TechSupport"], df["Churn"], normalize = "index")
Out[186]:
Churn False True
TechSupport
False 0.688138 0.311862
True 0.848337 0.151663

The $\chi^2$ test shows significance, so this column should be examined in more detail later. It appears that customers who has a tech support package had a significantly lower churn rate.

In [187]:
scs.chi2_contingency(support_contingency)
Out[187]:
(190.16684201526067,
 2.9235674453140758e-43,
 1,
 array([[3672.4160159, 1326.5839841],
        [1501.5839841,  542.4160159]]))
In [188]:
tv_contingency = pd.crosstab(df["StreamingTV"], df["Churn"])
tv_contingency
Out[188]:
Churn False True
StreamingTV
False 3281 1055
True 1893 814
In [189]:
pd.crosstab(df["StreamingTV"], df["Churn"], normalize = "index")
Out[189]:
Churn False True
StreamingTV
False 0.756688 0.243312
True 0.699298 0.300702

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [190]:
scs.chi2_contingency(tv_contingency)
Out[190]:
(27.862522274233417,
 1.3024835736732686e-07,
 1,
 array([[3185.35624024, 1150.64375976],
        [1988.64375976,  718.35624024]]))
In [191]:
movies_contingency = pd.crosstab(df["StreamingMovies"], df["Churn"])
movies_contingency
Out[191]:
Churn False True
StreamingMovies
False 3260 1051
True 1914 818
In [192]:
pd.crosstab(df["StreamingMovies"], df["Churn"], normalize = "index")
Out[192]:
Churn False True
StreamingMovies
False 0.756205 0.243795
True 0.700586 0.299414

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

In [193]:
scs.chi2_contingency(movies_contingency)
Out[193]:
(26.25133601003847,
 2.9974738476267514e-07,
 1,
 array([[3166.99048701, 1144.00951299],
        [2007.00951299,  724.99048701]]))
In [194]:
contract_contingency = pd.crosstab(original_df["Contract"], original_df["Churn"])
contract_contingency
Out[194]:
Churn False True
Contract
Month-to-month 2220 1655
One year 1307 166
Two year 1647 48
In [195]:
pd.crosstab(original_df["Contract"], original_df["Churn"], normalize = "index")
Out[195]:
Churn False True
Contract
Month-to-month 0.572903 0.427097
One year 0.887305 0.112695
Two year 0.971681 0.028319
In [196]:
sns.countplot(x = original_df["Contract"], hue = original_df["Churn"])
Out[196]:
<matplotlib.axes._subplots.AxesSubplot at 0x205767ce390>

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been in a month to month contract in this data set have a substantial difference in churn rate, likely because of the lack of legal agreements preventing ease of churn, and should be examined in much greater detail.

In [197]:
scs.chi2_contingency(contract_contingency)
Out[197]:
(1184.5965720837926,
 5.863038300673391e-258,
 2,
 array([[2846.69175067, 1028.30824933],
        [1082.11018032,  390.88981968],
        [1245.198069  ,  449.801931  ]]))
In [198]:
paperless_contingency = pd.crosstab(df["PaperlessBilling"], df["Churn"])
paperless_contingency
Out[198]:
Churn False True
PaperlessBilling
False 2403 469
True 2771 1400
In [199]:
pd.crosstab(df["PaperlessBilling"], df["Churn"], normalize = "index")
Out[199]:
Churn False True
PaperlessBilling
False 0.836699 0.163301
True 0.664349 0.335651

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been enrolled in paperless billing have a substantial difference in churn rate and should be examined in more detail later.

In [200]:
scs.chi2_contingency(paperless_contingency)
Out[200]:
(258.27764906707307,
 4.073354668665985e-58,
 1,
 array([[2109.85773108,  762.14226892],
        [3064.14226892, 1106.85773108]]))
In [201]:
payment_contingency = pd.crosstab(original_df["PaymentMethod"], original_df["Churn"])
payment_contingency
Out[201]:
Churn False True
PaymentMethod
Bank transfer (automatic) 1286 258
Credit card (automatic) 1290 232
Electronic check 1294 1071
Mailed check 1304 308
In [202]:
pd.crosstab(original_df["PaymentMethod"], original_df["Churn"], normalize = "index")
Out[202]:
Churn False True
PaymentMethod
Bank transfer (automatic) 0.832902 0.167098
Credit card (automatic) 0.847569 0.152431
Electronic check 0.547146 0.452854
Mailed check 0.808933 0.191067
In [203]:
fig = sns.countplot(x = original_df["PaymentMethod"], hue = original_df["Churn"])
fig.set_xticklabels(fig.get_xticklabels(), rotation = 45)
Out[203]:
[Text(0, 0, 'Electronic check'),
 Text(0, 0, 'Mailed check'),
 Text(0, 0, 'Bank transfer (automatic)'),
 Text(0, 0, 'Credit card (automatic)')]

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been paying via electronic check have a substantial difference in churn rate and should be examined in much greater detail later.

In [204]:
scs.chi2_contingency(paperless_contingency)
Out[204]:
(258.27764906707307,
 4.073354668665985e-58,
 1,
 array([[2109.85773108,  762.14226892],
        [3064.14226892, 1106.85773108]]))

Continuous Variable Churn Distribution

The three continuous variables, Tenure, MonthlyCharges, and TotalCharges, are visualized through Seaborn’s distplot function, which overlays a histogram of the data with a kernel density estimate. The dataframe is divided by the Churn column, and for each continuous column, both dataframes’ distributions are superimposed for direct comparison |via histograms.

In [205]:
churn_df = df[df['Churn'] == True]
churn_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1869 entries, 2 to 7041
Data columns (total 24 columns):
Gender                 1869 non-null bool
SeniorCitizen          1869 non-null bool
Partner                1869 non-null bool
Dependents             1869 non-null bool
Tenure                 1869 non-null int64
PhoneService           1869 non-null bool
MultipleLines          1869 non-null bool
FiberOpticService      1869 non-null bool
DSLService             1869 non-null bool
OnlineSecurity         1869 non-null bool
OnlineBackup           1869 non-null bool
DeviceProtection       1869 non-null bool
TechSupport            1869 non-null bool
StreamingTV            1869 non-null bool
StreamingMovies        1869 non-null bool
OneYearContract        1869 non-null bool
TwoYearContract        1869 non-null bool
PaperlessBilling       1869 non-null bool
MailedCheckPayment     1869 non-null bool
BankTransferPayment    1869 non-null bool
CreditCardPayment      1869 non-null bool
MonthlyCharges         1869 non-null float64
TotalCharges           1869 non-null float64
Churn                  1869 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 96.7 KB
In [206]:
not_churn_df = df[df['Churn'] == False]
not_churn_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5174 entries, 0 to 7042
Data columns (total 24 columns):
Gender                 5174 non-null bool
SeniorCitizen          5174 non-null bool
Partner                5174 non-null bool
Dependents             5174 non-null bool
Tenure                 5174 non-null int64
PhoneService           5174 non-null bool
MultipleLines          5174 non-null bool
FiberOpticService      5174 non-null bool
DSLService             5174 non-null bool
OnlineSecurity         5174 non-null bool
OnlineBackup           5174 non-null bool
DeviceProtection       5174 non-null bool
TechSupport            5174 non-null bool
StreamingTV            5174 non-null bool
StreamingMovies        5174 non-null bool
OneYearContract        5174 non-null bool
TwoYearContract        5174 non-null bool
PaperlessBilling       5174 non-null bool
MailedCheckPayment     5174 non-null bool
BankTransferPayment    5174 non-null bool
CreditCardPayment      5174 non-null bool
MonthlyCharges         5174 non-null float64
TotalCharges           5174 non-null float64
Churn                  5174 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 267.8 KB

In regards to the customers who churned and those that did not in this data set, there is a very substantial difference in Tenure. As expected, the lower the tenure, the higher the relative percentage of churn rate.

In [207]:
sns.distplot(not_churn_df['Tenure'])
sns.distplot(churn_df['Tenure'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
Out[207]:
<matplotlib.legend.Legend at 0x205768b92e8>
In [208]:
ax = sns.scatterplot(x = 'Tenure', y = 'Churn', data = df, hue = 'Churn', alpha = 0.0075)

In looking at the MonthlyCharges and TotalCharges values, it becomes clear that the churned and not churned distribution differ on these columns. The churn rate is much higher when a customer has a higher MonthlyCharges value. This is to be expected since a higher monthly bill causes more financial pressure on the customer and gives a greater performance expectation on the end of the service provider. It appears that over 35% of customers who have not churned have 20-25 as their MonthlyCharges value.

In [209]:
sns.distplot(not_churn_df['MonthlyCharges'])
sns.distplot(churn_df['MonthlyCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
Out[209]:
<matplotlib.legend.Legend at 0x20576a0a240>
In [210]:
ax = sns.scatterplot(x = 'MonthlyCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)

With the TotalCharges column, the distribution in churned and not churned is less of a difference than in the MonthlyCharges column, but is still substantial. The lower the TotalCharges, the higher the churn rate. It appears that over 10% of all customers who have churned had less than 400 TotalCharges.

In [211]:
sns.distplot(not_churn_df['TotalCharges'])
sns.distplot(churn_df['TotalCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
Out[211]:
<matplotlib.legend.Legend at 0x20576b52438>
In [212]:
ax = sns.scatterplot(x = 'TotalCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)

A demographic seems to be forming by combining all of these insights. The type of person that is most likely to have churned in this dataset are those that have indicators of a greater knowledge of computers, like opting for fiber optic speeds, paying with e-checks, utilizing paperless billing and had been doing everything online as a result. Other factors that seem to be strongly affecting churn rate are having a low monthly charge, being a senior citizen, perhaps as a result of death, and ease of attrition, like being on a short term, month to month contract rather than a long term one.

The two columns that did not have significance will be removed to simplify later calculations.

In [213]:
df.drop(['Gender', 'PhoneService'], axis = 1, inplace = True)
In [214]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
SeniorCitizen          7043 non-null bool
Partner                7043 non-null bool
Dependents             7043 non-null bool
Tenure                 7043 non-null int64
MultipleLines          7043 non-null bool
FiberOpticService      7043 non-null bool
DSLService             7043 non-null bool
OnlineSecurity         7043 non-null bool
OnlineBackup           7043 non-null bool
DeviceProtection       7043 non-null bool
TechSupport            7043 non-null bool
StreamingTV            7043 non-null bool
StreamingMovies        7043 non-null bool
OneYearContract        7043 non-null bool
TwoYearContract        7043 non-null bool
PaperlessBilling       7043 non-null bool
MailedCheckPayment     7043 non-null bool
BankTransferPayment    7043 non-null bool
CreditCardPayment      7043 non-null bool
MonthlyCharges         7043 non-null float64
TotalCharges           7043 non-null float64
Churn                  7043 non-null bool
dtypes: bool(19), float64(2), int64(1)
memory usage: 295.9 KB

Correlation

For multivariate visualization, a heatmap of the dataframe’s correlation coefficient matrix is created and is used to look for correlation between the columns. Given the scenario, the most important correlations to look for are those pertaining to the Churn column, which is the focus in prediction. This visualization is very important to look for interactions that need removed, like the extremely high correlation coefficient that exists between the Tenure and TotalCharges columns.

In [215]:
sns.heatmap(
    df.corr(),
    xticklabels = df.columns,
    yticklabels = df.columns
)
Out[215]:
<matplotlib.axes._subplots.AxesSubplot at 0x20576ca84e0>