Analysis of Customer Churn in the Telco Customer Churn Dataset¶

Matthew Unrue, October 2019¶

Western Governors University Data Mining and Analytics II Project

Academic%20Excellence%20Award.png

Table of Contents:¶

I: Tool Selection and Project Description
II.A: Data Exploration and Preparation
II.B: Contingency Tables and Significance
III.A: Data Analysis
III.B: Logistic Regression
IV: Conclusions

Tool Selection and Project Description ¶

The goal of this data mining and analysis project is to examine customer attrition patterns and to determine what indicators can be used to make better business decisions to prevent loss. Furthermore, a model will be produced in order to predict whether or not a given customer might discontinue business based on a model built from this analysis.

This project is done in Python using the various libraries in the scientific Python ecosystem of scikit-learn with a few other libraries for various uses. Specifically, this involves using pandas for data wrangling, manipulation, and cleaning, matplotlib and Seaborn for data visualization, and scikit-learn itself for model creation and analysis.

Like R and unlike SAS, all of these are easily available, free, and open-source. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.

The primary descriptive method used in this project is FAMD, or Factor Analysis of Mixed Data. FAMD is performed to both examine the variables in order to determine which are the most important, and to reduce the number of variables to examine without overly reducing the explained variance so that the predictive method will be easier to create and use.

The primary predictive method is logistic regression, and this technique is used to build the predictive model. Logistic regression is appropriate because the dependent variable is Boolean and is easily predicted using the probability of binary results that logistic regression provides. Additionally, three of the variables are continuous, which factor easily into any type of linear regression, and the remaining are nominal categorical, which can easily factor into the regression by encoding them in dummy variables. Finally, the coefficients of each parameter that the model will provide can be used to see how much that variable affects the chance of churn. The result is a model that is robust, easy to understand, and is simple to use in predicting if a customer will churn or not.

The essential criteria and phenomenon to be predicted are what factors are present with customers who are leaving, indicated by the Churn column, which can be detected by finding which independent variables have a high positive correlation with the Churn column. Additionally, finding the variables that are associated with a lower churn rate is useful for customer retention, and is an equally important goal. Beyond this, it is important to find what combination of all these factors is most associated with Churn rate so that smart business decisions can be made in the future.

Data Exploration and Preparation ¶


CustomerID	Gender
SeniorCitizen	Partner
Dependents	Tenure
PhoneService	MultipleLines
InternetService	OnlineSecurity
OnlineBackup	DeviceProtection
TechSupport	StreamingTV
StreamingMovies	Contract
PaperlessBilling	PaymentMethod
MonthlyCharges	TotalCharges
Churn
Outlier Detection

The target and dependent variable in this data is the Churn column, which indicates whether or not that given customer has churned and ceased business. It is a nominal categorical variable that is originally loaded as a string or object dtype with yes/no values but is then tidied into a proper Boolean with True/False values.

There are many potential independent variables that can be used as predictor variables in this dataset. Two of the most important independent variables overall are the MonthlyCharges and the Tenure columns, which contains the continuous quantitative values that represent how much each customer is paying by month according to contract, and how long that customer has had a business relationship with this company. These are among the most important because of how strongly financial figures factor in to customers’ decisions. A higher MonthlyCharges value may easy be representative of more pressure to leave to a cheaper competitor as soon as a contract is over, or in the case of a month by month contract, to end it sooner rather than later. Meanwhile, a higher Tenure value is indicative of a customer who has invested more thus far and is less likely to leave based on any single factor.

For other independent variables, PhoneService and InternetService do an excellent job of grouping customer type, and the various columns of data that require one of these grouping columns to have a value indicating that they’re paying for that type of service can indicate that a disproportionate number of customers that are churning are all a very specific subset of customer. An example of a possibility is that if all customers with internet service, specifically DSL service, through this company are leaving and choosing to not renew contracts after previous ones are up, then reasons as to why this is needed to specifically be looked into. Additionally, that DSL customers could be prime targets for customer relations programs, or improved sales, because of this relation. A competitor may be offering a sale far better than the company currently offers or is willing to match.

The first goal and requirement in manipulating the data is to have a dataset that can be considered both tidy and clean. For data to be tidy, each variable must have its own column, each observation must have its own row, and each value must have its own cell. (Wickham & Grolemund, 2017, 12.2) For data to be clean, it needs to be free from obvious errors, be formatted into the correct form of data, be devoid of missing values, have irrelevant columns removed, and to be consistent within itself. Additionally, the data will be broken down so that dummy encodings (One-Hot Encoding) primarily represent the categorical data as opposed to the string values that exist in the initial dataset so that it can easily be implemented in an appropriate model. Beyond this, outliers and interactions must be found and removed if any exist to ensure the integrity of the resulting analyses and predictive model.

# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as scs
import prince
import pickle
from sklearn import preprocessing

%matplotlib inline

This dataset has 21 columns, with a number of which may be useful for predicting attrition, which is represented by the final column "Churn." Furthermore, it appears that none of the values are missing, or more specifically, NaN, but there is still potential for outliers or incorrect data.

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

df.head(10)

To begin the data cleaning process after the initial loading and examining of the dataframe structure, each column is examined and any of the columns that have null values are flagged.

for col in df.columns.values:
    print(col + ": " + str(df[col].isnull().values.any()))

customerID: False
gender: False
SeniorCitizen: False
Partner: False
Dependents: False
tenure: False
PhoneService: False
MultipleLines: False
InternetService: False
OnlineSecurity: False
OnlineBackup: False
DeviceProtection: False
TechSupport: False
StreamingTV: False
StreamingMovies: False
Contract: False
PaperlessBilling: False
PaymentMethod: False
MonthlyCharges: False
TotalCharges: False
Churn: False

n_rows = len(df)
n_rows

7043

The column customerID appears to be a structured string simply used for identifying customers. While this column is extremely important in relational databases and would be useful if we planned on any dataframe splits and subsequent joins, it is not useful for analysis and will be removed early on to simplify later steps.

df.drop('customerID', axis = 1, inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 1.1+ MB

Return

As the customerID column is the only one that will need removed at this stage of the data exploration, the remainder of this first step is to go through each column, one by one, and do the most basic cleaning and explorative tasks. This includes:

Inferring the purpose of each column in the larger perspective of the business in order to interpret the information contained within properly.
Changing the column titles to be consistent.
Changing categorical values to be consistent like with Boolean values to True and False.
Changing the data types, or dtypes, of some dataframe columns to ensure the data is handled appropriately.
Examining the numbers and proportion of each categorical value in categorical columns.
Examining the distribution and calculate summary statistics like the mean, median, and quartiles of the values of the quantitative columns.
Finding and filling in any NaN or null values.

gender is a binary column and has a nearly perfect split between the values. 50.48% of customers are male and the remaining 49.52% are female. To make everything consistant, the column title will be updated.

df = df.rename(columns = {"gender": "Gender"})

df['Gender'].value_counts()

Male      3555
Female    3488
Name: Gender, dtype: int64

df['Gender'].value_counts(normalize = True)

Male      0.504756
Female    0.495244
Name: Gender, dtype: float64

Return

SeniorCitizen appears to be a boolean value indicating whether or not the row corresponds to a customer that is a senior citizen, but what exact age that represents is unknown. 16.21% of customers are senior citizens in this dataset. This is the first example of a boolean column in this dataset, and they overall fail to follow a single convention. The boolean-like object columns will all be changed to an actual bool dtype as they come up.

df['SeniorCitizen'] = df['SeniorCitizen'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(1), float64(1), int64(1), object(17)
memory usage: 1.0+ MB

df['SeniorCitizen'].value_counts()

False    5901
True     1142
Name: SeniorCitizen, dtype: int64

df['SeniorCitizen'].value_counts(normalize = True)

False    0.837853
True     0.162147
Name: SeniorCitizen, dtype: float64

Return

The Partner column may relate to a business partnership or some similar idea, but its position in between age related columns and dependants when the columns seem ordered by topic lead the belief of it indicating the existance of a spouse or established relationship partner instead in boolean form. 48.3% of customers in this dataset have a partner. It is another column that needs changed to the bool dtype.

df['Partner'] = df['Partner'].replace({"No":False, "Yes":True})

df['Partner'] = df['Partner'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(2), float64(1), int64(1), object(16)
memory usage: 1004.3+ KB

df['Partner'].value_counts()

False    3641
True     3402
Name: Partner, dtype: int64

df['Partner'].value_counts(normalize = True)

False    0.516967
True     0.483033
Name: Partner, dtype: float64

Return

Likewise, the Dependents column appears to be a boolean value of whether children or other legal dependents exist under the customer a given row represents that also needs converted to bool. 29.96% of customers in this dataset have dependents.

df['Dependents'] = df['Dependents'].replace({"No":False, "Yes":True})
df['Dependents'] = df['Dependents'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(3), float64(1), int64(1), object(15)
memory usage: 956.2+ KB

df['Dependents'].value_counts()

False    4933
True     2110
Name: Dependents, dtype: int64

df['Dependents'].value_counts(normalize = True)

False    0.700412
True     0.299588
Name: Dependents, dtype: float64

Return

The tenure column, whose name notably does not follow the column naming conventions, seems to be the business meaning of the term, or the number of years or some other time period that the customer has been with the company. Interestingly, the numbers range from 0 to 72. The highest value, 72, is the second most common while 0, is the lowest value and the least common. To make everything consistant, the column title will be updated.

df = df.rename(columns = {"tenure": "Tenure"})

df['Tenure'].value_counts()

1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: Tenure, Length: 73, dtype: int64

Return

PhoneServices is a boolean value indicating whether or not a customer is paying for phone service or not, with 90.32% of customers at this telecommunications company actively having a phone plan. Like others, it needs converted to the bool dtype.

df['PhoneService'] = df['PhoneService'].replace({"No":False, "Yes":True})
df['PhoneService'] = df['PhoneService'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(4), float64(1), int64(1), object(14)
memory usage: 908.0+ KB

df['PhoneService'].value_counts()

True     6361
False     682
Name: PhoneService, dtype: int64

df['PhoneService'].value_counts(normalize = True)

True     0.903166
False    0.096834
Name: PhoneService, dtype: float64

Return

MultipleLines is a string value that indicates whether the given customer has multiple lines on their account, has only a single line on their account, or if this column isn't applicable for the customer, that they do not have phone service through this company. 42.18% of total customers in this dataset have multiple lines, and 46.71% of customers with phone service in this dataset have multiple lines.

df['MultipleLines'].value_counts()

No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64

n_multiple_lines = df['MultipleLines'].value_counts()[1]
lines_proportion = n_multiple_lines / n_rows
print("{0:.4f}".format(lines_proportion) + "%")

0.4218%

lines_proportion2 = n_multiple_lines / (n_rows - df['MultipleLines'].value_counts()[2])
print("{0:.4f}".format(lines_proportion2) + "%")

0.4671%

Return

InternetService is a string value indicating whether or not a customer is paying for internet service or not, and whether or not this service is for a fiber optic plan or a DSL plan. 90.32% of customers in this data service have an internet plan. 43.96% and 34.37% of total customers in this dataset have a fiber or DSL plan respectively, and 56.12% and 43.88% of customers with internet service have fiber or DSL plans respectively.

Many of the columns in this dataset have a number of possibilities that signify whether or not a service is included in the given customer's plan. Many of these columns would be boolean, but have a third value indicating that they do not have the required service plan for it to be a possibility. While these may be converted into bool dtype columns for later analysis, the boolean-like yes and no values will be changed to True or False on this first pass.

df['InternetService'] = df['InternetService'].replace({"No":"False"})

df['InternetService'].value_counts()

Fiber optic    3096
DSL            2421
False          1526
Name: InternetService, dtype: int64

n_internet = n_rows - df['InternetService'].value_counts()[2]
internet_proportion = n_internet / n_rows
print("{0:.4f}".format(internet_proportion) + "%")

0.7833%

n_fiber = df['InternetService'].value_counts()[0]
n_dsl = df['InternetService'].value_counts()[1]
fiber_proportion = n_fiber / n_rows
dsl_proportion = n_dsl / n_rows
print("Fiber: {0:.4f}".format(fiber_proportion) + "%\nDSL: {0:.4f}".format(dsl_proportion) + "%")

Fiber: 0.4396%
DSL: 0.3437%

fiber_proportion2 = n_fiber / n_internet
dsl_proportion2 = n_dsl / n_internet
print("Fiber: {0:.4f}".format(fiber_proportion2) + "%\nDSL: {0:.4f}".format(dsl_proportion2) + "%")

Fiber: 0.5612%
DSL: 0.4388%

Return

OnlineSecurity is a string value that indicates whether the given customer has an online security package in addition to their internet service, did not opt for the online security package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 28.67% of total customers in this dataset have the online security package, and 36.6% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.

df['OnlineSecurity'] = df['OnlineSecurity'].replace({"No":"False", "Yes":"True"})

df['OnlineSecurity'].value_counts()

False                  3498
True                   2019
No internet service    1526
Name: OnlineSecurity, dtype: int64

n_security = df['OnlineSecurity'].value_counts()[1]
security_proportion = n_security / n_rows
print("{0:.4f}".format(security_proportion) + "%")

0.2867%

security_proportion2 = n_security / (n_rows - df['OnlineSecurity'].value_counts()[2])
print("{0:.4f}".format(security_proportion2) + "%")

0.3660%

Return

OnlineBackup is a string value that indicates whether the given customer has an online backup package in addition to their internet service, did not opt for the online backup package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.49% of total customers in this dataset have the online security package, and 44.03% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.

df['OnlineBackup'] = df['OnlineBackup'].replace({"No":"False", "Yes":"True"})

df['OnlineBackup'].value_counts()

False                  3088
True                   2429
No internet service    1526
Name: OnlineBackup, dtype: int64

n_backup = df['OnlineBackup'].value_counts()[1]
backup_proportion = n_backup / n_rows
print("{0:.4f}".format(backup_proportion) + "%")

0.3449%

backup_proportion2 = n_backup / (n_rows - df['OnlineBackup'].value_counts()[2])
print("{0:.4f}".format(backup_proportion2) + "%")

0.4403%

Return

DeviceProtection is a string value that indicates whether the given customer has a device protection plan in addition to their internet service, did not opt for any protection plan, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.39% of total customers in this dataset have a device protection plan, and 43.90% of customers with internet service in this dataset have a device protection plan. Its values will be updated to the more boolean-like format.

df['DeviceProtection'] = df['DeviceProtection'].replace({"No":"False", "Yes":"True"})

df['DeviceProtection'].value_counts()

False                  3095
True                   2422
No internet service    1526
Name: DeviceProtection, dtype: int64

n_protection = df['DeviceProtection'].value_counts()[1]
protection_proportion = n_protection / n_rows
print("{0:.4f}".format(protection_proportion) + "%")

0.3439%

protection_proportion2 = n_protection / (n_rows - df['DeviceProtection'].value_counts()[2])
print("{0:.4f}".format(protection_proportion2) + "%")

0.4390%

OnlineBackup and DeviceProtection appear to be the first potential major issue for future analysis. Both require that the customer have a value other than no in the InternetService column, and also have extremely similar numbers, 2429 and 2422, 44.03% and 43.90% respectively. The correlation between the types of column values, namely the No internet service ones, are going to prevent proper testing and model building if tested together.

contingency_table = pd.crosstab(df['OnlineBackup'], df['DeviceProtection'])
contingency_table

As we will see shortly, a number of the columns require internet service to have meaningful values and are in the same situation as the above section, so the data will have to be restructured to account for this.

Return

TechSupport is a string value that indicates whether the given customer has a tech support package in addition to their internet service, did not opt for any tech support package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 29.02% of total customers in this dataset have a tech support package, and 37.05% of customers with internet service in this dataset have a tech support package. Its values will be updated to the more boolean-like format.

df['TechSupport'] = df['TechSupport'].replace({"No":"False", "Yes":"True"})

df['TechSupport'].value_counts()

False                  3473
True                   2044
No internet service    1526
Name: TechSupport, dtype: int64

n_support = df['TechSupport'].value_counts()[1]
support_proportion = n_support / n_rows
print("{0:.4f}".format(support_proportion) + "%")

0.2902%

support_proportion2 = n_support / (n_rows - df['TechSupport'].value_counts()[2])
print("{0:.4f}".format(support_proportion2) + "%")

0.3705%

Return

StreamingTV is a string value that indicates whether the given customer has a TV streaming package in addition to their internet service, did not opt for any TV streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.44% of total customers in this dataset have a TV streaming package, and 49.07% of customers with internet service in this dataset have a TV streaming package. Its values will be updated to the more boolean-like format.

df['StreamingTV'] = df['StreamingTV'].replace({"No":"False", "Yes":"True"})

df['StreamingTV'].value_counts()

False                  2810
True                   2707
No internet service    1526
Name: StreamingTV, dtype: int64

n_tv = df['StreamingTV'].value_counts()[1]
tv_proportion = n_tv / n_rows
print("{0:.4f}".format(tv_proportion) + "%")

0.3844%

tv_proportion2 = n_tv / (n_rows - df['StreamingTV'].value_counts()[2])
print("{0:.4f}".format(tv_proportion2) + "%")

0.4907%

Return

StreamingMovies is a string value that indicates whether the given customer has a movie streaming package in addition to their internet service, did not opt for any movie streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.79% of total customers in this dataset have a movie streaming package, and 49.52% of customers with internet service in this dataset have a movie streaming package. Its values will be updated to the more boolean-like format.

df['StreamingMovies'] = df['StreamingMovies'].replace({"No":"False", "Yes":"True"})

df['StreamingMovies'].value_counts()

False                  2785
True                   2732
No internet service    1526
Name: StreamingMovies, dtype: int64

n_movie = df['StreamingMovies'].value_counts()[1]
movie_proportion = n_movie / n_rows
print("{0:.4f}".format(movie_proportion) + "%")

0.3879%

movie_proportion2 = n_movie / (n_rows - df['StreamingMovies'].value_counts()[2])
print("{0:.4f}".format(movie_proportion2) + "%")

0.4952%

Return

Contract is a string value that indicates what type of contract the given customer is under. The contract type values include Month-to-month, One year, and two year. 55.02% of customers are under a month to month contract, 24.07% are under a single year contract, and the remaining 20.91% are under a two year contract.

df['Contract'].value_counts()

Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64

df['Contract'].value_counts(normalize = True)

Month-to-month    0.550192
Two year          0.240664
One year          0.209144
Name: Contract, dtype: float64

Return

PaperlessBilling is a boolean value indicating whether or not the given customer has opted for paperless billing or not, with 59.22% of customers at this telecommunications company actively enrolled in paperless billing. As a true boolean column, it will be completely changed to the boolean dtype.

df['PaperlessBilling'] = df['PaperlessBilling'].replace({"No":False, "Yes":True})
df['PaperlessBilling'] = df['PaperlessBilling'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: bool(5), float64(1), int64(1), object(13)
memory usage: 859.9+ KB

df['PaperlessBilling'].value_counts()

True     4171
False    2872
Name: PaperlessBilling, dtype: int64

df['PaperlessBilling'].value_counts(normalize = True)

True     0.592219
False    0.407781
Name: PaperlessBilling, dtype: float64

Return

PaymentMethod is a string value that indicates what type of payment method the given customer uses. The payment method values include electronic checks, traditional checks via mail, automatic bank transfer, and automatic payment with a credit card. 33.58% of customers pay via electronic check, 22.89% pay via mailed check, 21.92% pay automatically via bank transfer, and the remaining 21.61% pay automatically via credit card.

df['PaymentMethod'].value_counts()

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

df['PaymentMethod'].value_counts(normalize = True)

Electronic check             0.335794
Mailed check                 0.228880
Bank transfer (automatic)    0.219225
Credit card (automatic)      0.216101
Name: PaymentMethod, dtype: float64

Return

MonthlyCharges is a float value indicating the required monthly payment of a given customer. The mean value is 64.76, the mode value 20.05, the minimum 18.25, and the maximum value 118.75. This distribution is extremely skewed to the right.

df['MonthlyCharges'].value_counts()

20.05     61
19.85     45
19.95     44
19.90     44
20.00     43
          ..
114.75     1
103.60     1
113.40     1
57.65      1
113.30     1
Name: MonthlyCharges, Length: 1585, dtype: int64

df['MonthlyCharges'].describe()

count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64

plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()

Return

TotalCharges appears to be a column that represents the total amount a given customer has paid throughout all transactions of the business-customer relationship. However, the column is loaded as the object dtype, or a string, instead of a float. This will have to be changed to examine the data effectively.

df['TotalCharges'].describe()

count     7043
unique    6531
top           
freq        11
Name: TotalCharges, dtype: object

These 11 rows have missing values for their TotalCharges column, but this appears to be the only missing column. Notably, these customers' Tenure values are all 0, and none of them have churned. It seems likely that these customers don't have a TotalCharges value simply because they haven't been charged yet. As such, their TotalCharges value will be set to 0 to allow for later analysis.

df.loc[pd.to_numeric(df['TotalCharges'], errors='coerce').isnull()]

tofix = [488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]

df.loc[tofix, 'TotalCharges'] = 0

df.iloc[tofix]['TotalCharges']

488     0
753     0
936     0
1082    0
1340    0
3331    0
3826    0
4380    0
5218    0
6670    0
6754    0
Name: TotalCharges, dtype: object

Now we can change the column to the float dtype and then examine the data. The mean TotalCharges value is 2283.30, the minimum value is 18.80, and the maximum is 8684.80. Overall, the column is very right-skewed.

# Correct the TotalCharges column's dtype from object to float.
df['TotalCharges'] = df['TotalCharges'].astype('float')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null object
dtypes: bool(5), float64(2), int64(1), object(12)
memory usage: 859.9+ KB

df['TotalCharges'].describe()

count    7043.000000
mean     2279.734304
std      2266.794470
min         0.000000
25%       398.550000
50%      1394.550000
75%      3786.600000
max      8684.800000
Name: TotalCharges, dtype: float64

plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()

Return

Finally, the last column of the dataset is the Churn column, which is a boolean value indicating on whether or not the customer has ceased business. 26.54% customers in this dataset have churned. As a true boolean column, it will be completely changed to the boolean dtype.

df['Churn'] = df['Churn'].replace({"No":False, "Yes":True})
df['Churn'] = df['Churn'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB

df['Churn'].value_counts()

False    5174
True     1869
Name: Churn, dtype: int64

df['Churn'].value_counts(normalize = True)

False    0.73463
True     0.26537
Name: Churn, dtype: float64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB

Return

Outlier Detection ¶

In this dataset, the three continuous variable columns do not have any outliers. None are present based on IQR range, as seen on the boxplots, and none are present based on the z-score with a threshold of 3 standard deviations.

Tenure¶

sns.boxplot(df['Tenure'])

<matplotlib.axes._subplots.AxesSubplot at 0x20574365b38>

tenure_z = np.abs(scs.zscore(df['Tenure']))
tenure_z

array([1.27744458, 0.06632742, 1.23672422, ..., 0.87024095, 1.15528349,
       1.36937906])

print(np.where(tenure_z > 3))

(array([], dtype=int64),)

MonthlyCharges¶

sns.boxplot(df['MonthlyCharges'])

<matplotlib.axes._subplots.AxesSubplot at 0x205743ba080>

monthly_z = np.abs(scs.zscore(df['MonthlyCharges']))
monthly_z

array([1.16032292, 0.25962894, 0.36266036, ..., 1.1686319 , 0.32033821,
       1.35896134])

print(np.where(monthly_z > 3))

(array([], dtype=int64),)

TotalCharges¶

sns.boxplot(df['TotalCharges'])

<matplotlib.axes._subplots.AxesSubplot at 0x20574438a90>

total_z = np.abs(scs.zscore(df['TotalCharges']))
total_z

array([0.99261052, 0.17216471, 0.9580659 , ..., 0.85293201, 0.87051315,
       2.01389665])

print(np.where(total_z > 3))

(array([], dtype=int64),)

Return

Contingency Tables and Significance ¶

Dataframe Restructuring
Univariate Distribution Visualization
Contingency Table Creation
Continuous Variable Churn Distribution

Contingency tables will be calculated for all variables in order to see how they appear to affect the churn rate before using more intensive methods.

However, this is also where the categorical variables become an issue for analyzing the data set. The next step in the overall process is to transform the data into the structure that the models need. As logistic regression will be performed, the categorical columns will need to be encoded into dummy variables, which are expressed in Boolean values.

In addition to the information encoding needing changed, a number of these variables are inherently correlated with each other; e.g. MultipleLines with PhoneService, and OnlineSecurity with InternetService. To do meaningful tests of significance, this intercorrelation needs to be removed before the tests are run. Specifically, the problem is that the values of "no phone service" and "no internet service" are also factoring into the tests. There are a few approaches available in order deal with this.

The first is to split the dataframe into three distinct dataframes, one for customers with only phone service, one for customers with only internet service, and the last for customers with both phone and internet service through this company. This is less than ideal for several reasons. Rather than a single dataframe with single calculations, this would then create three dataframes and require three tests for each calculation, further complicating how all of the tests would be combined.

phone_count = df.query("PhoneService == True and InternetService == 'False'").shape[0]
phone_count

1526

internet_count = df.query("PhoneService == False and InternetService != 'False'").shape[0]
internet_count

682

both_count = df.query("PhoneService == True and InternetService != 'False'").shape[0]
both_count

4835

(phone_count + internet_count + both_count) == n_rows

True

The second option is to simply remove a subsection of the rows, effectively removing columns of data in the process. This would technically work but it would also remove a significant amount of information in the process.

The remaining option is to manipulate the encoding and representation of the data. The categories that these all involve can be split into multiple boolean columns representing the "positive" values of the original columns. As an example, the MultipleLines column would be changed to be a boolean True/False option, with all rows that do not have phone service also being represented as False. Ultimately, the multiple logistic regression model is only going to adjust the calculation with a parameter if the value is true, and to create the model, these dummy variables have to be created regardless.

Dataframe Restructuring ¶

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
Gender              7043 non-null object
SeniorCitizen       7043 non-null bool
Partner             7043 non-null bool
Dependents          7043 non-null bool
Tenure              7043 non-null int64
PhoneService        7043 non-null bool
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null bool
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null float64
Churn               7043 non-null bool
dtypes: bool(6), float64(2), int64(1), object(11)
memory usage: 811.7+ KB

original_df = df.copy(deep = True)

Gender¶

df['Gender'].value_counts()

Male      3555
Female    3488
Name: Gender, dtype: int64

df['Gender'] = df['Gender'].replace({"Male":False, "Female":True})
df['Gender'] = df['Gender'].astype('bool')
df['Gender'].value_counts()

False    3555
True     3488
Name: Gender, dtype: int64

MultipleLines¶

df['MultipleLines'].value_counts()

No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64

df.loc[df['MultipleLines'] == "No phone service", 'MultipleLines'] = np.NaN

df['MultipleLines'] = df['MultipleLines'].replace({"No":False, "Yes":True})
df['MultipleLines'] = df['MultipleLines'].fillna(False).astype('bool')
df['MultipleLines'].value_counts()

False    4072
True     2971
Name: MultipleLines, dtype: int64

FiberOpticService¶

df['InternetService'].value_counts()

Fiber optic    3096
DSL            2421
False          1526
Name: InternetService, dtype: int64

df['FiberOpticService'] = False

df.loc[df['InternetService'] == "Fiber optic", 'FiberOpticService'] = True

df['FiberOpticService'].value_counts()

False    3947
True     3096
Name: FiberOpticService, dtype: int64

DSLService¶

df['DSLService'] = False

df.loc[df['InternetService'] == "DSL", 'DSLService'] = True

df['DSLService'].value_counts()

False    4622
True     2421
Name: DSLService, dtype: int64

df.drop('InternetService', axis = 1, inplace = True)

OnlineSecurity¶

df['OnlineSecurity'].value_counts()

False                  3498
True                   2019
No internet service    1526
Name: OnlineSecurity, dtype: int64

df.loc[df['OnlineSecurity'] == "No internet service", 'OnlineSecurity'] = np.NaN

df['OnlineSecurity'] = df['OnlineSecurity'].replace({"False":False, "True":True})
df['OnlineSecurity'] = df['OnlineSecurity'].fillna(False).astype('bool')
df['OnlineSecurity'].value_counts()

False    5024
True     2019
Name: OnlineSecurity, dtype: int64

OnlineBackup¶

df['OnlineBackup'].value_counts()

False                  3088
True                   2429
No internet service    1526
Name: OnlineBackup, dtype: int64

df.loc[df['OnlineBackup'] == "No internet service", 'OnlineBackup'] = np.NaN

df['OnlineBackup'] = df['OnlineBackup'].replace({"False":False, "True":True})
df['OnlineBackup'] = df['OnlineBackup'].fillna(False).astype('bool')
df['OnlineBackup'].value_counts()

False    4614
True     2429
Name: OnlineBackup, dtype: int64

DeviceProtection¶

df['DeviceProtection'].value_counts()

False                  3095
True                   2422
No internet service    1526
Name: DeviceProtection, dtype: int64

df.loc[df['DeviceProtection'] == "No internet service", 'DeviceProtection'] = np.NaN

df['DeviceProtection'] = df['DeviceProtection'].replace({"False":False, "True":True})
df['DeviceProtection'] = df['DeviceProtection'].fillna(False).astype('bool')
df['DeviceProtection'].value_counts()

False    4621
True     2422
Name: DeviceProtection, dtype: int64

TechSupport¶

df['TechSupport'].value_counts()

False                  3473
True                   2044
No internet service    1526
Name: TechSupport, dtype: int64

df.loc[df['TechSupport'] == "No internet service", 'TechSupport'] = np.NaN

df['TechSupport'] = df['TechSupport'].replace({"False":False, "True":True})
df['TechSupport'] = df['TechSupport'].fillna(False).astype('bool')
df['TechSupport'].value_counts()

False    4999
True     2044
Name: TechSupport, dtype: int64

StreamingTV¶

df['StreamingTV'].value_counts()

False                  2810
True                   2707
No internet service    1526
Name: StreamingTV, dtype: int64

df.loc[df['StreamingTV'] == "No internet service", 'StreamingTV'] = np.NaN

df['StreamingTV'] = df['StreamingTV'].replace({"False":False, "True":True})
df['StreamingTV'] = df['StreamingTV'].fillna(False).astype('bool')
df['StreamingTV'].value_counts()

False    4336
True     2707
Name: StreamingTV, dtype: int64

StreamingMovies¶

df['StreamingMovies'].value_counts()

False                  2785
True                   2732
No internet service    1526
Name: StreamingMovies, dtype: int64

df.loc[df['StreamingMovies'] == "No internet service", 'StreamingMovies'] = np.NaN

df['StreamingMovies'] = df['StreamingMovies'].replace({"False":False, "True":True})
df['StreamingMovies'] = df['StreamingMovies'].fillna(False).astype('bool')
df['StreamingMovies'].value_counts()

False    4311
True     2732
Name: StreamingMovies, dtype: int64

Contract¶

The most common type of contract is the Month-to-month contract, so it will be the reference level, having both the new one year and two year contract columns as False.

df['Contract'].value_counts()

Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64

df['OneYearContract'] = False
df['TwoYearContract'] = False

df.loc[df['Contract'] == "One year", 'OneYearContract'] = True
df.loc[df['Contract'] == "Two year", 'TwoYearContract'] = True

df['OneYearContract'].value_counts()

False    5570
True     1473
Name: OneYearContract, dtype: int64

df['TwoYearContract'].value_counts()

False    5348
True     1695
Name: TwoYearContract, dtype: int64

df.drop('Contract', axis = 1, inplace = True)

PaymentMethod¶

The most common type of payment method is by electronic check, so it will be the reference level, having the new columns mailed check, bank transfer, and credit card as False.

df['PaymentMethod'].value_counts()

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

df['MailedCheckPayment'] = False
df['BankTransferPayment'] = False
df['CreditCardPayment'] = False

df.loc[df['PaymentMethod'] == "Mailed check", 'MailedCheckPayment'] = True
df.loc[df['PaymentMethod'] == "Bank transfer (automatic)", 'BankTransferPayment'] = True
df.loc[df['PaymentMethod'] == "Credit card (automatic)", 'CreditCardPayment'] = True

df['MailedCheckPayment'].value_counts()

False    5431
True     1612
Name: MailedCheckPayment, dtype: int64

df['BankTransferPayment'].value_counts()

False    5499
True     1544
Name: BankTransferPayment, dtype: int64

df['CreditCardPayment'].value_counts()

False    5521
True     1522
Name: CreditCardPayment, dtype: int64

df.drop('PaymentMethod', axis = 1, inplace = True)

Finally, reorganize all of the columns back to the order that the original dataset was in.

cols = df.columns
cols

Index(['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure',
       'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'FiberOpticService', 'DSLService', 'OneYearContract', 'TwoYearContract',
       'MailedCheckPayment', 'BankTransferPayment', 'CreditCardPayment'],
      dtype='object')

# Reorders the columns back to how the original dataframe had them.
df = df[['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure',
         'PhoneService', 'MultipleLines', 'FiberOpticService', 'DSLService',
         'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
         'StreamingTV', 'StreamingMovies', 'OneYearContract', 'TwoYearContract',
         'PaperlessBilling', 'MailedCheckPayment', 'BankTransferPayment',
         'CreditCardPayment', 'MonthlyCharges', 'TotalCharges', 'Churn']]

This section culminates into the dataframe below. All categorical variables have been recoded into Boolean values for the future regression modeling and the continuous variables are still intact as integer for the Tenure column, and floating point for the two charges columns.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 24 columns):
Gender                 7043 non-null bool
SeniorCitizen          7043 non-null bool
Partner                7043 non-null bool
Dependents             7043 non-null bool
Tenure                 7043 non-null int64
PhoneService           7043 non-null bool
MultipleLines          7043 non-null bool
FiberOpticService      7043 non-null bool
DSLService             7043 non-null bool
OnlineSecurity         7043 non-null bool
OnlineBackup           7043 non-null bool
DeviceProtection       7043 non-null bool
TechSupport            7043 non-null bool
StreamingTV            7043 non-null bool
StreamingMovies        7043 non-null bool
OneYearContract        7043 non-null bool
TwoYearContract        7043 non-null bool
PaperlessBilling       7043 non-null bool
MailedCheckPayment     7043 non-null bool
BankTransferPayment    7043 non-null bool
CreditCardPayment      7043 non-null bool
MonthlyCharges         7043 non-null float64
TotalCharges           7043 non-null float64
Churn                  7043 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 309.6 KB

Univariate Distribution Visualization ¶

As all of the distributions have been examined in the data exploration and cleaning stage, this section acts as a recap and a point at which to actually represent them graphically. The following code reformats the cleaned dataframe into its long form so that the resulting graphs are generated together in a single chart with shared axes. This should allow for easy comparison of proportions. The large chart of all categorical variables is followed by the summary statistics and a histogram showing the distribution of each of the continuous variables.

Categorical Variable Distributions¶

viz_df = df.drop(['Tenure', 'MonthlyCharges', 'TotalCharges'], axis = 1)
viz_df.columns

Index(['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'FiberOpticService', 'DSLService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'OneYearContract', 'TwoYearContract',
       'PaperlessBilling', 'MailedCheckPayment', 'BankTransferPayment',
       'CreditCardPayment', 'Churn'],
      dtype='object')

melted = viz_df.melt()

melted

g = sns.FacetGrid(
    melted,
    col = 'variable',
    hue = 'value',
    sharey = 'row',
    sharex = 'col',
    col_wrap = 7,
    legend_out = True,
)

g = g.map(sns.countplot, 'value', order = [False, True]).add_legend()

plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Categorical Variable Distributions')

# Add the percentage proportion value for each column in each variable plot.
for i, ax in enumerate(g.axes):
    # Set each plot's title to its corresponding column name.
    ax.set_title(viz_df.columns[i])
    for p in ax.patches:
        percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
        
        ax.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
                   ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

        
g.savefig("Charts/Univariate Distributions.svg")
g

<seaborn.axisgrid.FacetGrid at 0x20576d9dac8>

Continuous Variable Distributions¶

df['Tenure'].describe()

count    7043.000000
mean       32.371149
std        24.559481
min         0.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: Tenure, dtype: float64

plt.hist(df['Tenure'])
plt.title('Distribution of Tenure')
plt.xlabel('Tenure')
plt.ylabel('Tenure Frequency')
plt.grid(True)
plt.show()

df['MonthlyCharges'].describe()

count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64

plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()

df['TotalCharges'].describe()

count    7043.000000
mean     2279.734304
std      2266.794470
min         0.000000
25%       398.550000
50%      1394.550000
75%      3786.600000
max      8684.800000
Name: TotalCharges, dtype: float64

plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()

Bivariate Distribution Visualization¶

For ease of interpretation and usage given the 2x2 relationships between churn and the other categorical variables, simple contingency tables were used to visualize the categorical bivariate distributions. Bar, pie, and mosaic plots were all considered, but all seemed to overcomplicate the presentation. In each of the following sections, the first table is the raw numbers of the categorical variables, and the second table contains the same information but normalized into proportions. The two columns Contract and PaymentMethod have more values than just True/False. These are visualized with bar plots grouped by the rows’ Churn values.

Contingency Table Creation ¶


Gender	SeniorCitizen
Partner	Dependents
PhoneService	MultipleLines
InternetService	OnlineSecurity
OnlineBackup	DeviceProtection
TechSupport	StreamingTV
StreamingMovies	Contract
PaperlessBilling	PaymentMethod

Gender ¶

gender_contingency = pd.crosstab(df["Gender"], df["Churn"])
gender_contingency

pd.crosstab(df["Gender"], df["Churn"], normalize = "index")

The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.

scs.chi2_contingency(gender_contingency)

(0.4840828822091383,
 0.48657873605618596,
 1,
 array([[2611.61010933,  943.38989067],
        [2562.38989067,  925.61010933]]))

Return

SeniorCitizen ¶

senior_contingency = pd.crosstab(df["SeniorCitizen"], df["Churn"])
senior_contingency

pd.crosstab(df["SeniorCitizen"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(senior_contingency)

(159.42630036838742,
 1.510066805092378e-36,
 1,
 array([[4335.05239245, 1565.94760755],
        [ 838.94760755,  303.05239245]]))

Return

Partner ¶

partner_contingency = pd.crosstab(df["Partner"], df["Churn"])
partner_contingency

pd.crosstab(df["Partner"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(partner_contingency)

(158.7333820309922,
 2.1399113440759935e-36,
 1,
 array([[2674.78830044,  966.21169956],
        [2499.21169956,  902.78830044]]))

Return

Dependents ¶

dependents_contingency = pd.crosstab(df["Dependents"], df["Churn"])
dependents_contingency

pd.crosstab(df["Dependents"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(dependents_contingency)

(189.12924940423474,
 4.9249216612154196e-43,
 1,
 array([[3623.93042737, 1309.06957263],
        [1550.06957263,  559.93042737]]))

Return

PhoneService ¶

phone_contingency = pd.crosstab(df["PhoneService"], df["Churn"])
phone_contingency

pd.crosstab(df["PhoneService"], df["Churn"], normalize = "index")

The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.

scs.chi2_contingency(phone_contingency)

(0.9150329892546948,
 0.3387825358066928,
 1,
 array([[ 501.01774812,  180.98225188],
        [4672.98225188, 1688.01774812]]))

Return

MultipleLines ¶

lines_contingency = pd.crosstab(df["MultipleLines"], df["Churn"])
lines_contingency

pd.crosstab(df["MultipleLines"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. It may be that a greater number of lines increases the probability of churn as a result of the higher cost overall.

scs.chi2_contingency(lines_contingency)

(11.143251001456251,
 0.0008433795342472428,
 1,
 array([[2991.41388613, 1080.58611387],
        [2182.58611387,  788.41388613]]))

Return

InternetService ¶

fiber_contingency = pd.crosstab(df["FiberOpticService"], df["Churn"])
fiber_contingency

pd.crosstab(df["FiberOpticService"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing fiber optic services in this data set have a substantial difference in churn rate and should be examined in much greater detail.

scs.chi2_contingency(fiber_contingency)

(666.8080208747958,
 4.940476033744708e-147,
 1,
 array([[2899.58511998, 1047.41488002],
        [2274.41488002,  821.58511998]]))

dsl_contingency = pd.crosstab(df["DSLService"], df["Churn"])
dsl_contingency

pd.crosstab(df["DSLService"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had not been purchasing DSL services in this data set have a substantial difference in churn rate and should be examined in much greater detail.

scs.chi2_contingency(dsl_contingency)

(108.07545746651793,
 2.587376621108007e-25,
 1,
 array([[3395.46045719, 1226.53954281],
        [1778.53954281,  642.46045719]]))

Return

OnlineSecurity ¶

security_contingency = pd.crosstab(df["OnlineSecurity"], df["Churn"])
security_contingency

pd.crosstab(df["OnlineSecurity"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing internet service but not the online security package in this data set have a substantial difference in churn rate and should be examined in much greater detail.

scs.chi2_contingency(security_contingency)

(205.63310416062058,
 1.2320984831180024e-46,
 1,
 array([[3690.78176913, 1333.21823087],
        [1483.21823087,  535.78176913]]))

Return

OnlineBackup ¶

backup_contingency = pd.crosstab(df["OnlineBackup"], df["Churn"])
backup_contingency

pd.crosstab(df["OnlineBackup"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(backup_contingency)

(47.260854003612764,
 6.214092807254819e-12,
 1,
 array([[3389.58341616, 1224.41658384],
        [1784.41658384,  644.58341616]]))

Return

DeviceProtection ¶

dev_protection_contingency = pd.crosstab(df["DeviceProtection"], df["Churn"])
dev_protection_contingency

pd.crosstab(df["DeviceProtection"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(dev_protection_contingency)

(30.513394539261306,
 3.315693222362861e-08,
 1,
 array([[3394.72582706, 1226.27417294],
        [1779.27417294,  642.72582706]]))

Return

TechSupport ¶

support_contingency = pd.crosstab(df["TechSupport"], df["Churn"])
support_contingency

pd.crosstab(df["TechSupport"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. It appears that customers who has a tech support package had a significantly lower churn rate.

scs.chi2_contingency(support_contingency)

(190.16684201526067,
 2.9235674453140758e-43,
 1,
 array([[3672.4160159, 1326.5839841],
        [1501.5839841,  542.4160159]]))

Return

StreamingTV ¶

tv_contingency = pd.crosstab(df["StreamingTV"], df["Churn"])
tv_contingency

pd.crosstab(df["StreamingTV"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(tv_contingency)

(27.862522274233417,
 1.3024835736732686e-07,
 1,
 array([[3185.35624024, 1150.64375976],
        [1988.64375976,  718.35624024]]))

Return

StreamingMovies ¶

movies_contingency = pd.crosstab(df["StreamingMovies"], df["Churn"])
movies_contingency

pd.crosstab(df["StreamingMovies"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later.

scs.chi2_contingency(movies_contingency)

(26.25133601003847,
 2.9974738476267514e-07,
 1,
 array([[3166.99048701, 1144.00951299],
        [2007.00951299,  724.99048701]]))

Return

Contract ¶

contract_contingency = pd.crosstab(original_df["Contract"], original_df["Churn"])
contract_contingency

pd.crosstab(original_df["Contract"], original_df["Churn"], normalize = "index")

sns.countplot(x = original_df["Contract"], hue = original_df["Churn"])

<matplotlib.axes._subplots.AxesSubplot at 0x205767ce390>

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been in a month to month contract in this data set have a substantial difference in churn rate, likely because of the lack of legal agreements preventing ease of churn, and should be examined in much greater detail.

scs.chi2_contingency(contract_contingency)

(1184.5965720837926,
 5.863038300673391e-258,
 2,
 array([[2846.69175067, 1028.30824933],
        [1082.11018032,  390.88981968],
        [1245.198069  ,  449.801931  ]]))

Return

PaperlessBilling ¶

paperless_contingency = pd.crosstab(df["PaperlessBilling"], df["Churn"])
paperless_contingency

pd.crosstab(df["PaperlessBilling"], df["Churn"], normalize = "index")

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been enrolled in paperless billing have a substantial difference in churn rate and should be examined in more detail later.

scs.chi2_contingency(paperless_contingency)

(258.27764906707307,
 4.073354668665985e-58,
 1,
 array([[2109.85773108,  762.14226892],
        [3064.14226892, 1106.85773108]]))

Return

PaymentMethod ¶

payment_contingency = pd.crosstab(original_df["PaymentMethod"], original_df["Churn"])
payment_contingency

pd.crosstab(original_df["PaymentMethod"], original_df["Churn"], normalize = "index")

fig = sns.countplot(x = original_df["PaymentMethod"], hue = original_df["Churn"])
fig.set_xticklabels(fig.get_xticklabels(), rotation = 45)

[Text(0, 0, 'Electronic check'),
 Text(0, 0, 'Mailed check'),
 Text(0, 0, 'Bank transfer (automatic)'),
 Text(0, 0, 'Credit card (automatic)')]

The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been paying via electronic check have a substantial difference in churn rate and should be examined in much greater detail later.

scs.chi2_contingency(paperless_contingency)

(258.27764906707307,
 4.073354668665985e-58,
 1,
 array([[2109.85773108,  762.14226892],
        [3064.14226892, 1106.85773108]]))

Return

The three continuous variables, Tenure, MonthlyCharges, and TotalCharges, are visualized through Seaborn’s distplot function, which overlays a histogram of the data with a kernel density estimate. The dataframe is divided by the Churn column, and for each continuous column, both dataframes’ distributions are superimposed for direct comparison |via histograms.

churn_df = df[df['Churn'] == True]
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1869 entries, 2 to 7041
Data columns (total 24 columns):
Gender                 1869 non-null bool
SeniorCitizen          1869 non-null bool
Partner                1869 non-null bool
Dependents             1869 non-null bool
Tenure                 1869 non-null int64
PhoneService           1869 non-null bool
MultipleLines          1869 non-null bool
FiberOpticService      1869 non-null bool
DSLService             1869 non-null bool
OnlineSecurity         1869 non-null bool
OnlineBackup           1869 non-null bool
DeviceProtection       1869 non-null bool
TechSupport            1869 non-null bool
StreamingTV            1869 non-null bool
StreamingMovies        1869 non-null bool
OneYearContract        1869 non-null bool
TwoYearContract        1869 non-null bool
PaperlessBilling       1869 non-null bool
MailedCheckPayment     1869 non-null bool
BankTransferPayment    1869 non-null bool
CreditCardPayment      1869 non-null bool
MonthlyCharges         1869 non-null float64
TotalCharges           1869 non-null float64
Churn                  1869 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 96.7 KB

not_churn_df = df[df['Churn'] == False]
not_churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5174 entries, 0 to 7042
Data columns (total 24 columns):
Gender                 5174 non-null bool
SeniorCitizen          5174 non-null bool
Partner                5174 non-null bool
Dependents             5174 non-null bool
Tenure                 5174 non-null int64
PhoneService           5174 non-null bool
MultipleLines          5174 non-null bool
FiberOpticService      5174 non-null bool
DSLService             5174 non-null bool
OnlineSecurity         5174 non-null bool
OnlineBackup           5174 non-null bool
DeviceProtection       5174 non-null bool
TechSupport            5174 non-null bool
StreamingTV            5174 non-null bool
StreamingMovies        5174 non-null bool
OneYearContract        5174 non-null bool
TwoYearContract        5174 non-null bool
PaperlessBilling       5174 non-null bool
MailedCheckPayment     5174 non-null bool
BankTransferPayment    5174 non-null bool
CreditCardPayment      5174 non-null bool
MonthlyCharges         5174 non-null float64
TotalCharges           5174 non-null float64
Churn                  5174 non-null bool
dtypes: bool(21), float64(2), int64(1)
memory usage: 267.8 KB

Tenure ¶

In regards to the customers who churned and those that did not in this data set, there is a very substantial difference in Tenure. As expected, the lower the tenure, the higher the relative percentage of churn rate.

sns.distplot(not_churn_df['Tenure'])
sns.distplot(churn_df['Tenure'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])

<matplotlib.legend.Legend at 0x205768b92e8>

ax = sns.scatterplot(x = 'Tenure', y = 'Churn', data = df, hue = 'Churn', alpha = 0.0075)

Return

MonthlyCharges ¶

In looking at the MonthlyCharges and TotalCharges values, it becomes clear that the churned and not churned distribution differ on these columns. The churn rate is much higher when a customer has a higher MonthlyCharges value. This is to be expected since a higher monthly bill causes more financial pressure on the customer and gives a greater performance expectation on the end of the service provider. It appears that over 35% of customers who have not churned have 20-25 as their MonthlyCharges value.

sns.distplot(not_churn_df['MonthlyCharges'])
sns.distplot(churn_df['MonthlyCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])

<matplotlib.legend.Legend at 0x20576a0a240>

ax = sns.scatterplot(x = 'MonthlyCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)

Return

Total Charges ¶

With the TotalCharges column, the distribution in churned and not churned is less of a difference than in the MonthlyCharges column, but is still substantial. The lower the TotalCharges, the higher the churn rate. It appears that over 10% of all customers who have churned had less than 400 TotalCharges.

sns.distplot(not_churn_df['TotalCharges'])
sns.distplot(churn_df['TotalCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])

<matplotlib.legend.Legend at 0x20576b52438>

ax = sns.scatterplot(x = 'TotalCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)

Return

A demographic seems to be forming by combining all of these insights. The type of person that is most likely to have churned in this dataset are those that have indicators of a greater knowledge of computers, like opting for fiber optic speeds, paying with e-checks, utilizing paperless billing and had been doing everything online as a result. Other factors that seem to be strongly affecting churn rate are having a low monthly charge, being a senior citizen, perhaps as a result of death, and ease of attrition, like being on a short term, month to month contract rather than a long term one.

The two columns that did not have significance will be removed to simplify later calculations.

df.drop(['Gender', 'PhoneService'], axis = 1, inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 22 columns):
SeniorCitizen          7043 non-null bool
Partner                7043 non-null bool
Dependents             7043 non-null bool
Tenure                 7043 non-null int64
MultipleLines          7043 non-null bool
FiberOpticService      7043 non-null bool
DSLService             7043 non-null bool
OnlineSecurity         7043 non-null bool
OnlineBackup           7043 non-null bool
DeviceProtection       7043 non-null bool
TechSupport            7043 non-null bool
StreamingTV            7043 non-null bool
StreamingMovies        7043 non-null bool
OneYearContract        7043 non-null bool
TwoYearContract        7043 non-null bool
PaperlessBilling       7043 non-null bool
MailedCheckPayment     7043 non-null bool
BankTransferPayment    7043 non-null bool
CreditCardPayment      7043 non-null bool
MonthlyCharges         7043 non-null float64
TotalCharges           7043 non-null float64
Churn                  7043 non-null bool
dtypes: bool(19), float64(2), int64(1)
memory usage: 295.9 KB

Correlation¶

For multivariate visualization, a heatmap of the dataframe’s correlation coefficient matrix is created and is used to look for correlation between the columns. Given the scenario, the most important correlations to look for are those pertaining to the Churn column, which is the focus in prediction. This visualization is very important to look for interactions that need removed, like the extremely high correlation coefficient that exists between the Tenure and TotalCharges columns.

sns.heatmap(
    df.corr(),
    xticklabels = df.columns,
    yticklabels = df.columns
)

<matplotlib.axes._subplots.AxesSubplot at 0x20576ca84e0>

Most of this correlation heatmap makes intuitive sense; people who pay for movie streaming are more likely to also pay for TV streaming, a higher monthly rate leads to a higher total amount paid, those that have partners also have more dependents. Additionally, StreamingTV and StreamingMovies are correlated with higher MonthlyCharges. However, a few important things jump out. Tenure and TotalCharges are extremely correlated, possibly enough to skew the entire model, TwoYearContracts are correlated with higher Tenure, and perhaps most importantly FiberOpticService is the most correlated variable with Churn. FiberOpticService is also correlated with a higher MonthlyCharges value, perhaps partially explaining the link to Churn.

Data Analysis ¶

MCA
PCA
FAMD

Now the data can be examined using data mining techniques. To continue exploring the data and to find the most important variables for the predictive analysis, MCA, or Multiple Correspondence Analysis will be used. Any low order interactions will be looked for by plotting and examining the results of the MCA.

import warnings
warnings.filterwarnings('ignore')

MCA ¶

mca_df = df[['SeniorCitizen', 'Partner', 'Dependents', 'MultipleLines',
             'FiberOpticService', 'DSLService', 'OnlineSecurity', 'OnlineBackup',
             'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
             'OneYearContract', 'TwoYearContract', 'PaperlessBilling',
             'MailedCheckPayment', 'BankTransferPayment', 'CreditCardPayment']]

mca = prince.MCA(
    n_components = 18,
    n_iter = 3,
    copy = True
)
mca = mca.fit(mca_df)

ax = mca.plot_coordinates(
    X = mca_df,
    ax = None,
    figsize=(20, 20),
    show_row_points=True,
    row_points_size=10,
    show_row_labels=False,
    show_column_points=True,
    column_points_size=30,
    show_column_labels=False,
    legend_n_cols=1
)

plt.savefig('Charts/MCA With Rows.svg')

mca_eigenvalues = mca.eigenvalues_
mca_eigenvalues

[0.18215878233504215,
 0.13486243244317503,
 0.0851054536060337,
 0.07099623972076913,
 0.06845680757433464,
 0.060425499184298945,
 0.05162360236621634,
 0.04949224423636462,
 0.04578464962160184,
 0.038440267282812496,
 0.03732833124325017,
 0.03502522054614435,
 0.03386588305995358,
 0.030203970598546497,
 0.026370994188364753,
 0.02578309341080396,
 0.016938795588812717,
 0.007137732993476052]

plt.plot(np.arange(18), mca_eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()

The MCA of the cleaned data is below. The first thing to notice is that these two factors only account for 31.71% of total inertia. There's a lot of variance in this dataset that isn't explained by this visualization, but these are the 2 Principal factors.

The second thing to notice is that it shows a number of distinct clusters of columns and their discrimination in the dataset. Although this MCA plots all of the columns twice, once for each of their boolean values, a several columns are clustered together on both values.

Some, like StreamingMovies and StreamingTV not only have an almost negligible difference from one another, shown by the lack of angle between their vectors and their base point at the center, but also have virtually the same discrimination, represented by their length from the base point. Their points actually overlap in this visualization. This coupled with their relatively high correlation in the correlation heatmap means that the final model would likely be improved by removing one of these values. As StreamingMovies appears to be slightly more discriminating, StreamingTV will be removed.

Other clusters, like DeviceProtection and OnlineBackup are very closely related to each other, but have have differing discrimination. DeviceProtection is more discriminating in both value cases, and would be kept over OnlineBackup if one of these needed removed, but their direct correlation with each other is pretty average based on the correlation coefficient heatmap. These columns and the StreamingTV/StreamingMovies columns make sense as possible interactions because the customers that get one are significantly more likely to get the other in their respective clusters.

Finally, two other natural clusters draw the eye; the first is TechSupport_False, OnlineSecurity_False, and Partner_False, and the second is BankTransferPayment_False, CreditCardPayment_False, and OneYearContract_False. These are interesting, but much less of a concern than the overlapping StreamingTV/Movies columns, as their True value counterparts are considerably spread out, both by discrimination and similiarities with each other.

mca = prince.MCA(
    n_components = 3,
    n_iter = 3,
    copy = True
)
mca = mca.fit(mca_df)

ax = mca.plot_coordinates(
    X = mca_df,
    ax = None,
    figsize=(20, 20),
    show_row_points=False,
    show_row_labels=False,
    show_column_points=True,
    column_points_size=30,
    show_column_labels=True,
    legend_n_cols=3
)
plt.savefig('Charts/MCA.svg')

PCA ¶

pca_df = df[['Tenure', 'MonthlyCharges', 'TotalCharges']]

pca_df = preprocessing.MinMaxScaler().fit_transform(pca_df)
pca_df

array([[0.01388889, 0.11542289, 0.00343704],
       [0.47222222, 0.38507463, 0.21756402],
       [0.02777778, 0.35422886, 0.01245279],
       ...,
       [0.15277778, 0.11293532, 0.03989153],
       [0.05555556, 0.55870647, 0.03530306],
       [0.91666667, 0.86965174, 0.78810105]])

# Rewrite the normalized columns back to the base dataframe for the upcoming analyses.
df['Tenure'] = (df['Tenure'] - df['Tenure'].mean()) / df['Tenure'].std()
df['MonthlyCharges'] = (df['MonthlyCharges'] - df['MonthlyCharges'].mean()) / df['MonthlyCharges'].std()
df['TotalCharges'] = (df['TotalCharges'] - df['TotalCharges'].mean()) / df['TotalCharges'].std()

pca = prince.PCA(
     n_components=3,
     n_iter=3,
     copy=True,
     check_input=True,
)
pca = pca.fit(pca_df)

ax = pca.plot_row_coordinates(
    pca_df,
    ax = None,
    figsize = (6, 6),
    x_component = 0,
    y_component = 1
)

No handles with labels found to put in legend.

pca.explained_inertia_

[0.7268883792805116, 0.25325809879467087, 0.01985352192481742]

pca.column_correlations(pca_df)

In the process of looking into principal components and variable interactions, it is worth looking at the continuous variable columns and reconsidering what they may represent. Tenure may be better decribed as the amount of bills a customer has had, and if that is the case, then TotalCharges may be nothing more than Tenure * MonthlyCharges.

As seen in the test below, this isn't exactly the case, but the correlation between Tenure and MonthlyCharges, and TotalCharges is very high, with a Pearson correlation coefficient of 0.83 between Tenure and TotalCharges. Given the high correlation between these, TotalCharges will be removed from the dataframe before the logistic regression is performed.

df[['Tenure', 'MonthlyCharges', 'TotalCharges']].corr()

test_df = original_df[['Tenure', 'MonthlyCharges', 'TotalCharges']]
test_df['TenureMonthlyCharges'] = test_df['Tenure'] * test_df['MonthlyCharges']
test_df['Difference'] = test_df['TenureMonthlyCharges'] - test_df['TotalCharges']
test_df.describe()

sns.distplot(test_df['Difference'])
plt.grid(True)

Now the columns identified as interactions or flagged for removal with the MCA, PCA, and correlation tests can be removed.

df.drop(['StreamingTV', 'TotalCharges'], axis = 1, inplace = True)

FAMD ¶

To continue exploring the data and to find the most important variables for the predictive analysis, FAMD, or Factor Analysis of Mixed Data will be used.

famd = prince.FAMD(
    n_components = 20,
    n_iter = 3,
    copy = True
)
famd = famd.fit(df)

With the factor analysis performed, we can examine the eigenvalues of the dataset and determine what eigenvectors to consider. Judging by the scree plot, the top 4 vectors should be used.

eigenvalues = famd.eigenvalues_
eigenvalues

[1.3934409203989468,
 0.7335955145699093,
 0.6747811569176864,
 0.16300031504936702,
 0.10912111549558255,
 0.08597713294817523,
 0.08471337252757039,
 0.07689480167820648,
 0.06994335210310547,
 0.06790319611467667,
 0.06182116997501196,
 0.057821381747244986,
 0.055741802793307725,
 0.05250894093022588,
 0.05077638525372531,
 0.0485356505223153,
 0.038270382492945225,
 0.02289045346256873,
 0.018366894899560567,
 0.014173621899521156]

The scree plot below shows that 4 factors are an adequate amount to use for modeling.

plt.plot(np.arange(20), eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()

famd = prince.FAMD(
    n_components = 4,
    n_iter = 3,
    copy = True
)
famd = famd.fit(df)

eigenvalues = famd.eigenvalues_
eigenvalues

[1.3934409204854312,
 0.7335955122650868,
 0.6747811488940323,
 0.1629938863475917]

famd.explained_inertia_

[0.35910856842937416,
 0.18905748377473677,
 0.17390022699925284,
 0.04200572864519644]

However, this only equates to 76.4% of variance of the dataset explained. A 5th factor may be necessary for proper accuracy.

sum(famd.explained_inertia_)

0.7640720078485603

famd.column_correlations(df)

plot = famd.plot_partial_row_coordinates(
     df,
     ax=None,
     figsize=(6, 6),
     x_component=0,
     y_component=1,
     alpha = 0.1
)

plot2 = famd.plot_partial_row_coordinates(
     df,
     ax=None,
     figsize=(6, 6),
     x_component=1,
     y_component=2,
     alpha = 0.1
)

plot3 = famd.plot_partial_row_coordinates(
     df,
     ax=None,
     figsize=(6, 6),
     x_component=0,
     y_component=2,
     alpha = 0.1
)

three_dimension_fa = famd.row_contributions(df)
three_dimension_fa

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = three_dimension_fa[0]
y = three_dimension_fa[1]
z = three_dimension_fa[2]
ax.scatter(x, y, z, alpha = 0.1)

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x20578129710>

three_dimension_fa_churn = three_dimension_fa.copy(deep = True)
three_dimension_fa_churn['Churn'] = df['Churn']

fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = three_dimension_fa_churn[0]
y = three_dimension_fa_churn[1]
z = three_dimension_fa_churn[2]
color = three_dimension_fa_churn['Churn']
ax.scatter(x, y, z, c = color, alpha = 0.1)

pickle.dump(fig, open('3D FAMD Churn.fig.pickle', 'wb'))
fig.savefig("Charts/FAMD Churn.svg")

Logistic Regression ¶

Two models will be created using Logistic Regression here, once with the basic cleaned dataframe, and another using the Factor Analysis of Mixed Data results.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, classification_report, roc_curve, auc

Base Dataframe Model¶

x_df = df.drop('Churn', axis = 1)

# Random_state is set to allow exact reproducibility.
x_train, x_test, y_train, y_test = train_test_split(x_df, df['Churn'],
    test_size = 0.1, random_state = 1)

regression = LogisticRegression()
regression.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

regression.intercept_

array([-2.14121805])

regression.coef_

array([[ 0.22680933, -0.00440932, -0.11375933, -0.79849974,  0.20898317,
         1.683634  ,  0.88671329, -0.40286786, -0.09805871,  0.03363374,
        -0.30891087,  0.37943566, -0.66759031, -1.36620076,  0.3358983 ,
        -0.35053947, -0.34015657, -0.39539662,  0.01770752]])

It’s this list of coefficients that show the detection of the phenomena that needed to be found and prompted this analysis. Here we see what factors are most associated with customer churn and from there we can infer what is most likely causing customers to churn. As a 1 indicates a positive customer churn, the positive coefficients are the ones that push the probability closer to churn, and the negative coefficients push the probability closer to not churning.

coefs = pd.concat([pd.DataFrame(df.columns.drop('Churn')),
                   pd.DataFrame(np.transpose(regression.coef_))], axis = 1)

coefs.columns = ["Column", "Coefficient"]
coefs

The biggest issue is with customers with FiberOpticService. It is over twice as influential on the probability of churn than the second highest, DSLService. The combination of these two show that the Internet services of this telecommunications business are the areas that need the most attention. Beyond that, customers that pay for Streaming Movies, and TV based on correlation that caused an interaction requiring the second to be removed from the model, are at the next highest risk of churn. Finally, customers enrolled in paperless billing, senior citizens, and customers with multiple phone lines are also groups to pay attention to.

coefs.where(coefs['Coefficient'] > 0).sort_values(['Coefficient'], ascending = False).dropna()

coefs.where(coefs['Coefficient'] < 0).sort_values(['Coefficient']).dropna()

These show that the strongest positive predictor variables are having Internet service, particularly Fiber optic service, paying for movie or TV streaming, and being enrolled in paperless billing. Additionally, these also show the strongest negative predictors of churn are having a non-month-to-month contract, particularly a two year contract, having a higher tenure value, paying for an online security package, and having any payment method other than electronic check.

In testing the model, we get an accuracy of 81.13%. This is 31.13% better than chance on average.

accuracy = regression.score(x_test, y_test)
accuracy

0.8113475177304964

predictions = regression.predict(x_test)
actual = y_test

confusion = confusion_matrix(actual, predictions)
confusion

array([[479,  54],
       [ 79,  93]], dtype=int64)

precision = precision_score(actual, predictions)
precision

0.6326530612244898

Though the accuracy is high, the precision and recall on True Churns could use some improvement and indicate that the model could be adjusted for better results. Having a larger proportion of customers that churned in the training dataset would help a lot and would be easier to get the data for in this situation than many others.

print(classification_report(actual, predictions))

              precision    recall  f1-score   support

       False       0.86      0.90      0.88       533
        True       0.63      0.54      0.58       172

    accuracy                           0.81       705
   macro avg       0.75      0.72      0.73       705
weighted avg       0.80      0.81      0.81       705

probabilities = regression.predict_proba(x_test)
predictions = probabilities[:,1]

false_positive_rate, true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

Finally, by calculating the ROC and AUC, we get an AUC of 0.84, which indicates an excellent model relative to chance.

roc_auc = auc(false_positive_rate, true_positive_rate)

roc_auc

0.8421942493127973

plt.title('Receiver Operating Characteristic Curve')
plt.plot(false_positive_rate, true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/ROC AUC.svg')
plt.show()

FAMD Based Logistic Regression Model¶

famd_df = df.copy(deep = True)

# Random_state is set to allow exact reproducibility.
famd_x_train, famd_x_test, famd_y_train, famd_y_test = train_test_split(famd_df.drop(['Churn'], axis = 1), famd_df['Churn'],
    test_size = 0.1, random_state = 1)

famd = prince.FAMD(
    n_components = 5,
    n_iter = 3,
    copy = True
)
famd = famd.fit(famd_x_train)

famd.column_correlations(famd_df)

famd_x = famd.transform(famd_x_train)
famd_x

This regression will be fitted with the FAMD transformed version of the training split data instead of the base split.

famd_regression = LogisticRegression()
famd_regression.fit(famd_x, famd_y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

famd_regression.coef_

array([[-0.14528036, -1.79793415,  0.0583536 , -0.68493203, -0.49683262]])

famd_accuracy = famd_regression.score(famd.transform(famd_x_test), famd_y_test)
famd_accuracy

0.8

famd_predictions = regression.predict(famd_x_test)
famd_actual = famd_y_test

famd_confusion = confusion_matrix(famd_actual, famd_predictions)
famd_confusion

array([[479,  54],
       [ 79,  93]], dtype=int64)

famd_precision = precision_score(famd_actual, famd_predictions)
famd_precision

0.6326530612244898

Notice that the results are exactly the same as the model trained on the base dataset.

print(classification_report(famd_actual, famd_predictions))

              precision    recall  f1-score   support

       False       0.86      0.90      0.88       533
        True       0.63      0.54      0.58       172

    accuracy                           0.81       705
   macro avg       0.75      0.72      0.73       705
weighted avg       0.80      0.81      0.81       705

famd_probabilities = regression.predict_proba(famd_x_test)
famd_predictions = famd_probabilities[:,1]

famd_false_positive_rate, famd_true_positive_rate, famd_threshold =\
    roc_curve(famd_y_test, famd_predictions)

Finally, by calculating the ROC and AUC, we get an AUC of 0.84, exactly the same as before.

famd_roc_auc = auc(famd_false_positive_rate, famd_true_positive_rate)

famd_roc_auc

0.8421942493127973

plt.title('FAMD Based Receiver Operating Characteristic Curve')
plt.plot(famd_false_positive_rate, famd_true_positive_rate, 'blue',
         label = 'FAMD AUC = %0.2f' % famd_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('FAMD True Positive Rate')
plt.xlabel('FAMD False Positive Rate')
plt.savefig('Charts/FAMD ROC AUC.svg')
plt.show()

df.to_csv("Cleaned Data.csv", index = False)

Conclusions ¶

The biggest issue is with customers with FiberOpticService. It is over twice as influential on the probability of churn than the second highest, DSLService. The combination of these two show that the Internet services of this telecommunications business are the areas that need the most attention. Beyond that, customers that pay for Streaming Movies, and TV based on correlation that caused an interaction requiring the second to be removed from the model, are at the next highest risk of churn. Finally, customers enrolled in paperless billing, senior citizens, and customers with multiple phone lines are also groups to pay attention to.

Ways to continue this analysis:

Look into the effect on the logistic model of a new boolean column that is True in the rows where there is a positive difference in Tenure * MonthlyCharges and TotalCharges. This may represent a fine, fee, or other superfluous bill, which are all things that customers would consider suddenly ceasing business over. This may more strongly affect those with a lower tenure who haven't established a status quo to overcome.
This plot has a very interesting projection and may indicate that a SVM-based approach to predict customer churn may be quite effective.
A 3 dimensional plot of the cleaned MCA plot here would add another dimension with the 3rd largest eigenvector and add its additional 8.51% inertia. It could be worth inspecting how this 3rd dimension affects the perceived clusters of columns and their values, and seeing if the conclusions would affect any column reduction choices that were too close to call from the other measures.

References¶

Wickham, H., & Grolemund, G. (2017). R for Data Science. Retrieved from https://r4ds.had.co.nz/

	0	1	2
0	0.837734	0.528903	0.135878
1	0.717447	-0.690778	0.089971
2	0.981905	0.053484	-0.181667

	Tenure	MonthlyCharges	TotalCharges
Tenure	1.000000	0.247900	0.826178
MonthlyCharges	0.247900	1.000000	0.651174
TotalCharges	0.826178	0.651174	1.000000

	0	1	2	3
DSLService	-0.102155	-0.071361	0.159414	-0.122870
MultipleLines	0.528898	-0.164727	-0.501276	0.158218
Churn	-0.068418	0.431927	-0.171190	0.428358
MailedCheckPayment	-0.394541	0.104815	0.385285	-0.142482
DeviceProtection	0.540366	-0.197343	-0.494249	0.148635
FiberOpticService	0.553599	0.262111	-0.777801	0.646707
PaperlessBilling	0.247266	0.121161	-0.346142	0.291692
TwoYearContract	0.268859	-0.600647	0.043417	-0.490537
TechSupport	0.419837	-0.212003	-0.349522	0.059269
CreditCardPayment	0.154922	-0.228437	-0.042583	-0.149762
OnlineBackup	0.511849	-0.211573	-0.453953	0.112696
Partner	0.285479	-0.355039	-0.114051	-0.211761
StreamingMovies	0.597270	-0.068711	-0.633186	0.325075
Dependents	0.013797	-0.204942	0.105630	-0.220824
OnlineSecurity	0.392665	-0.230035	-0.308667	0.020314
BankTransferPayment	0.169347	-0.234801	-0.055352	-0.149878
OneYearContract	0.119587	-0.206147	-0.015626	-0.147057
SeniorCitizen	0.161476	0.062453	-0.218131	0.170256
Tenure	0.743946	-0.937791	-0.299277	-0.552492
MonthlyCharges	0.831801	0.103881	-0.998564	0.670000

	0	1	2	3
0	0.882797	0.559959	0.811645	0.000041
1	0.008367	0.019372	0.034542	0.003672
2	0.347024	0.914571	0.102900	0.076616
3	0.018341	0.447507	0.273310	0.120922
4	0.126795	1.262461	0.010533	0.167145
...	...	...	...	...
7038	0.029263	0.244650	0.201811	0.130923
7039	1.244258	0.994727	0.977425	0.005612
7040	0.639374	0.156971	0.800820	0.013135
7041	0.072412	1.200753	0.033086	0.162248
7042	1.122787	0.590739	1.095784	0.003698

	0	1	2	3	4
DSLService	-0.100882	0.121104	0.108692	-0.126099	-0.158809
MultipleLines	0.528091	-0.023063	-0.530585	0.223502	0.229530
MailedCheckPayment	-0.393740	0.037720	0.397323	-0.192324	-0.188543
DeviceProtection	0.539898	0.009306	-0.539978	0.221229	0.205505
FiberOpticService	0.548373	-0.511974	-0.571199	0.699766	0.715311
PaperlessBilling	0.244976	-0.232041	-0.251903	0.315083	0.310328
TwoYearContract	0.273575	0.567662	-0.253554	-0.438470	-0.440033
TechSupport	0.420089	0.073158	-0.416997	0.119724	0.097706
CreditCardPayment	0.156499	0.195271	-0.149484	-0.123219	-0.129501
OnlineBackup	0.511640	0.036435	-0.510726	0.182198	0.166229
Partner	0.287750	0.286865	-0.275241	-0.168644	-0.149939
StreamingMovies	0.595235	-0.157380	-0.602200	0.399080	0.388547
Dependents	0.015837	0.225284	-0.004727	-0.213871	-0.194924
OnlineSecurity	0.393211	0.103994	-0.389110	0.077776	0.057836
BankTransferPayment	0.170907	0.196701	-0.164094	-0.122458	-0.124034
OneYearContract	0.121108	0.184091	-0.114338	-0.126060	-0.131957
SeniorCitizen	0.160099	-0.133386	-0.165643	0.186088	0.186185
Tenure	0.749746	0.758434	-0.725735	-0.433527	-0.441560
MonthlyCharges	0.826927	-0.443388	-0.846395	0.764762	0.759446

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
5	9305-CDSKC	Female	No	No	8	Yes	Yes	Fiber optic	No	...	Yes	No	Yes	Yes	Month-to-month	Yes	Electronic check	99.65	820.5	Yes
6	1452-KIOVK	Male	No	Yes	22	Yes	Yes	Fiber optic	No	...	No	No	Yes	No	Month-to-month	Yes	Credit card (automatic)	89.10	1949.4	No
7	6713-OKOMC	Female	No	No	10	No	No phone service	DSL	Yes	...	No	No	No	No	Month-to-month	No	Mailed check	29.75	301.9	No
8	7892-POOKP	Female	Yes	No	28	Yes	Yes	Fiber optic	No	...	Yes	Yes	Yes	Yes	Month-to-month	Yes	Electronic check	104.80	3046.05	Yes
9	6388-TABGU	Male	No	Yes	62	Yes	No	DSL	Yes	...	No	No	No	No	One year	No	Bank transfer (automatic)	56.15	3487.95	No

DeviceProtection	False	No internet service	True
OnlineBackup
False	1984	0	1104
No internet service	0	1526	0
True	1111	0	1318

Churn	False	True
Contract
Month-to-month	0.572903	0.427097
One year	0.887305	0.112695
Two year	0.971681	0.028319

Churn	False	True
PaymentMethod
Bank transfer (automatic)	1286	258
Credit card (automatic)	1290	232
Electronic check	1294	1071
Mailed check	1304	308

Churn	False	True
PaymentMethod
Bank transfer (automatic)	0.832902	0.167098
Credit card (automatic)	0.847569	0.152431
Electronic check	0.547146	0.452854
Mailed check	0.808933	0.191067

	Tenure	MonthlyCharges	TotalCharges	TenureMonthlyCharges	Difference
count	7043.000000	7043.000000	7043.000000	7043.000000	7043.000000
mean	32.371149	64.761692	2279.734304	2279.581350	-0.152953
std	24.559481	30.090047	2266.794470	2264.729447	67.202778
min	0.000000	18.250000	0.000000	0.000000	-373.250000
25%	9.000000	35.500000	398.550000	394.000000	-28.500000
50%	29.000000	70.350000	1394.550000	1393.600000	0.000000
75%	55.000000	89.850000	3786.600000	3786.100000	28.600000
max	72.000000	118.750000	8684.800000	8550.000000	370.850000

	Column	Coefficient
0	SeniorCitizen	0.226809
1	Partner	-0.004409
2	Dependents	-0.113759
3	Tenure	-0.798500
4	MultipleLines	0.208983
5	FiberOpticService	1.683634
6	DSLService	0.886713
7	OnlineSecurity	-0.402868
8	OnlineBackup	-0.098059
9	DeviceProtection	0.033634
10	TechSupport	-0.308911
11	StreamingMovies	0.379436
12	OneYearContract	-0.667590
13	TwoYearContract	-1.366201
14	PaperlessBilling	0.335898
15	MailedCheckPayment	-0.350539
16	BankTransferPayment	-0.340157
17	CreditCardPayment	-0.395397
18	MonthlyCharges	0.017708

	0	1	2	3	4
1144	-0.658755	-0.483230	1.359566	0.246775	-0.428892
4867	1.347853	-0.716190	0.004799	-0.256819	0.392239
4793	1.789362	-0.263427	0.183537	-0.038438	-0.000527
5304	0.151470	-0.704028	0.577799	-0.413002	0.274259
6192	0.395527	-0.824554	0.495050	-0.280361	-0.388576
...	...	...	...	...	...
905	0.977050	-1.574820	0.435212	-0.258079	-0.267402
5192	-0.139433	1.653669	0.645249	-0.354361	0.084785
3980	1.473054	-1.110007	0.138658	-0.114690	-0.249897
235	-0.216571	-0.722307	1.319012	0.198001	-0.382773
5157	0.660279	-0.213309	1.209992	0.751212	0.697376