Western Governors University Data Mining and Analytics II Project
The goal of this data mining and analysis project is to examine customer attrition patterns and to determine what indicators can be used to make better business decisions to prevent loss. Furthermore, a model will be produced in order to predict whether or not a given customer might discontinue business based on a model built from this analysis.
This project is done in Python using the various libraries in the scientific Python ecosystem of scikit-learn with a few other libraries for various uses. Specifically, this involves using pandas for data wrangling, manipulation, and cleaning, matplotlib and Seaborn for data visualization, and scikit-learn itself for model creation and analysis.
Like R and unlike SAS, all of these are easily available, free, and open-source. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.
The primary descriptive method used in this project is FAMD, or Factor Analysis of Mixed Data. FAMD is performed to both examine the variables in order to determine which are the most important, and to reduce the number of variables to examine without overly reducing the explained variance so that the predictive method will be easier to create and use.
The primary predictive method is logistic regression, and this technique is used to build the predictive model. Logistic regression is appropriate because the dependent variable is Boolean and is easily predicted using the probability of binary results that logistic regression provides. Additionally, three of the variables are continuous, which factor easily into any type of linear regression, and the remaining are nominal categorical, which can easily factor into the regression by encoding them in dummy variables. Finally, the coefficients of each parameter that the model will provide can be used to see how much that variable affects the chance of churn. The result is a model that is robust, easy to understand, and is simple to use in predicting if a customer will churn or not.
The essential criteria and phenomenon to be predicted are what factors are present with customers who are leaving, indicated by the Churn column, which can be detected by finding which independent variables have a high positive correlation with the Churn column. Additionally, finding the variables that are associated with a lower churn rate is useful for customer retention, and is an equally important goal. Beyond this, it is important to find what combination of all these factors is most associated with Churn rate so that smart business decisions can be made in the future.
The target and dependent variable in this data is the Churn column, which indicates whether or not that given customer has churned and ceased business. It is a nominal categorical variable that is originally loaded as a string or object dtype with yes/no values but is then tidied into a proper Boolean with True/False values.
There are many potential independent variables that can be used as predictor variables in this dataset. Two of the most important independent variables overall are the MonthlyCharges and the Tenure columns, which contains the continuous quantitative values that represent how much each customer is paying by month according to contract, and how long that customer has had a business relationship with this company. These are among the most important because of how strongly financial figures factor in to customers’ decisions. A higher MonthlyCharges value may easy be representative of more pressure to leave to a cheaper competitor as soon as a contract is over, or in the case of a month by month contract, to end it sooner rather than later. Meanwhile, a higher Tenure value is indicative of a customer who has invested more thus far and is less likely to leave based on any single factor.
For other independent variables, PhoneService and InternetService do an excellent job of grouping customer type, and the various columns of data that require one of these grouping columns to have a value indicating that they’re paying for that type of service can indicate that a disproportionate number of customers that are churning are all a very specific subset of customer. An example of a possibility is that if all customers with internet service, specifically DSL service, through this company are leaving and choosing to not renew contracts after previous ones are up, then reasons as to why this is needed to specifically be looked into. Additionally, that DSL customers could be prime targets for customer relations programs, or improved sales, because of this relation. A competitor may be offering a sale far better than the company currently offers or is willing to match.
The first goal and requirement in manipulating the data is to have a dataset that can be considered both tidy and clean. For data to be tidy, each variable must have its own column, each observation must have its own row, and each value must have its own cell. (Wickham & Grolemund, 2017, 12.2) For data to be clean, it needs to be free from obvious errors, be formatted into the correct form of data, be devoid of missing values, have irrelevant columns removed, and to be consistent within itself. Additionally, the data will be broken down so that dummy encodings (One-Hot Encoding) primarily represent the categorical data as opposed to the string values that exist in the initial dataset so that it can easily be implemented in an appropriate model. Beyond this, outliers and interactions must be found and removed if any exist to ensure the integrity of the resulting analyses and predictive model.
# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as scs
import prince
import pickle
from sklearn import preprocessing
%matplotlib inline
This dataset has 21 columns, with a number of which may be useful for predicting attrition, which is represented by the final column "Churn." Furthermore, it appears that none of the values are missing, or more specifically, NaN, but there is still potential for outliers or incorrect data.
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.info()
df.head(10)
To begin the data cleaning process after the initial loading and examining of the dataframe structure, each column is examined and any of the columns that have null values are flagged.
for col in df.columns.values:
print(col + ": " + str(df[col].isnull().values.any()))
n_rows = len(df)
n_rows
The column customerID appears to be a structured string simply used for identifying customers. While this column is extremely important in relational databases and would be useful if we planned on any dataframe splits and subsequent joins, it is not useful for analysis and will be removed early on to simplify later steps.
df.drop('customerID', axis = 1, inplace = True)
df.info()
As the customerID column is the only one that will need removed at this stage of the data exploration, the remainder of this first step is to go through each column, one by one, and do the most basic cleaning and explorative tasks. This includes:
df = df.rename(columns = {"gender": "Gender"})
df['Gender'].value_counts()
df['Gender'].value_counts(normalize = True)
SeniorCitizen appears to be a boolean value indicating whether or not the row corresponds to a customer that is a senior citizen, but what exact age that represents is unknown. 16.21% of customers are senior citizens in this dataset. This is the first example of a boolean column in this dataset, and they overall fail to follow a single convention. The boolean-like object columns will all be changed to an actual bool dtype as they come up.
df['SeniorCitizen'] = df['SeniorCitizen'].astype('bool')
df.info()
df['SeniorCitizen'].value_counts()
df['SeniorCitizen'].value_counts(normalize = True)
The Partner column may relate to a business partnership or some similar idea, but its position in between age related columns and dependants when the columns seem ordered by topic lead the belief of it indicating the existance of a spouse or established relationship partner instead in boolean form. 48.3% of customers in this dataset have a partner. It is another column that needs changed to the bool dtype.
df['Partner'] = df['Partner'].replace({"No":False, "Yes":True})
df['Partner'] = df['Partner'].astype('bool')
df.info()
df['Partner'].value_counts()
df['Partner'].value_counts(normalize = True)
df['Dependents'] = df['Dependents'].replace({"No":False, "Yes":True})
df['Dependents'] = df['Dependents'].astype('bool')
df.info()
df['Dependents'].value_counts()
df['Dependents'].value_counts(normalize = True)
The tenure column, whose name notably does not follow the column naming conventions, seems to be the business meaning of the term, or the number of years or some other time period that the customer has been with the company. Interestingly, the numbers range from 0 to 72. The highest value, 72, is the second most common while 0, is the lowest value and the least common. To make everything consistant, the column title will be updated.
df = df.rename(columns = {"tenure": "Tenure"})
df['Tenure'].value_counts()
df['PhoneService'] = df['PhoneService'].replace({"No":False, "Yes":True})
df['PhoneService'] = df['PhoneService'].astype('bool')
df.info()
df['PhoneService'].value_counts()
df['PhoneService'].value_counts(normalize = True)
MultipleLines is a string value that indicates whether the given customer has multiple lines on their account, has only a single line on their account, or if this column isn't applicable for the customer, that they do not have phone service through this company. 42.18% of total customers in this dataset have multiple lines, and 46.71% of customers with phone service in this dataset have multiple lines.
df['MultipleLines'].value_counts()
n_multiple_lines = df['MultipleLines'].value_counts()[1]
lines_proportion = n_multiple_lines / n_rows
print("{0:.4f}".format(lines_proportion) + "%")
lines_proportion2 = n_multiple_lines / (n_rows - df['MultipleLines'].value_counts()[2])
print("{0:.4f}".format(lines_proportion2) + "%")
InternetService is a string value indicating whether or not a customer is paying for internet service or not, and whether or not this service is for a fiber optic plan or a DSL plan. 90.32% of customers in this data service have an internet plan. 43.96% and 34.37% of total customers in this dataset have a fiber or DSL plan respectively, and 56.12% and 43.88% of customers with internet service have fiber or DSL plans respectively.
Many of the columns in this dataset have a number of possibilities that signify whether or not a service is included in the given customer's plan. Many of these columns would be boolean, but have a third value indicating that they do not have the required service plan for it to be a possibility. While these may be converted into bool dtype columns for later analysis, the boolean-like yes and no values will be changed to True or False on this first pass.
df['InternetService'] = df['InternetService'].replace({"No":"False"})
df['InternetService'].value_counts()
n_internet = n_rows - df['InternetService'].value_counts()[2]
internet_proportion = n_internet / n_rows
print("{0:.4f}".format(internet_proportion) + "%")
n_fiber = df['InternetService'].value_counts()[0]
n_dsl = df['InternetService'].value_counts()[1]
fiber_proportion = n_fiber / n_rows
dsl_proportion = n_dsl / n_rows
print("Fiber: {0:.4f}".format(fiber_proportion) + "%\nDSL: {0:.4f}".format(dsl_proportion) + "%")
fiber_proportion2 = n_fiber / n_internet
dsl_proportion2 = n_dsl / n_internet
print("Fiber: {0:.4f}".format(fiber_proportion2) + "%\nDSL: {0:.4f}".format(dsl_proportion2) + "%")
OnlineSecurity is a string value that indicates whether the given customer has an online security package in addition to their internet service, did not opt for the online security package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 28.67% of total customers in this dataset have the online security package, and 36.6% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.
df['OnlineSecurity'] = df['OnlineSecurity'].replace({"No":"False", "Yes":"True"})
df['OnlineSecurity'].value_counts()
n_security = df['OnlineSecurity'].value_counts()[1]
security_proportion = n_security / n_rows
print("{0:.4f}".format(security_proportion) + "%")
security_proportion2 = n_security / (n_rows - df['OnlineSecurity'].value_counts()[2])
print("{0:.4f}".format(security_proportion2) + "%")
OnlineBackup is a string value that indicates whether the given customer has an online backup package in addition to their internet service, did not opt for the online backup package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.49% of total customers in this dataset have the online security package, and 44.03% of customers with internet service in this dataset have the online security package. Its values will be updated to the more boolean-like format.
df['OnlineBackup'] = df['OnlineBackup'].replace({"No":"False", "Yes":"True"})
df['OnlineBackup'].value_counts()
n_backup = df['OnlineBackup'].value_counts()[1]
backup_proportion = n_backup / n_rows
print("{0:.4f}".format(backup_proportion) + "%")
backup_proportion2 = n_backup / (n_rows - df['OnlineBackup'].value_counts()[2])
print("{0:.4f}".format(backup_proportion2) + "%")
DeviceProtection is a string value that indicates whether the given customer has a device protection plan in addition to their internet service, did not opt for any protection plan, or if this column isn't applicable for the customer, that they do not have internet service through this company. 34.39% of total customers in this dataset have a device protection plan, and 43.90% of customers with internet service in this dataset have a device protection plan. Its values will be updated to the more boolean-like format.
df['DeviceProtection'] = df['DeviceProtection'].replace({"No":"False", "Yes":"True"})
df['DeviceProtection'].value_counts()
n_protection = df['DeviceProtection'].value_counts()[1]
protection_proportion = n_protection / n_rows
print("{0:.4f}".format(protection_proportion) + "%")
protection_proportion2 = n_protection / (n_rows - df['DeviceProtection'].value_counts()[2])
print("{0:.4f}".format(protection_proportion2) + "%")
OnlineBackup and DeviceProtection appear to be the first potential major issue for future analysis. Both require that the customer have a value other than no in the InternetService column, and also have extremely similar numbers, 2429 and 2422, 44.03% and 43.90% respectively. The correlation between the types of column values, namely the No internet service ones, are going to prevent proper testing and model building if tested together.
contingency_table = pd.crosstab(df['OnlineBackup'], df['DeviceProtection'])
contingency_table
As we will see shortly, a number of the columns require internet service to have meaningful values and are in the same situation as the above section, so the data will have to be restructured to account for this.
TechSupport is a string value that indicates whether the given customer has a tech support package in addition to their internet service, did not opt for any tech support package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 29.02% of total customers in this dataset have a tech support package, and 37.05% of customers with internet service in this dataset have a tech support package. Its values will be updated to the more boolean-like format.
df['TechSupport'] = df['TechSupport'].replace({"No":"False", "Yes":"True"})
df['TechSupport'].value_counts()
n_support = df['TechSupport'].value_counts()[1]
support_proportion = n_support / n_rows
print("{0:.4f}".format(support_proportion) + "%")
support_proportion2 = n_support / (n_rows - df['TechSupport'].value_counts()[2])
print("{0:.4f}".format(support_proportion2) + "%")
StreamingTV is a string value that indicates whether the given customer has a TV streaming package in addition to their internet service, did not opt for any TV streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.44% of total customers in this dataset have a TV streaming package, and 49.07% of customers with internet service in this dataset have a TV streaming package. Its values will be updated to the more boolean-like format.
df['StreamingTV'] = df['StreamingTV'].replace({"No":"False", "Yes":"True"})
df['StreamingTV'].value_counts()
n_tv = df['StreamingTV'].value_counts()[1]
tv_proportion = n_tv / n_rows
print("{0:.4f}".format(tv_proportion) + "%")
tv_proportion2 = n_tv / (n_rows - df['StreamingTV'].value_counts()[2])
print("{0:.4f}".format(tv_proportion2) + "%")
StreamingMovies is a string value that indicates whether the given customer has a movie streaming package in addition to their internet service, did not opt for any movie streaming package, or if this column isn't applicable for the customer, that they do not have internet service through this company. 38.79% of total customers in this dataset have a movie streaming package, and 49.52% of customers with internet service in this dataset have a movie streaming package. Its values will be updated to the more boolean-like format.
df['StreamingMovies'] = df['StreamingMovies'].replace({"No":"False", "Yes":"True"})
df['StreamingMovies'].value_counts()
n_movie = df['StreamingMovies'].value_counts()[1]
movie_proportion = n_movie / n_rows
print("{0:.4f}".format(movie_proportion) + "%")
movie_proportion2 = n_movie / (n_rows - df['StreamingMovies'].value_counts()[2])
print("{0:.4f}".format(movie_proportion2) + "%")
Contract is a string value that indicates what type of contract the given customer is under. The contract type values include Month-to-month, One year, and two year. 55.02% of customers are under a month to month contract, 24.07% are under a single year contract, and the remaining 20.91% are under a two year contract.
df['Contract'].value_counts()
df['Contract'].value_counts(normalize = True)
df['PaperlessBilling'] = df['PaperlessBilling'].replace({"No":False, "Yes":True})
df['PaperlessBilling'] = df['PaperlessBilling'].astype('bool')
df.info()
df['PaperlessBilling'].value_counts()
df['PaperlessBilling'].value_counts(normalize = True)
PaymentMethod is a string value that indicates what type of payment method the given customer uses. The payment method values include electronic checks, traditional checks via mail, automatic bank transfer, and automatic payment with a credit card. 33.58% of customers pay via electronic check, 22.89% pay via mailed check, 21.92% pay automatically via bank transfer, and the remaining 21.61% pay automatically via credit card.
df['PaymentMethod'].value_counts()
df['PaymentMethod'].value_counts(normalize = True)
df['MonthlyCharges'].value_counts()
df['MonthlyCharges'].describe()
plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()
TotalCharges appears to be a column that represents the total amount a given customer has paid throughout all transactions of the business-customer relationship. However, the column is loaded as the object dtype, or a string, instead of a float. This will have to be changed to examine the data effectively.
df['TotalCharges'].describe()
These 11 rows have missing values for their TotalCharges column, but this appears to be the only missing column. Notably, these customers' Tenure values are all 0, and none of them have churned. It seems likely that these customers don't have a TotalCharges value simply because they haven't been charged yet. As such, their TotalCharges value will be set to 0 to allow for later analysis.
df.loc[pd.to_numeric(df['TotalCharges'], errors='coerce').isnull()]
tofix = [488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]
df.loc[tofix, 'TotalCharges'] = 0
df.iloc[tofix]['TotalCharges']
Now we can change the column to the float dtype and then examine the data. The mean TotalCharges value is 2283.30, the minimum value is 18.80, and the maximum is 8684.80. Overall, the column is very right-skewed.
# Correct the TotalCharges column's dtype from object to float.
df['TotalCharges'] = df['TotalCharges'].astype('float')
df.info()
df['TotalCharges'].describe()
plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()
df['Churn'] = df['Churn'].replace({"No":False, "Yes":True})
df['Churn'] = df['Churn'].astype('bool')
df.info()
df['Churn'].value_counts()
df['Churn'].value_counts(normalize = True)
df.info()
In this dataset, the three continuous variable columns do not have any outliers. None are present based on IQR range, as seen on the boxplots, and none are present based on the z-score with a threshold of 3 standard deviations.
sns.boxplot(df['Tenure'])
tenure_z = np.abs(scs.zscore(df['Tenure']))
tenure_z
print(np.where(tenure_z > 3))
sns.boxplot(df['MonthlyCharges'])
monthly_z = np.abs(scs.zscore(df['MonthlyCharges']))
monthly_z
print(np.where(monthly_z > 3))
sns.boxplot(df['TotalCharges'])
total_z = np.abs(scs.zscore(df['TotalCharges']))
total_z
print(np.where(total_z > 3))
Contingency tables will be calculated for all variables in order to see how they appear to affect the churn rate before using more intensive methods.
However, this is also where the categorical variables become an issue for analyzing the data set. The next step in the overall process is to transform the data into the structure that the models need. As logistic regression will be performed, the categorical columns will need to be encoded into dummy variables, which are expressed in Boolean values.
In addition to the information encoding needing changed, a number of these variables are inherently correlated with each other; e.g. MultipleLines with PhoneService, and OnlineSecurity with InternetService. To do meaningful tests of significance, this intercorrelation needs to be removed before the tests are run. Specifically, the problem is that the values of "no phone service" and "no internet service" are also factoring into the tests. There are a few approaches available in order deal with this.
The first is to split the dataframe into three distinct dataframes, one for customers with only phone service, one for customers with only internet service, and the last for customers with both phone and internet service through this company. This is less than ideal for several reasons. Rather than a single dataframe with single calculations, this would then create three dataframes and require three tests for each calculation, further complicating how all of the tests would be combined.
phone_count = df.query("PhoneService == True and InternetService == 'False'").shape[0]
phone_count
internet_count = df.query("PhoneService == False and InternetService != 'False'").shape[0]
internet_count
both_count = df.query("PhoneService == True and InternetService != 'False'").shape[0]
both_count
(phone_count + internet_count + both_count) == n_rows
The second option is to simply remove a subsection of the rows, effectively removing columns of data in the process. This would technically work but it would also remove a significant amount of information in the process.
The remaining option is to manipulate the encoding and representation of the data. The categories that these all involve can be split into multiple boolean columns representing the "positive" values of the original columns. As an example, the MultipleLines column would be changed to be a boolean True/False option, with all rows that do not have phone service also being represented as False. Ultimately, the multiple logistic regression model is only going to adjust the calculation with a parameter if the value is true, and to create the model, these dummy variables have to be created regardless.
df.info()
original_df = df.copy(deep = True)
df['Gender'].value_counts()
df['Gender'] = df['Gender'].replace({"Male":False, "Female":True})
df['Gender'] = df['Gender'].astype('bool')
df['Gender'].value_counts()
df['MultipleLines'].value_counts()
df.loc[df['MultipleLines'] == "No phone service", 'MultipleLines'] = np.NaN
df['MultipleLines'] = df['MultipleLines'].replace({"No":False, "Yes":True})
df['MultipleLines'] = df['MultipleLines'].fillna(False).astype('bool')
df['MultipleLines'].value_counts()
df['InternetService'].value_counts()
df['FiberOpticService'] = False
df.loc[df['InternetService'] == "Fiber optic", 'FiberOpticService'] = True
df['FiberOpticService'].value_counts()
df['DSLService'] = False
df.loc[df['InternetService'] == "DSL", 'DSLService'] = True
df['DSLService'].value_counts()
df.drop('InternetService', axis = 1, inplace = True)
df['OnlineSecurity'].value_counts()
df.loc[df['OnlineSecurity'] == "No internet service", 'OnlineSecurity'] = np.NaN
df['OnlineSecurity'] = df['OnlineSecurity'].replace({"False":False, "True":True})
df['OnlineSecurity'] = df['OnlineSecurity'].fillna(False).astype('bool')
df['OnlineSecurity'].value_counts()
df['OnlineBackup'].value_counts()
df.loc[df['OnlineBackup'] == "No internet service", 'OnlineBackup'] = np.NaN
df['OnlineBackup'] = df['OnlineBackup'].replace({"False":False, "True":True})
df['OnlineBackup'] = df['OnlineBackup'].fillna(False).astype('bool')
df['OnlineBackup'].value_counts()
df['DeviceProtection'].value_counts()
df.loc[df['DeviceProtection'] == "No internet service", 'DeviceProtection'] = np.NaN
df['DeviceProtection'] = df['DeviceProtection'].replace({"False":False, "True":True})
df['DeviceProtection'] = df['DeviceProtection'].fillna(False).astype('bool')
df['DeviceProtection'].value_counts()
df['TechSupport'].value_counts()
df.loc[df['TechSupport'] == "No internet service", 'TechSupport'] = np.NaN
df['TechSupport'] = df['TechSupport'].replace({"False":False, "True":True})
df['TechSupport'] = df['TechSupport'].fillna(False).astype('bool')
df['TechSupport'].value_counts()
df['StreamingTV'].value_counts()
df.loc[df['StreamingTV'] == "No internet service", 'StreamingTV'] = np.NaN
df['StreamingTV'] = df['StreamingTV'].replace({"False":False, "True":True})
df['StreamingTV'] = df['StreamingTV'].fillna(False).astype('bool')
df['StreamingTV'].value_counts()
df['StreamingMovies'].value_counts()
df.loc[df['StreamingMovies'] == "No internet service", 'StreamingMovies'] = np.NaN
df['StreamingMovies'] = df['StreamingMovies'].replace({"False":False, "True":True})
df['StreamingMovies'] = df['StreamingMovies'].fillna(False).astype('bool')
df['StreamingMovies'].value_counts()
The most common type of contract is the Month-to-month contract, so it will be the reference level, having both the new one year and two year contract columns as False.
df['Contract'].value_counts()
df['OneYearContract'] = False
df['TwoYearContract'] = False
df.loc[df['Contract'] == "One year", 'OneYearContract'] = True
df.loc[df['Contract'] == "Two year", 'TwoYearContract'] = True
df['OneYearContract'].value_counts()
df['TwoYearContract'].value_counts()
df.drop('Contract', axis = 1, inplace = True)
The most common type of payment method is by electronic check, so it will be the reference level, having the new columns mailed check, bank transfer, and credit card as False.
df['PaymentMethod'].value_counts()
df['MailedCheckPayment'] = False
df['BankTransferPayment'] = False
df['CreditCardPayment'] = False
df.loc[df['PaymentMethod'] == "Mailed check", 'MailedCheckPayment'] = True
df.loc[df['PaymentMethod'] == "Bank transfer (automatic)", 'BankTransferPayment'] = True
df.loc[df['PaymentMethod'] == "Credit card (automatic)", 'CreditCardPayment'] = True
df['MailedCheckPayment'].value_counts()
df['BankTransferPayment'].value_counts()
df['CreditCardPayment'].value_counts()
df.drop('PaymentMethod', axis = 1, inplace = True)
Finally, reorganize all of the columns back to the order that the original dataset was in.
cols = df.columns
cols
# Reorders the columns back to how the original dataframe had them.
df = df[['Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Tenure',
'PhoneService', 'MultipleLines', 'FiberOpticService', 'DSLService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'OneYearContract', 'TwoYearContract',
'PaperlessBilling', 'MailedCheckPayment', 'BankTransferPayment',
'CreditCardPayment', 'MonthlyCharges', 'TotalCharges', 'Churn']]
This section culminates into the dataframe below. All categorical variables have been recoded into Boolean values for the future regression modeling and the continuous variables are still intact as integer for the Tenure column, and floating point for the two charges columns.
df.info()
As all of the distributions have been examined in the data exploration and cleaning stage, this section acts as a recap and a point at which to actually represent them graphically. The following code reformats the cleaned dataframe into its long form so that the resulting graphs are generated together in a single chart with shared axes. This should allow for easy comparison of proportions. The large chart of all categorical variables is followed by the summary statistics and a histogram showing the distribution of each of the continuous variables.
viz_df = df.drop(['Tenure', 'MonthlyCharges', 'TotalCharges'], axis = 1)
viz_df.columns
melted = viz_df.melt()
melted
g = sns.FacetGrid(
melted,
col = 'variable',
hue = 'value',
sharey = 'row',
sharex = 'col',
col_wrap = 7,
legend_out = True,
)
g = g.map(sns.countplot, 'value', order = [False, True]).add_legend()
plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Categorical Variable Distributions')
# Add the percentage proportion value for each column in each variable plot.
for i, ax in enumerate(g.axes):
# Set each plot's title to its corresponding column name.
ax.set_title(viz_df.columns[i])
for p in ax.patches:
percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
ax.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')
g.savefig("Charts/Univariate Distributions.svg")
g
df['Tenure'].describe()
plt.hist(df['Tenure'])
plt.title('Distribution of Tenure')
plt.xlabel('Tenure')
plt.ylabel('Tenure Frequency')
plt.grid(True)
plt.show()
df['MonthlyCharges'].describe()
plt.hist(df['MonthlyCharges'])
plt.title('Distribution of MonthlyCharges')
plt.xlabel('Monthly Charge')
plt.ylabel('Monthly Charge Frequency')
plt.grid(True)
plt.show()
df['TotalCharges'].describe()
plt.hist(df['TotalCharges'])
plt.title('Distribution of TotalCharges')
plt.xlabel('Total Charge')
plt.ylabel('Total Charge Frequency')
plt.grid(True)
plt.show()
For ease of interpretation and usage given the 2x2 relationships between churn and the other categorical variables, simple contingency tables were used to visualize the categorical bivariate distributions. Bar, pie, and mosaic plots were all considered, but all seemed to overcomplicate the presentation. In each of the following sections, the first table is the raw numbers of the categorical variables, and the second table contains the same information but normalized into proportions. The two columns Contract and PaymentMethod have more values than just True/False. These are visualized with bar plots grouped by the rows’ Churn values.
gender_contingency = pd.crosstab(df["Gender"], df["Churn"])
gender_contingency
pd.crosstab(df["Gender"], df["Churn"], normalize = "index")
The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.
scs.chi2_contingency(gender_contingency)
senior_contingency = pd.crosstab(df["SeniorCitizen"], df["Churn"])
senior_contingency
pd.crosstab(df["SeniorCitizen"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(senior_contingency)
partner_contingency = pd.crosstab(df["Partner"], df["Churn"])
partner_contingency
pd.crosstab(df["Partner"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(partner_contingency)
dependents_contingency = pd.crosstab(df["Dependents"], df["Churn"])
dependents_contingency
pd.crosstab(df["Dependents"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(dependents_contingency)
phone_contingency = pd.crosstab(df["PhoneService"], df["Churn"])
phone_contingency
pd.crosstab(df["PhoneService"], df["Churn"], normalize = "index")
The $\chi^2$ test does not show significance, so this column will not be examined in more detail later.
scs.chi2_contingency(phone_contingency)
lines_contingency = pd.crosstab(df["MultipleLines"], df["Churn"])
lines_contingency
pd.crosstab(df["MultipleLines"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. It may be that a greater number of lines increases the probability of churn as a result of the higher cost overall.
scs.chi2_contingency(lines_contingency)
fiber_contingency = pd.crosstab(df["FiberOpticService"], df["Churn"])
fiber_contingency
pd.crosstab(df["FiberOpticService"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing fiber optic services in this data set have a substantial difference in churn rate and should be examined in much greater detail.
scs.chi2_contingency(fiber_contingency)
dsl_contingency = pd.crosstab(df["DSLService"], df["Churn"])
dsl_contingency
pd.crosstab(df["DSLService"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had not been purchasing DSL services in this data set have a substantial difference in churn rate and should be examined in much greater detail.
scs.chi2_contingency(dsl_contingency)
security_contingency = pd.crosstab(df["OnlineSecurity"], df["Churn"])
security_contingency
pd.crosstab(df["OnlineSecurity"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been purchasing internet service but not the online security package in this data set have a substantial difference in churn rate and should be examined in much greater detail.
scs.chi2_contingency(security_contingency)
backup_contingency = pd.crosstab(df["OnlineBackup"], df["Churn"])
backup_contingency
pd.crosstab(df["OnlineBackup"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(backup_contingency)
dev_protection_contingency = pd.crosstab(df["DeviceProtection"], df["Churn"])
dev_protection_contingency
pd.crosstab(df["DeviceProtection"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(dev_protection_contingency)
support_contingency = pd.crosstab(df["TechSupport"], df["Churn"])
support_contingency
pd.crosstab(df["TechSupport"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. It appears that customers who has a tech support package had a significantly lower churn rate.
scs.chi2_contingency(support_contingency)
tv_contingency = pd.crosstab(df["StreamingTV"], df["Churn"])
tv_contingency
pd.crosstab(df["StreamingTV"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(tv_contingency)
movies_contingency = pd.crosstab(df["StreamingMovies"], df["Churn"])
movies_contingency
pd.crosstab(df["StreamingMovies"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later.
scs.chi2_contingency(movies_contingency)
contract_contingency = pd.crosstab(original_df["Contract"], original_df["Churn"])
contract_contingency
pd.crosstab(original_df["Contract"], original_df["Churn"], normalize = "index")
sns.countplot(x = original_df["Contract"], hue = original_df["Churn"])
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been in a month to month contract in this data set have a substantial difference in churn rate, likely because of the lack of legal agreements preventing ease of churn, and should be examined in much greater detail.
scs.chi2_contingency(contract_contingency)
paperless_contingency = pd.crosstab(df["PaperlessBilling"], df["Churn"])
paperless_contingency
pd.crosstab(df["PaperlessBilling"], df["Churn"], normalize = "index")
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been enrolled in paperless billing have a substantial difference in churn rate and should be examined in more detail later.
scs.chi2_contingency(paperless_contingency)
payment_contingency = pd.crosstab(original_df["PaymentMethod"], original_df["Churn"])
payment_contingency
pd.crosstab(original_df["PaymentMethod"], original_df["Churn"], normalize = "index")
fig = sns.countplot(x = original_df["PaymentMethod"], hue = original_df["Churn"])
fig.set_xticklabels(fig.get_xticklabels(), rotation = 45)
The $\chi^2$ test shows significance, so this column should be examined in more detail later. Customers who had been paying via electronic check have a substantial difference in churn rate and should be examined in much greater detail later.
scs.chi2_contingency(paperless_contingency)
The three continuous variables, Tenure, MonthlyCharges, and TotalCharges, are visualized through Seaborn’s distplot function, which overlays a histogram of the data with a kernel density estimate. The dataframe is divided by the Churn column, and for each continuous column, both dataframes’ distributions are superimposed for direct comparison |via histograms.
churn_df = df[df['Churn'] == True]
churn_df.info()
not_churn_df = df[df['Churn'] == False]
not_churn_df.info()
In regards to the customers who churned and those that did not in this data set, there is a very substantial difference in Tenure. As expected, the lower the tenure, the higher the relative percentage of churn rate.
sns.distplot(not_churn_df['Tenure'])
sns.distplot(churn_df['Tenure'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
ax = sns.scatterplot(x = 'Tenure', y = 'Churn', data = df, hue = 'Churn', alpha = 0.0075)
In looking at the MonthlyCharges and TotalCharges values, it becomes clear that the churned and not churned distribution differ on these columns. The churn rate is much higher when a customer has a higher MonthlyCharges value. This is to be expected since a higher monthly bill causes more financial pressure on the customer and gives a greater performance expectation on the end of the service provider. It appears that over 35% of customers who have not churned have 20-25 as their MonthlyCharges value.
sns.distplot(not_churn_df['MonthlyCharges'])
sns.distplot(churn_df['MonthlyCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
ax = sns.scatterplot(x = 'MonthlyCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)
With the TotalCharges column, the distribution in churned and not churned is less of a difference than in the MonthlyCharges column, but is still substantial. The lower the TotalCharges, the higher the churn rate. It appears that over 10% of all customers who have churned had less than 400 TotalCharges.
sns.distplot(not_churn_df['TotalCharges'])
sns.distplot(churn_df['TotalCharges'])
plt.grid(True)
plt.legend(["Not Churned", "Churned"])
ax = sns.scatterplot(x = 'TotalCharges', y = 'Churn', data = df, hue = 'Churn', alpha = 0.05)
A demographic seems to be forming by combining all of these insights. The type of person that is most likely to have churned in this dataset are those that have indicators of a greater knowledge of computers, like opting for fiber optic speeds, paying with e-checks, utilizing paperless billing and had been doing everything online as a result. Other factors that seem to be strongly affecting churn rate are having a low monthly charge, being a senior citizen, perhaps as a result of death, and ease of attrition, like being on a short term, month to month contract rather than a long term one.
The two columns that did not have significance will be removed to simplify later calculations.
df.drop(['Gender', 'PhoneService'], axis = 1, inplace = True)
df.info()
For multivariate visualization, a heatmap of the dataframe’s correlation coefficient matrix is created and is used to look for correlation between the columns. Given the scenario, the most important correlations to look for are those pertaining to the Churn column, which is the focus in prediction. This visualization is very important to look for interactions that need removed, like the extremely high correlation coefficient that exists between the Tenure and TotalCharges columns.
sns.heatmap(
df.corr(),
xticklabels = df.columns,
yticklabels = df.columns
)
Most of this correlation heatmap makes intuitive sense; people who pay for movie streaming are more likely to also pay for TV streaming, a higher monthly rate leads to a higher total amount paid, those that have partners also have more dependents. Additionally, StreamingTV and StreamingMovies are correlated with higher MonthlyCharges. However, a few important things jump out. Tenure and TotalCharges are extremely correlated, possibly enough to skew the entire model, TwoYearContracts are correlated with higher Tenure, and perhaps most importantly FiberOpticService is the most correlated variable with Churn. FiberOpticService is also correlated with a higher MonthlyCharges value, perhaps partially explaining the link to Churn.
Now the data can be examined using data mining techniques. To continue exploring the data and to find the most important variables for the predictive analysis, MCA, or Multiple Correspondence Analysis will be used. Any low order interactions will be looked for by plotting and examining the results of the MCA.
import warnings
warnings.filterwarnings('ignore')
mca_df = df[['SeniorCitizen', 'Partner', 'Dependents', 'MultipleLines',
'FiberOpticService', 'DSLService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'OneYearContract', 'TwoYearContract', 'PaperlessBilling',
'MailedCheckPayment', 'BankTransferPayment', 'CreditCardPayment']]
mca = prince.MCA(
n_components = 18,
n_iter = 3,
copy = True
)
mca = mca.fit(mca_df)
ax = mca.plot_coordinates(
X = mca_df,
ax = None,
figsize=(20, 20),
show_row_points=True,
row_points_size=10,
show_row_labels=False,
show_column_points=True,
column_points_size=30,
show_column_labels=False,
legend_n_cols=1
)
plt.savefig('Charts/MCA With Rows.svg')
mca_eigenvalues = mca.eigenvalues_
mca_eigenvalues
plt.plot(np.arange(18), mca_eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()
The second thing to notice is that it shows a number of distinct clusters of columns and their discrimination in the dataset. Although this MCA plots all of the columns twice, once for each of their boolean values, a several columns are clustered together on both values.
Some, like StreamingMovies and StreamingTV not only have an almost negligible difference from one another, shown by the lack of angle between their vectors and their base point at the center, but also have virtually the same discrimination, represented by their length from the base point. Their points actually overlap in this visualization. This coupled with their relatively high correlation in the correlation heatmap means that the final model would likely be improved by removing one of these values. As StreamingMovies appears to be slightly more discriminating, StreamingTV will be removed.
Other clusters, like DeviceProtection and OnlineBackup are very closely related to each other, but have have differing discrimination. DeviceProtection is more discriminating in both value cases, and would be kept over OnlineBackup if one of these needed removed, but their direct correlation with each other is pretty average based on the correlation coefficient heatmap. These columns and the StreamingTV/StreamingMovies columns make sense as possible interactions because the customers that get one are significantly more likely to get the other in their respective clusters.
Finally, two other natural clusters draw the eye; the first is TechSupport_False, OnlineSecurity_False, and Partner_False, and the second is BankTransferPayment_False, CreditCardPayment_False, and OneYearContract_False. These are interesting, but much less of a concern than the overlapping StreamingTV/Movies columns, as their True value counterparts are considerably spread out, both by discrimination and similiarities with each other.
mca = prince.MCA(
n_components = 3,
n_iter = 3,
copy = True
)
mca = mca.fit(mca_df)
ax = mca.plot_coordinates(
X = mca_df,
ax = None,
figsize=(20, 20),
show_row_points=False,
show_row_labels=False,
show_column_points=True,
column_points_size=30,
show_column_labels=True,
legend_n_cols=3
)
plt.savefig('Charts/MCA.svg')
pca_df = df[['Tenure', 'MonthlyCharges', 'TotalCharges']]
pca_df = preprocessing.MinMaxScaler().fit_transform(pca_df)
pca_df
# Rewrite the normalized columns back to the base dataframe for the upcoming analyses.
df['Tenure'] = (df['Tenure'] - df['Tenure'].mean()) / df['Tenure'].std()
df['MonthlyCharges'] = (df['MonthlyCharges'] - df['MonthlyCharges'].mean()) / df['MonthlyCharges'].std()
df['TotalCharges'] = (df['TotalCharges'] - df['TotalCharges'].mean()) / df['TotalCharges'].std()
pca = prince.PCA(
n_components=3,
n_iter=3,
copy=True,
check_input=True,
)
pca = pca.fit(pca_df)
ax = pca.plot_row_coordinates(
pca_df,
ax = None,
figsize = (6, 6),
x_component = 0,
y_component = 1
)
pca.explained_inertia_
pca.column_correlations(pca_df)
In the process of looking into principal components and variable interactions, it is worth looking at the continuous variable columns and reconsidering what they may represent. Tenure may be better decribed as the amount of bills a customer has had, and if that is the case, then TotalCharges may be nothing more than Tenure * MonthlyCharges.
As seen in the test below, this isn't exactly the case, but the correlation between Tenure and MonthlyCharges, and TotalCharges is very high, with a Pearson correlation coefficient of 0.83 between Tenure and TotalCharges. Given the high correlation between these, TotalCharges will be removed from the dataframe before the logistic regression is performed.
df[['Tenure', 'MonthlyCharges', 'TotalCharges']].corr()
test_df = original_df[['Tenure', 'MonthlyCharges', 'TotalCharges']]
test_df['TenureMonthlyCharges'] = test_df['Tenure'] * test_df['MonthlyCharges']
test_df['Difference'] = test_df['TenureMonthlyCharges'] - test_df['TotalCharges']
test_df.describe()
sns.distplot(test_df['Difference'])
plt.grid(True)
Now the columns identified as interactions or flagged for removal with the MCA, PCA, and correlation tests can be removed.
df.drop(['StreamingTV', 'TotalCharges'], axis = 1, inplace = True)
To continue exploring the data and to find the most important variables for the predictive analysis, FAMD, or Factor Analysis of Mixed Data will be used.
famd = prince.FAMD(
n_components = 20,
n_iter = 3,
copy = True
)
famd = famd.fit(df)
With the factor analysis performed, we can examine the eigenvalues of the dataset and determine what eigenvectors to consider. Judging by the scree plot, the top 4 vectors should be used.
eigenvalues = famd.eigenvalues_
eigenvalues
The scree plot below shows that 4 factors are an adequate amount to use for modeling.
plt.plot(np.arange(20), eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()
famd = prince.FAMD(
n_components = 4,
n_iter = 3,
copy = True
)
famd = famd.fit(df)
eigenvalues = famd.eigenvalues_
eigenvalues
famd.explained_inertia_
However, this only equates to 76.4% of variance of the dataset explained. A 5th factor may be necessary for proper accuracy.
sum(famd.explained_inertia_)
famd.column_correlations(df)
plot = famd.plot_partial_row_coordinates(
df,
ax=None,
figsize=(6, 6),
x_component=0,
y_component=1,
alpha = 0.1
)
plot2 = famd.plot_partial_row_coordinates(
df,
ax=None,
figsize=(6, 6),
x_component=1,
y_component=2,
alpha = 0.1
)
plot3 = famd.plot_partial_row_coordinates(
df,
ax=None,
figsize=(6, 6),
x_component=0,
y_component=2,
alpha = 0.1
)
three_dimension_fa = famd.row_contributions(df)
three_dimension_fa
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = three_dimension_fa[0]
y = three_dimension_fa[1]
z = three_dimension_fa[2]
ax.scatter(x, y, z, alpha = 0.1)
three_dimension_fa_churn = three_dimension_fa.copy(deep = True)
three_dimension_fa_churn['Churn'] = df['Churn']
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
x = three_dimension_fa_churn[0]
y = three_dimension_fa_churn[1]
z = three_dimension_fa_churn[2]
color = three_dimension_fa_churn['Churn']
ax.scatter(x, y, z, c = color, alpha = 0.1)
pickle.dump(fig, open('3D FAMD Churn.fig.pickle', 'wb'))
fig.savefig("Charts/FAMD Churn.svg")
Two models will be created using Logistic Regression here, once with the basic cleaned dataframe, and another using the Factor Analysis of Mixed Data results.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, classification_report, roc_curve, auc
x_df = df.drop('Churn', axis = 1)
# Random_state is set to allow exact reproducibility.
x_train, x_test, y_train, y_test = train_test_split(x_df, df['Churn'],
test_size = 0.1, random_state = 1)
regression = LogisticRegression()
regression.fit(x_train, y_train)
regression.intercept_
regression.coef_
It’s this list of coefficients that show the detection of the phenomena that needed to be found and prompted this analysis. Here we see what factors are most associated with customer churn and from there we can infer what is most likely causing customers to churn. As a 1 indicates a positive customer churn, the positive coefficients are the ones that push the probability closer to churn, and the negative coefficients push the probability closer to not churning.
coefs = pd.concat([pd.DataFrame(df.columns.drop('Churn')),
pd.DataFrame(np.transpose(regression.coef_))], axis = 1)
coefs.columns = ["Column", "Coefficient"]
coefs
The biggest issue is with customers with FiberOpticService. It is over twice as influential on the probability of churn than the second highest, DSLService. The combination of these two show that the Internet services of this telecommunications business are the areas that need the most attention. Beyond that, customers that pay for Streaming Movies, and TV based on correlation that caused an interaction requiring the second to be removed from the model, are at the next highest risk of churn. Finally, customers enrolled in paperless billing, senior citizens, and customers with multiple phone lines are also groups to pay attention to.
coefs.where(coefs['Coefficient'] > 0).sort_values(['Coefficient'], ascending = False).dropna()
coefs.where(coefs['Coefficient'] < 0).sort_values(['Coefficient']).dropna()
These show that the strongest positive predictor variables are having Internet service, particularly Fiber optic service, paying for movie or TV streaming, and being enrolled in paperless billing. Additionally, these also show the strongest negative predictors of churn are having a non-month-to-month contract, particularly a two year contract, having a higher tenure value, paying for an online security package, and having any payment method other than electronic check.
In testing the model, we get an accuracy of 81.13%. This is 31.13% better than chance on average.
accuracy = regression.score(x_test, y_test)
accuracy
predictions = regression.predict(x_test)
actual = y_test
confusion = confusion_matrix(actual, predictions)
confusion
precision = precision_score(actual, predictions)
precision
Though the accuracy is high, the precision and recall on True Churns could use some improvement and indicate that the model could be adjusted for better results. Having a larger proportion of customers that churned in the training dataset would help a lot and would be easier to get the data for in this situation than many others.
print(classification_report(actual, predictions))
probabilities = regression.predict_proba(x_test)
predictions = probabilities[:,1]
false_positive_rate, true_positive_rate, threshold =\
roc_curve(y_test, predictions)
Finally, by calculating the ROC and AUC, we get an AUC of 0.84, which indicates an excellent model relative to chance.
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
plt.title('Receiver Operating Characteristic Curve')
plt.plot(false_positive_rate, true_positive_rate, 'blue',
label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/ROC AUC.svg')
plt.show()
famd_df = df.copy(deep = True)
# Random_state is set to allow exact reproducibility.
famd_x_train, famd_x_test, famd_y_train, famd_y_test = train_test_split(famd_df.drop(['Churn'], axis = 1), famd_df['Churn'],
test_size = 0.1, random_state = 1)
famd = prince.FAMD(
n_components = 5,
n_iter = 3,
copy = True
)
famd = famd.fit(famd_x_train)
famd.column_correlations(famd_df)
famd_x = famd.transform(famd_x_train)
famd_x
This regression will be fitted with the FAMD transformed version of the training split data instead of the base split.
famd_regression = LogisticRegression()
famd_regression.fit(famd_x, famd_y_train)
famd_regression.coef_
famd_accuracy = famd_regression.score(famd.transform(famd_x_test), famd_y_test)
famd_accuracy
famd_predictions = regression.predict(famd_x_test)
famd_actual = famd_y_test
famd_confusion = confusion_matrix(famd_actual, famd_predictions)
famd_confusion
famd_precision = precision_score(famd_actual, famd_predictions)
famd_precision
Notice that the results are exactly the same as the model trained on the base dataset.
print(classification_report(famd_actual, famd_predictions))
famd_probabilities = regression.predict_proba(famd_x_test)
famd_predictions = famd_probabilities[:,1]
famd_false_positive_rate, famd_true_positive_rate, famd_threshold =\
roc_curve(famd_y_test, famd_predictions)
Finally, by calculating the ROC and AUC, we get an AUC of 0.84, exactly the same as before.
famd_roc_auc = auc(famd_false_positive_rate, famd_true_positive_rate)
famd_roc_auc
plt.title('FAMD Based Receiver Operating Characteristic Curve')
plt.plot(famd_false_positive_rate, famd_true_positive_rate, 'blue',
label = 'FAMD AUC = %0.2f' % famd_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('FAMD True Positive Rate')
plt.xlabel('FAMD False Positive Rate')
plt.savefig('Charts/FAMD ROC AUC.svg')
plt.show()
df.to_csv("Cleaned Data.csv", index = False)
The biggest issue is with customers with FiberOpticService. It is over twice as influential on the probability of churn than the second highest, DSLService. The combination of these two show that the Internet services of this telecommunications business are the areas that need the most attention. Beyond that, customers that pay for Streaming Movies, and TV based on correlation that caused an interaction requiring the second to be removed from the model, are at the next highest risk of churn. Finally, customers enrolled in paperless billing, senior citizens, and customers with multiple phone lines are also groups to pay attention to.
Ways to continue this analysis:
Wickham, H., & Grolemund, G. (2017). R for Data Science. Retrieved from https://r4ds.had.co.nz/