This notebook is so large and works with so much that data that it was run in multiple settings with the kernel reset for memory management each time. As such, the code cell blocks have execution numbers that are not perfectly in order. Though these do not match up perfectly in this version, the code was and should only be excecuted from top to bottom.
Data helps businesses solve problems, make better decisions, and understand consumers, but a lot of data needs to be stored and available to enable these benefits. Hard drive failure is the most common form of data loss, which is one of the most impactful problems that businesses can experience today as simple drive recovery can cost up to $7,500 per drive (Painchaud, 2018). For cloud-based data centers, keeping multitudes of businesses’ data intact for their own operations is crucial. Being able to predict which hard drives are at the highest risk of failure based on understanding of the combinations of routine diagnostics test results is an ideal solution to backup and replace failing drives before the data is lost.
The dataset used is Backblaze’s 4th quarter data from 2019 (Backblaze, 2020). All of the needed data is contained within the .zip file that Backblaze provides to the public as .csv files split by day.
The dataset contains .csv files for each day of its corresponding quarter, from 2019-10-01 to 2019-12-31. As an example, the subsection of the dataset for 2019-10-01 contains 115,259 rows of data. However, as this data contains recorded readings from a live data center, the number of hard drives and thus rows, changed daily as failed drives were taken out and new drives were installed. The 129 column attributes are Date, Serial Number, Model Number, Capacity, Failure, 62 Self-Monitoring, Analysis and Reporting Technology (SMART) test results, and 62 normalized values of the SMART test values. The Failure attribute is the dependent variable of this study and is a qualitative binary categorical variable. The Date, Serial Number, and Model are nominal qualitative independent variables. Finally, and the SMART value columns are continuous quantitative independent variables.
As stated in Backblaze’s Hard Drive Data and Stats page (Backblaze, n.d.), this dataset is free for any use as long as Backblaze is cited as the data source, that users accept that they are solely responsible for how the data is used, and that the data cannot be sold to anybody as it publicly available.
Python, pandas, and the scikit-learn stack are extensively used for the loading, tidying, manipulation, and analysis of the datasets. PyTorch is used for all neural network related tasks of the analysis and model production. Matplotlib and seaborn are used to create charts and graphics for analysis and presentation of project findings. A needed algorithm, namely Fisher's Exact test for contingency tables greater than 2x2 dimensions, is unavailable in the scikit-learn ecosystem, and R.stats is used for this by using rpy2 to run the R code by embedding it in the Python process. Prince is used for factor analysis, and imbalanced-learn is used for the implementation of SMOTE.
Like R and unlike SAS, all of these packages are easily available, free, and open-source with Python. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.
Synthetic Minority Over-Sampling Technique (SMOTE) is used specifically to handle the imbalanced classes for training and testing splits. PCA is used for dimensionality reduction. Predictor variables are examined through correlation coefficients and Fisher's exact test, as well as graphed univariate and bivariate distributions. A logistic regression model and a decision tree model are examined along with the results of the PCA to find predictor variables as well. For building a predictive model for future use, the logistic regression model, a random forest ensemble model, and neural networks are compared to determine which can produce the most useful model.
As HDD failure is an extremely rare event, the dependent variable class is extremely imbalanced and failing to control for the imbalance through techniques like boosting or oversampling would lead to ineffective models. As the dependent variable is a Boolean value, this task is a binary classification task. Logistic regression is an ideal predictive model for binary classification tasks that gives a probability for classification while also having a simplistic interpretation of coefficients that can be used for feature selection. Decision trees are also simple to understand and work well for classification tasks. Given the complexity of the various fields in the dataset, a more complicated model may work better for predictive power. Random forests and neural networks work very well for classification tasks under these circumstances.
The key project outcomes are a deep understanding of the risk of hard drive failure based on the results of SMART test values regardless of manufacturer and predictive models that will be able to flag hard drives that are at high risk of failing. The understanding of the risk of failure based on test values will empower better business decisions by optimizing the choice of storage used based on projected lifetime. The predictive models will allow the business to proactively backup data from storage onto new storage devices before failure while also allowing hard drives to continue working closer to their end of life, minimizing waste from constantly replacing hard drives before it is needed. The combination of these two products will also enable the future creation of a more automated system that protects data from hard drive failure.
The dataset provided by Backblaze is made up of 92 .csv files, 1 for each day in the 2019 4th quarter, totaling 3.13GB of text data. As hard drive failure is an extremely rare event, all of these days will need to be considered together in order to have enough failures to draw conclusions. The project begins by combining all parts of the dataset from their .csv files into a single file.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import csv
import scipy.stats as scs
import gc
import pickle
# Jupyter magic commands for displaying plot objects in the notebook and
# setting float display limits.
%matplotlib inline
%precision %.10g
sns.set_style("dark")
if not os.path.isfile('q4_combined.csv'):
# Create a generator of dataset files in the current working directory.
files = glob.glob(os.path.join(os.getcwd(), "2019-*.csv"))
# Combine the fields into a single file, writing the column index from
# only the first .csv file.
index = False
with open('q4_combined.csv', 'w') as combined:
for file in files:
with open(file, 'r') as part:
if not index:
for row in part:
combined.write(row)
index = True
else:
next(part)
for row in part:
combined.write(row)
with open('q4_combined.csv') as file:
for (count, _) in enumerate(file, 0):
pass
row_count = count
print("Rows: " + str(row_count))
df = pd.read_csv('q4_combined.csv')
df.info()
Out of 10,991,209 hard drive days, there were only 678 failures, which gives a failure rate of 0.006169%.
df['failure'].value_counts()
nonfailed, failed = df['failure'].value_counts()
failure_rate = failed / nonfailed
print("Failure Rate: " + str("{:.6f}".format(failure_rate * 100)) + "%")
Weiss (2013) defined the imbalance ratio as the ratio between majority and minority classes with a modestly imbalanced dataset having an imbalance ratio of 10:1, and extremely imbalanced datasets as having an imbalance ratio of 1000:1 or greater (pg. 15). This dataset has an imbalance ratio of approximately 16,210:1 and as such will require very careful cultivation in order for any predictive model to successfully learn from. The rarity of the positive failure cases is also the reason that the entire 4th quarter dataset is required.
Unfortunately, this combined file requires too much memory to load all at once for current hardware restraints. It needs 13.5GB for just the data, not including the memory needed for the OS and other software, nor memory for calculations.
# Return the summed memory usage of each column in bytes.
memory_usage = sum(df.memory_usage(deep=True))
memory_usage
print(str(memory_usage / 1000) + "KB")
print(str("{:.2f}".format(memory_usage / 1000000)) + "MB")
print(str("{:.2f}".format(memory_usage / 1000000000)) + "GB")
As this dataset contains both raw and normalized values for all of the SMART values, a simple way to deal with the memory issues is to divide the dataset into a raw form and a normalized form.
list(df.columns.values)
raw_cols = []
for col in df.columns.values:
if "normalized" not in col:
raw_cols.append(col)
print(raw_cols)
norm_cols = []
for col in df.columns.values:
if "raw" not in col:
norm_cols.append(col)
print(norm_cols)
if not os.path.isfile('q4_raw.csv'):
df.to_csv('q4_raw.csv', columns = raw_cols, index=False)
if not os.path.isfile('q4_normalized.csv'):
df.to_csv('q4_normalized.csv', columns = norm_cols, index=False)
try:
del [df, nonfailed, failed, failure_rate, memory_usage, raw_cols, norm_cols]
print("Memory cleared successfully.")
except:
pass
The considerably smaller raw value subset of data is the main dataset of this project. As with nearly all real-world datasets, this one needs considerable cleaning and tidying in order to use for analysis.
df = pd.read_csv('q4_raw.csv')
df.info()
null_values = df.isna().sum().sum()
null_values
len(df.columns)
n_rows = len(df)
n_rows
n_values = n_rows * len(df.columns)
n_values
null_values / n_values
# Calculate the number of values in the total dataset
n_rows * 131
df.head(30)
# Return the memory usage of each column in bytes.
print(df.memory_usage(deep=True))
# Total number of failures
df.failure.sum()
# Average number of failures per day
df.failure.sum() / len(df.date.unique())
All SMART test columns have null values in some rows. The dataset notes state that this comes from differing manufacturer's standards despite the standardized nature of SMART tests.
for col in df.columns.values:
print(col + ": " + str(df[col].isnull().values.any()))
Deriving the manufacturer from the model column will allow the dataset to be easily divided by manufacturer.
df.model.unique()
The "DELLBOSS VD" model value seems the be the only value potentially out of place.
df.loc[(df['model'] == "DELLBOSS VD") &
(df['date'] == "2019-10-01")]
None of the SMART values exist for this hard drive model, but 60 of the drives have this model value. Additionally, no failures for this model exist in the dataset. Any row with this model value should be removed from the training data before any predictive analysis. Some searching online leads to the belief that it may be a RAID controller. (https://www.dell.com/support/manuals/au/en/aubsd1/boss-s-1/boss_s1_ug_publication/overview?guid=guid-b20ef25b-b7e3-40f2-b7cd-e497358cd10a&lang=en-us)
df.loc[(df['model'] == "DELLBOSS VD") &
(df['failure'] == 1)]
Additionally the "Seagate SSD" model seems to be missing information. Like the "DELLBOSS VD" model rows, this one also does not have any failures and will need to be removed before predictive analysis is performed.
df.loc[(df['model'] == "Seagate SSD") &
(df['date'] == "2019-10-01")]
df.loc[(df['model'] == "Seagate SSD") &
(df['failure'] == 1)]
The rows not appropriate for analysis are deleted.
df.drop(df[(df['model'] == "DELLBOSS VD") | \
(df['model'] == "Seagate SSD")].index, axis = 0, inplace = True)
n_rows = len(df)
n_rows
# model: ["Manufacturer", "New Model"]
manufacturer_dict = {
'ST4000DM000': ["Seagate", "ST4000DM000"],
'ST12000NM0007': ["Seagate", "ST12000NM0007"],
'HGST HMS5C4040ALE640': ["HGST", "HMS5C4040ALE640"],
'ST8000NM0055': ["Seagate", "ST8000NM0055"],
'ST8000DM002': ["Seagate", "ST8000DM002"],
'HGST HMS5C4040BLE640': ["HGST", "HMS5C4040BLE640"],
'HGST HUH721212ALN604': ["HGST", "HUH721212ALN604"],
'TOSHIBA MG07ACA14TA': ["Toshiba", "MG07ACA14TA"],
'HGST HUH721212ALE600': ["HGST", "HUH721212ALE600"],
'TOSHIBA MQ01ABF050': ["Toshiba", "MQ01ABF050"],
'ST500LM030': ["Seagate", "ST500LM030"],
'ST6000DX000': ["Seagate", "ST6000DX000"],
'ST10000NM0086': ["Seagate", "ST10000NM0086"],
'DELLBOSS VD': ["Dell", "DELLBOSS VD"],
'TOSHIBA MQ01ABF050M': ["Toshiba", "MQ01ABF050M"],
'WDC WD5000LPVX': ["Western Digital", "WD5000LPVX"],
'ST500LM012 HN': ["Seagate", "ST500LM012 HN"],
'HGST HUH728080ALE600': ["HGST", "HUH728080ALE600"],
'TOSHIBA MD04ABA400V': ["Toshiba", "MD04ABA400V"],
'TOSHIBA HDWF180': ["Toshiba", "HDWF180"],
'ST8000DM005': ["Seagate", "ST8000DM005"],
'Seagate SSD': ["Seagate", "Seagate SSD"],
'HGST HUH721010ALE600': ["HGST", "Seagate SSD"],
'ST4000DM005': ["Seagate", "ST4000DM005"],
'WDC WD5000LPCX': ["Western Digital", "WD5000LPCX"],
'HGST HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
'ST500LM021': ["Seagate", "ST500LM021"],
'Hitachi HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
'HGST HUS726040ALE610': ["HGST", "HUS726040ALE610"],
'Seagate BarraCuda SSD ZA500CM10002': ["Seagate", "ZA500CM10002"],
'ST12000NM0117': ["Seagate", "ST12000NM0117"],
'Seagate BarraCuda SSD ZA2000CM10002': ["Seagate", "ZA2000CM10002"],
'Seagate BarraCuda SSD ZA250CM10002': ["Seagate", "ZA250CM10002"],
'TOSHIBA HDWE160': ["Toshiba", "HDWE160"],
'WDC WD5000BPKT': ["Western Digital", "WD5000BPKT"],
'ST6000DM001': ["Seagate", "ST6000DM001"],
'WDC WD60EFRX': ["Western Digital", "WD60EFRX"],
'ST8000DM004': ["Seagate", "ST8000DM004"],
'HGST HMS5C4040BLE641': ["HGST", "HMS5C4040BLE641"],
'ST1000LM024 HN': ["Seagate", "ST1000LM024 HN'"],
'ST6000DM004': ["Seagate", "ST6000DM004"],
'ST12000NM0008': ["Seagate", "ST12000NM0008"],
'ST16000NM001G': ["Seagate", "ST16000NM001G"]
}
# Change the model column into Manufacturer and Model columns.
df['model_temp'] = df['model']
df['manufacturer'] = ''
df['manufacturer'] = df['model_temp'].map(lambda x: manufacturer_dict[x][0])
df['model'] = df['model_temp'].map(lambda x: manufacturer_dict[x][1])
df.drop(['model_temp'], axis=1, inplace=True)
df.head()
Given the size of the dataset, a few minor changes to the columns may free up a considerable amount of memory. The date and capacity_bytes columns are two easy places to improve.
# date
df['date'].value_counts()
df['date'][0:5]
before_mem = df['date'].memory_usage()
before_mem
df['date'] = df['date'].str[-5:]
df.head()
df['date'] = df['date'].astype('category')
df['date'][0:5]
after_mem = df['date'].memory_usage()
after_mem
memory_saved = before_mem - after_mem
print("Memory saved on the date column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
# model
before_mem = df['model'].memory_usage()
df['model'] = df['model'].astype('category')
after_mem = df['model'].memory_usage()
memory_saved = before_mem - after_mem
print("Memory saved on the model column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
# failure
before_mem = df['failure'].memory_usage(deep = True)
df['failure'] = df['failure'].astype('bool')
after_mem = df['failure'].memory_usage(deep = True)
memory_saved = before_mem - after_mem
print("Memory saved on the failure column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
# capacity_bytes
before_memory = df['capacity_bytes'].memory_usage(deep = True)
before_memory
Here we can see that 1108 drive days have an error value rather than their actual capacity. These rows may need to be removed, but may also be an excellent signal for a failing drive.
df.loc[df["capacity_bytes"] == -1]["manufacturer"].value_counts()
sns.countplot(x = df.loc[df["capacity_bytes"] == -1]["capacity_bytes"], \
hue = df["failure"])
Unfortunately, all drives experiencing this error do not fail and this can introduce problems in the final model. As it only affects 0.01% of the dataset, removing the affected rows seems best.
# Calculate the percentage of the dataset that is affected by this error.
str(np.around(((1008/n_rows) * 100), 2)) + "%"
df.drop(df[(df['capacity_bytes'] == -1)].index, axis = 0, inplace = True)
n_rows = len(df)
n_rows
df['capacity_bytes'].value_counts()
The capacity_bytes column is converted from bytes to terabytes to condense the information on disk.
df['capacity_TB'] = np.around((df['capacity_bytes']/(1000*1000*1000*1000)), \
decimals = 2)
df.head()
df['capacity_TB'].value_counts()
df['capacity_TB'] = df['capacity_TB'].astype('category')
after_mem = df['capacity_TB'].memory_usage()
memory_saved = before_memory - after_mem
print("Memory saved on the capacity column: " + \
str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
df.drop(['capacity_bytes'], axis=1, inplace=True)
df.head()
fail_df = pd.crosstab(df["manufacturer"], df["failure"])
fail_df
fail_df['Rate'] = fail_df[1] / (fail_df[0] + fail_df[1])
fail_df
corr_df = df.corr()
corr_df['failure']
With these things finished, the univariate distributions can be examined to gain a better sense of the data.
The first column, date shows some sort of testing or operational failure on November 5th.
plt.figure(figsize = (20, 10))
plt.title('Number of Drives in Operation per Day (Q4 2019)')
g = sns.countplot(df['date'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.figure.savefig("Charts/Date Distribution.png")
g.figure.savefig("Charts/Date Distribution.svg")
Drive capacities are mostly 4, 8, and 12 TB, likely coinciding with large investments in new drives for the datacenter and possibly alongside the price lowering of specific models.
plt.figure(figsize = (5, 5))
plt.title('Capacity of Drives')
g = sns.countplot(df['capacity_TB'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')
g.figure.savefig("Charts/Capacity Distribution.svg")
g.figure.savefig("Charts/Capacity Distribution.svg")
The manufacturer of the most drives in this dataset is Seagate at 72.59%. HGST is the second highest at 24.24%. Western Digital is the least represented manufacturer in the dataset with only 0.23%, but as HGST was acquired by Western Digital in 2012 (Sanders, 2018), the drives in this dataset will likely be quite similar between the two manufacturers given the seven-year timespan between then and the time of dataset recording and creation. Finally, Toshiba is the other manufacturer, with 2.94% of the dataset. This amount is quite low and may make it difficult to accurately predict their drives in comparison.
plt.figure(figsize = (5, 5))
plt.title('Manufacturers of Drives')
g = sns.countplot(df['manufacturer'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')
g.figure.savefig("Charts/Manufacturer Distribution.svg")
g.figure.savefig("Charts/Manufacturer Distribution.png")
The SMART values vary greatly from the number of different types of drives that exist in this dataset. Before the columns can be graphed appropriately, the NaN/null values need to be examined. It's most likely that the missing data is most related to the hard drive's manufacturer or model.
sns.distplot(df['smart_1_raw'])
plt.grid(True)
plt.show()
# Pandas styling function
def highlight_nans(val):
color = 'red' if val == True or val > 0 else 'black'
return 'color: %s' % color
Every single SMART figure column has null values.
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 75)
df.isna().any()
manu_nan_df = pd.DataFrame()
for manu in df['manufacturer'].unique():
manu_nan_df[manu] = df.loc[df['manufacturer'] == manu].isna().sum()
manu_nan_df.style.applymap(highlight_nans)
model_nan_df = pd.DataFrame()
for model in df['model'].unique():
model_nan_df[model] = df.loc[df['model'] == model].isna().sum()
model_nan_df.style.applymap(highlight_nans)
model_nan_percent_df = pd.DataFrame()
for model in df['model'].unique():
model_nan_percent_df[model] = (df.loc[df['model'] == model].isna().sum())\
/len(df.loc[df['model'] == model])
model_nan_percent_df
plt.figure(figsize = (20, 20))
plt.title('Model NaN Value Proportion by Hard Drive Model')
g = sns.heatmap(model_nan_percent_df, linewidths=0.2)
g.figure.savefig("Charts/Model NaN Heatmap.svg")
g.figure.savefig("Charts/Model NaN Heatmap.png")
description_df = df.describe()
description_df
The count row is equivalent to the number of non-null values. If a column has a count of 0, every single value in it is NaN or null, and should be deleted.
description_df.iloc[0]
smart_13_raw, smart_15_raw, smart_179_raw, smart_181_raw, smart_182_raw, smart_201_raw, smart_250_raw, smart_251_raw, smart_252_raw, and smart_255_raw are all empty in this dataset, as all rows have NaN values in these columns.
count_df = pd.DataFrame()
count_df['count'] = description_df.iloc[0]
count_df
# Pandas styling function
def highlight_count_nans1(val):
if val >= 66.6:
color = 'green'
elif val >= 33.3 and val < 66.6:
color = 'yellow'
else:
color = 'red'
return 'color: %s' % color
# Pandas styling function
def highlight_count_nans2(val):
green = int((val * 255) / 100)
red = int(255 - green)
rgb = (red, green, 0)
# Convert to hexadecimal for pandas styling
color = '#%02x%02x%02x' % rgb
return 'color: %s' % color
count_df['perc_not_nan'] = (count_df['count'] / n_rows) * 100
count_df
count_df.style.applymap(highlight_count_nans1, subset = ['perc_not_nan'])
count_df['bar'] = count_df['perc_not_nan']
count_df.style.\
applymap(highlight_count_nans2, subset = ['perc_not_nan']).\
bar(subset=['bar'], color='#d65f5f')
empty_columns = []
columns_to_examine = []
for row in count_df.iterrows():
if row[1][0] == 0.0:
empty_columns.append(row[0])
elif row[1][0] < (0.8 * n_rows):
columns_to_examine.append(row[0])
empty_columns
columns_to_examine
before_mem = df.memory_usage(deep=True).sum()
df.drop(empty_columns, axis=1, inplace=True)
after_mem = df.memory_usage(deep=True).sum()
memory_saved = before_mem - after_mem
print("Memory saved on empty column removal: " + \
str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
# Free up memory for the next computation.
try:
del [empty_columns, manufacturing_dict, before_mem, after_mem, memory_saved, fail_df, corr_df]
print("Memory successfully cleared.")
except:
pass
# Save the current form of the dataframe for restoration after the following calculations are performed.
if not os.path.isfile('pre_viz_df.csv'):
df.to_csv("pre_viz_df.csv", index = False)
df = pd.read_csv('pre_viz_df.csv')
viz_df = df.drop(['date', 'serial_number', 'failure', 'model', 'manufacturer', 'capacity_TB'], axis = 1)
viz_df.columns
# Free up memory for the next computation.
try:
del df
print("Memory successfully cleared.")
except:
pass
# Melt the df in chunks as df.melt() will take far too much memory.
pivot_list = list()
chunk_size = 250000
for i in range(0, len(viz_df), chunk_size):
row_pivot = viz_df.iloc[i: i + chunk_size].melt()
pivot_list.append(row_pivot)
melted = pd.concat(pivot_list)
del pivot_list
melted[0:30]
# Free up memory for the next computation.
try:
del viz_df
print("Memory successfully cleared.")
except:
pass
gc.collect()
g = sns.FacetGrid(
melted,
col = 'variable',
hue = 'value',
sharey = 'row',
sharex = 'col',
col_wrap = 7,
legend_out = True,
)
g = g.map(sns.distplot).add_legend()
plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Continuous Variable Distributions')
g.savefig("Charts/Univariate Distributions.svg")
g
Unfortunately, this operation takes too much memory to do in this manner. Each column will have to be graphed separately and then the graphs combined into a single graphic for the same effect.
# Reset to the dataframes and memory allocations from before the graphing attempts.
try:
del melted
except:
pass
df = pd.read_csv('pre_viz_df.csv')
df.head()
sns.distplot(df['smart_1_raw'])
sns.distplot(df['smart_2_raw'])
sns.distplot(df['smart_3_raw'], kde = False)
sns.distplot(df['smart_4_raw'])
sns.distplot(df['smart_5_raw'], kde = False)
sns.distplot(df['smart_7_raw'])
sns.distplot(df['smart_8_raw'])
sns.distplot(df['smart_9_raw'])
sns.distplot(df['smart_10_raw'], kde = False)
sns.distplot(df['smart_11_raw'])
sns.distplot(df['smart_12_raw'])
sns.distplot(df['smart_16_raw'])
sns.distplot(df['smart_17_raw'])
sns.distplot(df['smart_18_raw'], kde = False)
sns.distplot(df['smart_22_raw'], kde = False)
sns.distplot(df['smart_23_raw'])
sns.distplot(df['smart_24_raw'])
sns.distplot(df['smart_168_raw'])
sns.distplot(df['smart_170_raw'])
sns.distplot(df['smart_173_raw'], kde = False)
sns.distplot(df['smart_174_raw'])
sns.distplot(df['smart_177_raw'])
sns.distplot(df['smart_183_raw'], kde_kws={'bw':0.1})
sns.distplot(df['smart_184_raw'], kde = False)
sns.distplot(df['smart_187_raw'], kde = False)
sns.distplot(df['smart_188_raw'], kde = False)
sns.distplot(df['smart_189_raw'], kde = False)
sns.distplot(df['smart_190_raw'])
sns.distplot(df['smart_191_raw'])
sns.distplot(df['smart_192_raw'])
sns.distplot(df['smart_193_raw'])
sns.distplot(df['smart_194_raw'], kde = False)
sns.distplot(df['smart_195_raw'], kde = False)
sns.distplot(df['smart_196_raw'], kde = False)
sns.distplot(df['smart_197_raw'], kde = False)
sns.distplot(df['smart_198_raw'], kde = False)
sns.distplot(df['smart_199_raw'], kde = False)
sns.distplot(df['smart_200_raw'], kde = False)
sns.distplot(df['smart_218_raw'], kde = False)
sns.distplot(df['smart_220_raw'])
sns.distplot(df['smart_222_raw'])
sns.distplot(df['smart_223_raw'], kde = False)
sns.distplot(df['smart_224_raw'], kde = False)
sns.distplot(df['smart_225_raw'], kde = False)
sns.distplot(df['smart_226_raw'])
sns.distplot(df['smart_231_raw'], kde = False)
sns.distplot(df['smart_232_raw'])
sns.distplot(df['smart_233_raw'])
sns.distplot(df['smart_235_raw'],kde = False)
sns.distplot(df['smart_240_raw'])
sns.distplot(df['smart_241_raw'], kde = False)
sns.distplot(df['smart_242_raw'], kde = False)
sns.distplot(df['smart_254_raw'], kde = False)
fig, axes = plt.subplots(7, 8, figsize = (50, 40))
row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
'smart_254_raw']:
if col == 8:
row += 1
col = 0
sns.distplot(df[df_col], ax = axes[row, col], \
kde = False, norm_hist = False)
col += 1
axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()
plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions.svg")
fig.savefig("Charts/SMART Distributions.png")
fig, axes = plt.subplots(7, 8, figsize = (50, 40))
row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
'smart_254_raw']:
if col == 8:
row += 1
col = 0
try:
sns.distplot(df[df_col], ax = axes[row][col], norm_hist = True)
except:
sns.distplot(df[df_col], kde_kws = {'bw': 0.1}, ax = axes[row][col], norm_hist = True)
col += 1
axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()
plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution and KDE of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions KDE.svg")
fig.savefig("Charts/SMART Distributions KDE.png")
# Free up memory for the next section.
try:
del [highlight_nans, manu_nan_df, model_nan_df, model_nan_percent_df,
description_df, count_df, highlight_count_nans1,
highlight_counts_nans2, empty_columns]
print("Memory successfully cleared.")
except:
pass
gc.collect()
With some dataset tidying complete, the final major dataset adjustments that need to be made before analysis can be performed is that the NaN values need dealt with. The rows or columns with them can be removed, or they can be filled in through interpolation or estimation.
columns_to_examine = ['smart_13_raw', 'smart_15_raw', 'smart_179_raw',
'smart_181_raw', 'smart_182_raw', 'smart_201_raw',
'smart_250_raw', 'smart_251_raw', 'smart_252_raw',
'smart_255_raw']
columns_to_examine
#### Memory Management and Reloading Checkpoint
df = pd.read_csv('pre_viz_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df.isnull().sum().sort_values()
The first five mostly complete columns all have two NaNs, which are the result of two rows that have no raw smart values at all. Both drives failed, making them quite important for predicting future failure. However, the lack of data makes them useless for predicting future failure in their current form.
The most likely scenario is that both drives failed just before the diagnostics were collected. As such, these two rows will be deleted and their associated row for the date before their currently marked failures will be updated to have failed that day.
df.loc[df['smart_1_raw'].isnull() & df['smart_192_raw'].isnull() & \
df['smart_9_raw'].isnull() & df['smart_12_raw'].isnull() & \
df['smart_194_raw'].isnull()]
df.iloc[4632946]
df.iloc[4797700]
df.loc[(df['serial_number'] == 'ZJV00DR4') & (df['date'] == '11-09')]
df.at[4514189, 'failure'] = 1
df.iloc[4514189]
df.loc[(df['serial_number'] == 'ZHZ3M097') & (df['date'] == '11-10')]
df.at[4678156, 'failure'] = 1
df.iloc[4678156]
n_rows
df.drop(df.index[[4797700, 4632946]], inplace = True)
df.iloc[[4797700, 4632946]]
The next section of columns all have 8792 rows with NaNs, ignoring the 2 rows just removed. Coincidentally, all of these columns share the same problematic rows.
df_8794 = df.loc[df['smart_3_raw'].isnull() & df['smart_4_raw'].isnull() & \
df['smart_5_raw'].isnull() & df['smart_7_raw'].isnull() & \
df['smart_10_raw'].isnull() & df['smart_197_raw'].isnull() & \
df['smart_198_raw'].isnull() & df['smart_199_raw'].isnull()]
This subset of drives are all manfactured by Seagate, and are 3 size variations of the same model line. There is not an updated model from this line in this dataset to interpolate values from.
df_8794['manufacturer'].value_counts()
df_8794['model'].value_counts()
df_8794['capacity_TB'].value_counts()
df_8794['serial_number'].value_counts()
df_8794['serial_number'].value_counts().mean()
df_8794['failure'].value_counts()
[item for i, item in enumerate(df['model'].unique()) if "ZA" in item]
Interpolating mean values from the same manufacturer, Seagate, and the models' respective capacity_TB categories would be a good way to estimate the missing values if enough data exists.
Additionally, creating a boolean column to flag interpolated data as missing may help the predictive models account for it.
For the smart_3_raw data, the median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 0.25)]['smart_3_raw'].mean()
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 0.25)]['smart_3_raw']
smart_3_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_3_raw'].median()
smart_3_median_specialized
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_3_raw'])
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_3_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 2.00)]['smart_3_raw'].mean()
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
(df['capacity_TB'] == 2.00)]['smart_3_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_3_raw'].notnull())]['smart_3_raw'], kde = False)
smart_3_median = df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_3_raw'].notnull())]['smart_3_raw'].median()
smart_3_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_3_raw'].isnull()) & \
(df['capacity_TB'] == 0.50), 'smart_3_raw'] = smart_3_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_3_raw'].isnull(), 'smart_3_raw'] = smart_3_median
df['smart_3_raw'].isnull().sum()
For the smart_4_raw data, the mean for the manufacturer and drive capacity will be used for the second model. The mean for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_4_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'])
smart_4_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'].mean()
smart_4_mean_specialized
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_4_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'])
smart_4_mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'].mean()
smart_4_mean
# Use the mean to fill the capacity category that can be calculated.
df.loc[(df['smart_4_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_4_raw'] = smart_4_mean_specialized
# Use the mean to fill the capacity categories that cannot be calculated.
df.loc[df['smart_4_raw'].isnull(), 'smart_4_raw'] = smart_4_mean
df['smart_4_raw'].isnull().sum()
For the smart_5_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_5_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'], kde = False)
smart_5_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'].median()
smart_5_median_specialized
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_5_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'], kde = False)
smart_5_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'].median()
smart_5_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_5_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_5_raw'] = smart_5_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_5_raw'].isnull(), 'smart_5_raw'] = smart_5_median
df['smart_5_raw'].isnull().sum()
For the smart_7_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_7_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'])
smart_7_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].median()
smart_7_median_specialized
mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].mean()
mean
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_7_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'])
smart_7_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'].median()
smart_7_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_7_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_7_raw'] = smart_7_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_7_raw'].isnull(), 'smart_7_raw'] = smart_7_median
df['smart_7_raw'].isnull().sum()
For the smart_10_raw data, the median for the manufacturer will be used to fill the NaN values.
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_10_raw'].notnull())]['smart_10_raw']
df['smart_10_raw'].value_counts()
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_10_raw'].notnull())]['smart_10_raw'].value_counts()
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_10_raw'].notnull())]['smart_10_raw'])
smart_10_median = df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_10_raw'].notnull())]['smart_10_raw'].median()
smart_10_median
df.loc[df['smart_10_raw'].isnull(), 'smart_10_raw'] = smart_10_median
df['smart_10_raw'].isnull().sum()
For the smart_197_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
(df['capacity_TB'] == 0.25)]['smart_197_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_197_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_197_raw'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_197_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].value_counts()
smart_197_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].median()
smart_197_median_specialized
smart_197_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].mean()
smart_197_mean_specialized
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_197_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].value_counts()
smart_197_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].median()
smart_197_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_197_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_197_raw'] = smart_197_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_197_raw'].isnull(), 'smart_197_raw'] = smart_197_median
df['smart_197_raw'].isnull().sum()
For the smart_198_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_198_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].value_counts()
smart_198_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].median()
smart_198_median_specialized
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_198_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].value_counts()
smart_198_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].median()
smart_198_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_198_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_198_raw'] = smart_198_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_198_raw'].isnull(), 'smart_198_raw'] = smart_198_median
df['smart_198_raw'].isnull().sum()
For the smart_199_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_199_raw']
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].value_counts()
smart_199_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].median()
smart_199_median_specialized
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_199_raw']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'], kde = False)
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].value_counts()
smart_199_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].median()
smart_199_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_199_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_199_raw'] = smart_199_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_199_raw'].isnull(), 'smart_199_raw'] = smart_199_median
df['smart_199_raw'].isnull().sum()
The smart_193_raw column is a different problem than the last group of columns. This group has 53985 rows with NaN values, which is still low enough in this large dataset to interpolate values without majorly ill effects, but still requires caution.
An important note here is that some manufacturers use different SMART attributes to represent the same information. Most Seagate and some Western Digital and Hitachi drives actually use 225 rather than 193 to store the Load/Unload Cycle Count value (Acronis, Knowledge Base 9128; Acronis, Knowledge Base 9152). We can see here that no row has both 193 and 225 values.
df.loc[(df['smart_193_raw'].notnull()) & \
(df['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]
df_193 = df.loc[df['smart_193_raw'].isnull()]
df_193
df_193['manufacturer'].value_counts()
df_193['model'].value_counts()
df_193.loc[df_193['smart_193_raw'] != \
df_193['smart_225_raw']][['smart_193_raw', 'smart_225_raw']]
df_193.loc[(df_193['smart_193_raw'].notnull()) & \
(df_193['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]
The only rows that do not have either value are the exact same rows as the last group. These will need interpolated if the rows are to be kept. The 45193 other rows can be filled by combining the two columns that represent the same information.
df_193.loc[(df_193['smart_193_raw'].isnull()) & \
(df_193['smart_225_raw'].isnull())][['smart_193_raw', 'smart_225_raw']]
df_193.loc[(df_193['smart_193_raw'].isnull()) & \
(df_193['smart_225_raw'].isnull())]['model'].value_counts()
The smart_193_raw and smart_225_raw columns will be combined into a new smart_193_225 column and then the remaining values filled as in previous columns.
df['smart_193_225'] = df['smart_193_raw']
df['smart_193_225'].fillna(df['smart_225_raw'], inplace = True)
df[['smart_193_raw', 'smart_225_raw', 'smart_193_225']].isna().sum()
df.drop(['smart_193_raw', 'smart_225_raw'], axis=1, inplace=True)
df.head()
Now that the two columns have been merged, the same process of interpolation by model and capacity can be used on the remaining group.
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.25)]['smart_193_225']
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_193_225']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_193_225'])
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_193_225'].value_counts()
smart_193_225_median_specialized = df.loc[(df['manufacturer'] == "Seagate") &\
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_193_225'].median()
smart_193_225_median_specialized
smart_193_225_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 0.50)]['smart_193_225'].mean()
smart_193_225_mean_specialized
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull()) & \
(df['capacity_TB'] == 2.00)]['smart_193_225']
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull())]['smart_193_225'])
df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull())]['smart_193_225'].value_counts()
smart_193_225_median = df.loc[(df['manufacturer'] == "Seagate") & \
(df['smart_193_225'].notnull())]['smart_193_225'].median()
smart_193_225_median
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_193_225'].isnull()) & \
(df['capacity_TB'] == 0.50), 'smart_193_225'] = \
smart_193_225_median_specialized
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_193_225'].isnull(), 'smart_193_225'] = smart_193_225_median
df['smart_193_225'].isnull().sum()
The remaining columns to examine all have over 2 million NaN value rows each. This level of missing data causes interpolation to skew results far more than the previous groups' levels of missing data. The following grouping of columns have at least 70% of their values.
Column NaN Count
smart_240_raw 2733254
smart_241_raw 2909319
smart_242_raw 2909319
smart_187_raw 3062543
smart_188_raw 3062543
smart_190_raw 3062543
df.loc[df['failure'] == 0]['smart_240_raw'].isnull().value_counts()
df.loc[df['failure'] == 1]['smart_240_raw'].isnull().value_counts()
df.loc[df['smart_240_raw'].isnull()].head()
Notably, none of the HGST drives have a value for the smart_240_raw column. Additionally, the drives that are missing the smart_241_raw data are also likely the drives missing the smart_242_raw data.
Seagate drives have enough filled values to use and Toshiba drives have no missing values, but the HGST and Western Digital drives do not have enough values to interpolate from. As such, all missing values will be filled in with the mean.
df.loc[df['smart_240_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_240_raw'].notnull()]['manufacturer'].value_counts()
df.loc[df['smart_241_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_241_raw'].notnull()]['manufacturer'].value_counts()
df.loc[df['smart_242_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_242_raw'].notnull()]['manufacturer'].value_counts()
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].median()))
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].median()))
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].median()))
smart_240_mean = df.loc[df['smart_240_raw'].notnull()]['smart_240_raw'].mean()
smart_240_mean
df['smart_240_raw'].fillna(smart_240_mean, inplace = True)
df['smart_240_raw'].isnull().sum()
smart_241_mean = df.loc[df['smart_241_raw'].notnull()]['smart_241_raw'].mean()
smart_241_mean
df['smart_241_raw'].fillna(smart_241_mean, inplace = True)
df['smart_241_raw'].isnull().sum()
smart_242_mean = df.loc[df['smart_242_raw'].notnull()]['smart_242_raw'].mean()
smart_242_mean
df['smart_242_raw'].fillna(smart_242_mean, inplace = True)
df['smart_242_raw'].isnull().sum()
df.loc[df['failure'] == 0]['smart_187_raw'].isnull().value_counts()
df.loc[df['failure'] == 1]['smart_187_raw'].isnull().value_counts()
df.loc[df['smart_187_raw'].isnull()].head()
The group of the smart_187_raw, smart_188_raw, and smart_190_raw columns are divided by manufacturer, with all Seagate drives having the values and none of the other drive manufacturers having the values.
df.loc[df['smart_187_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_187_raw'].notnull()]['manufacturer'].value_counts()
sns.distplot(df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'], \
kde = False)
df.loc[df['smart_188_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_188_raw'].notnull()]['manufacturer'].value_counts()
sns.distplot(df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'], \
kde = False)
df.loc[df['smart_190_raw'].isnull()]['manufacturer'].value_counts()
df.loc[df['smart_190_raw'].notnull()]['manufacturer'].value_counts()
sns.distplot(df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'])
Given the column distributions, the smart_187_raw and smart_188_raw NaNs will be filled with the medians, and the smart_190_raw NaNs will be filled with the mean.
smart_187_median = df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'].median()
smart_187_median
df['smart_187_raw'].fillna(smart_187_median, inplace = True)
df['smart_187_raw'].isnull().sum()
smart_188_median = df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'].median()
smart_188_median
df['smart_188_raw'].fillna(smart_188_median, inplace = True)
df['smart_188_raw'].isnull().sum()
smart_190_mean = df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'].mean()
smart_190_mean
df['smart_190_raw'].fillna(smart_190_mean, inplace = True)
df['smart_190_raw'].isnull().sum()
if not os.path.isfile('pre_195_df.csv'):
df.to_csv('pre_195_df.csv', index=False)
These remaining columns have over 30% of their values missing, and an individualized approach will be taken with each of them. In some cases, categories of existing values may be helpful to preserve the potential for information with NaNs being their own category.
Row NaN Count
smart_195_raw 4806304
smart_191_raw 6402847
smart_184_raw 6781011
smart_189_raw 6781011
smart_200_raw 7076978
smart_196_raw 7921364
smart_8_raw 7946665
smart_2_raw 7946665
smart_183_raw 9132294
smart_22_raw 9741975
smart_223_raw 10463678
smart_18_raw 10651999
smart_224_raw 10652391
smart_220_raw 10652391
smart_222_raw 10652391
smart_226_raw 10652391
smart_23_raw 10742991
smart_24_raw 10742991
smart_11_raw 10904619
smart_225_raw 10929920
smart_254_raw 10948140
smart_235_raw 10966321
smart_233_raw 10966321
smart_232_raw 10966321
smart_168_raw 10966321
smart_170_raw 10966321
smart_218_raw 10966321
smart_174_raw 10966321
smart_16_raw 10966321
smart_17_raw 10966321
smart_173_raw 10966321
smart_231_raw 10966321
smart_177_raw 10966321
This column only has values in a single manufacturer's drives, and even then only 77% of them. There appears to be virtually no difference in the column's distribution by failure status. Filling in NaNs with this information would only result in collinearity between the column and the manufacturer column, so it will be dropped from the dataframe.
sns.distplot(df.loc[df['smart_195_raw'].notnull()]['smart_195_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_195_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_195_raw'])
plt.grid(True)
plt.title("smart_195_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_195_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_195_raw'].isnull()])
df['manufacturer'].value_counts()
df[['smart_195_raw', 'failure']].corr()
df.drop(['smart_195_raw'], axis=1, inplace=True)
df.head()
This column is not split along manufacturer lines like many others, but still has a large percentage of missing values. A categorical column smart_191_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_191_raw column will then be dropped.
sns.distplot(df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_191_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
sns.distplot(df.loc[(df['failure'] == 0) & \
(df['smart_191_raw'] != 0.0)]['smart_191_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & \
(df['smart_191_raw'] != 0.0)]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_191_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_191_raw'].isnull()])
smart_191_mean = df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'].mean()
smart_191_mean
df['smart_191_cat'] = 0
df.loc[(df['smart_191_raw'] < smart_191_mean), 'smart_191_cat'] = 1
df.loc[(df['smart_191_raw'] > smart_191_mean), 'smart_191_cat'] = 2
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_191_cat'].dtype
df['smart_191_cat'].value_counts()
df['smart_191_cat'].isnull().sum()
df.drop(['smart_191_raw'], axis=1, inplace=True)
df.head()
This column very rarely has any value other than 0 when it is available. However, whenever it is available and not 0, it has a disproportionate ratio of failures to nonfailures, making it a very useful measure for predicting failure. A categorical column smart_184_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value is 0 or NaN |
1 | Value is above 0 |
The original smart_184_raw column will then be dropped.
sns.distplot(df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_184_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_184_raw'], kde = False)
plt.grid(True)
plt.title("smart_184_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'].value_counts()
df.loc[(df['smart_184_raw'] != 0) & \
(df['smart_184_raw'].notnull())][['smart_184_raw', 'failure']]
df.loc[df['smart_184_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_184_raw'].isnull()])
df['smart_184_cat'] = 0
df.loc[(df['smart_184_raw'] > 0), 'smart_184_cat'] = 1
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_184_cat'].dtype
df['smart_184_cat'].value_counts()
df['smart_184_cat'].isnull().sum()
df.drop(['smart_184_raw'], axis=1, inplace=True)
df.head()
This column only has values in a single manufacturer's drives, and even then only 38% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.
sns.distplot(df.loc[df['smart_189_raw'].notnull()]['smart_189_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_189_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_189_raw'], kde = False)
plt.grid(True)
plt.title("smart_189_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_189_raw'].value_counts()
df.loc[df['smart_189_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_189_raw'].isnull()])
df[['smart_189_raw', 'failure']].corr()
df.drop(['smart_189_raw'], axis=1, inplace=True)
df.head()
This column is not entirely split along manufacturer lines like many others, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_200_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_200_raw column will then be dropped.
sns.distplot(df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_200_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_200_raw'], kde = False)
plt.grid(True)
plt.title("smart_200_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_200_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
plt.grid(True)
plt.title("smart_200_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_200_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_200_raw'].isnull()])
df[['smart_200_raw', 'failure']].corr()
smart_200_mean = df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'].mean()
smart_200_mean
df.loc[(df['failure'] == 0) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()
df['smart_200_cat'] = 0
df.loc[(df['smart_200_raw'] < smart_200_mean), 'smart_200_cat'] = 1
df.loc[(df['smart_200_raw'] > smart_200_mean), 'smart_200_cat'] = 2
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_200_cat'].dtype
df['smart_200_cat'].value_counts()
df['smart_200_cat'].isnull().sum()
df.drop(['smart_200_raw'], axis=1, inplace=True)
df.head()
This column is not split along manufacturer lines whatsoever, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_196_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_196_raw column will then be dropped.
sns.distplot(df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_196_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_196_raw'], kde = False)
plt.grid(True)
plt.title("smart_196_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_196_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
plt.grid(True)
plt.title("smart_196_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_196_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_196_raw'].isnull()])
df[['smart_196_raw', 'failure']].corr()
smart_196_mean = df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'].mean()
smart_196_mean
df.loc[(df['failure'] == 0) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()
df['smart_196_cat'] = 0
df.loc[(df['smart_196_raw'] < smart_196_mean), 'smart_196_cat'] = 1
df.loc[(df['smart_196_raw'] > smart_196_mean), 'smart_196_cat'] = 2
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_196_cat'].dtype
df['smart_196_cat'].value_counts()
df['smart_196_cat'].isnull().sum()
df.drop(['smart_196_raw'], axis=1, inplace=True)
df.head()
This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_8_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_8_raw column will then be dropped.
sns.distplot(df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_8_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_8_raw'])
plt.grid(True)
plt.title("smart_8_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_8_raw'].value_counts()
df.loc[df['smart_8_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_8_raw'].isnull()])
df[['smart_8_raw', 'failure']].corr()
smart_8_mean = df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'].mean()
smart_8_mean
df.loc[(df['failure'] == 0) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()
df['smart_8_cat'] = 0
df.loc[(df['smart_8_raw'] < smart_8_mean), 'smart_8_cat'] = 1
df.loc[(df['smart_8_raw'] > smart_8_mean), 'smart_8_cat'] = 2
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_8_cat'].dtype
df['smart_8_cat'].value_counts()
df['smart_8_cat'].isnull().sum()
df.drop(['smart_8_raw'], axis=1, inplace=True)
df.head()
This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_2_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_2_raw column will then be dropped.
sns.distplot(df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_2_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_2_raw'])
plt.grid(True)
plt.title("smart_2_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_2_raw'].value_counts()
df.loc[df['smart_2_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_2_raw'].isnull()])
df[['smart_2_raw', 'failure']].corr()
smart_2_mean = df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'].mean()
smart_2_mean
df.loc[(df['failure'] == 0) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()
df['smart_2_cat'] = 0
df.loc[(df['smart_2_raw'] < smart_2_mean), 'smart_2_cat'] = 1
df.loc[(df['smart_2_raw'] > smart_2_mean), 'smart_2_cat'] = 2
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_2_cat'].dtype
df['smart_2_cat'].value_counts()
df['smart_2_cat'].isnull().sum()
df.drop(['smart_2_raw'], axis=1, inplace=True)
df.head()
This column only has values in a single manufacturer's drives, and even then only 23% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.
sns.distplot(df.loc[df['smart_183_raw'].notnull()]['smart_183_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_183_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_183_raw'], kde = False)
plt.grid(True)
plt.title("smart_183_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_183_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
plt.grid(True)
plt.title("smart_183_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['smart_183_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_183_raw'].isnull()])
df[['smart_183_raw', 'failure']].corr()
df.drop(['smart_183_raw'], axis=1, inplace=True)
df.head()
if not os.path.isfile('pre_22_df.csv'):
df.to_csv('pre_22_df.csv', index=False)
df = pd.read_csv('pre_22_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
This column only has values in a single manufacturer's drives, as it is an indication of helium levels encased in certain HGST drives (Klein, 2015). Given this, it would make no sense to fill this column's NaN values in rows of drives from other manufacturers. Beyond that, the dataset does not have any failures with abnormal levels, making this column potentially a negative impact to the real-world effectiveness of a predictive model. Given this risk, the risk of collinearity with the manufacturer column, and the low correlation of this column and the failure rate, this column will be dropped from the dataframe without a category column for the simplification of the models.
sns.distplot(df.loc[df['smart_22_raw'].notnull()]['smart_22_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_22_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_22_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Non100 Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['failure'] == 1]['smart_22_raw'].value_counts()
df.loc[df['smart_22_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_22_raw'].isnull()])
df[['smart_22_raw', 'failure']].corr()
df.drop(['smart_22_raw'], axis=1, inplace=True)
df.head()
This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_223_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_223_raw column will then be dropped.
sns.distplot(df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 0]['smart_223_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_223_raw'], kde = False)
plt.grid(True)
plt.title("smart_223_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_223_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
plt.grid(True)
plt.title("smart_223_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['failure'] == 1]['smart_223_raw'].value_counts()
df.loc[df['smart_223_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_223_raw'].isnull()])
df[['smart_223_raw', 'failure']].corr()
smart_223_mean = df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'].mean()
smart_223_mean
df.loc[(df['failure'] == 0) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()
df['smart_223_cat'] = 0
df.loc[(df['smart_223_raw'] < smart_223_mean), 'smart_223_cat'] = 1
df.loc[(df['smart_223_raw'] > smart_223_mean), 'smart_223_cat'] = 2
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_223_cat'].dtype
df['smart_223_cat'].value_counts()
df['smart_223_cat'].isnull().sum()
df.drop(['smart_223_raw'], axis=1, inplace=True)
df.head()
if not os.path.isfile('pre_18_df.csv'):
df.to_csv('pre_18_df.csv', index=False)
df = pd.read_csv('pre_18_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.
sns.distplot(df.loc[df['smart_18_raw'].notnull()]['smart_18_raw'], kde = False)
df['smart_18_raw'].value_counts()
df.loc[df['failure'] == 1]['smart_18_raw'].value_counts()
df.loc[df['smart_18_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_18_raw'], axis=1, inplace=True)
df.head()
This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.
sns.distplot(df.loc[df['smart_224_raw'].notnull()]['smart_224_raw'], kde = False)
df['smart_224_raw'].value_counts()
df.loc[df['failure'] == 1]['smart_224_raw'].value_counts()
df.loc[df['smart_224_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_224_raw'], axis=1, inplace=True)
df.head()
This column is entirely split along manufacturer lines and has a large percentage of missing values, but it seems to be one of the few predictors available for Toshiba drives. Given the relatively large negative correlation between a higher value and failure rate, a categorical column smart_220_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Median |
2 | Above Median |
The original smart_220_raw column will then be dropped.
sns.distplot(df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_220_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_220_raw'], kde = False)
plt.grid(True)
plt.title("smart_220_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_220_raw'].value_counts()
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
plt.grid(True)
plt.title("smart_220_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df.loc[df['failure'] == 1]['smart_220_raw'].value_counts()
len(df.loc[df['smart_220_raw'].isnull()])
df.loc[df['smart_220_raw'].notnull()]['manufacturer'].value_counts()
df[['smart_220_raw', 'failure']].corr()
smart_220_median = df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'].median()
smart_220_median
df.loc[(df['failure'] == 0) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()
df['smart_220_cat'] = 0
df.loc[(df['smart_220_raw'] < smart_220_median), 'smart_220_cat'] = 1
df.loc[(df['smart_220_raw'] > smart_220_median), 'smart_220_cat'] = 2
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_220_cat'].dtype
df['smart_220_cat'].value_counts()
df['smart_220_cat'].isnull().sum()
df.drop(['smart_220_raw'], axis=1, inplace=True)
df.head()
Although only available on the Toshiba drives, this is the highest correlation to failure rates yet. A categorical column smart_222_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_222_raw column will then be dropped.
sns.distplot(df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_222_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_222_raw'])
plt.grid(True)
plt.title("smart_222_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_222_raw'].value_counts()
df.loc[df['failure'] == 1]['smart_222_raw'].value_counts()
len(df.loc[df['smart_222_raw'].isnull()])
df.loc[df['smart_222_raw'].notnull()]['manufacturer'].value_counts()
df[['smart_222_raw', 'failure']].corr()
smart_222_mean = df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'].mean()
smart_222_mean
df.loc[(df['failure'] == 0) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()
df['smart_222_cat'] = 0
df.loc[(df['smart_222_raw'] < smart_222_mean), 'smart_222_cat'] = 1
df.loc[(df['smart_222_raw'] > smart_222_mean), 'smart_222_cat'] = 2
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_222_cat'].dtype
df['smart_222_cat'].value_counts()
df['smart_222_cat'].isnull().sum()
df.drop(['smart_222_raw'], axis=1, inplace=True)
df.head()
Although only available on the Toshiba drives, this is the highest negative correlation to failure rates yet. A categorical column smart_226_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_226_raw column will then be dropped.
sns.distplot(df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_226_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_226_raw'])
plt.grid(True)
plt.title("smart_226_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_226_raw'].value_counts()
df.loc[df['failure'] == 1]['smart_226_raw'].value_counts()
len(df.loc[df['smart_226_raw'].isnull()])
df.loc[df['smart_226_raw'].notnull()]['manufacturer'].value_counts()
df[['smart_226_raw', 'failure']].corr()
smart_226_mean = df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'].mean()
smart_226_mean
df.loc[(df['failure'] == 0) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()
df['smart_226_cat'] = 0
df.loc[(df['smart_226_raw'] < smart_226_mean), 'smart_226_cat'] = 1
df.loc[(df['smart_226_raw'] > smart_226_mean), 'smart_226_cat'] = 2
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_226_cat'].dtype
df['smart_226_cat'].value_counts()
df['smart_226_cat'].isnull().sum()
df.drop(['smart_226_raw'], axis=1, inplace=True)
df.head()
This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.
sns.distplot(df.loc[df['smart_23_raw'].notnull()]['smart_23_raw'])
df['smart_23_raw'].value_counts()
df.loc[df['smart_23_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_23_raw'], axis=1, inplace=True)
df.head()
This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.
sns.distplot(df.loc[df['smart_24_raw'].notnull()]['smart_24_raw'])
df['smart_24_raw'].value_counts()
df.loc[df['smart_24_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_24_raw'], axis=1, inplace=True)
df.head()
Although only available on 0.64% of drives, this is the highest correlation to failure rates yet. A categorical column smart_11_cat will be created with the following categories and values.
Value | Representation |
---|---|
0 | Value NaN |
1 | Below Average |
2 | Above Average |
The original smart_11_raw column will then be dropped.
sns.distplot(df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'])
sns.distplot(df.loc[df['failure'] == 0]['smart_11_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_11_raw'])
plt.grid(True)
plt.title("smart_11_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
df['smart_11_raw'].value_counts()
df.loc[df['failure'] == 1]['smart_11_raw'].value_counts()
df.loc[df['smart_11_raw'].notnull()]['manufacturer'].value_counts()
len(df.loc[df['smart_11_raw'].isnull()])
df[['smart_11_raw', 'failure']].corr()
smart_11_mean = df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'].mean()
smart_11_mean
df.loc[(df['failure'] == 0) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()
df.loc[(df['failure'] == 1) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()
df['smart_11_cat'] = 0
df.loc[(df['smart_11_raw'] < smart_11_mean), 'smart_11_cat'] = 1
df.loc[(df['smart_11_raw'] > smart_11_mean), 'smart_11_cat'] = 2
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
df['smart_11_cat'].dtype
df['smart_11_cat'].value_counts()
df['smart_11_cat'].isnull().sum()
df.drop(['smart_11_raw'], axis=1, inplace=True)
df.head()
This column is not only missing 99.75% of its values, it also has no variance whatsoever, making it useless for analysis.
sns.distplot(df.loc[df['smart_254_raw'].notnull()]['smart_254_raw'])
df['smart_254_raw'].value_counts()
df.loc[df['smart_254_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_254_raw'], axis=1, inplace=True)
df.head()
This column represents an interesting report, in that the first 3 bytes of it is the drive's good block count, while the last 2 bytes is the drive's bad block count, but this column is missing 99.92% of its values, making it useless for this type of predictive analysis.
sns.distplot(df.loc[df['smart_235_raw'].notnull()]['smart_235_raw'])
df['smart_235_raw'].value_counts()
df.loc[df['smart_235_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_235_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_233_raw'].notnull()]['smart_233_raw'])
df.loc[(df['smart_233_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_233_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_233_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_232_raw'].notnull()]['smart_232_raw'])
df.loc[(df['smart_232_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_232_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_232_raw'], axis=1, inplace=True)
df.head()
if not os.path.isfile('pre_168_df.csv'):
df.to_csv('pre_168_df.csv', index=False)
df = pd.read_csv('pre_168_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_168_raw'].notnull()]['smart_168_raw'])
df.loc[(df['smart_168_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_168_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_168_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_170_raw'].notnull()]['smart_170_raw'])
df.loc[(df['smart_170_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_170_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_170_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_218_raw'].notnull()]['smart_218_raw'])
df.loc[(df['smart_218_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_218_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_218_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_174_raw'].notnull()]['smart_174_raw'])
df.loc[(df['smart_174_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_174_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_174_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_16_raw'].notnull()]['smart_16_raw'])
df.loc[(df['smart_16_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_16_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_16_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_17_raw'].notnull()]['smart_17_raw'])
df.loc[(df['smart_17_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_17_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_17_raw'], axis=1, inplace=True)
df.head()
#### Memory Management and Reloading Checkpoint
if not os.path.isfile('pre_173_df.csv'):
df.to_csv('pre_173_df.csv', index=False)
df = pd.read_csv('pre_173_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_173_raw'].notnull()]['smart_173_raw'])
df.loc[(df['smart_173_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_173_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_173_raw'], axis=1, inplace=True)
df.head()
No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.
sns.distplot(df.loc[df['smart_231_raw'].notnull()]['smart_231_raw'])
df.loc[(df['smart_231_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_231_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_231_raw'], axis=1, inplace=True)
df.head()
sns.distplot(df.loc[df['smart_177_raw'].notnull()]['smart_177_raw'])
df.loc[(df['smart_177_raw'].notnull()) & (df['failure'] == 1)]
df.loc[df['smart_177_raw'].notnull()]['manufacturer'].value_counts()
df.drop(['smart_177_raw'], axis=1, inplace=True)
df.head()
if not os.path.isfile('explorative_df.csv'):
df.to_csv('explorative_df.csv', index = False)
df = pd.read_csv('explorative_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
fig, axes = plt.subplots(6, 6, figsize = (30, 25))
row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
if col == 6:
row += 1
col = 0
# Histograms
if df[df_col].dtype.name == 'float64':
if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
ax.set_yscale('log')
else:
ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
# Countplots
elif df[df_col].dtype.name == 'category' or \
df[df_col].dtype.name == 'bool':
if df_col == "date":
ax = sns.countplot(df[df_col], ax = axes[row, col])
ax.set(xticklabels = [])
elif df_col == "model":
ax = sns.countplot(df[df_col], ax = axes[row, col])
ax.set(xticklabels = [])
ax.set_yscale('log')
elif df_col in ['smart_184_cat', 'smart_11_cat']:
ax = sns.countplot(df[df_col], ax = axes[row, col])
ax.set_yscale('log')
else:
sns.countplot(df[df_col], ax = axes[row, col])
else:
print("Unknown column dtype")
col += 1
plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Columns", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Distributions.svg")
fig.savefig("Charts/Dataframe Distributions.png")
fig, axes = plt.subplots(6, 6, figsize = (30, 25))
row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
if col == 6:
row += 1
col = 0
# Histograms
if df[df_col].dtype.name == 'float64':
if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
ax.set_yscale('log')
else:
ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
# Countplots
elif df[df_col].dtype.name == 'category' or df[df_col].dtype.name == 'bool':
if df_col == "date":
ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
ax.set(xticklabels = [])
elif df_col == "model":
ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
ax.set(xticklabels = [])
ax.set_yscale('log')
elif df_col in ['smart_184_cat', 'smart_11_cat']:
ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
else:
sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
else:
print("Unknown column dtype")
col += 1
plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Failure", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Failure Distributions.svg")
fig.savefig("Charts/Dataframe Failure Distributions.png")
With all NaN values interpolated or their columns removed, correlations can be determined between the columns.
corr_df = df.corr(method = 'pearson')
corr_df
A few of the columns' relations will need to be examined well based on these correlation coefficients.
A prominent feature is smart_9_raw as the column with the most extreme correlations with other columns, which is understandable given that SMART attribute 9 represents the total count of hours the drive has been in a power-on state (Acronis, Knowledge Base 9109). Most other issues worth measuring are likely correlated with the drive age and amount of operation. This column may also be a powerful predictor within predictive models as an older drive is more likely to wear down to failure suddenly than a newer drive in general even if other values are not present. Even if other predictors of failure are present in an instance, a drive with an average or lower smart_9_raw value may represent a drive that will fail far sooner than the average length of time to failure.
smart_240_raw also has quite high correlations with other independent variables.
smart_197_raw and smart_198_raw have a nearly perfect degree of collinearity with each other and little in comparison with any other column. smart_198_raw will be dropped as it has a lower correlation with the dependent variable failure.
Finally, smart_190_raw and smart_194_raw have a very high degree of collinearity with each other and little in comparison with any other column. One likely needs removed.
The dataset may be large enough to not need to worry about the multicollinearity affecting the predictive power of the models, but the redundancy of information may skew the results.
For potential predictors for failure, smart_5_raw and smart_197_raw have the highest positive correlations with failure, at 4.4% and 2.7%. SMART attribute 5 is the reallocated sectors count of drives, which triggers when a read, write, or verification error occurs (Acronis, Knowledge Base 9105). SMART attribute 197 is the current pending sector count, which is the count of unstable sectors that are awaiting remapping (Acronis, Knowledge Base 9133). This value decreases as sectors are remapped, but the value would remain consistently high if these sectors are unable to be remapped. Both columns make complete sense as the highest correlation with failure and will likely be the most important predictor variables for HDD failure.
df[['smart_197_raw', 'smart_198_raw', 'failure']].corr()
df.drop('smart_198_raw', axis = 1, inplace = True)
corr_df = df.corr(method = 'pearson')
fig, ax = plt.subplots(figsize = (30, 23))
sns.heatmap(
corr_df,
ax = ax,
annot = True,
fmt = ".1%",
vmin = -1, vmax = 1, center = 0,
linewidths = 3,
linecolor = "white",
xticklabels = corr_df.columns,
yticklabels = corr_df.columns,
square = True,
cbar = True
)
plt.title("Dataframe Correlation Heatmap", fontsize = 54)
fig.savefig("Charts/Corr Heatmap.svg")
fig.savefig("Charts/Corr Heatmap.png")
df.columns
from sklearn.feature_selection import chi2
# Display the results of a Chi Squared on a contingency table
# in a tabular format
def chi2_output(contingency: pd.core.frame.DataFrame):
chi2, p, dof, expected = scs.chi2_contingency(contingency)
print("χ2-Coefficient: \t" + str(chi2))
print("P-Value: \t\t" + str(p))
print("Degrees of Freedom: \t" + str(dof))
# Access the index names of the contingency table dataframe
ax_1 = str(contingency.axes[1][0])
ax_2 = str(contingency.axes[1][1])
ax_title = ax_1 + ":\t\t" + ax_2 + ":\t\t" + ax_1 + ":\t\t" + ax_2 + ":"
print("Expected Values:\n\t\t\tExpected:\t\t\tActual:")
print("\t" + contingency.axes[1].name + ":\t" + ax_title)
print(contingency.axes[0].name + ":")
# Map the indexes to string values to ensure numeric indexes
# don't cause type errors
contingency.index = contingency.index.map(str)
for i, j in enumerate(contingency.index):
expected_false = "{:.3f}".format(expected[i][0])
actual_false = str(contingency[0][i])
expected_true = "{:.3f}".format(expected[i][1])
actual_true = str(contingency[1][i])
# Tabular spacing adjustments on the assumption that 1 tab = 8 spaces
index_text = " " + j + ": \t"
if len(j) < 3:
index_text += "\t\t"
elif len(j) < 9:
index_text += "\t"
if len(expected_false) < 7:
expected_false += "\t"
if len(actual_false) < 7:
actual_false += "\t"
if len(expected_true) < 7:
expected_true += "]\t"
else:
expected_true += "]"
expected_text = "[" + expected_false + "\t" + expected_true
if len(expected_text) < 16:
expected_text = expected_text + "\t"
actual_text = "[" + actual_false + "\t" + actual_true + "]"
line = expected_text + "\t" + actual_text
print(index_text + line)
import os
import sys
from cffi import FFI
FFI_ = FFI()
FFI_.cdef('extern void* CreateSubarea(char * modelId, double areaKm2);')
FFI_.cdef('extern char** GetSubareaNames(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeIdentifiers(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeNames(void* simulation, int* size);')
def prepend_path_env (added_paths, to_env='PATH'):
path_sep = ';'
prior_path_env = os.environ.get(to_env)
prior_paths = prior_path_env.split(path_sep)
added_paths = [x for x in added_paths if os.path.exists(x)]
new_paths = prior_paths + added_paths
new_env_val = path_sep.join(new_paths)
return new_env_val
libs_path = r"C:\Users\aedri\Anaconda3\envs\tf1\lib\R\bin\x64"
dll_path = os.path.join("R.dll")
libs_path2 = r"C:\Users\aedri\Anaconda3\envs\tf1\Lib\R\library\stats\libs\x64"
dll_path2 = os.path.join("stats.dll")
to_env = 'PATH'
if(sys.platform == 'win32'):
os.environ[to_env] = prepend_path_env([libs_path], to_env)
os.environ[to_env] = prepend_path_env([libs_path2], to_env)
LIB = FFI_.dlopen(dll_path, 1) # 1 for Lazy loading
dir(LIB)
LIB2 = FFI_.dlopen(dll_path2, 1) # 1 for Lazy loading
dir(LIB2)
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
rpy2.robjects.numpy2ri.activate()
rstats = importr('stats')
# Display the formatted results of the R stats Fisher_Test,
# using Monte Carlo Simulation
def r_fisher_output(dataframe):
results = rstats.fisher_test(dataframe.to_numpy(), \
simulate_p_value = True)
# Convert the listvector object returned from R stats to
# a list of string values
d = [key + "_" + str(results.rx2(key)[0]) for key in results.names]
d2 = []
for i in d:
d2.append("".join(i.replace("\t", "").splitlines()))
# Replicate the tabluar data formatting
for line in d2:
if len(line.split("_")[0]) < 8:
print(line.replace("_", "\t\t"))
else:
print(line.replace("_", "\t"))
manufacturer_contingency = pd.crosstab(df['manufacturer'], df['failure'])
manufacturer_contingency
pd.crosstab(df['manufacturer'], df['failure'], normalize = "index")
chi2_output(manufacturer_contingency)
r_fisher_output(manufacturer_contingency)
The model column will ultimately be dropped even after all of the work that went into cleaning its data. The large amount of categories in it substantially adds complexity to the model while not improving enough. The manufacturer column, while less specific, contains all of the same variation with only four categories. Additionally, many of the models do not have a single failure, and even more have only thousands of hard drive days that they represent. Leaving the column in for predictive modeling and analysis will only hurt the overall results, and as such, is removed.
model_contingency = pd.crosstab(df['model'], df['failure'])
model_contingency
pd.crosstab(df['model'], df['failure'], normalize = "index")
chi2_output(model_contingency)
r_fisher_output(model_contingency)
df.drop('model', axis = 1, inplace = True)
capacity_contingency = pd.crosstab(df['capacity_TB'], df['failure'])
capacity_contingency
pd.crosstab(df['capacity_TB'], df['failure'], normalize = "index")
chi2_output(capacity_contingency)
r_fisher_output(capacity_contingency)
This column has the highest p-value out of all of the category columns. While still statistically significant, this is likely from the size of the dataset and not out of pure correlation. smart_191_cat is not likely to be a good predictor variable.
smart_191_contingency = pd.crosstab(df['smart_191_cat'], df['failure'])
smart_191_contingency
pd.crosstab(df['smart_191_cat'], df['failure'], normalize = "index")
chi2_output(smart_191_contingency)
r_fisher_output(smart_191_contingency)
With a p-value of 0.0, this is likely the strongest relation to failure in the dataset.
smart_184_contingency = pd.crosstab(df['smart_184_cat'], df['failure'])
smart_184_contingency
pd.crosstab(df['smart_184_cat'], df['failure'], normalize = "index")
chi2_output(smart_184_contingency)
r_fisher_output(smart_184_contingency)
smart_200_contingency = pd.crosstab(df['smart_200_cat'], df['failure'])
smart_200_contingency
pd.crosstab(df['smart_200_cat'], df['failure'], normalize = "index")
chi2_output(smart_200_contingency)
r_fisher_output(smart_200_contingency)
smart_196_contingency = pd.crosstab(df['smart_196_cat'], df['failure'])
smart_196_contingency
pd.crosstab(df['smart_196_cat'], df['failure'], normalize = "index")
chi2_output(smart_196_contingency)
r_fisher_output(smart_196_contingency)
smart_8_contingency = pd.crosstab(df['smart_8_cat'], df['failure'])
smart_8_contingency
pd.crosstab(df['smart_8_cat'], df['failure'], normalize = "index")
chi2_output(smart_8_contingency)
r_fisher_output(smart_8_contingency)
smart_2_contingency = pd.crosstab(df['smart_2_cat'], df['failure'])
smart_2_contingency
pd.crosstab(df['smart_2_cat'], df['failure'], normalize = "index")
chi2_output(smart_2_contingency)
r_fisher_output(smart_2_contingency)
smart_223_contingency = pd.crosstab(df['smart_223_cat'], df['failure'])
smart_223_contingency
pd.crosstab(df['smart_223_cat'], df['failure'], normalize = "index")
chi2_output(smart_223_contingency)
r_fisher_output(smart_223_contingency)
smart_220_contingency = pd.crosstab(df['smart_220_cat'], df['failure'])
smart_220_contingency
pd.crosstab(df['smart_220_cat'], df['failure'], normalize = "index")
chi2_output(smart_220_contingency)
r_fisher_output(smart_220_contingency)
smart_222_contingency = pd.crosstab(df['smart_222_cat'], df['failure'])
smart_222_contingency
pd.crosstab(df['smart_222_cat'], df['failure'], normalize = "index")
chi2_output(smart_222_contingency)
r_fisher_output(smart_222_contingency)
smart_226_contingency = pd.crosstab(df['smart_226_cat'], df['failure'])
smart_226_contingency
pd.crosstab(df['smart_226_cat'], df['failure'], normalize = "index")
chi2_output(smart_226_contingency)
r_fisher_output(smart_226_contingency)
smart_11_contingency = pd.crosstab(df['smart_11_cat'], df['failure'])
smart_11_contingency
pd.crosstab(df['smart_11_cat'], df['failure'], normalize = "index")
chi2_output(smart_11_contingency)
r_fisher_output(smart_11_contingency)
To begin performing factor analysis, the dataset will need to be prepared through standardization and normalization, as well as the test, train, and validation splits. Doing these before the PCA ensures that no data is contaminated with the influence of the testing and validation data.
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
y_df = df['failure']
x_df = df.drop('failure', axis = 1)
x_df.drop(['date', 'serial_number'], axis = 1, inplace = True)
del df
The first split is 80% Train and 20% Test, stratified on the y_df / failure series.
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, \
test_size = 0.2, random_state = 13, stratify = y_df)
Verify the stratified splitting.
y_train.value_counts()
y_train.value_counts()[1] / y_train.value_counts()[0]
y_test.value_counts()
Note that while the ratio is not exact, it is the closest possible.
y_test.value_counts()[1] / y_test.value_counts()[0]
(y_test.value_counts()[1] - 1) / y_test.value_counts()[0]
(y_test.value_counts()[1] + 1) / y_test.value_counts()[0]
The second split is 87.5% Train and 12.5% Validation, stratified on the y_df / failure series, to result in 70% Train and 10% Validation overall.
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, \
test_size = 0.125, random_state = 13, stratify = y_train)
y_train.value_counts()
y_train.value_counts()[1] / y_train.value_counts()[0]
y_valid.value_counts()
y_valid.value_counts()[1] / y_valid.value_counts()[0]
A scaler fit to the training data is created to standardize the continuous columns for model training. This avoids any contamination of the training data by ensuring that the test and validation datasets do not influence the training data at all.
scaler = preprocessing.StandardScaler()
x_train.columns
cont_cols = [
'smart_1_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw',
'smart_7_raw', 'smart_9_raw', 'smart_10_raw', 'smart_12_raw',
'smart_187_raw', 'smart_188_raw', 'smart_190_raw', 'smart_192_raw',
'smart_194_raw', 'smart_197_raw', 'smart_199_raw', 'smart_240_raw',
'smart_241_raw', 'smart_242_raw', 'smart_193_225', 'capacity_TB'
]
This fits the scaler to the continuous columns of the training data. The fit scaler will then be used to scale the testing and validation datasets.
x_train[cont_cols] = scaler.fit_transform(x_train[cont_cols])
A mean as close to zero as possible given the dataset and a standard deviation of 1 is a successful standardization.
x_train[cont_cols].describe()
x_test[cont_cols] = scaler.transform(x_test[cont_cols])
x_valid[cont_cols] = scaler.transform(x_valid[cont_cols])
import prince
pca = prince.PCA(
n_components = len(cont_cols),
n_iter = 3 ,
copy = True,
check_input = True,
random_state = 13
)
pca = pca.fit(x_train[cont_cols])
ax = pca.plot_row_coordinates(
x_train[cont_cols],
ax = None,
figsize = (6, 6),
x_component = 0,
y_component = 1
)
# No .svg file will be saved for this plot as it takes up
# 1.07 GB (1,158,481,389 bytes).
#plt.savefig("Charts/PCA.svg")
plt.savefig("Charts/PCA.png")
pca_results_df = pca.column_correlations(x_train[cont_cols])
pca_results_df
fig, ax = plt.subplots(figsize = (30, 23))
sns.heatmap(
pca_results_df,
ax = ax,
annot = True,
fmt = ".1%",
vmin = -1, vmax = 1, center = 0,
linewidths = 3,
linecolor = "white",
xticklabels = pca_results_df.columns,
yticklabels = pca_results_df.index,
square = True,
cbar = True
)
plt.title("PCA Results Heatmap", fontsize = 54)
plt.savefig("Charts/PCA Heatmap.svg")
plt.savefig("Charts/PCA Heatmap.png")
pca_eigenvalues = pca.eigenvalues_
pca_eigenvalues
pca.explained_inertia_
plt.plot(np.arange(len(cont_cols)), pca_eigenvalues, 'ro-')
plt.title("PCA Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.xticks(range(0, len(cont_cols)))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Scree Plot.svg")
plt.savefig("Charts/PCA Scree Plot.png")
cum_inertia = [0]
for i, e in enumerate(pca_eigenvalues):
cum_inertia.append(sum(pca_eigenvalues[0:i+1]) / sum(pca_eigenvalues))
cum_inertia
sum(pca_eigenvalues[0:13]) / sum(pca_eigenvalues)
plt.plot(range(0, len(cum_inertia)), cum_inertia)
plt.title("Inertia by Principal Components Kept")
plt.xlabel("Number of Principal Components")
plt.ylabel("Inertia")
plt.xticks(range(0, len(cum_inertia)))
plt.grid(b=True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b=True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Inertia Plot.svg")
plt.savefig("Charts/PCA Inertia Plot.png")
The eigenvalues and explained inertia were used to create a scree plot, and this plot is then used alongside the cumulative inertia to determine that 13 principal components an appropriate amount of dimensionality reduction to use as these components made up 82.37% of the inertia of the dataset in only 13 out of the 20, or 65%, of the total components.
PCA as a form of dimensionality reduction ensures that as little information, in the form of inertia, is lost as possible for the given number of dimensions reduced. As this dataset is quite large, any amount of dimensionality reduction will greatly affect the speed and chance of proper convergence in the predictive models to come. The result is reducing of the data by 35% while only losing 17.63% of the information, a 2-for-1 trade.
pca = prince.PCA(
n_components = 13,
n_iter = 3,
copy = True,
check_input = True,
random_state = 13
)
pca = pca.fit(x_train[cont_cols])
pca.explained_inertia_
pca_df = pca.transform(x_train[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')
pca_df
pca_df.info()
# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_train.drop(cont_cols, axis = 1, inplace = True)
x_train = x_train.join(pca_df)
x_train.info()
x_train.head()
pca_df = pca.transform(x_test[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')
pca_df
# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_test.drop(cont_cols, axis = 1, inplace = True)
x_test = x_test.join(pca_df)
x_test.info()
pca_df = pca.transform(x_valid[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')
# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)
x_valid = x_valid.join(pca_df)
pca_df = pca.transform(x_valid[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')
pca_df
# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)
x_valid = x_valid.join(pca_df)
x_valid.info()
if not os.path.isfile('pca_x_train.csv'):
x_train.to_csv('pca_x_train.csv', index = False)
if not os.path.isfile('y_train.csv'):
y_train.to_csv('y_train.csv', index = False, header = True)
if not os.path.isfile('pca_x_test.csv'):
x_test.to_csv('pca_x_test.csv', index = False)
if not os.path.isfile('y_test.csv'):
y_test.to_csv('y_test.csv', index = False, header = True)
if not os.path.isfile('pca_x_valid.csv'):
x_valid.to_csv('pca_x_valid.csv', index = False)
if not os.path.isfile('y_valid.csv'):
y_valid.to_csv('y_valid.csv', index = False, header = True)
reload_pca = False
if reload_pca:
x_train = pd.read_csv('pca_x_train.csv')
y_train = pd.read_csv('pca_y_train.csv')
x_test = pd.read_csv('pca_x_test.csv')
y_test = pd.read_csv('pca_y_test.csv')
x_valid = pd.read_csv('pca_x_valid.csv')
y_valid = pd.read_csv('pca_y_valid.csv')
n_rows = len(df)
#df['manufacturer'] = df['manufacturer'].astype('category')
#df['smart_191_cat'] = df['smart_191_cat'].astype('category')
#df['smart_184_cat'] = df['smart_184_cat'].astype('category')
#df['smart_200_cat'] = df['smart_200_cat'].astype('category')
#df['smart_196_cat'] = df['smart_196_cat'].astype('category')
#df['smart_8_cat'] = df['smart_8_cat'].astype('category')
#df['smart_2_cat'] = df['smart_2_cat'].astype('category')
#df['smart_223_cat'] = df['smart_223_cat'].astype('category')
#df['smart_220_cat'] = df['smart_220_cat'].astype('category')
#df['smart_222_cat'] = df['smart_222_cat'].astype('category')
#df['smart_226_cat'] = df['smart_226_cat'].astype('category')
#df['smart_11_cat'] = df['smart_11_cat'].astype('category')
To begin doing MCA, the categorical columns need converted to boolean encoding columns.
cat_cols = [
'manufacturer', 'smart_191_cat', 'smart_184_cat',
'smart_200_cat', 'smart_196_cat', 'smart_8_cat',
'smart_2_cat', 'smart_223_cat', 'smart_220_cat',
'smart_222_cat', 'smart_226_cat', 'smart_11_cat'
]
x_train_cat = pd.get_dummies(x_train[cat_cols], \
columns = cat_cols, dtype = bool)
x_test_cat = pd.get_dummies(x_test[cat_cols], \
columns = cat_cols, dtype = bool)
x_valid_cat = pd.get_dummies(x_valid[cat_cols], \
columns = cat_cols, dtype = bool)
x_train_cat.columns
cat_df.dtypes
cat_df.memory_usage().sum()
mca = prince.MCA(
n_components = 13,
n_iter = 3,
copy = True,
random_state = 13
)
mca = mca.fit(cat_df)
ax = mca.plot_coordinates(
X = cat_df,
ax = None,
figsize=(20, 20),
show_row_points = True,
row_points_size = 10,
show_row_labels = False,
show_column_points = True,
column_points_size = 30,
show_column_labels = False,
legend_n_cols = 1
)
plt.savefig('Charts/MCA With Rows.png')
mca_eigenvalues = mca.eigenvalues_
mca_eigenvalues
plt.plot(np.arange(len(mca_eigenvalues)), mca_eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()
ax = mca.plot_coordinates(
X = cat_df,
ax = None,
figsize = (20, 20),
show_row_points = False,
show_row_labels = False,
show_column_points = True,
column_points_size = 30,
show_column_labels = True,
legend_n_cols = 3
)
plt.savefig('Charts/MCA.svg')
plt.savefig('Charts/MCA.png')
# Replace the categorical columns with their encoded representation columns.
x_train.drop(cat_cols, axis = 1, inplace = True)
x_train = x_train.join(x_train_cat)
x_train.info()
# Replace the categorical columns with their encoded representation columns.
x_test.drop(cat_cols, axis = 1, inplace = True)
x_test = x_test.join(x_test_cat)
# Replace the categorical columns with their encoded representation columns.
x_valid.drop(cat_cols, axis = 1, inplace = True)
x_valid = x_valid.join(x_valid_cat)
if not os.path.isfile('cat_x_train.csv'):
x_train.to_csv('cat_x_train.csv', index = False)
if not os.path.isfile('cat_x_test.csv'):
x_test.to_csv('cat_x_test.csv', index = False)
if not os.path.isfile('cat_x_valid.csv'):
x_valid.to_csv('cat_x_valid.csv', index = False)
reload_cat = True
if reload_cat:
x_train = pd.read_csv('cat_x_train.csv')
y_train = pd.read_csv('y_train.csv')
x_test = pd.read_csv('cat_x_test.csv')
y_test = pd.read_csv('y_test.csv')
x_valid = pd.read_csv('cat_x_valid.csv')
y_valid = pd.read_csv('y_valid.csv')
n_rows = len(x_train)
While Factor Analysis of Mixed Data (FAMD) would have been ideal for dimensionality reduction, the current hardware requirements and software availability do not allow for it with such a large dataset.
Traditional training fails as hard drive failure is an extremely rare occurence. The model learns to only predict non-failure, making it useless for actually predicting failure. This is why a combination of undersampling the non-failures and oversampling the failures will improve the training and production of the predictive models.
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, \
classification_report, roc_curve, auc
regression = LogisticRegression(solver = 'sag', n_jobs = -1)
regression.fit(x_train, y_train.values.ravel())
regression.intercept_
regression.coef_
coefs = pd.concat([pd.DataFrame(x_train.columns),
pd.DataFrame(np.transpose(regression.coef_))], axis = 1)
coefs.columns = ["Column", "Coefficient"]
coefs
coefs.where(coefs['Coefficient'] > 0).sort_values(['Coefficient'], \
ascending = False).dropna()
coefs.where(coefs['Coefficient'] < 0).sort_values(['Coefficient']).dropna()
accuracy = regression.score(x_test, y_test)
accuracy
predictions = regression.predict(x_test)
actual = y_test
confusion = confusion_matrix(actual, predictions)
confusion
precision = precision_score(actual, predictions)
precision
print(classification_report(actual, predictions))
sm = SMOTE(random_state = 13)
x_train, y_train = sm.fit_resample(x_train, y_train)
y_train['failure'].value_counts()
if not os.path.isfile('smote_x_df.pkl'):
x_train.to_pickle('smote_x_df.pkl')
if not os.path.isfile('smote_y_df.pkl'):
y_train.to_pickle('smote_y_df.pkl')
reload_smote = True
if reload_smote:
x_train = pd.read_pickle('smote_x_df.pkl')
y_train = pd.read_pickle('smote_y_df.pkl')
x_test = pd.read_csv('cat_x_test.csv')
y_test = pd.read_csv('y_test.csv')
x_valid = pd.read_csv('cat_x_valid.csv')
y_valid = pd.read_csv('y_valid.csv')
n_rows = len(x_train)
regression = LogisticRegression(solver = 'liblinear')
regression.fit(x_train, y_train)
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
regression_predictions = regression.predict(x_test)
actual = y_test
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
regression_precision = precision_score(actual, regression_predictions)
regression_precision
print(classification_report(actual, regression_predictions))
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
regression_false_positive_rate, regression_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
plt.title('Liblinear Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Liblinear Logistic ROC AUC.svg')
plt.show()
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression Liblinear.sav', 'wb'))
regression = LogisticRegression(solver = 'sag', n_jobs = -1)
regression.fit(x_train, y_train)
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
regression_predictions = regression.predict(x_test)
actual = y_test
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
regression_precision = precision_score(actual, regression_predictions)
regression_precision
print(classification_report(actual, regression_predictions))
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
regression_false_positive_rate, regression_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
plt.title('SAG Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAG ROC AUC.svg')
plt.show()
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAG.sav', 'wb'))
regression = LogisticRegression(solver = 'saga', n_jobs = -1)
regression.fit(x_train, y_train)
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
regression_predictions = regression.predict(x_test)
actual = y_test
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
regression_precision = precision_score(actual, regression_predictions)
regression_precision
print(classification_report(actual, regression_predictions))
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
regression_false_positive_rate, regression_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
plt.title('SAGA Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAGA ROC AUC.svg')
plt.show()
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAGA.sav', 'wb'))
regression = LogisticRegression(solver = 'lbfgs', \
max_iter = 10000, n_jobs = 1)
regression.fit(x_train, y_train.values.ravel())
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
regression_predictions = regression.predict(x_test)
actual = y_test
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
regression_precision = precision_score(actual, regression_predictions)
regression_precision
print(classification_report(actual, regression_predictions))
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
regression_false_positive_rate, regression_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
regression_roc_auc = auc(regression_false_positive_rate, \
regression_true_positive_rate)
regression_roc_auc
plt.title('LBFGS Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic ROC AUC.png')
plt.show()
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression lbfgs.sav', 'wb'))
coefs = pd.concat([pd.DataFrame(x_train.columns),
pd.DataFrame(np.transpose(regression.coef_))], axis = 1)
coefs.columns = ["Column", "Coefficient"]
coefs
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
tree = DecisionTreeClassifier(max_depth = 20, splitter = 'best', \
random_state = 13)
tree.fit(x_train, y_train)
tree_accuracy = tree.score(x_test, y_test)
tree_accuracy
tree_predictions = tree.predict(x_test)
actual = y_test
tree_confusion = confusion_matrix(actual, tree_predictions)
tree_confusion
tree_precision = precision_score(actual, tree_predictions)
tree_precision
print(classification_report(actual, tree_predictions))
tree_probabilities = tree.predict_proba(x_test)
predictions = tree_probabilities[:,1]
tree_false_positive_rate, tree_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
tree_roc_auc = auc(tree_false_positive_rate, tree_true_positive_rate)
tree_roc_auc
plt.title('Decision Tree Receiver Operating Characteristic Curve')
plt.plot(tree_false_positive_rate, tree_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % tree_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Tree ROC AUC.svg')
plt.savefig('Charts/Tree ROC AUC.png')
plt.show()
fig, ax = plt.subplots(figsize=(40, 20))
plot_tree(tree, fontsize = 6, max_depth = 3, class_names = True, \
feature_names = x_train.columns)
plt.savefig('Charts/Decision Tree.svg', dpi=100)
plt.savefig('Charts/Decision Tree.png', dpi=100)
# Save the model to disk
pickle.dump(tree, open('Models/Decision Tree.sav', 'wb'))
tree = pickle.load(open('Models/Decision Tree.sav', 'rb'))
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
random_state = 13, n_jobs = -1)
forest.fit(x_train, y_train.values.ravel())
forest_accuracy = forest.score(x_test, y_test)
forest_accuracy
forest_predictions = forest.predict(x_test)
actual = y_test
forest_confusion = confusion_matrix(actual, forest_predictions)
forest_confusion
forest_precision = precision_score(actual, forest_predictions)
forest_precision
print(classification_report(actual, forest_predictions))
forest_probabilities = forest.predict_proba(x_test)
predictions = forest_probabilities[:,1]
forest_false_positive_rate, forest_true_positive_rate, threshold =\
roc_curve(y_test, predictions)
forest_roc_auc = auc(forest_false_positive_rate, forest_true_positive_rate)
forest_roc_auc
plt.title('Random Forest Receiver Operating Characteristic Curve')
plt.plot(forest_false_positive_rate, forest_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Forest ROC AUC.png')
plt.savefig('Charts/Forest ROC AUC.svg')
plt.show()
# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble.sav', 'wb'))
reload_forest = False
if reload_forest:
forest = pickle.load(open('Models/Random Forest Ensemble.sav', 'rb'))
In an attempt to train the ensemble in a way that prioritizes the true negative, or actual failure cases, this version of the random forest weights failures as twice as important as non-failures.
weighted_forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
random_state = 13, n_jobs = -1, class_weight = {0: 1, 1: 2})
weighted_forest.fit(x_train, y_train.values.ravel())
weighted_forest_accuracy = weighted_forest.score(x_test, y_test)
weighted_forest_accuracy
weighted_forest_predictions = weighted_forest.predict(x_test)
actual = y_test
Compared to the unweighted Random Forest ensemble, this weighted ensemble gains another 6 true negative classifications for a true negative rate of 40% rather than 36%, but also gains 40,687 false positive classifications, for 0.028%, instead of 0.0097% false positives.
weighted_forest_confusion = confusion_matrix(actual, \
weighted_forest_predictions)
weighted_forest_confusion
weighted_forest_precision = precision_score(actual, \
weighted_forest_predictions)
weighted_forest_precision
print(classification_report(actual, weighted_forest_predictions))
weighted_forest_probabilities = weighted_forest.predict_proba(x_test)
weighted_predictions = weighted_forest_probabilities[:,1]
weighted_forest_false_positive_rate, weighted_forest_true_positive_rate, \
threshold = roc_curve(y_test, weighted_predictions)
weighted_forest_roc_auc = auc(weighted_forest_false_positive_rate, \
weighted_forest_true_positive_rate)
weighted_forest_roc_auc
plt.title('Weighted Random Forest Receiver Operating Characteristic Curve')
plt.plot(weighted_forest_false_positive_rate, \
weighted_forest_true_positive_rate, \
'blue', label = 'AUC = %0.2f' % weighted_forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Weighted Forest ROC AUC.png')
plt.savefig('Charts/Weighted Forest ROC AUC.svg')
plt.show()
# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble Weighted.sav', 'wb'))
reload_weighted_forest = False
if reload_weighted_forest:
forest = pickle.load(open('Models/Random Forest Ensemble Weighted.sav', 'rb'))
import torch
from torch import nn, optim
import torch.utils.data as data_utils
torch.manual_seed(13)
PyTorch requires the boolean values to be converted to floating point, so these dtypes will be changed before the neural network is defined.
x_train.dtypes
for col in x_train:
if x_train[col].dtype == "bool":
x_train[col] = x_train[col].astype(float)
x_test[col] = x_test[col].astype(float)
x_train.dtypes
x_train.isna().sum().sum()
x_test.dtypes
x_test.isna().sum().sum()
y_train = y_train.astype(float)
y_test = y_test.astype(float)
train_label = torch.tensor(y_train.values)
trainset = torch.tensor(x_train.values)
train_tensor = data_utils.TensorDataset(trainset, train_label)
trainloader = data_utils.DataLoader(dataset = train_tensor, \
batch_size = 512, shuffle = True)
test_label = torch.tensor(y_test.values)
testset = torch.tensor(x_test.values)
test_tensor = data_utils.TensorDataset(testset, test_label)
testloader = data_utils.DataLoader(dataset = test_tensor, \
batch_size = 512, shuffle = True)
torch.backends.cudnn.enabled
torch.cuda.is_available()
print(torch.version.cuda)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
class nn_Classifier(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(49, 24)
self.act1 = nn.LeakyReLU()
self.fc2 = nn.Linear(24, 12)
self.act2 = nn.LeakyReLU()
self.fc3 = nn.Linear(12, 1)
def forward(self, x):
# make sure input tensor is flattened
x = x.view(-1, 49)
x = self.act1(self.fc1(x))
x = self.act2(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x
neural_network = nn_Classifier()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network.parameters(), lr = 1e-7, \
weight_decay = 1e-5)
def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform_(m.weight)
m.bias.data.fill_(0.01)
neural_network.apply(init_weights)
n_train = len(x_train)
epochs = 10
neural_network.to(device);
train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf
for e in range(epochs):
neural_network.train()
running_loss = 0
for row, target in trainloader:
row = row.to(device)
target = target.to(device)
optimizer.zero_grad()
output = neural_network(row.float())
loss = criterion(output, target.float())
loss.backward()
optimizer.step()
running_loss += loss.item()
# Reporting
print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
current += len(row)
else:
neural_network.eval()
test_loss = 0
accuracy = 0
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
for row, target in testloader:
row = row.to(device)
target = target.to(device)
output = neural_network(row.float())
test_loss += criterion(output, target.float())
current = 0
# Calculate average losses
train_losses.append(running_loss/len(trainloader))
valid_loss = test_loss/len(testloader)
test_losses.append(valid_loss)
# Print validation statistics
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
"Test Loss: {:.2f}.. ".format(valid_loss))
# Save the model if test loss has decreased
if test_loss/len(testloader) <= test_loss_min:
print('Test loss decreased ({:.4f} --> {:.4f}). Saving model ...'.format(
test_loss_min, valid_loss))
torch.save(neural_network.state_dict(), 'Models/Neural Network 1.pt')
test_loss_min = valid_loss
While it may eventually improve with enough training, it's most likely that this architecture of neural network is too simple for the problem at hand. A more complex one will be built next.
for param in neural_network.parameters():
print(param.data)
plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', \
label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', \
label = "Test Loss")
plt.title("Neural Network 1 Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN1 Loss Plot.svg")
plt.savefig("Charts/NN1 Loss Plot.png")
neural_network.eval()
output = []
pred_targets = []
with torch.no_grad():
for rows, targets in testloader:
rows = rows.to(device)
output += neural_network(rows.float())
pred_targets += targets
output[:30]
nn1_predictions = []
actual = []
for i, x in enumerate(output):
if output[i].item() <= 0.5:
nn1_predictions.append(0)
else:
nn1_predictions.append(1)
if pred_targets[i].item() == 0.0:
actual.append(0)
elif pred_targets[i].item() == 1.0:
actual.append(1)
nn1_confusion = confusion_matrix(actual, nn1_predictions)
nn1_confusion
nn1_precision = precision_score(actual, nn1_predictions)
nn1_precision
print(classification_report(actual, nn1_predictions))
nn1_false_positive_rate, nn1_true_positive_rate, threshold =\
roc_curve(actual, nn1_predictions)
nn1_roc_auc = auc(nn1_false_positive_rate, nn1_true_positive_rate)
nn1_roc_auc
plt.title('Neural Network 1 Receiver Operating Characteristic Curve')
plt.plot(nn1_false_positive_rate, nn1_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % nn1_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN1 ROC AUC.png')
plt.savefig('Charts/NN1 ROC AUC.svg')
plt.show()
class nn_Classifier2(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(49, 98)
self.act1 = nn.LeakyReLU()
self.fc2 = nn.Linear(98, 72)
self.act2 = nn.LeakyReLU()
self.fc3 = nn.Linear(72, 36)
self.act3 = nn.LeakyReLU()
self.fc4 = nn.Linear(36, 9)
self.act4 = nn.LeakyReLU()
self.fc5 = nn.Linear(9, 1)
def forward(self, x):
# make sure input tensor is flattened
x = x.view(-1, 49)
x = self.act1(self.fc1(x))
x = self.act2(self.fc2(x))
x = self.act3(self.fc3(x))
x = self.act4(self.fc4(x))
x = torch.sigmoid(self.fc5(x))
return x
neural_network2 = nn_Classifier2()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network2.parameters(), lr = 1e-7, weight_decay = 1e-5)
def init_weights2(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform_(m.weight)
m.bias.data.fill_(0.01)
neural_network2.apply(init_weights2)
n_train = len(x_train)
epochs = 50
neural_network2.to(device);
train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf
for e in range(epochs):
neural_network2.train()
running_loss = 0
for row, target in trainloader:
row = row.to(device)
target = target.to(device)
optimizer.zero_grad()
output = neural_network2(row.float())
loss = criterion(output, target.float())
loss.backward()
optimizer.step()
running_loss += loss.item()
# Reporting
print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
current += len(row)
else:
neural_network2.eval()
test_loss = 0
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
for row, target in testloader:
row = row.to(device)
target = target.to(device)
output = neural_network2(row.float())
test_loss += criterion(output, target.float())
current = 0
# Calculate average losses
train_losses.append(running_loss/len(trainloader))
valid_loss = (test_loss/len(testloader))
test_losses.append(valid_loss)
# Print validation statistics
print("Epoch: {}/{}.. ".format(e+1, epochs),
"Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
"Test Loss: {:.2f}.. ".format(valid_loss))
# Save the model if test loss has decreased
if test_loss/len(testloader) <= test_loss_min:
print('Test loss decreased ({:.4f} --> {:.4f}). Saving model ...'.format(
test_loss_min, valid_loss))
torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
test_loss_min = valid_loss
It may seem quite odd that the testing loss is consistently lower than the training loss. In this case, it's quite likely the the sheer size of the training set causes this to occur. The model is constantly improving every training batch and the training loss is calculated from the entire epoch. The testing loss is calculated after the entire epoch of batches have all affected the model for the better.
plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot.svg")
plt.savefig("Charts/NN2 Loss Plot.png")
train_losses
with open('Models/NN2_train_losses.txt', 'w') as loss:
for epoch in train_losses:
loss.write(str(epoch) + "\n")
for epoch in test_losses:
print(epoch.item())
with open('Models/NN2_test_losses.txt', 'w') as loss:
for epoch in test_losses:
loss.write(str(epoch.item()) + "\n")
neural_network2.eval()
output = []
pred_targets = []
with torch.no_grad():
for rows, targets in testloader:
rows = rows.to(device)
output += neural_network2(rows.float())
pred_targets += targets
output[:30]
nn2_predictions = []
actual = []
for i, x in enumerate(output):
if output[i].item() <= 0.5:
nn2_predictions.append(0)
else:
nn2_predictions.append(1)
if pred_targets[i].item() == 0.0:
actual.append(0)
elif pred_targets[i].item() == 1.0:
actual.append(1)
nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion
nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision
print(classification_report(actual, nn2_predictions))
nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
roc_curve(actual, nn2_predictions)
nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc
plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 50 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.svg')
plt.show()
While 50 epochs were originally planned, the test loss consistently decreases even at the 50th epoch. Additionally, when compared to the other models, this neural network has a very high amount of true negative predictions and a moderately low amount of false positive predictions. Additional training may result in a model that outperforms even the LBFGS solved logistic regression model for this task.
for e in range(20):
neural_network2.train()
running_loss = 0
for row, target in trainloader:
row = row.to(device)
target = target.to(device)
optimizer.zero_grad()
output = neural_network2(row.float())
loss = criterion(output, target.float())
loss.backward()
optimizer.step()
running_loss += loss.item()
# Reporting
print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
current += len(row)
else:
neural_network2.eval()
test_loss = 0
# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
for row, target in testloader:
row = row.to(device)
target = target.to(device)
output = neural_network2(row.float())
test_loss += criterion(output, target.float())
current = 0
# Calculate average losses
train_losses.append(running_loss/len(trainloader))
valid_loss = (test_loss/len(testloader))
test_losses.append(valid_loss)
# Print validation statistics
print("Epoch: {}/{}.. ".format(e + 51, epochs + 20),
"Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
"Test Loss: {:.2f}.. ".format(valid_loss))
# Save the model if test loss has decreased
if test_loss/len(testloader) <= test_loss_min:
print('Test loss decreased ({:.4f} --> {:.4f}). Saving model ...'.format(
test_loss_min, valid_loss))
torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
test_loss_min = valid_loss
plt.figure(figsize = (18, 5))
train_ax, = plt.plot(np.arange(epochs + 20), train_losses, 'r--', \
label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs + 20), test_losses, 'b--',\
label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs + 20))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot 2.svg")
plt.savefig("Charts/NN2 Loss Plot 2.png")
train_losses
with open('Models/NN2_train_losses.txt', 'w') as loss:
for epoch in train_losses:
loss.write(str(epoch) + "\n")
for epoch in test_losses:
print(epoch.item())
with open('Models/NN2_test_losses.txt', 'w') as loss:
for epoch in test_losses:
loss.write(str(epoch.item()) + "\n")
neural_network2.eval()
output = []
pred_targets = []
with torch.no_grad():
for rows, targets in testloader:
rows = rows.to(device)
output += neural_network2(rows.float())
pred_targets += targets
nn2_predictions = []
actual = []
for i, x in enumerate(output):
if output[i].item() <= 0.5:
nn2_predictions.append(0)
else:
nn2_predictions.append(1)
if pred_targets[i].item() == 0.0:
actual.append(0)
elif pred_targets[i].item() == 1.0:
actual.append(1)
nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion
nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision
print(classification_report(actual, nn2_predictions))
nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
roc_curve(actual, nn2_predictions)
nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc
plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 70 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.svg')
plt.show()
regression = pickle.load(open('Models/Logistic Regression lbfgs.sav', 'rb'))
regression_accuracy = regression.score(x_valid, y_valid)
regression_accuracy
regression_predictions = regression.predict(x_valid)
actual = y_valid
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
regression_precision = precision_score(actual, regression_predictions)
regression_precision
print(classification_report(actual, regression_predictions))
regression_probabilities = regression.predict_proba(x_valid)
predictions = regression_probabilities[:,1]
regression_false_positive_rate, regression_true_positive_rate, threshold =\
roc_curve(y_valid, predictions)
regression_roc_auc = auc(regression_false_positive_rate, \
regression_true_positive_rate)
regression_roc_auc
plt.title('LBFGS Logistic Regression Validation ROC Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, \
'blue', label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.png')
plt.show()
Table 1
HDD Failure Predictive Model Testing Results
Model | Sensitivity | Specificity | Precision | Error Rate | ROC AUC |
---|---|---|---|---|---|
Logistic Regression | 0.6397 | 0.9732 | 1.1478e-3 | 2.68% | 0.8729 |
Decision Tree | 0.4412 | 0.9690 | 0.8829e-3 | 3.10% | 0.6900 |
Random Forest | 0.3603 | 0.9903 | 2.2900e-3 | 0.98% | 0.7974 |
Class-Weighted Random Forest | 0.4044 | 0.9717 | 0.8858e-3 | 2.83% | 0.7998 |
Simple DNN | 0.6176 | 0.9185 | 0.4696e-3 | 8.15% | 0.7681 |
Complex DNN | 0.7132 | 0.9364 | 0.6946e-3 | 6.36% | 0.8248 |
A few limitations of this project exist. First, a very large amount of the dataset was made up of missing values. A second limitation that deserves caution is that the ratios of drives made by each manufacturer in the dataset is very imbalanced. No assumptions about value or reliability of the four manufacturers included in the dataset should be made from this data. A third limitation is that the dataset was extremely imbalanced in terms of the minority (failure) and majority (non-failure) classes. Though SMOTE succeeded exceptionally well at allowing predictive models to learn from the imbalanced data, it does introduce bias as the synthetically created instances of the minority classes overrepresent their information in the analysis. Finally, working computer memory was a great limitation throughout the project as the dataset is so large. This limitation prevented factor analysis of mixed data from being performed and PCA had to be selected as the alternative.
It is highly recommended that either the logistic regression model or the more complex DNN model is added to the daily HDD diagnostics checks and backup procedure pipeline. The complex DNN will successfully flag 71.3% of drives expected to fail that day and the logistic regression 64%, allowing for total backup and retirement of the drive before the failure occurs. Do note that while more sensitive to detecting failure, the DNN does have a higher false positive rate, at 6.36%, than the more conservative logistic regression at 2.68%. Until this can be completed, special care should be given to drives with higher values of SMART attributes 5, 197, and 9 to reduce data loss and complications arising from the events of HDD failure.
Once implemented, an ensemble approach between the two should be tested to further reduce the false positive rate. Furthermore, additional research is warranted beyond the scope and limitations of the project. Taking an RNN approach to the data tidying and predictive modeling will almost certainly improve the results quite significantly, as they are specifically designed for time-series data such as this.
Acronis. Knowledge Base 9105. S.M.A.R.T. Attribute: Reallocated Sectors Count | Knowledge Base. https://kb.acronis.com/content/9105.
Acronis. Knowledge Base 9109. S.M.A.R.T. Attribute: Power-On Hours (POH) | Knowledge Base. https://kb.acronis.com/content/9109.
Acronis. Knowledge Base 9128. S.M.A.R.T. Attribute: Load Cycle Count; Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9128.
Acronis. Knowledge Base 9133. S.M.A.R.T. Attribute: Current Pending Sector Count | Knowledge Base. https://kb.acronis.com/content/9133.
Acronis. Knowledge Base 9152. S.M.A.R.T. Attribute: Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9152.
Backblaze. (2020). data_Q4_2019. San Mateo, CA; Backblaze. Klein, A. (2015, April 16). SMART Hard Drive Attributes: SMART 22 is a Gas Gas Gas. Backblaze Blog | Cloud Storage & Cloud Backup. https://www.backblaze.com/blog/smart-22-is-a-gas-gas-gas/.
Painchaud, A. (2018, October 31). 8 Reasons on How Data Loss Can Negatively Impact Your Bussiness. https://www.sherweb.com/blog/security/statistics-on-data-loss/.
Sanders, J. (2018, November 13). Western Digital spins down HGST and Tegile brands in hard disk market shuffle. TechRepublic. https://www.techrepublic.com/article/western-digital-spins-down-hgst-and-tegile-brands-in-hard-disk-market-shuffle/.
Weiss, G. M. (2013). Foundations of Imbalanced Learning. Imbalanced Learning, 13–41. https://doi.org/10.1002/9781118646106.ch2