Statistical Analysis of HDD Failure

Matthew Unrue, Spring 2020

Western Govenors University MSDA Capstone


Website Version Note:

This notebook is so large and works with so much that data that it was run in multiple settings with the kernel reset for memory management each time. As such, the code cell blocks have execution numbers that are not perfectly in order. Though these do not match up perfectly in this version, the code was and should only be excecuted from top to bottom.

Additional Resources:

The 5-page project Executive Summary can be found here.
The reveal.js based multimedia presentation notes can be found here.
The 87-page report write-up of this project can be found here.


Introduction

What factors indicate impending hard disk drive failure?

H0: Study factors do not significantly indicate impending hard disk failure.
H1: Study factors do significantly indicate impending hard disk failure.

Context

Data helps businesses solve problems, make better decisions, and understand consumers, but a lot of data needs to be stored and available to enable these benefits. Hard drive failure is the most common form of data loss, which is one of the most impactful problems that businesses can experience today as simple drive recovery can cost up to $7,500 per drive (Painchaud, 2018). For cloud-based data centers, keeping multitudes of businesses’ data intact for their own operations is crucial. Being able to predict which hard drives are at the highest risk of failure based on understanding of the combinations of routine diagnostics test results is an ideal solution to backup and replace failing drives before the data is lost.

Data

The dataset used is Backblaze’s 4th quarter data from 2019 (Backblaze, 2020). All of the needed data is contained within the .zip file that Backblaze provides to the public as .csv files split by day.

The dataset contains .csv files for each day of its corresponding quarter, from 2019-10-01 to 2019-12-31. As an example, the subsection of the dataset for 2019-10-01 contains 115,259 rows of data. However, as this data contains recorded readings from a live data center, the number of hard drives and thus rows, changed daily as failed drives were taken out and new drives were installed. The 129 column attributes are Date, Serial Number, Model Number, Capacity, Failure, 62 Self-Monitoring, Analysis and Reporting Technology (SMART) test results, and 62 normalized values of the SMART test values. The Failure attribute is the dependent variable of this study and is a qualitative binary categorical variable. The Date, Serial Number, and Model are nominal qualitative independent variables. Finally, and the SMART value columns are continuous quantitative independent variables.

As stated in Backblaze’s Hard Drive Data and Stats page (Backblaze, n.d.), this dataset is free for any use as long as Backblaze is cited as the data source, that users accept that they are solely responsible for how the data is used, and that the data cannot be sold to anybody as it publicly available.

Data Analytics Tools and Techniques

Python, pandas, and the scikit-learn stack are extensively used for the loading, tidying, manipulation, and analysis of the datasets. PyTorch is used for all neural network related tasks of the analysis and model production. Matplotlib and seaborn are used to create charts and graphics for analysis and presentation of project findings. A needed algorithm, namely Fisher's Exact test for contingency tables greater than 2x2 dimensions, is unavailable in the scikit-learn ecosystem, and R.stats is used for this by using rpy2 to run the R code by embedding it in the Python process. Prince is used for factor analysis, and imbalanced-learn is used for the implementation of SMOTE.

Like R and unlike SAS, all of these packages are easily available, free, and open-source with Python. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.

Synthetic Minority Over-Sampling Technique (SMOTE) is used specifically to handle the imbalanced classes for training and testing splits. PCA is used for dimensionality reduction. Predictor variables are examined through correlation coefficients and Fisher's exact test, as well as graphed univariate and bivariate distributions. A logistic regression model and a decision tree model are examined along with the results of the PCA to find predictor variables as well. For building a predictive model for future use, the logistic regression model, a random forest ensemble model, and neural networks are compared to determine which can produce the most useful model.

As HDD failure is an extremely rare event, the dependent variable class is extremely imbalanced and failing to control for the imbalance through techniques like boosting or oversampling would lead to ineffective models. As the dependent variable is a Boolean value, this task is a binary classification task. Logistic regression is an ideal predictive model for binary classification tasks that gives a probability for classification while also having a simplistic interpretation of coefficients that can be used for feature selection. Decision trees are also simple to understand and work well for classification tasks. Given the complexity of the various fields in the dataset, a more complicated model may work better for predictive power. Random forests and neural networks work very well for classification tasks under these circumstances.

Project Outcomes

The key project outcomes are a deep understanding of the risk of hard drive failure based on the results of SMART test values regardless of manufacturer and predictive models that will be able to flag hard drives that are at high risk of failing. The understanding of the risk of failure based on test values will empower better business decisions by optimizing the choice of storage used based on projected lifetime. The predictive models will allow the business to proactively backup data from storage onto new storage devices before failure while also allowing hard drives to continue working closer to their end of life, minimizing waste from constantly replacing hard drives before it is needed. The combination of these two products will also enable the future creation of a more automated system that protects data from hard drive failure.

Dataset Preparation

The dataset provided by Backblaze is made up of 92 .csv files, 1 for each day in the 2019 4th quarter, totaling 3.13GB of text data. As hard drive failure is an extremely rare event, all of these days will need to be considered together in order to have enough failures to draw conclusions. The project begins by combining all parts of the dataset from their .csv files into a single file.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import csv
import scipy.stats as scs
import gc
import pickle
In [2]:
# Jupyter magic commands for displaying plot objects in the notebook and
# setting float display limits.
%matplotlib inline
%precision %.10g
sns.set_style("dark")
In [3]:
if not os.path.isfile('q4_combined.csv'):
    # Create a generator of dataset files in the current working directory.
    files = glob.glob(os.path.join(os.getcwd(), "2019-*.csv"))

    # Combine the fields into a single file, writing the column index from
    # only the first .csv file.
    index = False
    with open('q4_combined.csv', 'w') as combined:
        for file in files:
            with open(file, 'r') as part:

                if not index:
                    for row in part:
                        combined.write(row)
                    index = True

                else:
                    next(part)
                    for row in part:
                        combined.write(row)
                
In [4]:
with open('q4_combined.csv') as file:
    for (count, _) in enumerate(file, 0):
        pass
    
row_count = count
print("Rows: " + str(row_count))
Rows: 10991209
In [5]:
df = pd.read_csv('q4_combined.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Columns: 131 entries, date to smart_255_raw
dtypes: float64(126), int64(2), object(3)
memory usage: 10.7+ GB

Out of 10,991,209 hard drive days, there were only 678 failures, which gives a failure rate of 0.006169%.

In [6]:
df['failure'].value_counts()
Out[6]:
0    10990531
1         678
Name: failure, dtype: int64
In [7]:
nonfailed, failed = df['failure'].value_counts()
failure_rate = failed / nonfailed
print("Failure Rate: " + str("{:.6f}".format(failure_rate * 100)) + "%")
Failure Rate: 0.006169%

Weiss (2013) defined the imbalance ratio as the ratio between majority and minority classes with a modestly imbalanced dataset having an imbalance ratio of 10:1, and extremely imbalanced datasets as having an imbalance ratio of 1000:1 or greater (pg. 15). This dataset has an imbalance ratio of approximately 16,210:1 and as such will require very careful cultivation in order for any predictive model to successfully learn from. The rarity of the positive failure cases is also the reason that the entire 4th quarter dataset is required.

Unfortunately, this combined file requires too much memory to load all at once for current hardware restraints. It needs 13.5GB for just the data, not including the memory needed for the OS and other software, nor memory for calculations.

In [8]:
# Return the summed memory usage of each column in bytes.
memory_usage = sum(df.memory_usage(deep=True))
memory_usage
Out[8]:
13499129713
In [9]:
print(str(memory_usage / 1000) + "KB")
print(str("{:.2f}".format(memory_usage / 1000000)) + "MB")
print(str("{:.2f}".format(memory_usage / 1000000000)) + "GB")
13499129.713KB
13499.13MB
13.50GB

As this dataset contains both raw and normalized values for all of the SMART values, a simple way to deal with the memory issues is to divide the dataset into a raw form and a normalized form.

In [10]:
list(df.columns.values)
Out[10]:
['date',
 'serial_number',
 'model',
 'capacity_bytes',
 'failure',
 'smart_1_normalized',
 'smart_1_raw',
 'smart_2_normalized',
 'smart_2_raw',
 'smart_3_normalized',
 'smart_3_raw',
 'smart_4_normalized',
 'smart_4_raw',
 'smart_5_normalized',
 'smart_5_raw',
 'smart_7_normalized',
 'smart_7_raw',
 'smart_8_normalized',
 'smart_8_raw',
 'smart_9_normalized',
 'smart_9_raw',
 'smart_10_normalized',
 'smart_10_raw',
 'smart_11_normalized',
 'smart_11_raw',
 'smart_12_normalized',
 'smart_12_raw',
 'smart_13_normalized',
 'smart_13_raw',
 'smart_15_normalized',
 'smart_15_raw',
 'smart_16_normalized',
 'smart_16_raw',
 'smart_17_normalized',
 'smart_17_raw',
 'smart_18_normalized',
 'smart_18_raw',
 'smart_22_normalized',
 'smart_22_raw',
 'smart_23_normalized',
 'smart_23_raw',
 'smart_24_normalized',
 'smart_24_raw',
 'smart_168_normalized',
 'smart_168_raw',
 'smart_170_normalized',
 'smart_170_raw',
 'smart_173_normalized',
 'smart_173_raw',
 'smart_174_normalized',
 'smart_174_raw',
 'smart_177_normalized',
 'smart_177_raw',
 'smart_179_normalized',
 'smart_179_raw',
 'smart_181_normalized',
 'smart_181_raw',
 'smart_182_normalized',
 'smart_182_raw',
 'smart_183_normalized',
 'smart_183_raw',
 'smart_184_normalized',
 'smart_184_raw',
 'smart_187_normalized',
 'smart_187_raw',
 'smart_188_normalized',
 'smart_188_raw',
 'smart_189_normalized',
 'smart_189_raw',
 'smart_190_normalized',
 'smart_190_raw',
 'smart_191_normalized',
 'smart_191_raw',
 'smart_192_normalized',
 'smart_192_raw',
 'smart_193_normalized',
 'smart_193_raw',
 'smart_194_normalized',
 'smart_194_raw',
 'smart_195_normalized',
 'smart_195_raw',
 'smart_196_normalized',
 'smart_196_raw',
 'smart_197_normalized',
 'smart_197_raw',
 'smart_198_normalized',
 'smart_198_raw',
 'smart_199_normalized',
 'smart_199_raw',
 'smart_200_normalized',
 'smart_200_raw',
 'smart_201_normalized',
 'smart_201_raw',
 'smart_218_normalized',
 'smart_218_raw',
 'smart_220_normalized',
 'smart_220_raw',
 'smart_222_normalized',
 'smart_222_raw',
 'smart_223_normalized',
 'smart_223_raw',
 'smart_224_normalized',
 'smart_224_raw',
 'smart_225_normalized',
 'smart_225_raw',
 'smart_226_normalized',
 'smart_226_raw',
 'smart_231_normalized',
 'smart_231_raw',
 'smart_232_normalized',
 'smart_232_raw',
 'smart_233_normalized',
 'smart_233_raw',
 'smart_235_normalized',
 'smart_235_raw',
 'smart_240_normalized',
 'smart_240_raw',
 'smart_241_normalized',
 'smart_241_raw',
 'smart_242_normalized',
 'smart_242_raw',
 'smart_250_normalized',
 'smart_250_raw',
 'smart_251_normalized',
 'smart_251_raw',
 'smart_252_normalized',
 'smart_252_raw',
 'smart_254_normalized',
 'smart_254_raw',
 'smart_255_normalized',
 'smart_255_raw']
In [11]:
raw_cols = []
for col in df.columns.values:
    if "normalized" not in col:
        raw_cols.append(col)

print(raw_cols)
['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw', 'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_13_raw', 'smart_15_raw', 'smart_16_raw', 'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw', 'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw', 'smart_174_raw', 'smart_177_raw', 'smart_179_raw', 'smart_181_raw', 'smart_182_raw', 'smart_183_raw', 'smart_184_raw', 'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw', 'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw', 'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_200_raw', 'smart_201_raw', 'smart_218_raw', 'smart_220_raw', 'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw', 'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw', 'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw', 'smart_250_raw', 'smart_251_raw', 'smart_252_raw', 'smart_254_raw', 'smart_255_raw']
In [12]:
norm_cols = []
for col in df.columns.values:
    if "raw" not in col:
        norm_cols.append(col)

print(norm_cols)
['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_normalized', 'smart_2_normalized', 'smart_3_normalized', 'smart_4_normalized', 'smart_5_normalized', 'smart_7_normalized', 'smart_8_normalized', 'smart_9_normalized', 'smart_10_normalized', 'smart_11_normalized', 'smart_12_normalized', 'smart_13_normalized', 'smart_15_normalized', 'smart_16_normalized', 'smart_17_normalized', 'smart_18_normalized', 'smart_22_normalized', 'smart_23_normalized', 'smart_24_normalized', 'smart_168_normalized', 'smart_170_normalized', 'smart_173_normalized', 'smart_174_normalized', 'smart_177_normalized', 'smart_179_normalized', 'smart_181_normalized', 'smart_182_normalized', 'smart_183_normalized', 'smart_184_normalized', 'smart_187_normalized', 'smart_188_normalized', 'smart_189_normalized', 'smart_190_normalized', 'smart_191_normalized', 'smart_192_normalized', 'smart_193_normalized', 'smart_194_normalized', 'smart_195_normalized', 'smart_196_normalized', 'smart_197_normalized', 'smart_198_normalized', 'smart_199_normalized', 'smart_200_normalized', 'smart_201_normalized', 'smart_218_normalized', 'smart_220_normalized', 'smart_222_normalized', 'smart_223_normalized', 'smart_224_normalized', 'smart_225_normalized', 'smart_226_normalized', 'smart_231_normalized', 'smart_232_normalized', 'smart_233_normalized', 'smart_235_normalized', 'smart_240_normalized', 'smart_241_normalized', 'smart_242_normalized', 'smart_250_normalized', 'smart_251_normalized', 'smart_252_normalized', 'smart_254_normalized', 'smart_255_normalized']
In [13]:
if not os.path.isfile('q4_raw.csv'):
    df.to_csv('q4_raw.csv', columns = raw_cols, index=False)
In [14]:
if not os.path.isfile('q4_normalized.csv'):
    df.to_csv('q4_normalized.csv', columns = norm_cols, index=False)
In [15]:
try:
    del [df, nonfailed, failed, failure_rate, memory_usage, raw_cols, norm_cols]
    print("Memory cleared successfully.")
except:
    pass
Memory cleared successfully.

The considerably smaller raw value subset of data is the main dataset of this project. As with nearly all real-world datasets, this one needs considerable cleaning and tidying in order to use for analysis.

In [3]:
df = pd.read_csv('q4_raw.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Data columns (total 68 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   serial_number   object 
 2   model           object 
 3   capacity_bytes  int64  
 4   failure         int64  
 5   smart_1_raw     float64
 6   smart_2_raw     float64
 7   smart_3_raw     float64
 8   smart_4_raw     float64
 9   smart_5_raw     float64
 10  smart_7_raw     float64
 11  smart_8_raw     float64
 12  smart_9_raw     float64
 13  smart_10_raw    float64
 14  smart_11_raw    float64
 15  smart_12_raw    float64
 16  smart_13_raw    float64
 17  smart_15_raw    float64
 18  smart_16_raw    float64
 19  smart_17_raw    float64
 20  smart_18_raw    float64
 21  smart_22_raw    float64
 22  smart_23_raw    float64
 23  smart_24_raw    float64
 24  smart_168_raw   float64
 25  smart_170_raw   float64
 26  smart_173_raw   float64
 27  smart_174_raw   float64
 28  smart_177_raw   float64
 29  smart_179_raw   float64
 30  smart_181_raw   float64
 31  smart_182_raw   float64
 32  smart_183_raw   float64
 33  smart_184_raw   float64
 34  smart_187_raw   float64
 35  smart_188_raw   float64
 36  smart_189_raw   float64
 37  smart_190_raw   float64
 38  smart_191_raw   float64
 39  smart_192_raw   float64
 40  smart_193_raw   float64
 41  smart_194_raw   float64
 42  smart_195_raw   float64
 43  smart_196_raw   float64
 44  smart_197_raw   float64
 45  smart_198_raw   float64
 46  smart_199_raw   float64
 47  smart_200_raw   float64
 48  smart_201_raw   float64
 49  smart_218_raw   float64
 50  smart_220_raw   float64
 51  smart_222_raw   float64
 52  smart_223_raw   float64
 53  smart_224_raw   float64
 54  smart_225_raw   float64
 55  smart_226_raw   float64
 56  smart_231_raw   float64
 57  smart_232_raw   float64
 58  smart_233_raw   float64
 59  smart_235_raw   float64
 60  smart_240_raw   float64
 61  smart_241_raw   float64
 62  smart_242_raw   float64
 63  smart_250_raw   float64
 64  smart_251_raw   float64
 65  smart_252_raw   float64
 66  smart_254_raw   float64
 67  smart_255_raw   float64
dtypes: float64(63), int64(2), object(3)
memory usage: 5.6+ GB
In [4]:
null_values = df.isna().sum().sum()
null_values
Out[4]:
452576024
In [5]:
len(df.columns)
Out[5]:
68
In [6]:
n_rows = len(df)
n_rows
Out[6]:
10991209
In [7]:
n_values = n_rows * len(df.columns)
n_values
Out[7]:
747402212
In [8]:
null_values / n_values
Out[8]:
0.6055320906649926
In [9]:
# Calculate the number of values in the total dataset
n_rows * 131
Out[9]:
1439848379
In [10]:
df.head(30)
Out[10]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN
4 2019-10-01 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 2019-10-01 ZA16NQJR ST8000NM0055 8001563222016 0 117053872.0 NaN 0.0 7.0 0.0 ... NaN NaN 21190.0 5.861349e+10 1.380783e+11 NaN NaN NaN NaN NaN
6 2019-10-01 ZJV02XWG ST12000NM0007 12000138625024 0 194975656.0 NaN 0.0 8.0 0.0 ... NaN NaN 12038.0 5.206555e+10 9.974091e+10 NaN NaN NaN NaN NaN
7 2019-10-01 ZJV1CSVX ST12000NM0007 12000138625024 0 121918904.0 NaN 0.0 19.0 0.0 ... NaN NaN 10444.0 5.417592e+10 1.400380e+11 NaN NaN NaN NaN NaN
8 2019-10-01 ZJV02XWA ST12000NM0007 12000138625024 0 22209920.0 NaN 0.0 7.0 0.0 ... NaN NaN 12130.0 6.002246e+10 1.372655e+11 NaN NaN NaN NaN NaN
9 2019-10-01 ZA18CEBS ST8000NM0055 8001563222016 0 119880096.0 NaN 0.0 2.0 0.0 ... NaN NaN 18159.0 5.162341e+10 1.326167e+11 NaN NaN NaN NaN NaN
10 2019-10-01 Z305DEMG ST4000DM000 4000787030016 0 161164360.0 NaN 0.0 4.0 0.0 ... NaN NaN 31207.0 4.454928e+10 1.502931e+11 NaN NaN NaN NaN NaN
11 2019-10-01 ZA130TTW ST8000DM002 8001563222016 0 40241952.0 NaN 0.0 2.0 0.0 ... NaN NaN 26265.0 6.771851e+10 1.653885e+11 NaN NaN NaN NaN NaN
12 2019-10-01 ZJV5HJQF ST12000NM0007 12000138625024 0 41766200.0 NaN 0.0 2.0 0.0 ... NaN NaN 93.0 6.804080e+08 3.379383e+08 NaN NaN NaN NaN NaN
13 2019-10-01 ZJV1CSVV ST12000NM0007 12000138625024 0 90869464.0 NaN 0.0 3.0 0.0 ... NaN NaN 7121.0 4.144846e+10 6.582024e+10 NaN NaN NaN NaN NaN
14 2019-10-01 ZA18CEBF ST8000NM0055 8001563222016 0 206980416.0 NaN 0.0 5.0 0.0 ... NaN NaN 18174.0 5.177096e+10 1.458424e+11 NaN NaN NaN NaN NaN
15 2019-10-01 ZJV02XWV ST12000NM0007 12000138625024 0 122003344.0 NaN 0.0 3.0 0.0 ... NaN NaN 12123.0 5.917276e+10 1.325998e+11 NaN NaN NaN NaN NaN
16 2019-10-01 PL2331LAG9TEEJ HGST HMS5C4040ALE640 4000787030016 0 0.0 98.0 449.0 13.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 2019-10-01 PL2331LAH3WYAJ HGST HMS5C4040BLE640 4000787030016 0 0.0 106.0 539.0 5.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2019-10-01 2AGN81UY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 1.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 2019-10-01 PL1331LAHG53YH HGST HMS5C4040BLE640 4000787030016 0 0.0 104.0 440.0 7.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 2019-10-01 88Q0A0LGF97G TOSHIBA MG07ACA14TA 14000519643136 0 0.0 0.0 7795.0 2.0 0.0 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
21 2019-10-01 PL2331LAHDUVVJ HGST HMS5C4040BLE640 4000787030016 0 0.0 100.0 0.0 4.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 2019-10-01 ZA10JDYK ST8000DM002 8001563222016 0 144780968.0 NaN 0.0 5.0 0.0 ... NaN NaN 29378.0 5.207358e+10 1.730103e+11 NaN NaN NaN NaN NaN
23 2019-10-01 2AGN03VY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 1.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 2019-10-01 2AGNBDDY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 2.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 2019-10-01 ZA18CEBT ST8000NM0055 8001563222016 0 44530656.0 NaN 0.0 5.0 0.0 ... NaN NaN 18163.0 5.212663e+10 1.302842e+11 NaN NaN NaN NaN NaN
26 2019-10-01 PL1331LAHD252H HGST HMS5C4040BLE640 4000787030016 0 0.0 103.0 432.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 2019-10-01 PL1331LAHD1HTH HGST HMS5C4040BLE640 4000787030016 0 0.0 103.0 432.0 6.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 2019-10-01 8CGDP8AH HGST HUH721212ALE600 12000138625024 0 0.0 96.0 384.0 14.0 0.0 ... NaN NaN NaN 3.766782e+10 4.292351e+10 NaN NaN NaN NaN NaN
29 2019-10-01 ZCH0EBLP ST12000NM0007 12000138625024 0 119494112.0 NaN 0.0 9.0 0.0 ... NaN NaN 12020.0 5.211059e+10 9.888794e+10 NaN NaN NaN NaN NaN

30 rows × 68 columns

In [11]:
# Return the memory usage of each column in bytes.
print(df.memory_usage(deep=True))
Index                   128
date              736411003
serial_number     724524693
model             783195873
capacity_bytes     87929672
                    ...    
smart_250_raw      87929672
smart_251_raw      87929672
smart_252_raw      87929672
smart_254_raw      87929672
smart_255_raw      87929672
Length: 69, dtype: int64
In [12]:
# Total number of failures
df.failure.sum()
Out[12]:
678
In [13]:
# Average number of failures per day
df.failure.sum() / len(df.date.unique())
Out[13]:
7.369565217391305

All SMART test columns have null values in some rows. The dataset notes state that this comes from differing manufacturer's standards despite the standardized nature of SMART tests.

In [14]:
for col in df.columns.values:
    print(col + ": " + str(df[col].isnull().values.any()))
date: False
serial_number: False
model: False
capacity_bytes: False
failure: False
smart_1_raw: True
smart_2_raw: True
smart_3_raw: True
smart_4_raw: True
smart_5_raw: True
smart_7_raw: True
smart_8_raw: True
smart_9_raw: True
smart_10_raw: True
smart_11_raw: True
smart_12_raw: True
smart_13_raw: True
smart_15_raw: True
smart_16_raw: True
smart_17_raw: True
smart_18_raw: True
smart_22_raw: True
smart_23_raw: True
smart_24_raw: True
smart_168_raw: True
smart_170_raw: True
smart_173_raw: True
smart_174_raw: True
smart_177_raw: True
smart_179_raw: True
smart_181_raw: True
smart_182_raw: True
smart_183_raw: True
smart_184_raw: True
smart_187_raw: True
smart_188_raw: True
smart_189_raw: True
smart_190_raw: True
smart_191_raw: True
smart_192_raw: True
smart_193_raw: True
smart_194_raw: True
smart_195_raw: True
smart_196_raw: True
smart_197_raw: True
smart_198_raw: True
smart_199_raw: True
smart_200_raw: True
smart_201_raw: True
smart_218_raw: True
smart_220_raw: True
smart_222_raw: True
smart_223_raw: True
smart_224_raw: True
smart_225_raw: True
smart_226_raw: True
smart_231_raw: True
smart_232_raw: True
smart_233_raw: True
smart_235_raw: True
smart_240_raw: True
smart_241_raw: True
smart_242_raw: True
smart_250_raw: True
smart_251_raw: True
smart_252_raw: True
smart_254_raw: True
smart_255_raw: True

Deriving the manufacturer from the model column will allow the dataset to be easily divided by manufacturer.

In [15]:
df.model.unique()
Out[15]:
array(['ST4000DM000', 'ST12000NM0007', 'HGST HMS5C4040ALE640',
       'ST8000NM0055', 'ST8000DM002', 'HGST HMS5C4040BLE640',
       'HGST HUH721212ALN604', 'TOSHIBA MG07ACA14TA',
       'HGST HUH721212ALE600', 'TOSHIBA MQ01ABF050', 'ST500LM030',
       'ST6000DX000', 'ST10000NM0086', 'DELLBOSS VD',
       'TOSHIBA MQ01ABF050M', 'WDC WD5000LPVX', 'ST500LM012 HN',
       'HGST HUH728080ALE600', 'TOSHIBA MD04ABA400V', 'TOSHIBA HDWF180',
       'ST8000DM005', 'Seagate SSD', 'HGST HUH721010ALE600',
       'ST4000DM005', 'WDC WD5000LPCX', 'HGST HDS5C4040ALE630',
       'ST500LM021', 'Hitachi HDS5C4040ALE630', 'HGST HUS726040ALE610',
       'Seagate BarraCuda SSD ZA500CM10002', 'ST12000NM0117',
       'Seagate BarraCuda SSD ZA2000CM10002',
       'Seagate BarraCuda SSD ZA250CM10002', 'TOSHIBA HDWE160',
       'WDC WD5000BPKT', 'ST6000DM001', 'WDC WD60EFRX', 'ST8000DM004',
       'HGST HMS5C4040BLE641', 'ST1000LM024 HN', 'ST6000DM004',
       'ST12000NM0008', 'ST16000NM001G'], dtype=object)

The "DELLBOSS VD" model value seems the be the only value potentially out of place.

In [16]:
df.loc[(df['model'] == "DELLBOSS VD") &
       (df['date'] == "2019-10-01")]
Out[16]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
162 2019-10-01 1747287481d20010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1169 2019-10-01 a79d077c55d30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1666 2019-10-01 8583f658cd680010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3224 2019-10-01 22ecf3ea21150010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3922 2019-10-01 9ac75f2107cc0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4150 2019-10-01 3b8f38bf6bc90010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4337 2019-10-01 29bae1bef9ad0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4567 2019-10-01 c3bea4912a060010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8696 2019-10-01 5bd1f7cc48910010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12851 2019-10-01 c1858f02677a0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13299 2019-10-01 b160b38dd1370010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15676 2019-10-01 ef29e2d545380010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16477 2019-10-01 7b7ec52d10240010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19564 2019-10-01 eef069c94dfb0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20934 2019-10-01 a79beabda2020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23008 2019-10-01 10ca0ecb78690010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26018 2019-10-01 5c2f968553650010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31033 2019-10-01 350901195c7b0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35634 2019-10-01 6866178485f00010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38044 2019-10-01 2d30418626330010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38758 2019-10-01 17eddeea3c620010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38973 2019-10-01 128cfa8eabec0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
40819 2019-10-01 e4d24bb6b3290010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41068 2019-10-01 a23b0568a5f60010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41224 2019-10-01 a941d1eaf0160010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41966 2019-10-01 13a2651c44500010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
42935 2019-10-01 b86976e2b7490010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
43813 2019-10-01 45f3334ff98c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48684 2019-10-01 37e5a52d44600010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54392 2019-10-01 49a73bc7c27d0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
57720 2019-10-01 f2907e144db40010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
58040 2019-10-01 312feea327f30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
59544 2019-10-01 dc85a3f97d6f0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60703 2019-10-01 f7acc8a7d9220010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62021 2019-10-01 cce2cfe98b7f0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62201 2019-10-01 c295df982e020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62319 2019-10-01 7818d2d7bc260010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65014 2019-10-01 b9f8a9fe5d910010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65271 2019-10-01 af70ef0319310010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
66849 2019-10-01 56cc876a649c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
73313 2019-10-01 22d96dd0f90c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
74417 2019-10-01 4d03b7d534ea0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
77924 2019-10-01 c3b6042ce1d70010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83725 2019-10-01 5de287ae7c050010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85238 2019-10-01 507b941884d90010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85320 2019-10-01 1f157071f4590010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85872 2019-10-01 421ceb5dd0720010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91051 2019-10-01 9dd00e2a06080010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91959 2019-10-01 eeb700c6e4960010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
92330 2019-10-01 76db3b83c3b30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
92831 2019-10-01 bff106d793020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
93365 2019-10-01 d83f152970950010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96150 2019-10-01 ccddfe2489d50010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96190 2019-10-01 e2cfef5b9de50010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96893 2019-10-01 ad6def546aea0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
98881 2019-10-01 3c8f79f4ce9b0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
101456 2019-10-01 d2830942e1ca0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
101692 2019-10-01 98a871fbf9de0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
102576 2019-10-01 826fc283ec560010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
111088 2019-10-01 2e591a197fd00010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

60 rows × 68 columns

None of the SMART values exist for this hard drive model, but 60 of the drives have this model value. Additionally, no failures for this model exist in the dataset. Any row with this model value should be removed from the training data before any predictive analysis. Some searching online leads to the belief that it may be a RAID controller. (https://www.dell.com/support/manuals/au/en/aubsd1/boss-s-1/boss_s1_ug_publication/overview?guid=guid-b20ef25b-b7e3-40f2-b7cd-e497358cd10a&lang=en-us)

In [17]:
df.loc[(df['model'] == "DELLBOSS VD") &
       (df['failure'] == 1)]
Out[17]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw

0 rows × 68 columns

Additionally the "Seagate SSD" model seems to be missing information. Like the "DELLBOSS VD" model rows, this one also does not have any failures and will need to be removed before predictive analysis is performed.

In [18]:
df.loc[(df['model'] == "Seagate SSD") &
       (df['date'] == "2019-10-01")]
Out[18]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
1113 2019-10-01 NB1206GH Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 869.0 1.823282e+09 NaN 1399.0 307.0 NaN NaN NaN NaN NaN
1482 2019-10-01 NB120KH2 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 15439.0 3.237942e+10 NaN 7769.0 4309.0 NaN NaN NaN NaN NaN
1507 2019-10-01 NB120KHJ Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 14427.0 3.025683e+10 NaN 7588.0 4182.0 NaN NaN NaN NaN NaN
4724 2019-10-01 NB120H6H Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 2848.0 5.972913e+09 NaN 1562.0 911.0 NaN NaN NaN NaN NaN
4749 2019-10-01 NB120H66 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 2549.0 5.346440e+09 NaN 1686.0 695.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
109521 2019-10-01 NB120G0J Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 7765.0 1.628449e+10 NaN 4844.0 2382.0 NaN NaN NaN NaN NaN
109891 2019-10-01 NB120AKM Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 17371.0 3.643071e+10 NaN 7951.0 4602.0 NaN NaN NaN NaN NaN
109901 2019-10-01 NB120AKR Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 12588.0 2.639919e+10 NaN 7887.0 4337.0 NaN NaN NaN NaN NaN
113784 2019-10-01 NB120KY9 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 12661.0 2.655319e+10 NaN 8001.0 4103.0 NaN NaN NaN NaN NaN
114644 2019-10-01 NB120HRB Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 752.0 1.578609e+09 NaN 1189.0 18.0 NaN NaN NaN NaN NaN

96 rows × 68 columns

In [19]:
df.loc[(df['model'] == "Seagate SSD") &
       (df['failure'] == 1)]
Out[19]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw

0 rows × 68 columns

The rows not appropriate for analysis are deleted.

In [4]:
df.drop(df[(df['model'] == "DELLBOSS VD") | \
           (df['model'] == "Seagate SSD")].index, axis = 0, inplace = True)
In [5]:
n_rows = len(df)
n_rows
Out[5]:
10976221
In [6]:
# model: ["Manufacturer", "New Model"]
manufacturer_dict = {
    'ST4000DM000': ["Seagate", "ST4000DM000"],
    'ST12000NM0007': ["Seagate", "ST12000NM0007"],
    'HGST HMS5C4040ALE640': ["HGST", "HMS5C4040ALE640"],
    'ST8000NM0055': ["Seagate", "ST8000NM0055"],
    'ST8000DM002': ["Seagate", "ST8000DM002"],
    'HGST HMS5C4040BLE640': ["HGST", "HMS5C4040BLE640"],
    'HGST HUH721212ALN604': ["HGST", "HUH721212ALN604"],
    'TOSHIBA MG07ACA14TA': ["Toshiba", "MG07ACA14TA"],
    'HGST HUH721212ALE600': ["HGST", "HUH721212ALE600"],
    'TOSHIBA MQ01ABF050': ["Toshiba", "MQ01ABF050"],
    'ST500LM030': ["Seagate", "ST500LM030"],
    'ST6000DX000': ["Seagate", "ST6000DX000"],
    'ST10000NM0086': ["Seagate", "ST10000NM0086"],
    'DELLBOSS VD': ["Dell", "DELLBOSS VD"],
    'TOSHIBA MQ01ABF050M': ["Toshiba", "MQ01ABF050M"],
    'WDC WD5000LPVX': ["Western Digital", "WD5000LPVX"],
    'ST500LM012 HN': ["Seagate", "ST500LM012 HN"],
    'HGST HUH728080ALE600': ["HGST", "HUH728080ALE600"],
    'TOSHIBA MD04ABA400V': ["Toshiba", "MD04ABA400V"],
    'TOSHIBA HDWF180': ["Toshiba", "HDWF180"],
    'ST8000DM005': ["Seagate", "ST8000DM005"],
    'Seagate SSD': ["Seagate", "Seagate SSD"],
    'HGST HUH721010ALE600': ["HGST", "Seagate SSD"],
    'ST4000DM005': ["Seagate", "ST4000DM005"],
    'WDC WD5000LPCX': ["Western Digital", "WD5000LPCX"],
    'HGST HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'ST500LM021': ["Seagate", "ST500LM021"],
    'Hitachi HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'HGST HUS726040ALE610': ["HGST", "HUS726040ALE610"],
    'Seagate BarraCuda SSD ZA500CM10002': ["Seagate", "ZA500CM10002"],
    'ST12000NM0117': ["Seagate", "ST12000NM0117"],
    'Seagate BarraCuda SSD ZA2000CM10002': ["Seagate", "ZA2000CM10002"],
    'Seagate BarraCuda SSD ZA250CM10002': ["Seagate", "ZA250CM10002"],
    'TOSHIBA HDWE160': ["Toshiba", "HDWE160"],
    'WDC WD5000BPKT': ["Western Digital", "WD5000BPKT"],
    'ST6000DM001': ["Seagate", "ST6000DM001"],
    'WDC WD60EFRX': ["Western Digital", "WD60EFRX"],
    'ST8000DM004': ["Seagate", "ST8000DM004"],
    'HGST HMS5C4040BLE641': ["HGST", "HMS5C4040BLE641"],
    'ST1000LM024 HN': ["Seagate", "ST1000LM024 HN'"],
    'ST6000DM004': ["Seagate", "ST6000DM004"],
    'ST12000NM0008': ["Seagate", "ST12000NM0008"],
    'ST16000NM001G': ["Seagate", "ST16000NM001G"]
}
In [7]:
# Change the model column into Manufacturer and Model columns.
df['model_temp'] = df['model']
df['manufacturer'] = ''

df['manufacturer'] = df['model_temp'].map(lambda x: manufacturer_dict[x][0])
df['model'] = df['model_temp'].map(lambda x: manufacturer_dict[x][1])

df.drop(['model_temp'], axis=1, inplace=True)
In [8]:
df.head()
Out[8]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN HGST

5 rows × 69 columns

Given the size of the dataset, a few minor changes to the columns may free up a considerable amount of memory. The date and capacity_bytes columns are two easy places to improve.

In [25]:
# date
df['date'].value_counts()
Out[25]:
2019-12-23    124853
2019-12-24    124853
2019-12-25    124853
2019-12-22    124851
2019-12-26    124850
               ...  
2019-10-09    115102
2019-10-04    115101
2019-10-07    115100
2019-10-03    115099
2019-11-05     55837
Name: date, Length: 92, dtype: int64
In [26]:
df['date'][0:5]
Out[26]:
0    2019-10-01
1    2019-10-01
2    2019-10-01
3    2019-10-01
4    2019-10-01
Name: date, dtype: object
In [27]:
before_mem = df['date'].memory_usage()
before_mem
Out[27]:
175619536
In [28]:
df['date'] = df['date'].str[-5:]
df.head()
Out[28]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer
0 10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate
1 10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate
2 10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate
3 10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN HGST

5 rows × 69 columns

In [29]:
df['date'] = df['date'].astype('category')
df['date'][0:5]
Out[29]:
0    10-01
1    10-01
2    10-01
3    10-01
4    10-01
Name: date, dtype: category
Categories (92, object): [10-01, 10-02, 10-03, 10-04, ..., 12-28, 12-29, 12-30, 12-31]
In [30]:
after_mem = df['date'].memory_usage()
after_mem
Out[30]:
98789285
In [31]:
memory_saved = before_mem - after_mem
print("Memory saved on the date column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the date column: 73.27MB
In [32]:
# model
before_mem = df['model'].memory_usage()
df['model'] = df['model'].astype('category')
after_mem = df['model'].memory_usage()
memory_saved = before_mem - after_mem
print("Memory saved on the model column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the model column: 73.27MB
In [33]:
# failure
before_mem = df['failure'].memory_usage(deep = True)
df['failure'] = df['failure'].astype('bool')
after_mem = df['failure'].memory_usage(deep = True)
memory_saved = before_mem - after_mem
print("Memory saved on the failure column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the failure column: 73.27MB
In [9]:
# capacity_bytes
before_memory = df['capacity_bytes'].memory_usage(deep = True)
before_memory
Out[9]:
175619536

Here we can see that 1108 drive days have an error value rather than their actual capacity. These rows may need to be removed, but may also be an excellent signal for a failing drive.

In [10]:
df.loc[df["capacity_bytes"] == -1]["manufacturer"].value_counts()
Out[10]:
Seagate            759
HGST               299
Toshiba             48
Western Digital      2
Name: manufacturer, dtype: int64
In [11]:
sns.countplot(x = df.loc[df["capacity_bytes"] == -1]["capacity_bytes"], \
              hue = df["failure"])
Out[11]:
<AxesSubplot:xlabel='capacity_bytes', ylabel='count'>

Unfortunately, all drives experiencing this error do not fail and this can introduce problems in the final model. As it only affects 0.01% of the dataset, removing the affected rows seems best.

In [12]:
# Calculate the percentage of the dataset that is affected by this error.
str(np.around(((1008/n_rows) * 100), 2)) + "%"
Out[12]:
'0.01%'
In [13]:
df.drop(df[(df['capacity_bytes'] == -1)].index, axis = 0, inplace = True)
In [14]:
n_rows = len(df)
n_rows
Out[14]:
10975113
In [15]:
df['capacity_bytes'].value_counts()
Out[15]:
12000138625024    4855875
4000787030016     3197457
8001563222016     2309775
14000519643136     232122
500107862016       177166
10000831348736     110993
6001175126016       82595
250059350016         6844
16000900661248       1840
2000398934016         355
1000204886016          91
Name: capacity_bytes, dtype: int64

The capacity_bytes column is converted from bytes to terabytes to condense the information on disk.

In [16]:
df['capacity_TB'] = np.around((df['capacity_bytes']/(1000*1000*1000*1000)), \
                              decimals = 2)
df.head()
Out[16]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer capacity_TB
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate 4.0
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate 12.0
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate 12.0
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate 12.0
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN HGST 4.0

5 rows × 70 columns

In [17]:
df['capacity_TB'].value_counts()
Out[17]:
12.00    4855875
4.00     3197457
8.00     2309775
14.00     232122
0.50      177166
10.00     110993
6.00       82595
0.25        6844
16.00       1840
2.00         355
1.00          91
Name: capacity_TB, dtype: int64
In [19]:
df['capacity_TB'] = df['capacity_TB'].astype('category')
after_mem = df['capacity_TB'].memory_usage()
memory_saved = before_memory - after_mem
print("Memory saved on the capacity column: " + \
      str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the capacity column: 73.28MB
In [20]:
df.drop(['capacity_bytes'], axis=1, inplace=True)
df.head()
Out[20]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer capacity_TB
0 2019-10-01 Z305B2QN ST4000DM000 0 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate 4.0
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 0 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate 12.0
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 0 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate 12.0
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 0 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate 12.0
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 0 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN HGST 4.0

5 rows × 69 columns

In [21]:
fail_df = pd.crosstab(df["manufacturer"], df["failure"])
fail_df
Out[21]:
failure 0 1
manufacturer
HGST 2660507 26
Seagate 7965951 606
Toshiba 322682 40
Western Digital 25295 6
In [22]:
fail_df['Rate'] = fail_df[1] / (fail_df[0] + fail_df[1])
fail_df
Out[22]:
failure 0 1 Rate
manufacturer
HGST 2660507 26 0.000010
Seagate 7965951 606 0.000076
Toshiba 322682 40 0.000124
Western Digital 25295 6 0.000237
In [23]:
corr_df = df.corr()
In [24]:
corr_df['failure']
Out[24]:
failure          1.000000
smart_1_raw      0.002183
smart_2_raw     -0.003998
smart_3_raw     -0.000161
smart_4_raw      0.001086
                   ...   
smart_250_raw         NaN
smart_251_raw         NaN
smart_252_raw         NaN
smart_254_raw         NaN
smart_255_raw         NaN
Name: failure, Length: 64, dtype: float64

With these things finished, the univariate distributions can be examined to gain a better sense of the data.

The first column, date shows some sort of testing or operational failure on November 5th.

In [25]:
plt.figure(figsize = (20, 10))
plt.title('Number of Drives in Operation per Day (Q4 2019)')
g = sns.countplot(df['date'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.figure.savefig("Charts/Date Distribution.png")
g.figure.savefig("Charts/Date Distribution.svg")

Drive capacities are mostly 4, 8, and 12 TB, likely coinciding with large investments in new drives for the datacenter and possibly alongside the price lowering of specific models.

In [26]:
plt.figure(figsize = (5, 5))
plt.title('Capacity of Drives')
g = sns.countplot(df['capacity_TB'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Capacity Distribution.svg")
g.figure.savefig("Charts/Capacity Distribution.svg")

The manufacturer of the most drives in this dataset is Seagate at 72.59%. HGST is the second highest at 24.24%. Western Digital is the least represented manufacturer in the dataset with only 0.23%, but as HGST was acquired by Western Digital in 2012 (Sanders, 2018), the drives in this dataset will likely be quite similar between the two manufacturers given the seven-year timespan between then and the time of dataset recording and creation. Finally, Toshiba is the other manufacturer, with 2.94% of the dataset. This amount is quite low and may make it difficult to accurately predict their drives in comparison.

In [27]:
plt.figure(figsize = (5, 5))
plt.title('Manufacturers of Drives')
g = sns.countplot(df['manufacturer'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Manufacturer Distribution.svg")
g.figure.savefig("Charts/Manufacturer Distribution.png")

The SMART values vary greatly from the number of different types of drives that exist in this dataset. Before the columns can be graphed appropriately, the NaN/null values need to be examined. It's most likely that the missing data is most related to the hard drive's manufacturer or model.

In [61]:
sns.distplot(df['smart_1_raw'])
plt.grid(True)
plt.show()
In [62]:
# Pandas styling function
def highlight_nans(val):
    color = 'red' if val == True or val > 0 else 'black'
    return 'color: %s' % color

Every single SMART figure column has null values.

In [63]:
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 75)
df.isna().any()
Out[63]:
date             False
serial_number    False
model            False
failure          False
smart_1_raw       True
smart_2_raw       True
smart_3_raw       True
smart_4_raw       True
smart_5_raw       True
smart_7_raw       True
smart_8_raw       True
smart_9_raw       True
smart_10_raw      True
smart_11_raw      True
smart_12_raw      True
smart_13_raw      True
smart_15_raw      True
smart_16_raw      True
smart_17_raw      True
smart_18_raw      True
smart_22_raw      True
smart_23_raw      True
smart_24_raw      True
smart_168_raw     True
smart_170_raw     True
smart_173_raw     True
smart_174_raw     True
smart_177_raw     True
smart_179_raw     True
smart_181_raw     True
smart_182_raw     True
smart_183_raw     True
smart_184_raw     True
smart_187_raw     True
smart_188_raw     True
smart_189_raw     True
smart_190_raw     True
smart_191_raw     True
smart_192_raw     True
smart_193_raw     True
smart_194_raw     True
smart_195_raw     True
smart_196_raw     True
smart_197_raw     True
smart_198_raw     True
smart_199_raw     True
smart_200_raw     True
smart_201_raw     True
smart_218_raw     True
smart_220_raw     True
smart_222_raw     True
smart_223_raw     True
smart_224_raw     True
smart_225_raw     True
smart_226_raw     True
smart_231_raw     True
smart_232_raw     True
smart_233_raw     True
smart_235_raw     True
smart_240_raw     True
smart_241_raw     True
smart_242_raw     True
smart_250_raw     True
smart_251_raw     True
smart_252_raw     True
smart_254_raw     True
smart_255_raw     True
manufacturer     False
capacity_TB      False
dtype: bool
In [64]:
manu_nan_df = pd.DataFrame()
for manu in df['manufacturer'].unique():
    manu_nan_df[manu] = df.loc[df['manufacturer'] == manu].isna().sum()
In [65]:
manu_nan_df.style.applymap(highlight_nans)
Out[65]:
Seagate HGST Toshiba Western Digital
date 0 0 0 0
serial_number 0 0 0 0
model 0 0 0 0
failure 0 0 0 0
smart_1_raw 2 0 0 0
smart_2_raw 7921364 0 0 25301
smart_3_raw 8794 0 0 0
smart_4_raw 8794 0 0 0
smart_5_raw 8794 0 0 0
smart_7_raw 8794 0 0 0
smart_8_raw 7921364 0 0 25301
smart_9_raw 2 0 0 0
smart_10_raw 8794 0 0 0
smart_11_raw 7921364 2660533 322722 0
smart_12_raw 2 0 0 0
smart_13_raw 7966557 2660533 322722 25301
smart_15_raw 7966557 2660533 322722 25301
smart_16_raw 7957765 2660533 322722 25301
smart_17_raw 7957765 2660533 322722 25301
smart_18_raw 7643443 2660533 322722 25301
smart_22_raw 7966557 1427395 322722 25301
smart_23_raw 7966557 2660533 90600 25301
smart_24_raw 7966557 2660533 90600 25301
smart_168_raw 7957765 2660533 322722 25301
smart_170_raw 7957765 2660533 322722 25301
smart_173_raw 7957765 2660533 322722 25301
smart_174_raw 7957765 2660533 322722 25301
smart_177_raw 7957765 2660533 322722 25301
smart_179_raw 7966557 2660533 322722 25301
smart_181_raw 7966557 2660533 322722 25301
smart_182_raw 7966557 2660533 322722 25301
smart_183_raw 6123738 2660533 322722 25301
smart_184_raw 3772455 2660533 322722 25301
smart_187_raw 53987 2660533 322722 25301
smart_188_raw 53987 2660533 322722 25301
smart_189_raw 3772455 2660533 322722 25301
smart_190_raw 53987 2660533 322722 25301
smart_191_raw 3727262 2660533 0 15052
smart_192_raw 2 0 0 0
smart_193_raw 53987 0 0 0
smart_194_raw 2 0 0 0
smart_195_raw 1797748 2660533 322722 25301
smart_196_raw 7921364 0 0 0
smart_197_raw 8794 0 0 0
smart_198_raw 8794 0 0 0
smart_199_raw 8794 0 0 0
smart_200_raw 4093723 2660533 322722 0
smart_201_raw 7966557 2660533 322722 25301
smart_218_raw 7957765 2660533 322722 25301
smart_220_raw 7966557 2660533 0 25301
smart_222_raw 7966557 2660533 0 25301
smart_223_raw 7921364 2517013 0 25301
smart_224_raw 7966557 2660533 0 25301
smart_225_raw 7921364 2660533 322722 25301
smart_226_raw 7966557 2660533 0 25301
smart_231_raw 7957765 2660533 322722 25301
smart_232_raw 7957765 2660533 322722 25301
smart_233_raw 7957765 2660533 322722 25301
smart_235_raw 7957765 2660533 322722 25301
smart_240_raw 53987 2660533 0 18734
smart_241_raw 45195 2517013 322722 24389
smart_242_raw 45195 2517013 322722 24389
smart_250_raw 7966557 2660533 322722 25301
smart_251_raw 7966557 2660533 322722 25301
smart_252_raw 7966557 2660533 322722 25301
smart_254_raw 7940496 2660533 322722 24389
smart_255_raw 7966557 2660533 322722 25301
manufacturer 0 0 0 0
capacity_TB 0 0 0 0
In [66]:
model_nan_df = pd.DataFrame()
for model in df['model'].unique():
    model_nan_df[model] = df.loc[df['model'] == model].isna().sum()
In [67]:
model_nan_df.style.applymap(highlight_nans)
Out[67]:
ST4000DM000 ST12000NM0007 HMS5C4040ALE640 ST8000NM0055 ST8000DM002 HMS5C4040BLE640 HUH721212ALN604 MG07ACA14TA HUH721212ALE600 MQ01ABF050 ST500LM030 ST6000DX000 ST10000NM0086 MQ01ABF050M WD5000LPVX ST500LM012 HN HUH728080ALE600 MD04ABA400V HDWF180 ST8000DM005 Seagate SSD ST4000DM005 WD5000LPCX HDS5C4040ALE630 ST500LM021 HUS726040ALE610 ZA500CM10002 ST12000NM0117 ZA2000CM10002 ZA250CM10002 HDWE160 WD5000BPKT ST6000DM001 WD60EFRX ST8000DM004 HMS5C4040BLE641 ST1000LM024 HN' ST6000DM004 ST12000NM0008 ST16000NM001G
date 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
serial_number 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
model 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
failure 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
smart_1_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_2_raw 1757498 3394893 0 1316386 896946 0 0 0 0 0 23025 81493 109173 0 19187 0 0 0 0 2257 0 3555 4928 0 3036 0 1593 462 355 6844 0 912 368 274 273 0 0 92 321275 1840
smart_3_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_4_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_5_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_7_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_8_raw 1757498 3394893 0 1316386 896946 0 0 0 0 0 23025 81493 109173 0 19187 0 0 0 0 2257 0 3555 4928 0 3036 0 1593 462 355 6844 0 912 368 274 273 0 0 92 321275 1840
smart_9_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_10_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_11_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 0 0 92073 9009 1840 2257 1820 3555 0 2484 3036 2570 1593 462 355 6844 368 0 368 0 273 91 0 92 321275 1840
smart_12_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_13_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_15_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_16_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_17_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_18_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 1 0
smart_22_raw 1757498 3394893 253758 1316386 896946 1168492 0 232122 0 42565 23025 81493 109173 36818 19187 45102 0 9009 1840 2257 0 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_23_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_24_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_168_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_170_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_173_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_174_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_177_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_179_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_181_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_182_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_183_raw 0 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 0 109173 36818 19187 45102 92073 9009 1840 2257 1820 0 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 0 91 91 92 321275 1840
smart_184_raw 0 3394893 253758 0 0 1168492 995725 232122 143520 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 1593 462 355 6844 368 912 0 274 0 91 91 0 321275 1840
smart_187_raw 0 1 253758 0 0 1168492 995725 232122 143520 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 1593 0 355 6844 368 912 0 274 0 91 91 0 1 0
smart_188_raw 0 1 253758 0 0 1168492 995725 232122 143520 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 1593 0 355 6844 368 912 0 274 0 91 91 0 1 0
smart_189_raw 0 3394893 253758 0 0 1168492 995725 232122 143520 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 1593 462 355 6844 368 912 0 274 0 91 91 0 321275 1840
smart_190_raw 0 1 253758 0 0 1168492 995725 232122 143520 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 1593 0 355 6844 368 912 0 274 0 91 91 0 1 0
smart_191_raw 0 3394893 253758 0 0 1168492 995725 0 143520 0 0 0 0 0 9850 0 92073 0 0 0 1820 0 4928 2484 0 2570 1593 462 355 6844 0 0 0 274 0 91 0 0 321275 1840
smart_192_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_193_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 45102 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 91 0 1 0
smart_194_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_195_raw 1757498 1 253758 0 0 1168492 995725 232122 143520 42565 23025 0 0 36818 19187 0 92073 9009 1840 0 1820 3555 4928 2484 3036 2570 1593 0 355 6844 368 912 0 274 0 91 0 0 1 1840
smart_196_raw 1757498 3394893 0 1316386 896946 0 0 0 0 0 23025 81493 109173 0 0 0 0 0 0 2257 0 3555 0 0 3036 0 1593 462 355 6844 0 0 368 0 273 0 0 92 321275 1840
smart_197_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_198_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_199_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_200_raw 1757498 1 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 0 36818 0 0 92073 9009 1840 2257 1820 3555 0 2484 3036 2570 1593 0 355 6844 368 0 368 0 273 91 0 92 1 0
smart_201_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_218_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_220_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 0 23025 81493 109173 0 19187 45102 92073 0 0 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 0 912 368 274 273 91 91 92 321275 1840
smart_222_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 0 23025 81493 109173 0 19187 45102 92073 0 0 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 0 912 368 274 273 91 91 92 321275 1840
smart_223_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 0 0 23025 81493 109173 0 19187 0 92073 0 0 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 0 912 368 274 273 91 0 92 321275 1840
smart_224_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 0 23025 81493 109173 0 19187 45102 92073 0 0 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 0 912 368 274 273 91 91 92 321275 1840
smart_225_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 0 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 0 92 321275 1840
smart_226_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 0 23025 81493 109173 0 19187 45102 92073 0 0 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 0 912 368 274 273 91 91 92 321275 1840
smart_231_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_232_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_233_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_235_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_240_raw 0 1 253758 0 0 1168492 995725 0 143520 0 0 0 0 0 13532 45102 92073 0 0 0 1820 0 4928 2484 0 2570 1593 0 355 6844 0 0 0 274 0 91 91 0 1 0
smart_241_raw 0 1 253758 0 0 1168492 995725 232122 0 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 0 0 0 0 368 0 0 274 0 91 91 0 1 0
smart_242_raw 0 1 253758 0 0 1168492 995725 232122 0 42565 0 0 0 36818 19187 45102 92073 9009 1840 0 1820 0 4928 2484 0 2570 0 0 0 0 368 0 0 274 0 91 91 0 1 0
smart_250_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_251_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_252_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_254_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 0 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 0 2570 1593 462 355 6844 368 0 368 274 273 91 91 92 321275 1840
smart_255_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
manufacturer 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
capacity_TB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [68]:
model_nan_percent_df = pd.DataFrame()
for model in df['model'].unique():
    model_nan_percent_df[model] = (df.loc[df['model'] == model].isna().sum())\
        /len(df.loc[df['model'] == model])
In [69]:
model_nan_percent_df
Out[69]:
ST4000DM000 ST12000NM0007 HMS5C4040ALE640 ST8000NM0055 ST8000DM002 HMS5C4040BLE640 HUH721212ALN604 MG07ACA14TA HUH721212ALE600 MQ01ABF050 ST500LM030 ST6000DX000 ST10000NM0086 MQ01ABF050M WD5000LPVX ST500LM012 HN HUH728080ALE600 MD04ABA400V HDWF180 ST8000DM005 Seagate SSD ST4000DM005 WD5000LPCX HDS5C4040ALE630 ST500LM021 HUS726040ALE610 ZA500CM10002 ST12000NM0117 ZA2000CM10002 ZA250CM10002 HDWE160 WD5000BPKT ST6000DM001 WD60EFRX ST8000DM004 HMS5C4040BLE641 ST1000LM024 HN' ST6000DM004 ST12000NM0008 ST16000NM001G
date 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
serial_number 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
model 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
failure 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
smart_1_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_2_raw 1.0 1.000000e+00 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.000000 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.000000 1.0
smart_3_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_4_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_5_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_7_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_8_raw 1.0 1.000000e+00 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.000000 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.000000 1.0
smart_9_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_10_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_11_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.000000 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.000000 1.0
smart_12_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_13_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_15_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_16_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_17_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_18_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.000003 0.0
smart_22_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_23_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_24_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_168_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_170_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_173_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_174_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_177_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_179_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_181_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_182_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_183_raw 0.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.000000 1.0
smart_184_raw 0.0 1.000000e+00 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.000000 1.0
smart_187_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_188_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_189_raw 0.0 1.000000e+00 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.000000 1.0
smart_190_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_191_raw 0.0 1.000000e+00 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.513368 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.000000 1.0
smart_192_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_193_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.000003 0.0
smart_194_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_195_raw 1.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.000000 0.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.000003 1.0
smart_196_raw 1.0 1.000000e+00 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.000000 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.000000 1.0
smart_197_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_198_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_199_raw 0.0 2.945601e-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000003 0.0
smart_200_raw 1.0 2.945601e-07 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.000000 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.000003 0.0
smart_201_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_218_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_220_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.000000 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_222_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.000000 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_223_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.000000 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.000000 1.0
smart_224_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.000000 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_225_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.000000 1.0
smart_226_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.000000 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_231_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_232_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_233_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_235_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_240_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.705269 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_241_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_242_raw 0.0 2.945601e-07 1.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.000000 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.000003 0.0
smart_250_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_251_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_252_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_254_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
smart_255_raw 1.0 1.000000e+00 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.000000 1.0
manufacturer 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
capacity_TB 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0
In [70]:
plt.figure(figsize = (20, 20))
plt.title('Model NaN Value Proportion by Hard Drive Model')
g = sns.heatmap(model_nan_percent_df, linewidths=0.2)
g.figure.savefig("Charts/Model NaN Heatmap.svg")
g.figure.savefig("Charts/Model NaN Heatmap.png")
In [71]:
description_df = df.describe()
description_df
Out[71]:
smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_8_raw smart_9_raw smart_10_raw smart_11_raw smart_12_raw smart_13_raw smart_15_raw smart_16_raw smart_17_raw smart_18_raw smart_22_raw smart_23_raw smart_24_raw smart_168_raw smart_170_raw smart_173_raw smart_174_raw smart_177_raw smart_179_raw smart_181_raw smart_182_raw smart_183_raw smart_184_raw smart_187_raw smart_188_raw smart_189_raw smart_190_raw smart_191_raw smart_192_raw smart_193_raw smart_194_raw smart_195_raw smart_196_raw smart_197_raw smart_198_raw smart_199_raw smart_200_raw smart_201_raw smart_218_raw smart_220_raw smart_222_raw smart_223_raw smart_224_raw smart_225_raw smart_226_raw smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
count 1.097511e+07 3.028448e+06 1.096632e+07 1.096632e+07 1.096632e+07 1.096632e+07 3.028448e+06 1.097511e+07 1.096632e+07 70494.000000 1.097511e+07 0.0 0.0 8792.000000 8792.000000 323114.0 1.233138e+06 232122.0 232122.0 8792.0 8792.000000 8.792000e+03 8792.000000 8792.000000 0.0 0.0 0.0 1.842819e+06 4.194102e+06 7.912570e+06 7.912570e+06 4.194102e+06 7.912570e+06 4.572266e+06 1.097511e+07 1.092113e+07 1.097511e+07 6.168809e+06 3.053749e+06 1.096632e+07 1.096632e+07 1.096632e+07 3.898135e+06 0.0 8792.0 3.227220e+05 322722.000000 511435.000000 322722.0 4.519300e+04 322722.000000 8.792000e+03 8.792000e+03 8792.000000 8.792000e+03 8.241859e+06 8.065794e+06 8.065794e+06 0.0 0.0 0.0 26973.0 0.0
mean 8.802956e+07 6.700108e+01 2.421066e+02 9.593508e+00 2.241630e+01 1.387124e+09 2.320426e+01 1.956301e+04 3.230188e+01 676.745822 6.697014e+00 NaN NaN 93.155823 93.155823 0.0 9.999898e+01 0.0 0.0 0.0 243.410714 5.897619e+09 2.550728 4.982712 NaN NaN NaN 1.467631e+00 1.025249e-05 1.110177e+00 1.804494e+08 5.889907e+00 2.822723e+01 1.409016e+04 2.194653e+02 1.122817e+04 2.834573e+01 1.212009e+08 6.639966e-01 1.255862e-01 1.079844e-01 4.217150e-01 3.494200e+03 NaN 0.0 6.048954e+07 9188.189922 113.526147 0.0 3.241115e+05 458.527302 1.099512e+14 3.975942e+11 6209.844177 1.302403e+10 1.947989e+04 5.326082e+10 1.400945e+11 NaN NaN NaN 0.0 NaN
std 8.118987e+07 4.816420e+01 1.136119e+03 1.325780e+02 6.353874e+02 6.481567e+10 1.881317e+01 1.144213e+04 1.956962e+03 1010.676314 1.120958e+01 NaN NaN 19.106895 19.106895 0.0 1.346234e-01 0.0 0.0 0.0 312.487277 5.567485e+09 2.129612 3.346507 NaN NaN NaN 2.245870e+01 7.642989e-03 1.966771e+02 4.255038e+09 4.787608e+02 6.096319e+00 4.832565e+04 9.425638e+02 2.625259e+04 5.733469e+00 7.112637e+07 3.029599e+01 9.629527e+00 9.419638e+00 3.114347e+01 4.109417e+04 NaN 0.0 8.105917e+07 9011.095797 1232.030152 0.0 4.091719e+05 143.387204 1.522909e+00 3.333572e+10 5098.537468 1.069241e+10 1.154593e+04 1.592336e+10 1.668076e+11 NaN NaN NaN 0.0 NaN
min 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 NaN NaN 44.000000 44.000000 0.0 5.800000e+01 0.0 0.0 0.0 50.000000 4.294967e+09 0.000000 0.000000 NaN NaN NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.400000e+01 0.000000e+00 0.000000e+00 1.000000e+00 1.200000e+01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.0 0.000000e+00 0.000000 0.000000 0.0 3.480200e+04 160.000000 1.099512e+14 2.705829e+11 242.000000 5.086516e+08 0.000000e+00 0.000000e+00 1.000000e+00 NaN NaN NaN 0.0 NaN
25% 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.310000e+03 0.000000e+00 0.000000 3.000000e+00 NaN NaN 83.000000 83.000000 0.0 1.000000e+02 0.0 0.0 0.0 101.000000 4.295426e+09 1.000000 2.000000 NaN NaN NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.300000e+01 0.000000e+00 1.000000e+00 3.950000e+02 2.400000e+01 5.952610e+07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.0 0.000000e+00 826.000000 0.000000 0.0 9.649900e+04 370.000000 1.099512e+14 3.865471e+11 1716.000000 3.600462e+09 1.129700e+04 4.930868e+10 1.131444e+11 NaN NaN NaN 0.0 NaN
50% 7.481462e+07 9.800000e+01 0.000000e+00 6.000000e+00 0.000000e+00 3.903726e+08 1.800000e+01 1.901100e+04 0.000000e+00 336.000000 5.000000e+00 NaN NaN 93.000000 93.000000 0.0 1.000000e+02 0.0 0.0 0.0 154.000000 4.296475e+09 2.000000 5.000000 NaN NaN NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.800000e+01 2.300000e+01 8.000000e+01 1.079000e+03 2.800000e+01 1.214947e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.0 1.756365e+07 9324.000000 0.000000 0.0 1.515510e+05 534.000000 1.099512e+14 4.080219e+11 5171.000000 1.084498e+10 1.847300e+04 5.513324e+10 1.545293e+11 NaN NaN NaN 0.0 NaN
75% 1.599356e+08 1.020000e+02 0.000000e+00 9.000000e+00 0.000000e+00 8.594137e+08 4.200000e+01 2.920900e+04 0.000000e+00 1089.000000 9.000000e+00 NaN NaN 102.000000 102.000000 0.0 1.000000e+02 0.0 0.0 0.0 287.000000 4.297917e+09 3.000000 8.000000 NaN NaN NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.200000e+01 1.241300e+04 2.470000e+02 1.124700e+04 3.200000e+01 1.831201e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.0 1.019740e+08 11160.000000 0.000000 0.0 3.400660e+05 537.000000 1.099512e+14 4.209068e+11 9380.250000 1.967301e+10 2.938600e+04 6.324645e+10 1.742499e+11 NaN NaN NaN 0.0 NaN
max 1.126895e+09 1.061000e+03 1.114200e+04 2.514100e+04 6.552800e+04 1.457377e+13 4.500000e+01 5.939300e+04 3.276800e+05 13377.000000 8.690000e+02 NaN NaN 171.000000 171.000000 0.0 1.000000e+02 0.0 0.0 0.0 1979.000000 8.590406e+10 18.000000 11.000000 NaN NaN NaN 2.053000e+03 9.000000e+00 6.553500e+04 6.013057e+11 6.553500e+04 6.700000e+01 4.881618e+06 2.638800e+04 1.123031e+06 7.500000e+01 2.441406e+08 5.248000e+03 3.576000e+03 3.576000e+03 7.670000e+03 1.383907e+06 NaN 0.0 2.874409e+08 40503.000000 65536.000000 0.0 2.400769e+06 647.000000 1.099512e+14 4.380867e+11 23751.000000 4.981004e+10 5.897000e+04 2.053460e+11 2.811011e+13 NaN NaN NaN 0.0 NaN

The count row is equivalent to the number of non-null values. If a column has a count of 0, every single value in it is NaN or null, and should be deleted.

In [72]:
description_df.iloc[0]
Out[72]:
smart_1_raw      10975111.0
smart_2_raw       3028448.0
smart_3_raw      10966319.0
smart_4_raw      10966319.0
smart_5_raw      10966319.0
smart_7_raw      10966319.0
smart_8_raw       3028448.0
smart_9_raw      10975111.0
smart_10_raw     10966319.0
smart_11_raw        70494.0
smart_12_raw     10975111.0
smart_13_raw            0.0
smart_15_raw            0.0
smart_16_raw         8792.0
smart_17_raw         8792.0
smart_18_raw       323114.0
smart_22_raw      1233138.0
smart_23_raw       232122.0
smart_24_raw       232122.0
smart_168_raw        8792.0
smart_170_raw        8792.0
smart_173_raw        8792.0
smart_174_raw        8792.0
smart_177_raw        8792.0
smart_179_raw           0.0
smart_181_raw           0.0
smart_182_raw           0.0
smart_183_raw     1842819.0
smart_184_raw     4194102.0
smart_187_raw     7912570.0
smart_188_raw     7912570.0
smart_189_raw     4194102.0
smart_190_raw     7912570.0
smart_191_raw     4572266.0
smart_192_raw    10975111.0
smart_193_raw    10921126.0
smart_194_raw    10975111.0
smart_195_raw     6168809.0
smart_196_raw     3053749.0
smart_197_raw    10966319.0
smart_198_raw    10966319.0
smart_199_raw    10966319.0
smart_200_raw     3898135.0
smart_201_raw           0.0
smart_218_raw        8792.0
smart_220_raw      322722.0
smart_222_raw      322722.0
smart_223_raw      511435.0
smart_224_raw      322722.0
smart_225_raw       45193.0
smart_226_raw      322722.0
smart_231_raw        8792.0
smart_232_raw        8792.0
smart_233_raw        8792.0
smart_235_raw        8792.0
smart_240_raw     8241859.0
smart_241_raw     8065794.0
smart_242_raw     8065794.0
smart_250_raw           0.0
smart_251_raw           0.0
smart_252_raw           0.0
smart_254_raw       26973.0
smart_255_raw           0.0
Name: count, dtype: float64

smart_13_raw, smart_15_raw, smart_179_raw, smart_181_raw, smart_182_raw, smart_201_raw, smart_250_raw, smart_251_raw, smart_252_raw, and smart_255_raw are all empty in this dataset, as all rows have NaN values in these columns.

In [73]:
count_df = pd.DataFrame()
count_df['count'] = description_df.iloc[0]
count_df
Out[73]:
count
smart_1_raw 10975111.0
smart_2_raw 3028448.0
smart_3_raw 10966319.0
smart_4_raw 10966319.0
smart_5_raw 10966319.0
smart_7_raw 10966319.0
smart_8_raw 3028448.0
smart_9_raw 10975111.0
smart_10_raw 10966319.0
smart_11_raw 70494.0
smart_12_raw 10975111.0
smart_13_raw 0.0
smart_15_raw 0.0
smart_16_raw 8792.0
smart_17_raw 8792.0
smart_18_raw 323114.0
smart_22_raw 1233138.0
smart_23_raw 232122.0
smart_24_raw 232122.0
smart_168_raw 8792.0
smart_170_raw 8792.0
smart_173_raw 8792.0
smart_174_raw 8792.0
smart_177_raw 8792.0
smart_179_raw 0.0
smart_181_raw 0.0
smart_182_raw 0.0
smart_183_raw 1842819.0
smart_184_raw 4194102.0
smart_187_raw 7912570.0
smart_188_raw 7912570.0
smart_189_raw 4194102.0
smart_190_raw 7912570.0
smart_191_raw 4572266.0
smart_192_raw 10975111.0
smart_193_raw 10921126.0
smart_194_raw 10975111.0
smart_195_raw 6168809.0
smart_196_raw 3053749.0
smart_197_raw 10966319.0
smart_198_raw 10966319.0
smart_199_raw 10966319.0
smart_200_raw 3898135.0
smart_201_raw 0.0
smart_218_raw 8792.0
smart_220_raw 322722.0
smart_222_raw 322722.0
smart_223_raw 511435.0
smart_224_raw 322722.0
smart_225_raw 45193.0
smart_226_raw 322722.0
smart_231_raw 8792.0
smart_232_raw 8792.0
smart_233_raw 8792.0
smart_235_raw 8792.0
smart_240_raw 8241859.0
smart_241_raw 8065794.0
smart_242_raw 8065794.0
smart_250_raw 0.0
smart_251_raw 0.0
smart_252_raw 0.0
smart_254_raw 26973.0
smart_255_raw 0.0
In [74]:
# Pandas styling function
def highlight_count_nans1(val):
    if val >= 66.6:
        color = 'green'
    elif val >= 33.3 and val < 66.6:
        color = 'yellow'
    else:
        color = 'red'
        
    return 'color: %s' % color
In [75]:
# Pandas styling function
def highlight_count_nans2(val):
    green = int((val * 255) / 100)
    red = int(255 - green)
    rgb = (red, green, 0)

    # Convert to hexadecimal for pandas styling
    color = '#%02x%02x%02x' % rgb
    
    return 'color: %s' % color
In [76]:
count_df['perc_not_nan'] = (count_df['count'] / n_rows) * 100
count_df
Out[76]:
count perc_not_nan
smart_1_raw 10975111.0 99.999982
smart_2_raw 3028448.0 27.593775
smart_3_raw 10966319.0 99.919873
smart_4_raw 10966319.0 99.919873
smart_5_raw 10966319.0 99.919873
smart_7_raw 10966319.0 99.919873
smart_8_raw 3028448.0 27.593775
smart_9_raw 10975111.0 99.999982
smart_10_raw 10966319.0 99.919873
smart_11_raw 70494.0 0.642308
smart_12_raw 10975111.0 99.999982
smart_13_raw 0.0 0.000000
smart_15_raw 0.0 0.000000
smart_16_raw 8792.0 0.080109
smart_17_raw 8792.0 0.080109
smart_18_raw 323114.0 2.944061
smart_22_raw 1233138.0 11.235766
smart_23_raw 232122.0 2.114985
smart_24_raw 232122.0 2.114985
smart_168_raw 8792.0 0.080109
smart_170_raw 8792.0 0.080109
smart_173_raw 8792.0 0.080109
smart_174_raw 8792.0 0.080109
smart_177_raw 8792.0 0.080109
smart_179_raw 0.0 0.000000
smart_181_raw 0.0 0.000000
smart_182_raw 0.0 0.000000
smart_183_raw 1842819.0 16.790889
smart_184_raw 4194102.0 38.214659
smart_187_raw 7912570.0 72.095567
smart_188_raw 7912570.0 72.095567
smart_189_raw 4194102.0 38.214659
smart_190_raw 7912570.0 72.095567
smart_191_raw 4572266.0 41.660309
smart_192_raw 10975111.0 99.999982
smart_193_raw 10921126.0 99.508096
smart_194_raw 10975111.0 99.999982
smart_195_raw 6168809.0 56.207248
smart_196_raw 3053749.0 27.824306
smart_197_raw 10966319.0 99.919873
smart_198_raw 10966319.0 99.919873
smart_199_raw 10966319.0 99.919873
smart_200_raw 3898135.0 35.517949
smart_201_raw 0.0 0.000000
smart_218_raw 8792.0 0.080109
smart_220_raw 322722.0 2.940489
smart_222_raw 322722.0 2.940489
smart_223_raw 511435.0 4.659952
smart_224_raw 322722.0 2.940489
smart_225_raw 45193.0 0.411777
smart_226_raw 322722.0 2.940489
smart_231_raw 8792.0 0.080109
smart_232_raw 8792.0 0.080109
smart_233_raw 8792.0 0.080109
smart_235_raw 8792.0 0.080109
smart_240_raw 8241859.0 75.095892
smart_241_raw 8065794.0 73.491672
smart_242_raw 8065794.0 73.491672
smart_250_raw 0.0 0.000000
smart_251_raw 0.0 0.000000
smart_252_raw 0.0 0.000000
smart_254_raw 26973.0 0.245765
smart_255_raw 0.0 0.000000
In [77]:
count_df.style.applymap(highlight_count_nans1, subset = ['perc_not_nan'])
Out[77]:
count perc_not_nan
smart_1_raw 1.09751e+07 100
smart_2_raw 3.02845e+06 27.5938
smart_3_raw 1.09663e+07 99.9199
smart_4_raw 1.09663e+07 99.9199
smart_5_raw 1.09663e+07 99.9199
smart_7_raw 1.09663e+07 99.9199
smart_8_raw 3.02845e+06 27.5938
smart_9_raw 1.09751e+07 100
smart_10_raw 1.09663e+07 99.9199
smart_11_raw 70494 0.642308
smart_12_raw 1.09751e+07 100
smart_13_raw 0 0
smart_15_raw 0 0
smart_16_raw 8792 0.0801085
smart_17_raw 8792 0.0801085
smart_18_raw 323114 2.94406
smart_22_raw 1.23314e+06 11.2358
smart_23_raw 232122 2.11499
smart_24_raw 232122 2.11499
smart_168_raw 8792 0.0801085
smart_170_raw 8792 0.0801085
smart_173_raw 8792 0.0801085
smart_174_raw 8792 0.0801085
smart_177_raw 8792 0.0801085
smart_179_raw 0 0
smart_181_raw 0 0
smart_182_raw 0 0
smart_183_raw 1.84282e+06 16.7909
smart_184_raw 4.1941e+06 38.2147
smart_187_raw 7.91257e+06 72.0956
smart_188_raw 7.91257e+06 72.0956
smart_189_raw 4.1941e+06 38.2147
smart_190_raw 7.91257e+06 72.0956
smart_191_raw 4.57227e+06 41.6603
smart_192_raw 1.09751e+07 100
smart_193_raw 1.09211e+07 99.5081
smart_194_raw 1.09751e+07 100
smart_195_raw 6.16881e+06 56.2072
smart_196_raw 3.05375e+06 27.8243
smart_197_raw 1.09663e+07 99.9199
smart_198_raw 1.09663e+07 99.9199
smart_199_raw 1.09663e+07 99.9199
smart_200_raw 3.89814e+06 35.5179
smart_201_raw 0 0
smart_218_raw 8792 0.0801085
smart_220_raw 322722 2.94049
smart_222_raw 322722 2.94049
smart_223_raw 511435 4.65995
smart_224_raw 322722 2.94049
smart_225_raw 45193 0.411777
smart_226_raw 322722 2.94049
smart_231_raw 8792 0.0801085
smart_232_raw 8792 0.0801085
smart_233_raw 8792 0.0801085
smart_235_raw 8792 0.0801085
smart_240_raw 8.24186e+06 75.0959
smart_241_raw 8.06579e+06 73.4917
smart_242_raw 8.06579e+06 73.4917
smart_250_raw 0 0
smart_251_raw 0 0
smart_252_raw 0 0
smart_254_raw 26973 0.245765
smart_255_raw 0 0
In [78]:
count_df['bar'] = count_df['perc_not_nan']
count_df.style.\
    applymap(highlight_count_nans2, subset = ['perc_not_nan']).\
    bar(subset=['bar'], color='#d65f5f')
Out[78]:
count perc_not_nan bar
smart_1_raw 1.09751e+07 100 100
smart_2_raw 3.02845e+06 27.5938 27.5938
smart_3_raw 1.09663e+07 99.9199 99.9199
smart_4_raw 1.09663e+07 99.9199 99.9199
smart_5_raw 1.09663e+07 99.9199 99.9199
smart_7_raw 1.09663e+07 99.9199 99.9199
smart_8_raw 3.02845e+06 27.5938 27.5938
smart_9_raw 1.09751e+07 100 100
smart_10_raw 1.09663e+07 99.9199 99.9199
smart_11_raw 70494 0.642308 0.642308
smart_12_raw 1.09751e+07 100 100
smart_13_raw 0 0 0
smart_15_raw 0 0 0
smart_16_raw 8792 0.0801085 0.0801085
smart_17_raw 8792 0.0801085 0.0801085
smart_18_raw 323114 2.94406 2.94406
smart_22_raw 1.23314e+06 11.2358 11.2358
smart_23_raw 232122 2.11499 2.11499
smart_24_raw 232122 2.11499 2.11499
smart_168_raw 8792 0.0801085 0.0801085
smart_170_raw 8792 0.0801085 0.0801085
smart_173_raw 8792 0.0801085 0.0801085
smart_174_raw 8792 0.0801085 0.0801085
smart_177_raw 8792 0.0801085 0.0801085
smart_179_raw 0 0 0
smart_181_raw 0 0 0
smart_182_raw 0 0 0
smart_183_raw 1.84282e+06 16.7909 16.7909
smart_184_raw 4.1941e+06 38.2147 38.2147
smart_187_raw 7.91257e+06 72.0956 72.0956
smart_188_raw 7.91257e+06 72.0956 72.0956
smart_189_raw 4.1941e+06 38.2147 38.2147
smart_190_raw 7.91257e+06 72.0956 72.0956
smart_191_raw 4.57227e+06 41.6603 41.6603
smart_192_raw 1.09751e+07 100 100
smart_193_raw 1.09211e+07 99.5081 99.5081
smart_194_raw 1.09751e+07 100 100
smart_195_raw 6.16881e+06 56.2072 56.2072
smart_196_raw 3.05375e+06 27.8243 27.8243
smart_197_raw 1.09663e+07 99.9199 99.9199
smart_198_raw 1.09663e+07 99.9199 99.9199
smart_199_raw 1.09663e+07 99.9199 99.9199
smart_200_raw 3.89814e+06 35.5179 35.5179
smart_201_raw 0 0 0
smart_218_raw 8792 0.0801085 0.0801085
smart_220_raw 322722 2.94049 2.94049
smart_222_raw 322722 2.94049 2.94049
smart_223_raw 511435 4.65995 4.65995
smart_224_raw 322722 2.94049 2.94049
smart_225_raw 45193 0.411777 0.411777
smart_226_raw 322722 2.94049 2.94049
smart_231_raw 8792 0.0801085 0.0801085
smart_232_raw 8792 0.0801085 0.0801085
smart_233_raw 8792 0.0801085 0.0801085
smart_235_raw 8792 0.0801085 0.0801085
smart_240_raw 8.24186e+06 75.0959 75.0959
smart_241_raw 8.06579e+06 73.4917 73.4917
smart_242_raw 8.06579e+06 73.4917 73.4917
smart_250_raw 0 0 0
smart_251_raw 0 0 0
smart_252_raw 0 0 0
smart_254_raw 26973 0.245765 0.245765
smart_255_raw 0 0 0
In [79]:
empty_columns = []
columns_to_examine = []

for row in count_df.iterrows():
    if row[1][0] == 0.0:
        empty_columns.append(row[0])
        
    elif row[1][0] < (0.8 * n_rows):
        columns_to_examine.append(row[0])
        
        
empty_columns
Out[79]:
['smart_13_raw',
 'smart_15_raw',
 'smart_179_raw',
 'smart_181_raw',
 'smart_182_raw',
 'smart_201_raw',
 'smart_250_raw',
 'smart_251_raw',
 'smart_252_raw',
 'smart_255_raw']
In [80]:
columns_to_examine
Out[80]:
['smart_2_raw',
 'smart_8_raw',
 'smart_11_raw',
 'smart_16_raw',
 'smart_17_raw',
 'smart_18_raw',
 'smart_22_raw',
 'smart_23_raw',
 'smart_24_raw',
 'smart_168_raw',
 'smart_170_raw',
 'smart_173_raw',
 'smart_174_raw',
 'smart_177_raw',
 'smart_183_raw',
 'smart_184_raw',
 'smart_187_raw',
 'smart_188_raw',
 'smart_189_raw',
 'smart_190_raw',
 'smart_191_raw',
 'smart_195_raw',
 'smart_196_raw',
 'smart_200_raw',
 'smart_218_raw',
 'smart_220_raw',
 'smart_222_raw',
 'smart_223_raw',
 'smart_224_raw',
 'smart_225_raw',
 'smart_226_raw',
 'smart_231_raw',
 'smart_232_raw',
 'smart_233_raw',
 'smart_235_raw',
 'smart_240_raw',
 'smart_241_raw',
 'smart_242_raw',
 'smart_254_raw']
In [81]:
before_mem = df.memory_usage(deep=True).sum()
df.drop(empty_columns, axis=1, inplace=True)
after_mem = df.memory_usage(deep=True).sum()
memory_saved = before_mem - after_mem
print("Memory saved on empty column removal: " + \
      str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on empty column removal: 837.33MB
In [90]:
# Free up memory for the next computation.
try:
    del [empty_columns, manufacturing_dict, before_mem, after_mem, memory_saved, fail_df, corr_df]
    print("Memory successfully cleared.")
except:
    pass
In [82]:
# Save the current form of the dataframe for restoration after the following calculations are performed.
if not os.path.isfile('pre_viz_df.csv'):
    df.to_csv("pre_viz_df.csv", index = False)
In [3]:
df = pd.read_csv('pre_viz_df.csv')
In [4]:
viz_df = df.drop(['date', 'serial_number', 'failure', 'model', 'manufacturer', 'capacity_TB'], axis = 1)
viz_df.columns
Out[4]:
Index(['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
       'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
       'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
       'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
       'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
       'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
       'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
       'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
       'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
       'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
       'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
       'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
       'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
       'smart_254_raw'],
      dtype='object')
In [5]:
# Free up memory for the next computation.
try:
    del df
    print("Memory successfully cleared.")
except:
    pass
Memory successfully cleared.
In [6]:
# Melt the df in chunks as df.melt() will take far too much memory.
pivot_list = list()
chunk_size = 250000

for i in range(0, len(viz_df), chunk_size):
    row_pivot = viz_df.iloc[i: i + chunk_size].melt()
    pivot_list.append(row_pivot)

melted = pd.concat(pivot_list)
del pivot_list
In [7]:
melted[0:30]
Out[7]:
variable value
0 smart_1_raw 97236416.0
1 smart_1_raw 4665536.0
2 smart_1_raw 92892872.0
3 smart_1_raw 231702544.0
4 smart_1_raw 0.0
5 smart_1_raw 117053872.0
6 smart_1_raw 194975656.0
7 smart_1_raw 121918904.0
8 smart_1_raw 22209920.0
9 smart_1_raw 119880096.0
10 smart_1_raw 161164360.0
11 smart_1_raw 40241952.0
12 smart_1_raw 41766200.0
13 smart_1_raw 90869464.0
14 smart_1_raw 206980416.0
15 smart_1_raw 122003344.0
16 smart_1_raw 0.0
17 smart_1_raw 0.0
18 smart_1_raw 0.0
19 smart_1_raw 0.0
20 smart_1_raw 0.0
21 smart_1_raw 0.0
22 smart_1_raw 144780968.0
23 smart_1_raw 0.0
24 smart_1_raw 0.0
25 smart_1_raw 44530656.0
26 smart_1_raw 0.0
27 smart_1_raw 0.0
28 smart_1_raw 0.0
29 smart_1_raw 119494112.0
In [8]:
# Free up memory for the next computation.
try:
    del viz_df
    print("Memory successfully cleared.")
except:
    pass

gc.collect()
Memory successfully cleared.
Out[8]:
40
In [9]:
g = sns.FacetGrid(
    melted,
    col = 'variable',
    hue = 'value',
    sharey = 'row',
    sharex = 'col',
    col_wrap = 7,
    legend_out = True,
)

g = g.map(sns.distplot).add_legend()

plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Continuous Variable Distributions')

        
g.savefig("Charts/Univariate Distributions.svg")
g
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-9-e4bf2b072853> in <module>
      6     sharex = 'col',
      7     col_wrap = 7,
----> 8     legend_out = True,
      9 )
     10 

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, height, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws, size)
    250             hue_names = utils.categorical_order(data[hue], hue_order)
    251 
--> 252         colors = self._get_palette(data, hue, hue_order, palette)
    253 
    254         # Set up the lists of names for the row and column facet variables

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\axisgrid.py in _get_palette(self, data, hue, hue_order, palette)
    163                 current_palette = utils.get_color_cycle()
    164                 if n_colors > len(current_palette):
--> 165                     colors = color_palette("husl", n_colors)
    166                 else:
    167                     colors = color_palette(n_colors=n_colors)

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\palettes.py in color_palette(palette, n_colors, desat)
    242     try:
    243         palette = map(mpl.colors.colorConverter.to_rgb, palette)
--> 244         palette = _ColorPalette(palette)
    245     except ValueError:
    246         raise ValueError("Could not generate a palette for %s" % str(palette))

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in to_rgb(c)
    343 def to_rgb(c):
    344     """Convert *c* to an RGB color, silently dropping the alpha channel."""
--> 345     return to_rgba(c)[:3]
    346 
    347 

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in to_rgba(c, alpha)
    183         rgba = None
    184     if rgba is None:  # Suppress exception chaining of cache lookup failure.
--> 185         rgba = _to_rgba_no_colorcycle(c, alpha)
    186         try:
    187             _colors_full_map.cache[c, alpha] = rgba

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in _to_rgba_no_colorcycle(c, alpha)
    275     if alpha is not None:
    276         c = c[:3] + (alpha,)
--> 277     if any(elem < 0 or elem > 1 for elem in c):
    278         raise ValueError("RGBA values should be within 0-1 range")
    279     return c

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in <genexpr>(.0)
    275     if alpha is not None:
    276         c = c[:3] + (alpha,)
--> 277     if any(elem < 0 or elem > 1 for elem in c):
    278         raise ValueError("RGBA values should be within 0-1 range")
    279     return c

KeyboardInterrupt: 

Unfortunately, this operation takes too much memory to do in this manner. Each column will have to be graphed separately and then the graphs combined into a single graphic for the same effect.

In [10]:
# Reset to the dataframes and memory allocations from before the graphing attempts.
try:
    del melted
except:
    pass

df = pd.read_csv('pre_viz_df.csv')
In [11]:
df.head()
Out[11]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
0 10-01 Z305B2QN ST4000DM000 0 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN NaN NaN NaN 33009.0 5.063798e+10 1.623458e+11 NaN Seagate 4.0
1 10-01 ZJV0XJQ4 ST12000NM0007 0 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN NaN NaN NaN 9533.0 5.084775e+10 1.271356e+11 NaN Seagate 12.0
2 10-01 ZJV0XJQ3 ST12000NM0007 0 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN NaN NaN NaN 6977.0 4.920827e+10 4.658787e+10 NaN Seagate 12.0
3 10-01 ZJV0XJQ0 ST12000NM0007 0 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN NaN NaN NaN 10669.0 5.341374e+10 9.427903e+10 NaN Seagate 12.0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 0 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN HGST 4.0

5 rows × 59 columns

In [12]:
sns.distplot(df['smart_1_raw'])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c7de8f8c08>
In [13]:
sns.distplot(df['smart_2_raw'])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c82fd69ec8>
In [14]:
sns.distplot(df['smart_3_raw'], kde = False)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c83200e888>
In [15]:
sns.distplot(df['smart_4_raw'])
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f32a4788>
In [16]:
sns.distplot(df['smart_5_raw'], kde = False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f324fe08>
In [17]:
sns.distplot(df['smart_7_raw'])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f340f3c8>
In [18]:
sns.distplot(df['smart_8_raw'])
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f340f148>
In [19]:
sns.distplot(df['smart_9_raw'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3505048>
In [20]:
sns.distplot(df['smart_10_raw'], kde = False)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f374e0c8>
In [21]:
sns.distplot(df['smart_11_raw'])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f37dab88>
In [22]:
sns.distplot(df['smart_12_raw'])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f386bbc8>
In [23]:
sns.distplot(df['smart_16_raw'])
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f38ef688>
In [24]:
sns.distplot(df['smart_17_raw'])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3926c88>
In [25]:
sns.distplot(df['smart_18_raw'], kde = False)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f37ac948>
In [26]:
sns.distplot(df['smart_22_raw'], kde = False)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3d0ef88>
In [27]:
sns.distplot(df['smart_23_raw'])
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3a7d748>
In [28]:
sns.distplot(df['smart_24_raw'])
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f49886c8>
In [29]:
sns.distplot(df['smart_168_raw'])
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3b5f708>
In [30]:
sns.distplot(df['smart_170_raw'])
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3bdbd08>
In [32]:
sns.distplot(df['smart_173_raw'], kde = False)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3c77048>
In [33]:
sns.distplot(df['smart_174_raw'])
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4a4e388>
In [34]:
sns.distplot(df['smart_177_raw'])
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4a66588>
In [35]:
sns.distplot(df['smart_183_raw'], kde_kws={'bw':0.1})
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4b44148>
In [36]:
sns.distplot(df['smart_184_raw'], kde = False)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c7d6fb7a08>
In [37]:
sns.distplot(df['smart_187_raw'], kde = False)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4c34188>
In [38]:
sns.distplot(df['smart_188_raw'], kde = False)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4c8be08>
In [39]:
sns.distplot(df['smart_189_raw'], kde = False)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4d64d88>
In [40]:
sns.distplot(df['smart_190_raw'])
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4d7fd08>
In [41]:
sns.distplot(df['smart_191_raw'])
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4e7fcc8>
In [42]:
sns.distplot(df['smart_192_raw'])
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4ed6448>
In [43]:
sns.distplot(df['smart_193_raw'])
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4f73648>
In [44]:
sns.distplot(df['smart_194_raw'], kde = False)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4f73108>
In [45]:
sns.distplot(df['smart_195_raw'], kde = False)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f508d308>
In [46]:
sns.distplot(df['smart_196_raw'], kde = False)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f510fb48>
In [47]:
sns.distplot(df['smart_197_raw'], kde = False)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f518f708>
In [48]:
sns.distplot(df['smart_198_raw'], kde = False)
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f522b348>
In [49]:
sns.distplot(df['smart_199_raw'], kde = False)
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f522bbc8>
In [50]:
sns.distplot(df['smart_200_raw'], kde = False)
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5344348>
In [51]:
sns.distplot(df['smart_218_raw'], kde = False)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f53c5b48>
In [52]:
sns.distplot(df['smart_220_raw'])
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f544c708>
In [53]:
sns.distplot(df['smart_222_raw'])
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f53fdd08>
In [54]:
sns.distplot(df['smart_223_raw'], kde = False)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f54a4d08>
In [55]:
sns.distplot(df['smart_224_raw'], kde = False)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f558d5c8>
In [56]:
sns.distplot(df['smart_225_raw'], kde = False)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f56167c8>
In [57]:
sns.distplot(df['smart_226_raw'])
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5696848>
In [58]:
sns.distplot(df['smart_231_raw'], kde = False)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5775388>
In [59]:
sns.distplot(df['smart_232_raw'])
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f57ce248>
In [60]:
sns.distplot(df['smart_233_raw'])
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5813b48>
In [62]:
sns.distplot(df['smart_235_raw'],kde = False)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f592f8c8>
In [63]:
sns.distplot(df['smart_240_raw'])
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f59a4b08>
In [65]:
sns.distplot(df['smart_241_raw'], kde = False)
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5aa8988>
In [66]:
sns.distplot(df['smart_242_raw'], kde = False)
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5b025c8>
In [67]:
sns.distplot(df['smart_254_raw'], kde = False)
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5b959c8>
In [68]:
fig, axes = plt.subplots(7, 8, figsize = (50, 40))

row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
            'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
            'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
            'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
            'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
            'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
            'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
            'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
            'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
            'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
            'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
            'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
            'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
            'smart_254_raw']:
    
    if col == 8:
        row += 1
        col = 0
        
    sns.distplot(df[df_col], ax = axes[row, col], \
                 kde = False, norm_hist = False)
       
    col += 1
    

axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions.svg")
fig.savefig("Charts/SMART Distributions.png")
In [70]:
fig, axes = plt.subplots(7, 8, figsize = (50, 40))

row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
            'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
            'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
            'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
            'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
            'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
            'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
            'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
            'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
            'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
            'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
            'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
            'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
            'smart_254_raw']:
    
    if col == 8:
        row += 1
        col = 0
        
    try:
        sns.distplot(df[df_col], ax = axes[row][col], norm_hist = True)
    except:
        sns.distplot(df[df_col], kde_kws = {'bw': 0.1}, ax = axes[row][col], norm_hist = True)
        
    col += 1
    

axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution and KDE of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions KDE.svg")
fig.savefig("Charts/SMART Distributions KDE.png")
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
In [71]:
# Free up memory for the next section.
try:
    del [highlight_nans, manu_nan_df, model_nan_df, model_nan_percent_df,
         description_df, count_df, highlight_count_nans1,
         highlight_counts_nans2, empty_columns]
    print("Memory successfully cleared.")
except:
    pass

gc.collect()
Out[71]:
185

With some dataset tidying complete, the final major dataset adjustments that need to be made before analysis can be performed is that the NaN values need dealt with. The rows or columns with them can be removed, or they can be filled in through interpolation or estimation.

In [73]:
columns_to_examine = ['smart_13_raw', 'smart_15_raw', 'smart_179_raw',
                      'smart_181_raw', 'smart_182_raw', 'smart_201_raw',
                      'smart_250_raw', 'smart_251_raw', 'smart_252_raw',
                      'smart_255_raw']

columns_to_examine
Out[73]:
['smart_13_raw',
 'smart_15_raw',
 'smart_179_raw',
 'smart_181_raw',
 'smart_182_raw',
 'smart_201_raw',
 'smart_250_raw',
 'smart_251_raw',
 'smart_252_raw',
 'smart_255_raw']
In [4]:
#### Memory Management and Reloading Checkpoint
df = pd.read_csv('pre_viz_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
In [74]:
df.isnull().sum().sort_values()
Out[74]:
date                    0
manufacturer            0
failure                 0
capacity_TB             0
model                   0
serial_number           0
smart_1_raw             2
smart_192_raw           2
smart_9_raw             2
smart_12_raw            2
smart_194_raw           2
smart_3_raw          8794
smart_4_raw          8794
smart_5_raw          8794
smart_7_raw          8794
smart_10_raw         8794
smart_199_raw        8794
smart_198_raw        8794
smart_197_raw        8794
smart_193_raw       53987
smart_240_raw     2733254
smart_242_raw     2909319
smart_241_raw     2909319
smart_187_raw     3062543
smart_190_raw     3062543
smart_188_raw     3062543
smart_195_raw     4806304
smart_191_raw     6402847
smart_184_raw     6781011
smart_189_raw     6781011
smart_200_raw     7076978
smart_196_raw     7921364
smart_8_raw       7946665
smart_2_raw       7946665
smart_183_raw     9132294
smart_22_raw      9741975
smart_223_raw    10463678
smart_18_raw     10651999
smart_224_raw    10652391
smart_220_raw    10652391
smart_222_raw    10652391
smart_226_raw    10652391
smart_23_raw     10742991
smart_24_raw     10742991
smart_11_raw     10904619
smart_225_raw    10929920
smart_254_raw    10948140
smart_235_raw    10966321
smart_233_raw    10966321
smart_232_raw    10966321
smart_168_raw    10966321
smart_170_raw    10966321
smart_218_raw    10966321
smart_174_raw    10966321
smart_16_raw     10966321
smart_17_raw     10966321
smart_173_raw    10966321
smart_231_raw    10966321
smart_177_raw    10966321
dtype: int64

The first five mostly complete columns all have two NaNs, which are the result of two rows that have no raw smart values at all. Both drives failed, making them quite important for predicting future failure. However, the lack of data makes them useless for predicting future failure in their current form.

The most likely scenario is that both drives failed just before the diagnostics were collected. As such, these two rows will be deleted and their associated row for the date before their currently marked failures will be updated to have failed that day.

In [75]:
df.loc[df['smart_1_raw'].isnull() & df['smart_192_raw'].isnull() & \
       df['smart_9_raw'].isnull() & df['smart_12_raw'].isnull() & \
       df['smart_194_raw'].isnull()]
Out[75]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
4632946 11-10 ZJV00DR4 ST12000NM0007 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 12.0
4797700 11-11 ZHZ3M097 ST12000NM0008 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 12.0

2 rows × 59 columns

In [76]:
df.iloc[4632946]
Out[76]:
date                     11-10
serial_number         ZJV00DR4
model            ST12000NM0007
failure                      1
smart_1_raw                NaN
smart_2_raw                NaN
smart_3_raw                NaN
smart_4_raw                NaN
smart_5_raw                NaN
smart_7_raw                NaN
smart_8_raw                NaN
smart_9_raw                NaN
smart_10_raw               NaN
smart_11_raw               NaN
smart_12_raw               NaN
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw              NaN
smart_188_raw              NaN
smart_189_raw              NaN
smart_190_raw              NaN
smart_191_raw              NaN
smart_192_raw              NaN
smart_193_raw              NaN
smart_194_raw              NaN
smart_195_raw              NaN
smart_196_raw              NaN
smart_197_raw              NaN
smart_198_raw              NaN
smart_199_raw              NaN
smart_200_raw              NaN
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              NaN
smart_241_raw              NaN
smart_242_raw              NaN
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4632946, dtype: object
In [77]:
df.iloc[4797700]
Out[77]:
date                     11-11
serial_number         ZHZ3M097
model            ST12000NM0008
failure                      1
smart_1_raw                NaN
smart_2_raw                NaN
smart_3_raw                NaN
smart_4_raw                NaN
smart_5_raw                NaN
smart_7_raw                NaN
smart_8_raw                NaN
smart_9_raw                NaN
smart_10_raw               NaN
smart_11_raw               NaN
smart_12_raw               NaN
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw              NaN
smart_188_raw              NaN
smart_189_raw              NaN
smart_190_raw              NaN
smart_191_raw              NaN
smart_192_raw              NaN
smart_193_raw              NaN
smart_194_raw              NaN
smart_195_raw              NaN
smart_196_raw              NaN
smart_197_raw              NaN
smart_198_raw              NaN
smart_199_raw              NaN
smart_200_raw              NaN
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              NaN
smart_241_raw              NaN
smart_242_raw              NaN
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4797700, dtype: object
In [78]:
df.loc[(df['serial_number'] == 'ZJV00DR4') & (df['date'] == '11-09')]
Out[78]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
4514189 11-09 ZJV00DR4 ST12000NM0007 0 118859320.0 NaN 0.0 6.0 24.0 183150144.0 ... NaN NaN NaN NaN 15995.0 6.726393e+10 1.721129e+11 NaN Seagate 12.0

1 rows × 59 columns

In [79]:
df.at[4514189, 'failure'] = 1
df.iloc[4514189]
Out[79]:
date                     11-09
serial_number         ZJV00DR4
model            ST12000NM0007
failure                      1
smart_1_raw        1.18859e+08
smart_2_raw                NaN
smart_3_raw                  0
smart_4_raw                  6
smart_5_raw                 24
smart_7_raw         1.8315e+08
smart_8_raw                NaN
smart_9_raw              16406
smart_10_raw                 0
smart_11_raw               NaN
smart_12_raw                 5
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw                0
smart_188_raw                0
smart_189_raw              NaN
smart_190_raw               29
smart_191_raw              NaN
smart_192_raw              221
smart_193_raw             1399
smart_194_raw               29
smart_195_raw      1.18859e+08
smart_196_raw              NaN
smart_197_raw                0
smart_198_raw                0
smart_199_raw                0
smart_200_raw                0
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw            15995
smart_241_raw      6.72639e+10
smart_242_raw      1.72113e+11
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4514189, dtype: object
In [80]:
df.loc[(df['serial_number'] == 'ZHZ3M097') & (df['date'] == '11-10')]
Out[80]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
4678156 11-10 ZHZ3M097 ST12000NM0008 0 196597768.0 NaN 0.0 1.0 0.0 32428024.0 ... NaN NaN NaN NaN 229.0 4.255310e+09 5.706962e+09 NaN Seagate 12.0

1 rows × 59 columns

In [81]:
df.at[4678156, 'failure'] = 1
df.iloc[4678156]
Out[81]:
date                     11-10
serial_number         ZHZ3M097
model            ST12000NM0008
failure                      1
smart_1_raw        1.96598e+08
smart_2_raw                NaN
smart_3_raw                  0
smart_4_raw                  1
smart_5_raw                  0
smart_7_raw         3.2428e+07
smart_8_raw                NaN
smart_9_raw                375
smart_10_raw                 0
smart_11_raw               NaN
smart_12_raw                 1
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw                 0
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw                0
smart_188_raw                0
smart_189_raw              NaN
smart_190_raw               29
smart_191_raw              NaN
smart_192_raw                0
smart_193_raw              572
smart_194_raw               29
smart_195_raw      1.96598e+08
smart_196_raw              NaN
smart_197_raw                0
smart_198_raw                0
smart_199_raw                0
smart_200_raw                0
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              229
smart_241_raw      4.25531e+09
smart_242_raw      5.70696e+09
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4678156, dtype: object
In [83]:
n_rows
Out[83]:
10975113
In [84]:
df.drop(df.index[[4797700, 4632946]], inplace = True)
In [85]:
df.iloc[[4797700, 4632946]]
Out[85]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
4797702 11-11 ZCH0CCK9 ST12000NM0007 0 151625632.0 NaN 0.0 14.0 0.0 362901516.0 ... NaN NaN NaN NaN 15873.0 6.539592e+10 2.142969e+11 NaN Seagate 12.0
4632947 11-10 Z304JWB3 ST4000DM000 0 146071592.0 NaN 0.0 9.0 0.0 992849367.0 ... NaN NaN NaN NaN 35761.0 5.668042e+10 1.810822e+11 NaN Seagate 4.0

2 rows × 59 columns

The next section of columns all have 8792 rows with NaNs, ignoring the 2 rows just removed. Coincidentally, all of these columns share the same problematic rows.

In [86]:
df_8794 = df.loc[df['smart_3_raw'].isnull() & df['smart_4_raw'].isnull() & \
                 df['smart_5_raw'].isnull() & df['smart_7_raw'].isnull() & \
                 df['smart_10_raw'].isnull() & df['smart_197_raw'].isnull() & \
                 df['smart_198_raw'].isnull() & df['smart_199_raw'].isnull()]
Out[86]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
8416 10-01 7M00027W ZA500CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.080219e+11 6433.0 1.349139e+10 NaN 6647.0 1885.0 NaN Seagate 0.50
13012 10-01 7M200214 ZA2000CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.123169e+11 4340.0 9.102103e+09 NaN 14440.0 1928.0 NaN Seagate 2.00
13676 10-01 7LZ01GHG ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.123169e+11 5944.0 1.246611e+10 NaN 4555.0 2327.0 NaN Seagate 0.25
13682 10-01 7LZ01GH2 ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.123169e+11 5782.0 1.212574e+10 NaN 4557.0 2347.0 NaN Seagate 0.25
13683 10-01 7LZ01GH1 ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.037269e+11 3382.0 7.092588e+09 NaN 2166.0 1279.0 NaN Seagate 0.25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10970045 12-31 7LZ01ENF ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.123169e+11 5952.0 1.248291e+10 NaN 4570.0 2192.0 NaN Seagate 0.25
10971251 12-31 7LZ01K94 ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.123169e+11 5324.0 1.116620e+10 NaN 4170.0 2154.0 NaN Seagate 0.25
10972074 12-31 7LZ0232K ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 3.693672e+11 801.0 1.680505e+09 NaN 869.0 193.0 NaN Seagate 0.25
10973819 12-31 7LZ01GH7 ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.166118e+11 8910.0 1.868588e+10 NaN 6795.0 2979.0 NaN Seagate 0.25
10975048 12-31 7LZ01N9E ZA250CM10002 0 0.0 NaN NaN NaN NaN NaN ... 1.099512e+14 4.252018e+11 12680.0 2.659358e+10 NaN 7605.0 4363.0 NaN Seagate 0.25

8792 rows × 59 columns

This subset of drives are all manfactured by Seagate, and are 3 size variations of the same model line. There is not an updated model from this line in this dataset to interpolate values from.

In [87]:
df_8794['manufacturer'].value_counts()
Out[87]:
Seagate    8792
Name: manufacturer, dtype: int64
In [88]:
df_8794['model'].value_counts()
Out[88]:
ZA250CM10002     6844
ZA500CM10002     1593
ZA2000CM10002     355
Name: model, dtype: int64
In [89]:
df_8794['capacity_TB'].value_counts()
Out[89]:
0.25    6844
0.50    1593
2.00     355
Name: capacity_TB, dtype: int64
In [90]:
df_8794['serial_number'].value_counts()
Out[90]:
7M200214    92
7M00020R    92
7LZ01GH1    92
7LZ01GH2    92
7M0002A6    92
            ..
7LZ026L4     1
7LZ026L2     1
7LZ0249A     1
7LZ0249E     1
7LZ0249F     1
Name: serial_number, Length: 179, dtype: int64
In [91]:
df_8794['serial_number'].value_counts().mean()
Out[91]:
49.11731843575419
In [92]:
df_8794['failure'].value_counts()
Out[92]:
0    8792
Name: failure, dtype: int64
In [93]:
[item for i, item in enumerate(df['model'].unique()) if "ZA" in item]
Out[93]:
['ZA500CM10002', 'ZA2000CM10002', 'ZA250CM10002']

Interpolating mean values from the same manufacturer, Seagate, and the models' respective capacity_TB categories would be a good way to estimate the missing values if enough data exists.

Additionally, creating a boolean column to flag interpolated data as missing may help the predictive models account for it.

smart_3_raw

For the smart_3_raw data, the median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [9]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_3_raw'].mean()
Out[9]:
nan
In [10]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_3_raw']
Out[10]:
Series([], Name: smart_3_raw, dtype: float64)
In [11]:
smart_3_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_3_raw'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_3_raw'].median()
smart_3_median_specialized
Out[11]:
1816.0
In [12]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_3_raw'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_3_raw'])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x16be98c0448>
In [13]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_3_raw']
Out[13]:
134            0.0
246         2044.0
714            0.0
1006        1989.0
1502        1801.0
             ...  
10974512       0.0
10974685       0.0
10974769       0.0
10974840       0.0
10974960       0.0
Name: smart_3_raw, Length: 71163, dtype: float64
In [14]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_3_raw'].mean()
Out[14]:
nan
In [15]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_3_raw']
Out[15]:
Series([], Name: smart_3_raw, dtype: float64)
In [16]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_3_raw'].notnull())]['smart_3_raw'], kde = False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ba74e4dc8>
In [17]:
smart_3_median = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_3_raw'].notnull())]['smart_3_raw'].median()
smart_3_median
Out[17]:
0.0
In [18]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_3_raw'].isnull()) & \
       (df['capacity_TB'] == 0.50), 'smart_3_raw'] = smart_3_median_specialized
In [19]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_3_raw'].isnull(), 'smart_3_raw'] = smart_3_median
In [20]:
df['smart_3_raw'].isnull().sum()
Out[20]:
0

smart_4_raw

For the smart_4_raw data, the mean for the manufacturer and drive capacity will be used for the second model. The mean for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [22]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_4_raw']
Out[22]:
Series([], Name: smart_4_raw, dtype: float64)
In [23]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw']
Out[23]:
134          5.0
246         14.0
714         17.0
1006        13.0
1502        22.0
            ... 
10974512     6.0
10974685    13.0
10974769    10.0
10974840     5.0
10974960     6.0
Name: smart_4_raw, Length: 71163, dtype: float64
In [24]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ba74e4cc8>
In [25]:
smart_4_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'].mean()
smart_4_mean_specialized
Out[25]:
12.311720978598428
In [26]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_4_raw']
Out[26]:
Series([], Name: smart_4_raw, dtype: float64)
In [27]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'])
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab3fb7c88>
In [28]:
smart_4_mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'].mean()
smart_4_mean
Out[28]:
8.63607498740538
In [29]:
# Use the mean to fill the capacity category that can be calculated.
df.loc[(df['smart_4_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_4_raw'] = smart_4_mean_specialized
In [30]:
# Use the mean to fill the capacity categories that cannot be calculated.
df.loc[df['smart_4_raw'].isnull(), 'smart_4_raw'] = smart_4_mean
In [31]:
df['smart_4_raw'].isnull().sum()
Out[31]:
0

smart_5_raw

For the smart_5_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [36]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_5_raw']
Out[36]:
Series([], Name: smart_5_raw, dtype: float64)
In [37]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw']
Out[37]:
134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_5_raw, Length: 71163, dtype: float64
In [38]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'], kde = False)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab402cd08>
In [39]:
smart_5_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'].median()
smart_5_median_specialized
Out[39]:
0.0
In [40]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_5_raw']
Out[40]:
Series([], Name: smart_5_raw, dtype: float64)
In [41]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'], kde = False)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab6d25c08>
In [42]:
smart_5_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'].median()
smart_5_median
Out[42]:
0.0
In [43]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_5_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_5_raw'] = smart_5_median_specialized
In [44]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_5_raw'].isnull(), 'smart_5_raw'] = smart_5_median
In [45]:
df['smart_5_raw'].isnull().sum()
Out[45]:
0

smart_7_raw

For the smart_7_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [46]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_7_raw']
Out[46]:
Series([], Name: smart_7_raw, dtype: float64)
In [47]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw']
Out[47]:
134         901104176.0
246                 0.0
714          29927613.0
1006                0.0
1502                0.0
               ...     
10974512    309799114.0
10974685    331746418.0
10974769    694324598.0
10974840    623555853.0
10974960    216864131.0
Name: smart_7_raw, Length: 71163, dtype: float64
In [48]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'])
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab6e038c8>
In [49]:
smart_7_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].median()
smart_7_median_specialized
Out[49]:
0.0
In [50]:
mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].mean()
mean
Out[50]:
160253636.9962902
In [51]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_7_raw']
Out[51]:
Series([], Name: smart_7_raw, dtype: float64)
In [52]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'])
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab6eecd08>
In [53]:
smart_7_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'].median()
smart_7_median
Out[53]:
696801289.0
In [54]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_7_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_7_raw'] = smart_7_median_specialized
In [55]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_7_raw'].isnull(), 'smart_7_raw'] = smart_7_median
In [56]:
df['smart_7_raw'].isnull().sum()
Out[56]:
0

smart_10_raw

For the smart_10_raw data, the median for the manufacturer will be used to fill the NaN values.

In [57]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_10_raw'].notnull())]['smart_10_raw']
Out[57]:
0           0.0
1           0.0
2           0.0
3           0.0
5           0.0
           ... 
10975104    0.0
10975105    0.0
10975108    0.0
10975109    0.0
10975112    0.0
Name: smart_10_raw, Length: 7957763, dtype: float64
In [62]:
df['smart_10_raw'].value_counts()
Out[62]:
0.0         10961730
1.0             8518
65536.0         2958
2.0              729
131072.0         549
65537.0          182
131073.0         170
327680.0          92
262144.0          91
3.0               91
196608.0           1
Name: smart_10_raw, dtype: int64
In [58]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_10_raw'].notnull())]['smart_10_raw'].value_counts()
Out[58]:
0.0    7957763
Name: smart_10_raw, dtype: int64
In [59]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_10_raw'].notnull())]['smart_10_raw'])
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b3a8908>
In [60]:
smart_10_median = df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_10_raw'].notnull())]['smart_10_raw'].median()
smart_10_median
Out[60]:
0.0
In [61]:
df.loc[df['smart_10_raw'].isnull(), 'smart_10_raw'] = smart_10_median
In [63]:
df['smart_10_raw'].isnull().sum()
Out[63]:
0

smart_197_raw

For the smart_197_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [64]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_197_raw']
Out[64]:
Series([], Name: smart_197_raw, dtype: float64)
In [65]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_197_raw']
Out[65]:
134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_197_raw, Length: 71163, dtype: float64
In [66]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_197_raw'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_197_raw'], kde = False)
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b4bc1c8>
In [67]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].value_counts()
Out[67]:
0.0    70905
2.0      122
1.0      119
3.0        9
4.0        8
Name: smart_197_raw, dtype: int64
In [68]:
smart_197_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].median()
smart_197_median_specialized
Out[68]:
0.0
In [69]:
smart_197_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].mean()
smart_197_mean_specialized
Out[69]:
0.0059300479181597174
In [70]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_197_raw']
Out[70]:
Series([], Name: smart_197_raw, dtype: float64)
In [71]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'], kde = False)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b5ac0c8>
In [72]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].value_counts()
Out[72]:
0.0      7892657
8.0        46134
16.0       11240
24.0        3308
32.0        1400
          ...   
600.0          1
400.0          1
432.0          1
520.0          1
776.0          1
Name: smart_197_raw, Length: 74, dtype: int64
In [73]:
smart_197_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].median()
smart_197_median
Out[73]:
0.0
In [74]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_197_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_197_raw'] = smart_197_median_specialized
In [75]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_197_raw'].isnull(), 'smart_197_raw'] = smart_197_median
In [76]:
df['smart_197_raw'].isnull().sum()
Out[76]:
0

smart_198_raw

For the smart_198_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [77]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_198_raw']
Out[77]:
Series([], Name: smart_198_raw, dtype: float64)
In [78]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw']
Out[78]:
134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_198_raw, Length: 71163, dtype: float64
In [79]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'], kde = False)
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b670688>
In [80]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].value_counts()
Out[80]:
0.0    71163
Name: smart_198_raw, dtype: int64
In [81]:
smart_198_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].median()
smart_198_median_specialized
Out[81]:
0.0
In [82]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_198_raw']
Out[82]:
Series([], Name: smart_198_raw, dtype: float64)
In [83]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'], kde = False)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b791488>
In [84]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].value_counts()
Out[84]:
0.0       7892915
8.0         46134
16.0        11240
24.0         3308
32.0         1400
           ...   
1808.0          1
1840.0          1
2416.0          1
2808.0          1
600.0           1
Name: smart_198_raw, Length: 70, dtype: int64
In [85]:
smart_198_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].median()
smart_198_median
Out[85]:
0.0
In [86]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_198_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_198_raw'] = smart_198_median_specialized
In [87]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_198_raw'].isnull(), 'smart_198_raw'] = smart_198_median
In [88]:
df['smart_198_raw'].isnull().sum()
Out[88]:
0

smart_199_raw

For the smart_199_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

In [90]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_199_raw']
Out[90]:
Series([], Name: smart_199_raw, dtype: float64)
In [91]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw']
Out[91]:
134         0.0
246         4.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_199_raw, Length: 71163, dtype: float64
In [92]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'], kde = False)
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b8a38c8>
In [93]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].value_counts()
Out[93]:
0.0       69076
1.0         438
6.0         275
3.0         274
5.0         273
7.0          92
123.0        92
4.0          92
9.0          92
29.0         92
135.0        92
1296.0       92
2.0          92
12.0         91
Name: smart_199_raw, dtype: int64
In [94]:
smart_199_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].median()
smart_199_median_specialized
Out[94]:
0.0
In [95]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_199_raw']
Out[95]:
Series([], Name: smart_199_raw, dtype: float64)
In [96]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'], kde = False)
Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0b979888>
In [99]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].value_counts()
Out[99]:
0.0      7890277
1.0        12442
2.0         6416
3.0         4587
4.0         3485
          ...   
147.0          1
145.0          1
220.0          1
223.0          1
378.0          1
Name: smart_199_raw, Length: 389, dtype: int64
In [100]:
smart_199_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].median()
smart_199_median
Out[100]:
0.0
In [101]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_199_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_199_raw'] = smart_199_median_specialized
In [102]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_199_raw'].isnull(), 'smart_199_raw'] = smart_199_median
In [103]:
df['smart_199_raw'].isnull().sum()
Out[103]:
0

smart_193_raw and smart_225_raw

The smart_193_raw column is a different problem than the last group of columns. This group has 53985 rows with NaN values, which is still low enough in this large dataset to interpolate values without majorly ill effects, but still requires caution.

An important note here is that some manufacturers use different SMART attributes to represent the same information. Most Seagate and some Western Digital and Hitachi drives actually use 225 rather than 193 to store the Load/Unload Cycle Count value (Acronis, Knowledge Base 9128; Acronis, Knowledge Base 9152). We can see here that no row has both 193 and 225 values.

In [104]:
df.loc[(df['smart_193_raw'].notnull()) & \
       (df['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]
Out[104]:
smart_193_raw smart_225_raw
In [105]:
df_193 = df.loc[df['smart_193_raw'].isnull()]
df_193
Out[105]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_231_raw smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB
246 10-01 S2ZYJ9FG405092 ST500LM012 HN False 7.0 0.0 2044.0 14.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
1006 10-01 S2ZYJ9HF707975 ST500LM012 HN False 4.0 0.0 1989.0 13.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
1502 10-01 S2ZYJ9DG700888 ST500LM012 HN False 320.0 0.0 1801.0 22.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
2090 10-01 S2ZYJ9KG303900 ST500LM012 HN False 37055.0 0.0 2129.0 14.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
2365 10-01 S2ZYJ9DG701035 ST500LM012 HN False 13.0 0.0 1986.0 6.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10974017 12-31 S2ZYJ9KG303897 ST500LM012 HN False 46.0 0.0 1786.0 13.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
10974019 12-31 S2ZYJ9KG303892 ST500LM012 HN False 2.0 0.0 2149.0 10.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
10974020 12-31 S2ZYJ9KG303893 ST500LM012 HN False 2.0 0.0 1828.0 13.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
10974375 12-31 S2ZYJ9GGB01626 ST500LM012 HN False 110.0 0.0 1814.0 14.000000 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN Seagate 0.50
10975048 12-31 7LZ01N9E ZA250CM10002 False 0.0 NaN 0.0 8.636075 0.0 696801289.0 ... 1.099512e+14 4.252018e+11 12680.0 2.659358e+10 NaN 7605.0 4363.0 NaN Seagate 0.25

53985 rows × 59 columns

In [106]:
df_193['manufacturer'].value_counts()
Out[106]:
Seagate            53985
Western Digital        0
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64
In [107]:
df_193['model'].value_counts()
Out[107]:
ST500LM012 HN      45102
ZA250CM10002        6844
ZA500CM10002        1593
ZA2000CM10002        355
ST1000LM024 HN'       91
HDWE160                0
HDWF180                0
ST12000NM0007          0
ST10000NM0086          0
MQ01ABF050M            0
MQ01ABF050             0
MG07ACA14TA            0
MD04ABA400V            0
HUS726040ALE610        0
ST12000NM0008          0
HUH721212ALN604        0
HUH721212ALE600        0
HMS5C4040BLE641        0
HMS5C4040BLE640        0
HMS5C4040ALE640        0
HUH728080ALE600        0
ST16000NM001G          0
ST12000NM0117          0
ST4000DM000            0
WD60EFRX               0
WD5000LPVX             0
WD5000LPCX             0
WD5000BPKT             0
Seagate SSD            0
ST8000NM0055           0
ST8000DM005            0
ST8000DM004            0
ST8000DM002            0
ST6000DX000            0
ST6000DM004            0
ST6000DM001            0
ST500LM030             0
ST500LM021             0
ST4000DM005            0
HDS5C4040ALE630        0
Name: model, dtype: int64
In [108]:
df_193.loc[df_193['smart_193_raw'] != \
           df_193['smart_225_raw']][['smart_193_raw', 'smart_225_raw']]
Out[108]:
smart_193_raw smart_225_raw
246 NaN 310513.0
1006 NaN 72805.0
1502 NaN 72944.0
2090 NaN 159087.0
2365 NaN 59938.0
... ... ...
10974017 NaN 128760.0
10974019 NaN 412141.0
10974020 NaN 132761.0
10974375 NaN 692196.0
10975048 NaN NaN

53985 rows × 2 columns

In [109]:
df_193.loc[(df_193['smart_193_raw'].notnull()) & \
    (df_193['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]
Out[109]:
smart_193_raw smart_225_raw

The only rows that do not have either value are the exact same rows as the last group. These will need interpolated if the rows are to be kept. The 45193 other rows can be filled by combining the two columns that represent the same information.

In [110]:
df_193.loc[(df_193['smart_193_raw'].isnull()) & \
    (df_193['smart_225_raw'].isnull())][['smart_193_raw', 'smart_225_raw']]
Out[110]:
smart_193_raw smart_225_raw
8416 NaN NaN
13012 NaN NaN
13676 NaN NaN
13682 NaN NaN
13683 NaN NaN
... ... ...
10970045 NaN NaN
10971251 NaN NaN
10972074 NaN NaN
10973819 NaN NaN
10975048 NaN NaN

8792 rows × 2 columns

In [111]:
df_193.loc[(df_193['smart_193_raw'].isnull()) & \
           (df_193['smart_225_raw'].isnull())]['model'].value_counts()
Out[111]:
ZA250CM10002       6844
ZA500CM10002       1593
ZA2000CM10002       355
MD04ABA400V           0
ST12000NM0008         0
ST12000NM0007         0
ST1000LM024 HN'       0
ST10000NM0086         0
MQ01ABF050M           0
MQ01ABF050            0
MG07ACA14TA           0
HUS726040ALE610       0
HDWE160               0
ST12000NM0117         0
HUH721212ALN604       0
HUH721212ALE600       0
HMS5C4040BLE641       0
HMS5C4040BLE640       0
HMS5C4040ALE640       0
HDWF180               0
HUH728080ALE600       0
ST16000NM001G         0
ST4000DM000           0
ST4000DM005           0
WD60EFRX              0
WD5000LPVX            0
WD5000LPCX            0
WD5000BPKT            0
Seagate SSD           0
ST8000NM0055          0
ST8000DM005           0
ST8000DM004           0
ST8000DM002           0
ST6000DX000           0
ST6000DM004           0
ST6000DM001           0
ST500LM030            0
ST500LM021            0
ST500LM012 HN         0
HDS5C4040ALE630       0
Name: model, dtype: int64

The smart_193_raw and smart_225_raw columns will be combined into a new smart_193_225 column and then the remaining values filled as in previous columns.

In [112]:
df['smart_193_225'] = df['smart_193_raw']
In [113]:
df['smart_193_225'].fillna(df['smart_225_raw'], inplace = True)
In [114]:
df[['smart_193_raw', 'smart_225_raw', 'smart_193_225']].isna().sum()
Out[114]:
smart_193_raw       53985
smart_225_raw    10929918
smart_193_225        8792
dtype: int64
In [115]:
df.drop(['smart_193_raw', 'smart_225_raw'], axis=1, inplace=True)
df.head()
Out[115]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN NaN NaN 33009.0 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN NaN NaN 9533.0 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN NaN NaN 6977.0 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN NaN NaN 10669.0 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 4.0 356.0

5 rows × 58 columns

Now that the two columns have been merged, the same process of interpolation by model and capacity can be used on the remaining group.

In [116]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_193_225']
Out[116]:
Series([], Name: smart_193_225, dtype: float64)
In [117]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_193_225']
Out[117]:
134            266.0
246         310513.0
714             62.0
1006         72805.0
1502         72944.0
              ...   
10974512       651.0
10974685        27.0
10974769        27.0
10974840        19.0
10974960        13.0
Name: smart_193_225, Length: 71163, dtype: float64
In [118]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_193_225'])
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0ba5a488>
In [119]:
df.loc[(df['manufacturer'] == "Seagate") & (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_193_225'].value_counts()
Out[119]:
7.0         2001
11.0        1405
15.0        1301
14.0        1059
13.0        1033
            ... 
75245.0        1
75246.0        1
150495.0       1
75248.0        1
179911.0       1
Name: smart_193_225, Length: 11182, dtype: int64
In [120]:
smart_193_225_median_specialized = df.loc[(df['manufacturer'] == "Seagate") &\
                        (df['smart_193_225'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_193_225'].median()
smart_193_225_median_specialized
Out[120]:
90136.0
In [121]:
smart_193_225_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_193_225'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_193_225'].mean()
smart_193_225_mean_specialized
Out[121]:
205361.6568019898
In [122]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_193_225']
Out[122]:
Series([], Name: smart_193_225, dtype: float64)
In [123]:
sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull())]['smart_193_225'])
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab62d0a48>
In [124]:
df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull())]['smart_193_225'].value_counts()
Out[124]:
67.0         7191
68.0         6614
66.0         6069
65.0         5823
1008.0       5694
             ... 
1405860.0       1
1405857.0       1
1405849.0       1
1405844.0       1
131069.0        1
Name: smart_193_225, Length: 117352, dtype: int64
In [125]:
smart_193_225_median = df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull())]['smart_193_225'].median()
smart_193_225_median
Out[125]:
3694.0
In [126]:
# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_193_225'].isnull()) & \
       (df['capacity_TB'] == 0.50), 'smart_193_225'] = \
        smart_193_225_median_specialized
In [127]:
# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_193_225'].isnull(), 'smart_193_225'] = smart_193_225_median
In [128]:
df['smart_193_225'].isnull().sum()
Out[128]:
0

The remaining columns to examine all have over 2 million NaN value rows each. This level of missing data causes interpolation to skew results far more than the previous groups' levels of missing data. The following grouping of columns have at least 70% of their values.

Column            NaN Count
smart_240_raw     2733254
smart_241_raw     2909319
smart_242_raw     2909319
smart_187_raw     3062543
smart_188_raw     3062543
smart_190_raw     3062543

smart_240_raw, smart_241_raw, and smart_242_raw

In [129]:
df.loc[df['failure'] == 0]['smart_240_raw'].isnull().value_counts()
Out[129]:
False    8241226
True     2733207
Name: smart_240_raw, dtype: int64
In [130]:
df.loc[df['failure'] == 1]['smart_240_raw'].isnull().value_counts()
Out[130]:
False    633
True      45
Name: smart_240_raw, dtype: int64
In [131]:
df.loc[df['smart_240_raw'].isnull()].head()
Out[131]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 4.0 356.0
16 10-01 PL2331LAG9TEEJ HMS5C4040ALE640 False 0.0 98.0 449.0 13.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 4.0 283.0
17 10-01 PL2331LAH3WYAJ HMS5C4040BLE640 False 0.0 106.0 539.0 5.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 4.0 308.0
18 10-01 2AGN81UY HUH721212ALN604 False 0.0 96.0 0.0 1.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 12.0 193.0
19 10-01 PL1331LAHG53YH HMS5C4040BLE640 False 0.0 104.0 440.0 7.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN HGST 4.0 290.0

5 rows × 58 columns

Notably, none of the HGST drives have a value for the smart_240_raw column. Additionally, the drives that are missing the smart_241_raw data are also likely the drives missing the smart_242_raw data.

Seagate drives have enough filled values to use and Toshiba drives have no missing values, but the HGST and Western Digital drives do not have enough values to interpolate from. As such, all missing values will be filled in with the mean.

In [132]:
df.loc[df['smart_240_raw'].isnull()]['manufacturer'].value_counts()
Out[132]:
HGST               2660533
Seagate              53985
Western Digital      18734
Toshiba                  0
Name: manufacturer, dtype: int64
In [133]:
df.loc[df['smart_240_raw'].notnull()]['manufacturer'].value_counts()
Out[133]:
Seagate            7912570
Toshiba             322722
Western Digital       6567
HGST                     0
Name: manufacturer, dtype: int64
In [134]:
df.loc[df['smart_241_raw'].isnull()]['manufacturer'].value_counts()
Out[134]:
HGST               2517013
Toshiba             322722
Seagate              45193
Western Digital      24389
Name: manufacturer, dtype: int64
In [135]:
df.loc[df['smart_241_raw'].notnull()]['manufacturer'].value_counts()
Out[135]:
Seagate            7921362
HGST                143520
Western Digital        912
Toshiba                  0
Name: manufacturer, dtype: int64
In [136]:
df.loc[df['smart_242_raw'].isnull()]['manufacturer'].value_counts()
Out[136]:
HGST               2517013
Toshiba             322722
Seagate              45193
Western Digital      24389
Name: manufacturer, dtype: int64
In [137]:
df.loc[df['smart_242_raw'].notnull()]['manufacturer'].value_counts()
Out[137]:
Seagate            7921362
HGST                143520
Western Digital        912
Toshiba                  0
Name: manufacturer, dtype: int64
In [138]:
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].median()))
Not Failed Mean: 19480.03062808859
Not Failed Median: 18474.0
Failed Mean: 17624.443917851502
Failed Median: 16374.0
In [139]:
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].median()))
Not Failed Mean: 53260637298.32403
Not Failed Median: 55133127528.0
Failed Mean: 55750164828.86027
Failed Median: 57636526188.0
In [140]:
print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].median()))
Not Failed Mean: 140094197957.17606
Not Failed Median: 154529207824.0
Failed Mean: 144369181046.50168
Failed Median: 156081138388.5
In [141]:
smart_240_mean = df.loc[df['smart_240_raw'].notnull()]['smart_240_raw'].mean()
smart_240_mean
Out[141]:
19479.888113349185
In [142]:
df['smart_240_raw'].fillna(smart_240_mean, inplace = True)
In [143]:
df['smart_240_raw'].isnull().sum()
Out[143]:
0
In [144]:
smart_241_mean = df.loc[df['smart_241_raw'].notnull()]['smart_241_raw'].mean()
smart_241_mean
Out[144]:
53260820637.912384
In [145]:
df['smart_241_raw'].fillna(smart_241_mean, inplace = True)
In [146]:
df['smart_241_raw'].isnull().sum()
Out[146]:
0
In [147]:
smart_242_mean = df.loc[df['smart_242_raw'].notnull()]['smart_242_raw'].mean()
smart_242_mean
Out[147]:
140094512785.44415
In [148]:
df['smart_242_raw'].fillna(smart_242_mean, inplace = True)
In [149]:
df['smart_242_raw'].isnull().sum()
Out[149]:
0

smart_187_raw, smart_188_raw, and smart_190_raw

In [150]:
df.loc[df['failure'] == 0]['smart_187_raw'].isnull().value_counts()
Out[150]:
False    7911977
True     3062456
Name: smart_187_raw, dtype: int64
In [151]:
df.loc[df['failure'] == 1]['smart_187_raw'].isnull().value_counts()
Out[151]:
False    593
True      85
Name: smart_187_raw, dtype: int64
In [152]:
df.loc[df['smart_187_raw'].isnull()].head()
Out[152]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0
16 10-01 PL2331LAG9TEEJ HMS5C4040ALE640 False 0.0 98.0 449.0 13.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 283.0
17 10-01 PL2331LAH3WYAJ HMS5C4040BLE640 False 0.0 106.0 539.0 5.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 308.0
18 10-01 2AGN81UY HUH721212ALN604 False 0.0 96.0 0.0 1.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 12.0 193.0
19 10-01 PL1331LAHG53YH HMS5C4040BLE640 False 0.0 104.0 440.0 7.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 290.0

5 rows × 58 columns

The group of the smart_187_raw, smart_188_raw, and smart_190_raw columns are divided by manufacturer, with all Seagate drives having the values and none of the other drive manufacturers having the values.

In [153]:
df.loc[df['smart_187_raw'].isnull()]['manufacturer'].value_counts()
Out[153]:
HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64
In [154]:
df.loc[df['smart_187_raw'].notnull()]['manufacturer'].value_counts()
Out[154]:
Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64
In [155]:
sns.distplot(df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'], \
             kde = False)
Out[155]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ab6c04988>
In [156]:
df.loc[df['smart_188_raw'].isnull()]['manufacturer'].value_counts()
Out[156]:
HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64
In [157]:
df.loc[df['smart_188_raw'].notnull()]['manufacturer'].value_counts()
Out[157]:
Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64
In [158]:
sns.distplot(df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'], \
             kde = False)
Out[158]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c0bdc4708>
In [159]:
df.loc[df['smart_190_raw'].isnull()]['manufacturer'].value_counts()
Out[159]:
HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64
In [160]:
df.loc[df['smart_190_raw'].notnull()]['manufacturer'].value_counts()
Out[160]:
Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64
In [161]:
sns.distplot(df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'])
Out[161]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c1069ec08>

Given the column distributions, the smart_187_raw and smart_188_raw NaNs will be filled with the medians, and the smart_190_raw NaNs will be filled with the mean.

In [162]:
smart_187_median = df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'].median()
smart_187_median
Out[162]:
0.0
In [163]:
df['smart_187_raw'].fillna(smart_187_median, inplace = True)
In [164]:
df['smart_187_raw'].isnull().sum()
Out[164]:
0
In [165]:
smart_188_median = df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'].median()
smart_188_median
Out[165]:
0.0
In [166]:
df['smart_188_raw'].fillna(smart_188_median, inplace = True)
In [167]:
df['smart_188_raw'].isnull().sum()
Out[167]:
0
In [168]:
smart_190_mean = df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'].mean()
smart_190_mean
Out[168]:
28.227229585330683
In [169]:
df['smart_190_raw'].fillna(smart_190_mean, inplace = True)
In [170]:
df['smart_190_raw'].isnull().sum()
Out[170]:
0

Memory Management and Reloading Checkpoint

In [171]:
if not os.path.isfile('pre_195_df.csv'):
    df.to_csv('pre_195_df.csv', index=False)

The Remaining Columns

These remaining columns have over 30% of their values missing, and an individualized approach will be taken with each of them. In some cases, categories of existing values may be helpful to preserve the potential for information with NaNs being their own category.

Row              NaN Count
smart_195_raw     4806304
smart_191_raw     6402847
smart_184_raw     6781011
smart_189_raw     6781011
smart_200_raw     7076978
smart_196_raw     7921364
smart_8_raw       7946665
smart_2_raw       7946665
smart_183_raw     9132294
smart_22_raw      9741975
smart_223_raw    10463678
smart_18_raw     10651999
smart_224_raw    10652391
smart_220_raw    10652391
smart_222_raw    10652391
smart_226_raw    10652391
smart_23_raw     10742991
smart_24_raw     10742991
smart_11_raw     10904619
smart_225_raw    10929920
smart_254_raw    10948140
smart_235_raw    10966321
smart_233_raw    10966321
smart_232_raw    10966321
smart_168_raw    10966321
smart_170_raw    10966321
smart_218_raw    10966321
smart_174_raw    10966321
smart_16_raw     10966321
smart_17_raw     10966321
smart_173_raw    10966321
smart_231_raw    10966321
smart_177_raw    10966321

smart_195_raw

This column only has values in a single manufacturer's drives, and even then only 77% of them. There appears to be virtually no difference in the column's distribution by failure status. Filling in NaNs with this information would only result in collinearity between the column and the manufacturer column, so it will be dropped from the dataframe.

In [247]:
sns.distplot(df.loc[df['smart_195_raw'].notnull()]['smart_195_raw'])
Out[247]:
<matplotlib.axes._subplots.AxesSubplot at 0x1721850e388>
In [253]:
sns.distplot(df.loc[df['failure'] == 0]['smart_195_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_195_raw'])
plt.grid(True)
plt.title("smart_195_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[253]:
<matplotlib.legend.Legend at 0x17218898f08>
In [248]:
df.loc[df['smart_195_raw'].notnull()]['manufacturer'].value_counts()
Out[248]:
Seagate    6168809
Name: manufacturer, dtype: int64
In [249]:
len(df.loc[df['smart_195_raw'].isnull()])
Out[249]:
4806302
In [250]:
df['manufacturer'].value_counts()
Out[250]:
Seagate            7966555
HGST               2660533
Toshiba             322722
Western Digital      25301
Name: manufacturer, dtype: int64
In [77]:
df[['smart_195_raw', 'failure']].corr()
Out[77]:
smart_195_raw failure
smart_195_raw 1.000000 0.000181
failure 0.000181 1.000000
In [172]:
df.drop(['smart_195_raw'], axis=1, inplace=True)
df.head()
Out[172]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_232_raw smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN NaN NaN 33009.000000 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN NaN NaN 9533.000000 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN NaN NaN 6977.000000 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN NaN NaN 10669.000000 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0

5 rows × 57 columns

smart_191_raw

This column is not split along manufacturer lines like many others, but still has a large percentage of missing values. A categorical column smart_191_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_191_raw column will then be dropped.

In [255]:
sns.distplot(df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'])
Out[255]:
<matplotlib.axes._subplots.AxesSubplot at 0x17218a02dc8>
In [256]:
sns.distplot(df.loc[df['failure'] == 0]['smart_191_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[256]:
<matplotlib.legend.Legend at 0x17218b18908>
In [79]:
sns.distplot(df.loc[(df['failure'] == 0) & \
                    (df['smart_191_raw'] != 0.0)]['smart_191_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & \
                    (df['smart_191_raw'] != 0.0)]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[79]:
<matplotlib.legend.Legend at 0x2638a69f108>
In [257]:
df.loc[df['smart_191_raw'].notnull()]['manufacturer'].value_counts()
Out[257]:
Seagate            4239295
Toshiba             322722
Western Digital      10249
Name: manufacturer, dtype: int64
In [258]:
len(df.loc[df['smart_191_raw'].isnull()])
Out[258]:
6402845
In [173]:
smart_191_mean = df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'].mean()
smart_191_mean
Out[173]:
14090.159100979688
In [174]:
df['smart_191_cat'] = 0
In [175]:
df.loc[(df['smart_191_raw'] < smart_191_mean), 'smart_191_cat'] = 1
df.loc[(df['smart_191_raw'] > smart_191_mean), 'smart_191_cat'] = 2
In [176]:
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_191_cat'].dtype
Out[176]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [177]:
df['smart_191_cat'].value_counts()
Out[177]:
0    6402845
1    3563568
2    1008698
Name: smart_191_cat, dtype: int64
In [178]:
df['smart_191_cat'].isnull().sum()
Out[178]:
0
In [179]:
df.drop(['smart_191_raw'], axis=1, inplace=True)
df.head()
Out[179]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN NaN 33009.000000 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0 1
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN NaN 9533.000000 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN NaN 6977.000000 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN NaN 10669.000000 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0 0

5 rows × 57 columns

smart_184_raw

This column very rarely has any value other than 0 when it is available. However, whenever it is available and not 0, it has a disproportionate ratio of failures to nonfailures, making it a very useful measure for predicting failure. A categorical column smart_184_cat will be created with the following categories and values.

Value Representation
0 Value is 0 or NaN
1 Value is above 0

The original smart_184_raw column will then be dropped.

In [180]:
sns.distplot(df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'], kde = False)
Out[180]:
<matplotlib.axes._subplots.AxesSubplot at 0x16c1079f4c8>
In [181]:
sns.distplot(df.loc[df['failure'] == 0]['smart_184_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_184_raw'], kde = False)
plt.grid(True)
plt.title("smart_184_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[181]:
<matplotlib.legend.Legend at 0x16c10895808>
In [182]:
df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'].value_counts()
Out[182]:
0.0    4194090
1.0          5
5.0          3
9.0          1
8.0          1
4.0          1
2.0          1
Name: smart_184_raw, dtype: int64
In [183]:
df.loc[(df['smart_184_raw'] != 0) & \
       (df['smart_184_raw'].notnull())][['smart_184_raw', 'failure']]
Out[183]:
smart_184_raw failure
99758 8.0 True
2613651 9.0 True
4849813 1.0 True
5931498 2.0 True
8943626 4.0 False
9066037 5.0 False
9189214 5.0 True
9836420 1.0 False
9961273 1.0 False
10086127 1.0 True
10771703 1.0 False
10896354 5.0 False
In [184]:
df.loc[df['smart_184_raw'].notnull()]['manufacturer'].value_counts()
Out[184]:
Seagate            4194102
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64
In [185]:
len(df.loc[df['smart_184_raw'].isnull()])
Out[185]:
6781009
In [186]:
df['smart_184_cat'] = 0
In [187]:
df.loc[(df['smart_184_raw'] > 0), 'smart_184_cat'] = 1
In [188]:
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_184_cat'].dtype
Out[188]:
CategoricalDtype(categories=[0, 1], ordered=False)
In [189]:
df['smart_184_cat'].value_counts()
Out[189]:
0    10975099
1          12
Name: smart_184_cat, dtype: int64
In [190]:
df['smart_184_cat'].isnull().sum()
Out[190]:
0
In [191]:
df.drop(['smart_184_raw'], axis=1, inplace=True)
df.head()
Out[191]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN 33009.000000 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0 1 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN 9533.000000 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN 6977.000000 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN 10669.000000 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0 0 0

5 rows × 57 columns

smart_189_raw

This column only has values in a single manufacturer's drives, and even then only 38% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.

In [295]:
sns.distplot(df.loc[df['smart_189_raw'].notnull()]['smart_189_raw'], kde = False)
Out[295]:
<matplotlib.axes._subplots.AxesSubplot at 0x17242fb9808>
In [296]:
sns.distplot(df.loc[df['failure'] == 0]['smart_189_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_189_raw'], kde = False)
plt.grid(True)
plt.title("smart_189_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[296]:
<matplotlib.legend.Legend at 0x17218d76d88>
In [304]:
df['smart_189_raw'].value_counts()
Out[304]:
0.0       4032998
3.0          8714
6.0          8320
5.0          7783
7.0          7039
           ...   
6726.0          1
6721.0          1
6715.0          1
6711.0          1
2334.0          1
Name: smart_189_raw, Length: 818, dtype: int64
In [297]:
df.loc[df['smart_189_raw'].notnull()]['manufacturer'].value_counts()
Out[297]:
Seagate    4194102
Name: manufacturer, dtype: int64
In [298]:
len(df.loc[df['smart_189_raw'].isnull()])
Out[298]:
6781009
In [299]:
df[['smart_189_raw', 'failure']].corr()
Out[299]:
smart_189_raw failure
smart_189_raw 1.000000 -0.000079
failure -0.000079 1.000000
In [192]:
df.drop(['smart_189_raw'], axis=1, inplace=True)
df.head()
Out[192]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... NaN 33009.000000 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0 1 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... NaN 9533.000000 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... NaN 6977.000000 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... NaN 10669.000000 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0 0 0

5 rows × 56 columns

smart_200_raw

This column is not entirely split along manufacturer lines like many others, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_200_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_200_raw column will then be dropped.

In [307]:
sns.distplot(df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'], kde = False)
Out[307]:
<matplotlib.axes._subplots.AxesSubplot at 0x1722987a148>
In [308]:
sns.distplot(df.loc[df['failure'] == 0]['smart_200_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_200_raw'], kde = False)
plt.grid(True)
plt.title("smart_200_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[308]:
<matplotlib.legend.Legend at 0x17229970308>
In [309]:
df['smart_200_raw'].value_counts()
Out[309]:
0.0         3852942
11894.0          93
292551.0         92
230037.0         92
128448.0         92
             ...   
402402.0          1
402415.0          1
402423.0          1
402469.0          1
295843.0          1
Name: smart_200_raw, Length: 30531, dtype: int64
In [320]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
plt.grid(True)
plt.title("smart_200_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[320]:
<matplotlib.legend.Legend at 0x17240d80e08>
In [322]:
df.loc[df['smart_200_raw'].notnull()]['manufacturer'].value_counts()
Out[322]:
Seagate            3872834
Western Digital      25301
Name: manufacturer, dtype: int64
In [323]:
len(df.loc[df['smart_200_raw'].isnull()])
Out[323]:
7076976
In [310]:
df[['smart_200_raw', 'failure']].corr()
Out[310]:
smart_200_raw failure
smart_200_raw 1.000000 0.002178
failure 0.002178 1.000000
In [193]:
smart_200_mean = df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'].mean()
smart_200_mean
Out[193]:
3494.200142375777
In [194]:
df.loc[(df['failure'] == 0) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()
Out[194]:
3493.295720311555
In [195]:
df.loc[(df['failure'] == 1) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()
Out[195]:
12351.484924623115
In [196]:
df['smart_200_cat'] = 0
In [197]:
df.loc[(df['smart_200_raw'] < smart_200_mean), 'smart_200_cat'] = 1
df.loc[(df['smart_200_raw'] > smart_200_mean), 'smart_200_cat'] = 2
In [198]:
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_200_cat'].dtype
Out[198]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [199]:
df['smart_200_cat'].value_counts()
Out[199]:
0    7076976
1    3853796
2      44339
Name: smart_200_cat, dtype: int64
In [200]:
df['smart_200_cat'].isnull().sum()
Out[200]:
0
In [201]:
df.drop(['smart_200_raw'], axis=1, inplace=True)
df.head()
Out[201]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_240_raw smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... 33009.000000 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0 1 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... 9533.000000 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0 0 0 1
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... 6977.000000 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0 0 0 1
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... 10669.000000 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0 0 0 1
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... 19479.888113 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0 0 0 0

5 rows × 56 columns

smart_196_raw

This column is not split along manufacturer lines whatsoever, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_196_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_196_raw column will then be dropped.

In [333]:
sns.distplot(df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'], kde = False)
Out[333]:
<matplotlib.axes._subplots.AxesSubplot at 0x172424a0608>
In [334]:
sns.distplot(df.loc[df['failure'] == 0]['smart_196_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_196_raw'], kde = False)
plt.grid(True)
plt.title("smart_196_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[334]:
<matplotlib.legend.Legend at 0x172431350c8>
In [335]:
df['smart_196_raw'].value_counts()
Out[335]:
0.0       3042208
1.0          2466
2.0           472
6.0           353
7.0           317
           ...   
1054.0          1
1056.0          1
1057.0          1
375.0           1
5085.0          1
Name: smart_196_raw, Length: 505, dtype: int64
In [336]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
plt.grid(True)
plt.title("smart_196_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[336]:
<matplotlib.legend.Legend at 0x1724324c2c8>
In [337]:
df.loc[df['smart_196_raw'].notnull()]['manufacturer'].value_counts()
Out[337]:
HGST               2660533
Toshiba             322722
Seagate              45193
Western Digital      25301
Name: manufacturer, dtype: int64
In [338]:
len(df.loc[df['smart_196_raw'].isnull()])
Out[338]:
7921362
In [339]:
df[['smart_196_raw', 'failure']].corr()
Out[339]:
smart_196_raw failure
smart_196_raw 1.000000 0.009671
failure 0.009671 1.000000
In [202]:
smart_196_mean = df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'].mean()
smart_196_mean
Out[202]:
0.6639966153079379
In [203]:
df.loc[(df['failure'] == 0) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()
Out[203]:
0.662450747691953
In [204]:
df.loc[(df['failure'] == 1) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()
Out[204]:
56.2
In [205]:
df['smart_196_cat'] = 0
In [206]:
df.loc[(df['smart_196_raw'] < smart_196_mean), 'smart_196_cat'] = 1
df.loc[(df['smart_196_raw'] > smart_196_mean), 'smart_196_cat'] = 2
In [207]:
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_196_cat'].dtype
Out[207]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [208]:
df['smart_196_cat'].value_counts()
Out[208]:
0    7921362
1    3042208
2      11541
Name: smart_196_cat, dtype: int64
In [209]:
df['smart_196_cat'].isnull().sum()
Out[209]:
0
In [210]:
df.drop(['smart_196_raw'], axis=1, inplace=True)
df.head()
Out[210]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_241_raw smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... 5.063798e+10 1.623458e+11 NaN Seagate 4.0 34188.0 1 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... 5.084775e+10 1.271356e+11 NaN Seagate 12.0 2426.0 0 0 1 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... 4.920827e+10 4.658787e+10 NaN Seagate 12.0 1881.0 0 0 1 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... 5.341374e+10 9.427903e+10 NaN Seagate 12.0 970.0 0 0 1 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... 5.326082e+10 1.400945e+11 NaN HGST 4.0 356.0 0 0 0 1

5 rows × 56 columns

smart_8_raw

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_8_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_8_raw column will then be dropped.

In [349]:
sns.distplot(df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'])
Out[349]:
<matplotlib.axes._subplots.AxesSubplot at 0x17243340c88>
In [350]:
sns.distplot(df.loc[df['failure'] == 0]['smart_8_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_8_raw'])
plt.grid(True)
plt.title("smart_8_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[350]:
<matplotlib.legend.Legend at 0x1724344c088>
In [351]:
df['smart_8_raw'].value_counts()
Out[351]:
42.0    1188731
0.0     1016768
18.0     511993
43.0     139862
41.0      70099
15.0      40426
17.0      19545
16.0      16678
44.0      13437
40.0       9087
45.0       1822
Name: smart_8_raw, dtype: int64
In [353]:
df.loc[df['smart_8_raw'].notnull()]['manufacturer'].value_counts()
Out[353]:
HGST       2660533
Toshiba     322722
Seagate      45193
Name: manufacturer, dtype: int64
In [354]:
len(df.loc[df['smart_8_raw'].isnull()])
Out[354]:
7946663
In [355]:
df[['smart_8_raw', 'failure']].corr()
Out[355]:
smart_8_raw failure
smart_8_raw 1.000000 -0.003561
failure -0.003561 1.000000
In [211]:
smart_8_mean = df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'].mean()
smart_8_mean
Out[211]:
23.204262381259312
In [212]:
df.loc[(df['failure'] == 0) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()
Out[212]:
23.20460452474583
In [213]:
df.loc[(df['failure'] == 1) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()
Out[213]:
10.08860759493671
In [214]:
df['smart_8_cat'] = 0
In [215]:
df.loc[(df['smart_8_raw'] < smart_8_mean), 'smart_8_cat'] = 1
df.loc[(df['smart_8_raw'] > smart_8_mean), 'smart_8_cat'] = 2
In [216]:
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_8_cat'].dtype
Out[216]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [217]:
df['smart_8_cat'].value_counts()
Out[217]:
0    7946663
1    1605410
2    1423038
Name: smart_8_cat, dtype: int64
In [218]:
df['smart_8_cat'].isnull().sum()
Out[218]:
0
In [219]:
df.drop(['smart_8_raw'], axis=1, inplace=True)
df.head()
Out[219]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_242_raw smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... 1.623458e+11 NaN Seagate 4.0 34188.0 1 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... 1.271356e+11 NaN Seagate 12.0 2426.0 0 0 1 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... 4.658787e+10 NaN Seagate 12.0 1881.0 0 0 1 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... 9.427903e+10 NaN Seagate 12.0 970.0 0 0 1 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 103.0 436.0 9.0 0.0 0.0 ... 1.400945e+11 NaN HGST 4.0 356.0 0 0 0 1 2

5 rows × 56 columns

smart_2_raw

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_2_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_2_raw column will then be dropped.

In [368]:
sns.distplot(df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'])
Out[368]:
<matplotlib.axes._subplots.AxesSubplot at 0x17255978f48>
In [369]:
sns.distplot(df.loc[df['failure'] == 0]['smart_2_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_2_raw'])
plt.grid(True)
plt.title("smart_2_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[369]:
<matplotlib.legend.Legend at 0x1731879a0c8>
In [370]:
df['smart_2_raw'].value_counts()
Out[370]:
0.0      1016766
100.0     391933
96.0      348559
104.0     310452
103.0     213528
          ...   
70.0           3
161.0          1
67.0           1
64.0           1
62.0           1
Name: smart_2_raw, Length: 72, dtype: int64
In [371]:
df.loc[df['smart_2_raw'].notnull()]['manufacturer'].value_counts()
Out[371]:
HGST       2660533
Toshiba     322722
Seagate      45193
Name: manufacturer, dtype: int64
In [372]:
len(df.loc[df['smart_2_raw'].isnull()])
Out[372]:
7946663
In [373]:
df[['smart_2_raw', 'failure']].corr()
Out[373]:
smart_2_raw failure
smart_2_raw 1.000000 -0.003998
failure -0.003998 1.000000
In [220]:
smart_2_mean = df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'].mean()
smart_2_mean
Out[220]:
67.00107943078434
In [376]:
df.loc[(df['failure'] == 0) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()
Out[376]:
100.86548962821233
In [377]:
df.loc[(df['failure'] == 1) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()
Out[377]:
100.65217391304348
In [221]:
df['smart_2_cat'] = 0
In [222]:
df.loc[(df['smart_2_raw'] < smart_2_mean), 'smart_2_cat'] = 1
df.loc[(df['smart_2_raw'] > smart_2_mean), 'smart_2_cat'] = 2
In [223]:
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_2_cat'].dtype
Out[223]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [116]:
df['smart_2_cat'].value_counts()
Out[116]:
0    7946663
2    2010939
1    1017509
Name: smart_2_cat, dtype: int64
In [117]:
df['smart_2_cat'].isnull().sum()
Out[117]:
0
In [224]:
df.drop(['smart_2_raw'], axis=1, inplace=True)
df.head()
Out[224]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... NaN Seagate 4.0 34188.0 1 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... NaN Seagate 12.0 2426.0 0 0 1 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... NaN Seagate 12.0 1881.0 0 0 1 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... NaN Seagate 12.0 970.0 0 0 1 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... NaN HGST 4.0 356.0 0 0 0 1 2 2

5 rows × 56 columns

smart_183_raw

This column only has values in a single manufacturer's drives, and even then only 23% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.

In [397]:
sns.distplot(df.loc[df['smart_183_raw'].notnull()]['smart_183_raw'], kde = False)
Out[397]:
<matplotlib.axes._subplots.AxesSubplot at 0x1723729f608>
In [399]:
sns.distplot(df.loc[df['failure'] == 0]['smart_183_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_183_raw'], kde = False)
plt.grid(True)
plt.title("smart_183_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[399]:
<matplotlib.legend.Legend at 0x173188c32c8>
In [400]:
df['smart_183_raw'].value_counts()
Out[400]:
0.0     1526248
2.0      160993
1.0       66360
4.0       17906
3.0       16077
         ...   
86.0         75
43.0         57
87.0         52
68.0         39
58.0          2
Name: smart_183_raw, Length: 116, dtype: int64
In [401]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
plt.grid(True)
plt.title("smart_183_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[401]:
<matplotlib.legend.Legend at 0x172291088c8>
In [402]:
df.loc[df['smart_183_raw'].notnull()]['manufacturer'].value_counts()
Out[402]:
Seagate    1842819
Name: manufacturer, dtype: int64
In [403]:
len(df.loc[df['smart_183_raw'].isnull()])
Out[403]:
9132292
In [405]:
df[['smart_183_raw', 'failure']].corr()
Out[405]:
smart_183_raw failure
smart_183_raw 1.000000 0.000605
failure 0.000605 1.000000
In [225]:
df.drop(['smart_183_raw'], axis=1, inplace=True)
df.head()
Out[225]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... NaN Seagate 4.0 34188.0 1 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... NaN Seagate 12.0 2426.0 0 0 1 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... NaN Seagate 12.0 1881.0 0 0 1 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... NaN Seagate 12.0 970.0 0 0 1 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... NaN HGST 4.0 356.0 0 0 0 1 2 2

5 rows × 55 columns

Memory Management and Reloading Checkpoint

In [233]:
if not os.path.isfile('pre_22_df.csv'):
    df.to_csv('pre_22_df.csv', index=False)
In [3]:
df = pd.read_csv('pre_22_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')

smart_22_raw

This column only has values in a single manufacturer's drives, as it is an indication of helium levels encased in certain HGST drives (Klein, 2015). Given this, it would make no sense to fill this column's NaN values in rows of drives from other manufacturers. Beyond that, the dataset does not have any failures with abnormal levels, making this column potentially a negative impact to the real-world effectiveness of a predictive model. Given this risk, the risk of collinearity with the manufacturer column, and the low correlation of this column and the failure rate, this column will be dropped from the dataframe without a category column for the simplification of the models.

In [413]:
sns.distplot(df.loc[df['smart_22_raw'].notnull()]['smart_22_raw'], kde = False)
Out[413]:
<matplotlib.axes._subplots.AxesSubplot at 0x17218e47a88>
In [123]:
sns.distplot(df.loc[df['failure'] == 0]['smart_22_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[123]:
<matplotlib.legend.Legend at 0x26243e3ea08>
In [418]:
df['smart_22_raw'].value_counts()
Out[418]:
100.0    1232992
98.0          23
97.0          23
94.0          13
99.0          12
96.0          11
95.0           8
92.0           6
91.0           5
93.0           5
88.0           4
71.0           3
81.0           2
73.0           2
74.0           2
75.0           2
76.0           2
79.0           2
80.0           2
84.0           2
83.0           2
87.0           2
89.0           2
90.0           2
82.0           1
77.0           1
85.0           1
86.0           1
70.0           1
69.0           1
66.0           1
61.0           1
58.0           1
Name: smart_22_raw, dtype: int64
In [124]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Non100 Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:200: RuntimeWarning: Mean of empty slice.
  line, = ax.plot(a.mean(), 0)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\numpy\core\_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[124]:
<matplotlib.legend.Legend at 0x2638c946508>
In [132]:
df.loc[df['failure'] == 1]['smart_22_raw'].value_counts()
Out[132]:
100.0    10
Name: smart_22_raw, dtype: int64
In [133]:
df.loc[df['smart_22_raw'].notnull()]['manufacturer'].value_counts()
Out[133]:
HGST    1233138
Name: manufacturer, dtype: int64
In [134]:
len(df.loc[df['smart_22_raw'].isnull()])
Out[134]:
9741973
In [135]:
df[['smart_22_raw', 'failure']].corr()
Out[135]:
smart_22_raw failure
smart_22_raw 1.000000 0.000022
failure 0.000022 1.000000
In [4]:
df.drop(['smart_22_raw'], axis=1, inplace=True)
df.head()
Out[4]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_254_raw manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... NaN Seagate 4.0 34188.0 1 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... NaN Seagate 12.0 2426.0 0 0 1 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... NaN Seagate 12.0 1881.0 0 0 1 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... NaN Seagate 12.0 970.0 0 0 1 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... NaN HGST 4.0 356.0 0 0 0 1 2 2

5 rows × 54 columns

smart_223_raw

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_223_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_223_raw column will then be dropped.

In [138]:
sns.distplot(df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'], kde = False)
Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x2627455f688>
In [139]:
sns.distplot(df.loc[df['failure'] == 0]['smart_223_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_223_raw'], kde = False)
plt.grid(True)
plt.title("smart_223_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[139]:
<matplotlib.legend.Legend at 0x262746c9d08>
In [140]:
df['smart_223_raw'].value_counts()
Out[140]:
0.0       466047
164.0        208
196.0        202
654.0        191
484.0        191
           ...  
2587.0         1
2588.0         1
2594.0         1
2606.0         1
1852.0         1
Name: smart_223_raw, Length: 3560, dtype: int64
In [141]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
plt.grid(True)
plt.title("smart_223_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[141]:
<matplotlib.legend.Legend at 0x262747d2208>
In [142]:
df.loc[df['failure'] == 1]['smart_223_raw'].value_counts()
Out[142]:
0.0        41
911.0       1
331.0       1
4590.0      1
266.0       1
1160.0      1
1215.0      1
5115.0      1
2609.0      1
2102.0      1
841.0       1
13377.0     1
221.0       1
184.0       1
Name: smart_223_raw, dtype: int64
In [143]:
df.loc[df['smart_223_raw'].notnull()]['manufacturer'].value_counts()
Out[143]:
Toshiba    322722
HGST       143520
Seagate     45193
Name: manufacturer, dtype: int64
In [144]:
len(df.loc[df['smart_223_raw'].isnull()])
Out[144]:
10463676
In [145]:
df[['smart_223_raw', 'failure']].corr()
Out[145]:
smart_223_raw failure
smart_223_raw 1.000000 0.004138
failure 0.004138 1.000000
In [5]:
smart_223_mean = df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'].mean()
smart_223_mean
Out[5]:
113.52614701770509
In [147]:
df.loc[(df['failure'] == 0) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()
Out[147]:
1278.8611129476585
In [148]:
df.loc[(df['failure'] == 1) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()
Out[148]:
2532.4615384615386
In [6]:
df['smart_223_cat'] = 0
In [7]:
df.loc[(df['smart_223_raw'] < smart_223_mean), 'smart_223_cat'] = 1
df.loc[(df['smart_223_raw'] > smart_223_mean), 'smart_223_cat'] = 2
In [8]:
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_223_cat'].dtype
Out[8]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [9]:
df['smart_223_cat'].value_counts()
Out[9]:
0    10463676
1      469343
2       42092
Name: smart_223_cat, dtype: int64
In [10]:
df['smart_223_cat'].isnull().sum()
Out[10]:
0
In [11]:
df.drop(['smart_223_raw'], axis=1, inplace=True)
df.head()
Out[11]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... Seagate 4.0 34188.0 1 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... Seagate 12.0 2426.0 0 0 1 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... Seagate 12.0 1881.0 0 0 1 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... Seagate 12.0 970.0 0 0 1 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... HGST 4.0 356.0 0 0 0 1 2 2 0

5 rows × 54 columns

Memory Management and Reloading Checkpoint

In [12]:
if not os.path.isfile('pre_18_df.csv'):
    df.to_csv('pre_18_df.csv', index=False)
In [5]:
df = pd.read_csv('pre_18_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')

smart_18_raw

This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.

In [6]:
sns.distplot(df.loc[df['smart_18_raw'].notnull()]['smart_18_raw'], kde = False)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd4e11b108>
In [7]:
df['smart_18_raw'].value_counts()
Out[7]:
0.0    323114
Name: smart_18_raw, dtype: int64
In [8]:
df.loc[df['failure'] == 1]['smart_18_raw'].value_counts()
Out[8]:
0.0    10
Name: smart_18_raw, dtype: int64
In [9]:
df.loc[df['smart_18_raw'].notnull()]['manufacturer'].value_counts()
Out[9]:
Seagate            323114
Western Digital         0
Toshiba                 0
HGST                    0
Name: manufacturer, dtype: int64
In [13]:
df.drop(['smart_18_raw'], axis=1, inplace=True)
df.head()
Out[13]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... Seagate 4.0 34188.0 1 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... Seagate 12.0 2426.0 0 0 1 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... Seagate 12.0 1881.0 0 0 1 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... Seagate 12.0 970.0 0 0 1 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... HGST 4.0 356.0 0 0 0 1 2 2 0

5 rows × 53 columns

smart_224_raw

This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.

In [12]:
sns.distplot(df.loc[df['smart_224_raw'].notnull()]['smart_224_raw'], kde = False)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1de874b18c8>
In [13]:
df['smart_224_raw'].value_counts()
Out[13]:
0.0    322722
Name: smart_224_raw, dtype: int64
In [15]:
df.loc[df['failure'] == 1]['smart_224_raw'].value_counts()
Out[15]:
0.0    40
Name: smart_224_raw, dtype: int64
In [16]:
df.loc[df['smart_224_raw'].notnull()]['manufacturer'].value_counts()
Out[16]:
Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [14]:
df.drop(['smart_224_raw'], axis=1, inplace=True)
df.head()
Out[14]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... manufacturer capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... Seagate 4.0 34188.0 1 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... Seagate 12.0 2426.0 0 0 1 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... Seagate 12.0 1881.0 0 0 1 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... Seagate 12.0 970.0 0 0 1 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... HGST 4.0 356.0 0 0 0 1 2 2 0

5 rows × 52 columns

smart_220_raw

This column is entirely split along manufacturer lines and has a large percentage of missing values, but it seems to be one of the few predictors available for Toshiba drives. Given the relatively large negative correlation between a higher value and failure rate, a categorical column smart_220_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Median
2 Above Median

The original smart_220_raw column will then be dropped.

In [8]:
sns.distplot(df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce586d9e48>
In [9]:
sns.distplot(df.loc[df['failure'] == 0]['smart_220_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_220_raw'], kde = False)
plt.grid(True)
plt.title("smart_220_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[9]:
<matplotlib.legend.Legend at 0x1cd48449348>
In [10]:
df['smart_220_raw'].value_counts()
Out[10]:
0.0            91799
2097152.0       3664
2097153.0       3215
1835008.0       2602
2228224.0       2455
               ...  
393222.0           1
393221.0           1
52035592.0         1
219545607.0        1
286130185.0        1
Name: smart_220_raw, Length: 2581, dtype: int64
In [11]:
sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
plt.grid(True)
plt.title("smart_220_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[11]:
<matplotlib.legend.Legend at 0x1cde0ecf1c8>
In [12]:
df.loc[df['failure'] == 1]['smart_220_raw'].value_counts()
Out[12]:
0.0            33
235536386.0     1
218234882.0     1
1048579.0       1
2097155.0       1
18612226.0      1
134217729.0     1
1835009.0       1
Name: smart_220_raw, dtype: int64
In [14]:
len(df.loc[df['smart_220_raw'].isnull()])
Out[14]:
10652389
In [13]:
df.loc[df['smart_220_raw'].notnull()]['manufacturer'].value_counts()
Out[13]:
Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [15]:
df[['smart_220_raw', 'failure']].corr()
Out[15]:
smart_220_raw failure
smart_220_raw 1.000000 -0.006208
failure -0.006208 1.000000
In [15]:
smart_220_median = df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'].median()
smart_220_median
Out[15]:
17563650.0
In [18]:
df.loc[(df['failure'] == 0) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()
Out[18]:
60495145.75764994
In [19]:
df.loc[(df['failure'] == 1) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()
Out[19]:
15289549.15
In [16]:
df['smart_220_cat'] = 0
In [17]:
df.loc[(df['smart_220_raw'] < smart_220_median), 'smart_220_cat'] = 1
df.loc[(df['smart_220_raw'] > smart_220_median), 'smart_220_cat'] = 2
In [18]:
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_220_cat'].dtype
Out[18]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [19]:
df['smart_220_cat'].value_counts()
Out[19]:
0    10652643
1      161341
2      161127
Name: smart_220_cat, dtype: int64
In [20]:
df['smart_220_cat'].isnull().sum()
Out[20]:
0
In [21]:
df.drop(['smart_220_raw'], axis=1, inplace=True)
df.head()
Out[21]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... capacity_TB smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 4.0 34188.0 1 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 12.0 2426.0 0 0 1 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 12.0 1881.0 0 0 1 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 12.0 970.0 0 0 1 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 4.0 356.0 0 0 0 1 2 2 0 0

5 rows × 52 columns

smart_222_raw

Although only available on the Toshiba drives, this is the highest correlation to failure rates yet. A categorical column smart_222_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_222_raw column will then be dropped.

In [28]:
sns.distplot(df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce3f5c4548>
In [29]:
sns.distplot(df.loc[df['failure'] == 0]['smart_222_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_222_raw'])
plt.grid(True)
plt.title("smart_222_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[29]:
<matplotlib.legend.Legend at 0x1cd0f4e9908>
In [30]:
df['smart_222_raw'].value_counts()
Out[30]:
77.0       525
89.0       375
90.0       333
94.0       304
58.0       301
          ... 
30889.0      1
28043.0      1
26603.0      1
26609.0      1
26583.0      1
Name: smart_222_raw, Length: 27393, dtype: int64
In [32]:
df.loc[df['failure'] == 1]['smart_222_raw'].value_counts()
Out[32]:
25099.0    2
23812.0    1
24778.0    1
17444.0    1
16776.0    1
25250.0    1
15603.0    1
25711.0    1
22518.0    1
17055.0    1
24416.0    1
20610.0    1
23714.0    1
9565.0     1
25217.0    1
336.0      1
218.0      1
99.0       1
23203.0    1
15095.0    1
9199.0     1
15775.0    1
22333.0    1
17532.0    1
17755.0    1
14413.0    1
6384.0     1
16092.0    1
27990.0    1
19802.0    1
20436.0    1
22484.0    1
25076.0    1
28690.0    1
25807.0    1
806.0      1
1045.0     1
18607.0    1
21792.0    1
Name: smart_222_raw, dtype: int64
In [34]:
len(df.loc[df['smart_222_raw'].isnull()])
Out[34]:
10652389
In [33]:
df.loc[df['smart_222_raw'].notnull()]['manufacturer'].value_counts()
Out[33]:
Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [35]:
df[['smart_222_raw', 'failure']].corr()
Out[35]:
smart_222_raw failure
smart_222_raw 1.000000 0.010691
failure 0.010691 1.000000
In [22]:
smart_222_mean = df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'].mean()
smart_222_mean
Out[22]:
9188.18992197619
In [23]:
df.loc[(df['failure'] == 0) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()
Out[23]:
9187.117322937132
In [24]:
df.loc[(df['failure'] == 1) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()
Out[24]:
17840.9
In [25]:
df['smart_222_cat'] = 0
In [26]:
df.loc[(df['smart_222_raw'] < smart_222_mean), 'smart_222_cat'] = 1
df.loc[(df['smart_222_raw'] > smart_222_mean), 'smart_222_cat'] = 2
In [27]:
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_222_cat'].dtype
Out[27]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [28]:
df['smart_222_cat'].value_counts()
Out[28]:
0    10652389
2      168101
1      154621
Name: smart_222_cat, dtype: int64
In [29]:
df['smart_222_cat'].isnull().sum()
Out[29]:
0
In [30]:
df.drop(['smart_222_raw'], axis=1, inplace=True)
df.head()
Out[30]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_193_225 smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 34188.0 1 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 2426.0 0 0 1 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 1881.0 0 0 1 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 970.0 0 0 1 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 356.0 0 0 0 1 2 2 0 0 0

5 rows × 52 columns

smart_226_raw

Although only available on the Toshiba drives, this is the highest negative correlation to failure rates yet. A categorical column smart_226_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_226_raw column will then be dropped.

In [45]:
sns.distplot(df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'])
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd99227848>
In [46]:
sns.distplot(df.loc[df['failure'] == 0]['smart_226_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_226_raw'])
plt.grid(True)
plt.title("smart_226_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[46]:
<matplotlib.legend.Legend at 0x1cdcb246648>
In [47]:
df['smart_226_raw'].value_counts()
Out[47]:
535.0    27872
536.0    24618
534.0    24428
533.0    21772
537.0    18981
         ...  
629.0        2
300.0        2
295.0        2
242.0        1
634.0        1
Name: smart_226_raw, Length: 232, dtype: int64
In [48]:
df.loc[df['failure'] == 1]['smart_226_raw'].value_counts()
Out[48]:
533.0    3
180.0    2
258.0    2
168.0    2
173.0    2
182.0    2
243.0    1
179.0    1
248.0    1
249.0    1
592.0    1
277.0    1
272.0    1
183.0    1
540.0    1
250.0    1
176.0    1
184.0    1
537.0    1
177.0    1
586.0    1
261.0    1
257.0    1
251.0    1
265.0    1
187.0    1
181.0    1
269.0    1
262.0    1
167.0    1
270.0    1
263.0    1
186.0    1
Name: smart_226_raw, dtype: int64
In [50]:
len(df.loc[df['smart_226_raw'].isnull()])
Out[50]:
10652389
In [49]:
df.loc[df['smart_226_raw'].notnull()]['manufacturer'].value_counts()
Out[49]:
Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [51]:
df[['smart_226_raw', 'failure']].corr()
Out[51]:
smart_226_raw failure
smart_226_raw 1.000000 -0.014187
failure -0.014187 1.000000
In [31]:
smart_226_mean = df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'].mean()
smart_226_mean
Out[31]:
458.5273021362039
In [32]:
df.loc[(df['failure'] == 0) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()
Out[32]:
458.54995010567677
In [33]:
df.loc[(df['failure'] == 1) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()
Out[33]:
275.825
In [34]:
df['smart_226_cat'] = 0
In [35]:
df.loc[(df['smart_226_raw'] < smart_226_mean), 'smart_226_cat'] = 1
df.loc[(df['smart_226_raw'] > smart_226_mean), 'smart_226_cat'] = 2
In [36]:
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_226_cat'].dtype
Out[36]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [37]:
df['smart_226_cat'].value_counts()
Out[37]:
0    10652389
2      234337
1       88385
Name: smart_226_cat, dtype: int64
In [38]:
df['smart_226_cat'].isnull().sum()
Out[38]:
0
In [39]:
df.drop(['smart_226_raw'], axis=1, inplace=True)
df.head()
Out[39]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 1 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 0 1 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 0 1 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 0 1 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 0 1 2 2 0 0 0 0

5 rows × 52 columns

smart_23_raw

This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.

In [62]:
sns.distplot(df.loc[df['smart_23_raw'].notnull()]['smart_23_raw'])
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:288: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd9a403e88>
In [63]:
df['smart_23_raw'].value_counts()
Out[63]:
0.0    232122
Name: smart_23_raw, dtype: int64
In [64]:
df.loc[df['smart_23_raw'].notnull()]['manufacturer'].value_counts()
Out[64]:
Toshiba            232122
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [40]:
df.drop(['smart_23_raw'], axis=1, inplace=True)
df.head()
Out[40]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 1 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 0 1 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 0 1 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 0 1 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 0 1 2 2 0 0 0 0

5 rows × 51 columns

smart_24_raw

This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.

In [67]:
sns.distplot(df.loc[df['smart_24_raw'].notnull()]['smart_24_raw'])
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce2c004cc8>
In [68]:
df['smart_24_raw'].value_counts()
Out[68]:
0.0    232122
Name: smart_24_raw, dtype: int64
In [69]:
df.loc[df['smart_24_raw'].notnull()]['manufacturer'].value_counts()
Out[69]:
Toshiba            232122
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64
In [41]:
df.drop(['smart_24_raw'], axis=1, inplace=True)
df.head()
Out[41]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 1 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 0 1 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 0 1 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 0 1 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 0 1 2 2 0 0 0 0

5 rows × 50 columns

smart_11_raw

Although only available on 0.64% of drives, this is the highest correlation to failure rates yet. A categorical column smart_11_cat will be created with the following categories and values.

Value Representation
0 Value NaN
1 Below Average
2 Above Average

The original smart_11_raw column will then be dropped.

In [72]:
sns.distplot(df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'])
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce2ae293c8>
In [73]:
sns.distplot(df.loc[df['failure'] == 0]['smart_11_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_11_raw'])
plt.grid(True)
plt.title("smart_11_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])
Out[73]:
<matplotlib.legend.Legend at 0x1ce86d97f48>
In [74]:
df['smart_11_raw'].value_counts()
Out[74]:
0.0       25301
164.0       208
196.0       202
484.0       191
654.0       191
          ...  
894.0         1
2786.0        1
2787.0        1
2790.0        1
2763.0        1
Name: smart_11_raw, Length: 3558, dtype: int64
In [75]:
df.loc[df['failure'] == 1]['smart_11_raw'].value_counts()
Out[75]:
0.0        6
911.0      1
331.0      1
4590.0     1
266.0      1
1160.0     1
1215.0     1
5115.0     1
2609.0     1
2102.0     1
841.0      1
13377.0    1
221.0      1
184.0      1
Name: smart_11_raw, dtype: int64
In [76]:
df.loc[df['smart_11_raw'].notnull()]['manufacturer'].value_counts()
Out[76]:
Seagate            45193
Western Digital    25301
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64
In [77]:
len(df.loc[df['smart_11_raw'].isnull()])
Out[77]:
10904617
In [78]:
df[['smart_11_raw', 'failure']].corr()
Out[78]:
smart_11_raw failure
smart_11_raw 1.000000 0.017156
failure 0.017156 1.000000
In [42]:
smart_11_mean = df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'].mean()
smart_11_mean
Out[42]:
676.7458223394899
In [43]:
df.loc[(df['failure'] == 0) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()
Out[43]:
676.4611280595956
In [44]:
df.loc[(df['failure'] == 1) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()
Out[44]:
1732.7368421052631
In [45]:
df['smart_11_cat'] = 0
In [46]:
df.loc[(df['smart_11_raw'] < smart_11_mean), 'smart_11_cat'] = 1
df.loc[(df['smart_11_raw'] > smart_11_mean), 'smart_11_cat'] = 2
In [47]:
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
df['smart_11_cat'].dtype
Out[47]:
CategoricalDtype(categories=[0, 1, 2], ordered=False)
In [48]:
df['smart_11_cat'].value_counts()
Out[48]:
0    10904617
1       45469
2       25025
Name: smart_11_cat, dtype: int64
In [49]:
df['smart_11_cat'].isnull().sum()
Out[49]:
0
In [50]:
df.drop(['smart_11_raw'], axis=1, inplace=True)
df.head()
Out[50]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 50 columns

smart_254_raw

This column is not only missing 99.75% of its values, it also has no variance whatsoever, making it useless for analysis.

In [51]:
sns.distplot(df.loc[df['smart_254_raw'].notnull()]['smart_254_raw'])
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x20238c92f48>
In [92]:
df['smart_254_raw'].value_counts()
Out[92]:
0.0    26973
Name: smart_254_raw, dtype: int64
In [93]:
df.loc[df['smart_254_raw'].notnull()]['manufacturer'].value_counts()
Out[93]:
Seagate            26061
Western Digital      912
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64
In [52]:
df.drop(['smart_254_raw'], axis=1, inplace=True)
df.head()
Out[52]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 49 columns

smart_235_raw

This column represents an interesting report, in that the first 3 bytes of it is the drive's good block count, while the last 2 bytes is the drive's bad block count, but this column is missing 99.92% of its values, making it useless for this type of predictive analysis.

In [96]:
sns.distplot(df.loc[df['smart_235_raw'].notnull()]['smart_235_raw'])
Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f55648>
In [97]:
df['smart_235_raw'].value_counts()
Out[97]:
1.722229e+09    2
2.630864e+09    2
1.656523e+09    2
1.666588e+09    2
2.332312e+10    2
               ..
1.247958e+09    1
1.249292e+09    1
1.250461e+09    1
1.251153e+09    1
8.591649e+09    1
Name: smart_235_raw, Length: 8744, dtype: int64
In [98]:
df.loc[df['smart_235_raw'].notnull()]['manufacturer'].value_counts()
Out[98]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [53]:
df.drop(['smart_235_raw'], axis=1, inplace=True)
df.head()
Out[53]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 48 columns

smart_233_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [101]:
sns.distplot(df.loc[df['smart_233_raw'].notnull()]['smart_233_raw'])
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f91f88>
In [104]:
df.loc[(df['smart_233_raw'].notnull()) & (df['failure'] == 1)]
Out[104]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 48 columns

In [102]:
df.loc[df['smart_233_raw'].notnull()]['manufacturer'].value_counts()
Out[102]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [54]:
df.drop(['smart_233_raw'], axis=1, inplace=True)
df.head()
Out[54]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 47 columns

smart_232_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [106]:
sns.distplot(df.loc[df['smart_232_raw'].notnull()]['smart_232_raw'])
Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f85b88>
In [107]:
df.loc[(df['smart_232_raw'].notnull()) & (df['failure'] == 1)]
Out[107]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 47 columns

In [108]:
df.loc[df['smart_232_raw'].notnull()]['manufacturer'].value_counts()
Out[108]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [55]:
df.drop(['smart_232_raw'], axis=1, inplace=True)
df.head()
Out[55]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 46 columns

Memory Management and Reloading Checkpoint

In [58]:
if not os.path.isfile('pre_168_df.csv'):
    df.to_csv('pre_168_df.csv', index=False)
In [4]:
df = pd.read_csv('pre_168_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')

smart_168_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [110]:
sns.distplot(df.loc[df['smart_168_raw'].notnull()]['smart_168_raw'])
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce8710c808>
In [111]:
df.loc[(df['smart_168_raw'].notnull()) & (df['failure'] == 1)]
Out[111]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 46 columns

In [112]:
df.loc[df['smart_168_raw'].notnull()]['manufacturer'].value_counts()
Out[112]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [5]:
df.drop(['smart_168_raw'], axis=1, inplace=True)
df.head()
Out[5]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 45 columns

smart_170_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [114]:
sns.distplot(df.loc[df['smart_170_raw'].notnull()]['smart_170_raw'])
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce8729a608>
In [115]:
df.loc[(df['smart_170_raw'].notnull()) & (df['failure'] == 1)]
Out[115]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 45 columns

In [116]:
df.loc[df['smart_170_raw'].notnull()]['manufacturer'].value_counts()
Out[116]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [6]:
df.drop(['smart_170_raw'], axis=1, inplace=True)
df.head()
Out[6]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 44 columns

smart_218_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [118]:
sns.distplot(df.loc[df['smart_218_raw'].notnull()]['smart_218_raw'])
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d05c6ffcc8>
In [119]:
df.loc[(df['smart_218_raw'].notnull()) & (df['failure'] == 1)]
Out[119]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 44 columns

In [120]:
df.loc[df['smart_218_raw'].notnull()]['manufacturer'].value_counts()
Out[120]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [7]:
df.drop(['smart_218_raw'], axis=1, inplace=True)
df.head()
Out[7]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 43 columns

smart_174_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [122]:
sns.distplot(df.loc[df['smart_174_raw'].notnull()]['smart_174_raw'])
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d05c735e88>
In [123]:
df.loc[(df['smart_174_raw'].notnull()) & (df['failure'] == 1)]
Out[123]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 43 columns

In [124]:
df.loc[df['smart_174_raw'].notnull()]['manufacturer'].value_counts()
Out[124]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [8]:
df.drop(['smart_174_raw'], axis=1, inplace=True)
df.head()
Out[8]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 42 columns

smart_16_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [126]:
sns.distplot(df.loc[df['smart_16_raw'].notnull()]['smart_16_raw'])
Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d05c7b3408>
In [127]:
df.loc[(df['smart_16_raw'].notnull()) & (df['failure'] == 1)]
Out[127]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 42 columns

In [128]:
df.loc[df['smart_16_raw'].notnull()]['manufacturer'].value_counts()
Out[128]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [9]:
df.drop(['smart_16_raw'], axis=1, inplace=True)
df.head()
Out[9]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 41 columns

smart_17_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [130]:
sns.distplot(df.loc[df['smart_17_raw'].notnull()]['smart_17_raw'])
Out[130]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d05c8a8dc8>
In [131]:
df.loc[(df['smart_17_raw'].notnull()) & (df['failure'] == 1)]
Out[131]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 41 columns

In [132]:
df.loc[df['smart_17_raw'].notnull()]['manufacturer'].value_counts()
Out[132]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [10]:
df.drop(['smart_17_raw'], axis=1, inplace=True)
df.head()
Out[10]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 40 columns

In [11]:
#### Memory Management and Reloading Checkpoint
if not os.path.isfile('pre_173_df.csv'):
    df.to_csv('pre_173_df.csv', index=False)
In [4]:
df = pd.read_csv('pre_173_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')

smart_173_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [5]:
sns.distplot(df.loc[df['smart_173_raw'].notnull()]['smart_173_raw'])
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b0ccd78e48>
In [6]:
df.loc[(df['smart_173_raw'].notnull()) & (df['failure'] == 1)]
Out[6]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 40 columns

In [7]:
df.loc[df['smart_173_raw'].notnull()]['manufacturer'].value_counts()
Out[7]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [12]:
df.drop(['smart_173_raw'], axis=1, inplace=True)
df.head()
Out[12]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 39 columns

smart_231_raw

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

In [9]:
sns.distplot(df.loc[df['smart_231_raw'].notnull()]['smart_231_raw'])
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b0feb75588>
In [10]:
df.loc[(df['smart_231_raw'].notnull()) & (df['failure'] == 1)]
Out[10]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 39 columns

In [11]:
df.loc[df['smart_231_raw'].notnull()]['manufacturer'].value_counts()
Out[11]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [13]:
df.drop(['smart_231_raw'], axis=1, inplace=True)
df.head()
Out[13]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 38 columns

smart_177_raw

In [13]:
sns.distplot(df.loc[df['smart_177_raw'].notnull()]['smart_177_raw'])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b11152b188>
In [14]:
df.loc[(df['smart_177_raw'].notnull()) & (df['failure'] == 1)]
Out[14]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat

0 rows × 38 columns

In [15]:
df.loc[df['smart_177_raw'].notnull()]['manufacturer'].value_counts()
Out[15]:
Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64
In [14]:
df.drop(['smart_177_raw'], axis=1, inplace=True)
df.head()
Out[14]:
date serial_number model failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw ... smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat smart_226_cat smart_11_cat
0 10-01 Z305B2QN ST4000DM000 False 97236416.0 0.0 13.0 0.0 704304346.0 33261.0 ... 0 0 0 0 0 0 0 0 0 0
1 10-01 ZJV0XJQ4 ST12000NM0007 False 4665536.0 0.0 3.0 0.0 422822971.0 10298.0 ... 0 1 0 0 0 0 0 0 0 0
2 10-01 ZJV0XJQ3 ST12000NM0007 False 92892872.0 0.0 1.0 0.0 936518450.0 7329.0 ... 0 1 0 0 0 0 0 0 0 0
3 10-01 ZJV0XJQ0 ST12000NM0007 False 231702544.0 0.0 6.0 0.0 416687782.0 10898.0 ... 0 1 0 0 0 0 0 0 0 0
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 False 0.0 436.0 9.0 0.0 0.0 23334.0 ... 0 0 1 2 2 0 0 0 0 0

5 rows × 37 columns

In [15]:
if not os.path.isfile('explorative_df.csv'):
    df.to_csv('explorative_df.csv', index = False)

Analysis of Potential Predictors

Memory Management and Reloading Checkpoint

In [10]:
df = pd.read_csv('explorative_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')
In [16]:
fig, axes = plt.subplots(6, 6, figsize = (30, 25))

row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
    
    if col == 6:
        row += 1
        col = 0
        
    # Histograms
    if df[df_col].dtype.name == 'float64':
        if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
                      'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
                      'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
                      'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
                      'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
            ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
            ax.set_yscale('log')
            
        else:
            ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
        
    # Countplots
    elif df[df_col].dtype.name == 'category' or \
                df[df_col].dtype.name == 'bool':
        if df_col == "date":
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            
        elif df_col == "model":
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            ax.set_yscale('log')
            
        elif df_col in ['smart_184_cat', 'smart_11_cat']:
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set_yscale('log')
            
        else:
            sns.countplot(df[df_col], ax = axes[row, col])
    
    else:
        print("Unknown column dtype")
    
    col += 1
    

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Columns", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Distributions.svg")
fig.savefig("Charts/Dataframe Distributions.png")
In [17]:
fig, axes = plt.subplots(6, 6, figsize = (30, 25))

row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
    
    if col == 6:
        row += 1
        col = 0
        
    # Histograms
    if df[df_col].dtype.name == 'float64':
        if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
                      'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
                      'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
                      'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
                      'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
            ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
            ax.set_yscale('log')
            
        else:
            ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
        
    # Countplots
    elif df[df_col].dtype.name == 'category' or df[df_col].dtype.name == 'bool':
        if df_col == "date":
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            
        elif df_col == "model":
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            ax.set_yscale('log')
            
        elif df_col in ['smart_184_cat', 'smart_11_cat']:
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            
        else:
            sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
    
    else:
        print("Unknown column dtype")
    
    col += 1
    

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Failure", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Failure Distributions.svg")
fig.savefig("Charts/Dataframe Failure Distributions.png")

With all NaN values interpolated or their columns removed, correlations can be determined between the columns.

In [4]:
corr_df = df.corr(method = 'pearson')
corr_df
Out[4]:
failure smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw smart_10_raw smart_12_raw smart_187_raw ... smart_192_raw smart_194_raw smart_197_raw smart_198_raw smart_199_raw smart_240_raw smart_241_raw smart_242_raw capacity_TB smart_193_225
failure 1.000000 0.002200 -1.664119e-04 0.001083 0.044413 0.000082 -0.000190 -1.296866e-04 0.001259 0.006324 ... -0.000087 0.000137 0.027410 0.020870 0.000506 -1.360845e-03 1.255774e-03 2.058487e-04 0.000602 0.000636
smart_1_raw 0.002200 1.000000 -2.311701e-01 -0.008473 0.014507 0.008715 0.082018 -1.788949e-02 0.000917 0.001963 ... -0.154805 -0.023544 0.002662 0.004631 -0.002304 7.928208e-02 4.301937e-02 1.381072e-02 0.110230 0.057691
smart_3_raw -0.000166 -0.231170 1.000000e+00 0.007011 -0.007325 -0.004563 -0.140497 4.258443e-03 -0.020797 -0.001022 ... -0.032340 -0.024982 -0.001506 -0.002443 -0.000193 -3.036262e-01 -1.130124e-02 -3.669415e-03 0.109833 0.002603
smart_4_raw 0.001083 -0.008473 7.010971e-03 1.000000 -0.000145 0.000296 0.024426 1.154425e-03 0.110635 0.000031 ... -0.005423 0.010887 -0.000235 -0.000232 0.000167 4.539439e-03 1.314098e-02 6.288057e-03 -0.036918 0.088046
smart_5_raw 0.044413 0.014507 -7.325152e-03 -0.000145 1.000000 -0.000127 -0.006007 -5.818652e-04 0.003318 0.022669 ... -0.000367 -0.004287 0.005990 0.006118 -0.000259 -7.569510e-03 2.646947e-02 4.199105e-03 0.024436 -0.006208
smart_7_raw 0.000082 0.008715 -4.562669e-03 0.000296 -0.000127 1.000000 0.018988 -3.530821e-04 0.004242 0.000151 ... -0.003937 -0.007435 -0.000100 -0.000063 -0.000170 2.125667e-02 1.115787e-02 4.814785e-03 -0.006759 0.022182
smart_9_raw -0.000190 0.082018 -1.404970e-01 0.024426 -0.006007 0.018988 1.000000 1.448785e-02 0.151401 -0.001826 ... -0.048660 -0.255554 -0.000854 -0.000627 0.003222 8.263827e-01 4.055059e-01 2.403070e-01 -0.821823 0.298264
smart_10_raw -0.000130 -0.017889 4.258443e-03 0.001154 -0.000582 -0.000353 0.014488 1.000000e+00 0.016557 -0.000079 ... 0.003794 -0.001671 -0.000215 -0.000189 -0.000205 2.373800e-15 -5.687530e-16 -1.244273e-14 -0.020359 -0.004730
smart_12_raw 0.001259 0.000917 -2.079667e-02 0.110635 0.003318 0.004242 0.151401 1.655722e-02 1.000000 0.001363 ... -0.016338 -0.008522 -0.000003 0.000001 0.004982 1.093496e-01 5.200587e-02 5.729262e-02 -0.125083 0.075448
smart_187_raw 0.006324 0.001963 -1.021893e-03 0.000031 0.022669 0.000151 -0.001826 -7.907920e-05 0.001363 1.000000 ... -0.000212 -0.002019 0.000312 0.000328 -0.000060 -2.124657e-03 2.980454e-03 1.679313e-05 0.003999 -0.001180
smart_188_raw 0.004512 0.015110 -7.675596e-03 0.002522 0.023388 0.000958 -0.014084 -5.939758e-04 0.014031 0.004907 ... -0.004477 0.009208 -0.000030 0.000036 0.119021 -1.636407e-02 5.057671e-03 -6.395375e-03 -0.010661 -0.004182
smart_190_raw -0.000015 -0.001256 1.052630e-14 0.004690 -0.004044 -0.007757 -0.239176 -1.730456e-16 -0.027309 -0.002127 ... 0.002582 0.902827 -0.000895 -0.000915 -0.000305 -2.727695e-01 -1.667955e-03 -5.554090e-02 0.193054 -0.091489
smart_192_raw -0.000087 -0.154805 -3.234025e-02 -0.005423 -0.000367 -0.003937 -0.048660 3.793984e-03 -0.016338 -0.000212 ... 1.000000 0.019913 0.001547 -0.001284 0.000290 -3.025350e-02 2.727831e-02 -3.325593e-03 0.020102 -0.042788
smart_194_raw 0.000137 -0.023544 -2.498233e-02 0.010887 -0.004287 -0.007435 -0.255554 -1.671127e-03 -0.008522 -0.002019 ... 0.019913 1.000000 -0.000758 -0.001036 -0.001245 -2.456049e-01 -3.040466e-02 -5.864715e-02 0.169691 -0.083569
smart_197_raw 0.027410 0.002662 -1.506073e-03 -0.000235 0.005990 -0.000100 -0.000854 -2.150969e-04 -0.000003 0.000312 ... 0.001547 -0.000758 1.000000 0.978249 0.000040 -2.082457e-03 -2.641448e-03 -9.483053e-04 -0.000096 -0.000288
smart_198_raw 0.020870 0.004631 -2.443247e-03 -0.000232 0.006118 -0.000063 -0.000627 -1.890706e-04 0.000001 0.000328 ... -0.001284 -0.001036 0.978249 1.000000 0.000066 -8.091369e-04 -2.572496e-03 -9.330611e-04 0.001548 0.000012
smart_199_raw 0.000506 -0.002304 -1.928865e-04 0.000167 -0.000259 -0.000170 0.003222 -2.052890e-04 0.004982 -0.000060 ... 0.000290 -0.001245 0.000040 0.000066 1.000000 7.865792e-04 3.307712e-03 1.030953e-03 -0.004438 -0.001160
smart_240_raw -0.001361 0.079282 -3.036262e-01 0.004539 -0.007570 0.021257 0.826383 2.373800e-15 0.109350 -0.002125 ... -0.030253 -0.245605 -0.002082 -0.000809 0.000787 1.000000e+00 4.086996e-01 2.579818e-01 -0.641924 0.284585
smart_241_raw 0.001256 0.043019 -1.130124e-02 0.013141 0.026469 0.011158 0.405506 -5.687530e-16 0.052006 0.002980 ... 0.027278 -0.030405 -0.002641 -0.002572 0.003308 4.086996e-01 1.000000e+00 2.355348e-01 -0.072924 0.054042
smart_242_raw 0.000206 0.013811 -3.669415e-03 0.006288 0.004199 0.004815 0.240307 -1.244273e-14 0.057293 0.000017 ... -0.003326 -0.058647 -0.000948 -0.000933 0.001031 2.579818e-01 2.355348e-01 1.000000e+00 -0.146782 0.075975
capacity_TB 0.000602 0.110230 1.098329e-01 -0.036918 0.024436 -0.006759 -0.821823 -2.035881e-02 -0.125083 0.003999 ... 0.020102 0.169691 -0.000096 0.001548 -0.004438 -6.419243e-01 -7.292402e-02 -1.467825e-01 1.000000 -0.265300
smart_193_225 0.000636 0.057691 2.603227e-03 0.088046 -0.006208 0.022182 0.298264 -4.730102e-03 0.075448 -0.001180 ... -0.042788 -0.083569 -0.000288 0.000012 -0.001160 2.845854e-01 5.404226e-02 7.597500e-02 -0.265300 1.000000

22 rows × 22 columns

A few of the columns' relations will need to be examined well based on these correlation coefficients.

A prominent feature is smart_9_raw as the column with the most extreme correlations with other columns, which is understandable given that SMART attribute 9 represents the total count of hours the drive has been in a power-on state (Acronis, Knowledge Base 9109). Most other issues worth measuring are likely correlated with the drive age and amount of operation. This column may also be a powerful predictor within predictive models as an older drive is more likely to wear down to failure suddenly than a newer drive in general even if other values are not present. Even if other predictors of failure are present in an instance, a drive with an average or lower smart_9_raw value may represent a drive that will fail far sooner than the average length of time to failure.

smart_240_raw also has quite high correlations with other independent variables.

smart_197_raw and smart_198_raw have a nearly perfect degree of collinearity with each other and little in comparison with any other column. smart_198_raw will be dropped as it has a lower correlation with the dependent variable failure.

Finally, smart_190_raw and smart_194_raw have a very high degree of collinearity with each other and little in comparison with any other column. One likely needs removed.

The dataset may be large enough to not need to worry about the multicollinearity affecting the predictive power of the models, but the redundancy of information may skew the results.

For potential predictors for failure, smart_5_raw and smart_197_raw have the highest positive correlations with failure, at 4.4% and 2.7%. SMART attribute 5 is the reallocated sectors count of drives, which triggers when a read, write, or verification error occurs (Acronis, Knowledge Base 9105). SMART attribute 197 is the current pending sector count, which is the count of unstable sectors that are awaiting remapping (Acronis, Knowledge Base 9133). This value decreases as sectors are remapped, but the value would remain consistently high if these sectors are unable to be remapped. Both columns make complete sense as the highest correlation with failure and will likely be the most important predictor variables for HDD failure.

In [5]:
df[['smart_197_raw', 'smart_198_raw', 'failure']].corr()
Out[5]:
smart_197_raw smart_198_raw failure
smart_197_raw 1.000000 0.978249 0.02741
smart_198_raw 0.978249 1.000000 0.02087
failure 0.027410 0.020870 1.00000
In [11]:
df.drop('smart_198_raw', axis = 1, inplace = True)
In [30]:
corr_df = df.corr(method = 'pearson')
In [31]:
fig, ax = plt.subplots(figsize = (30, 23))

sns.heatmap(
    corr_df,
    ax = ax,
    annot = True,
    fmt = ".1%",
    vmin = -1, vmax = 1, center = 0,
    linewidths = 3,
    linecolor = "white",
    xticklabels = corr_df.columns,
    yticklabels = corr_df.columns,
    square = True,
    cbar = True
)

plt.title("Dataframe Correlation Heatmap", fontsize = 54)
fig.savefig("Charts/Corr Heatmap.svg")
fig.savefig("Charts/Corr Heatmap.png")
In [32]:
df.columns
Out[32]:
Index(['date', 'serial_number', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_199_raw', 'smart_240_raw', 'smart_241_raw',
       'smart_242_raw', 'manufacturer', 'capacity_TB', 'smart_193_225',
       'smart_191_cat', 'smart_184_cat', 'smart_200_cat', 'smart_196_cat',
       'smart_8_cat', 'smart_2_cat', 'smart_223_cat', 'smart_220_cat',
       'smart_222_cat', 'smart_226_cat', 'smart_11_cat'],
      dtype='object')
In [3]:
from sklearn.feature_selection import chi2
In [13]:
# Display the results of a Chi Squared on a contingency table
# in a tabular format
def chi2_output(contingency: pd.core.frame.DataFrame):
    chi2, p, dof, expected = scs.chi2_contingency(contingency)
    print("χ2-Coefficient: \t" + str(chi2))
    print("P-Value: \t\t" + str(p))
    print("Degrees of Freedom: \t" + str(dof))
    
    # Access the index names of the contingency table dataframe
    ax_1 = str(contingency.axes[1][0])
    ax_2 = str(contingency.axes[1][1])
    ax_title = ax_1 + ":\t\t" + ax_2 + ":\t\t" + ax_1 + ":\t\t" + ax_2 + ":"
            
    print("Expected Values:\n\t\t\tExpected:\t\t\tActual:")
    print("\t" + contingency.axes[1].name + ":\t" + ax_title)
    print(contingency.axes[0].name + ":")
    
    # Map the indexes to string values to ensure numeric indexes
    # don't cause type errors
    contingency.index = contingency.index.map(str)
    
    for i, j in enumerate(contingency.index):
        expected_false = "{:.3f}".format(expected[i][0])
        actual_false = str(contingency[0][i])
        expected_true = "{:.3f}".format(expected[i][1])
        actual_true = str(contingency[1][i])
        
        # Tabular spacing adjustments on the assumption that 1 tab = 8 spaces
        index_text = "    " + j + ": \t"
        
        if len(j) < 3:
            index_text += "\t\t"
        elif len(j) < 9:
            index_text += "\t"
        
        if len(expected_false) < 7:
            expected_false += "\t"
        if len(actual_false) < 7:
            actual_false += "\t"
        if len(expected_true) < 7:
            expected_true += "]\t"
        else:
            expected_true += "]"
        
        expected_text = "[" + expected_false + "\t" + expected_true
        if len(expected_text) < 16:
            expected_text = expected_text + "\t"
        actual_text = "[" + actual_false + "\t" + actual_true + "]"

        line = expected_text + "\t" + actual_text
                   
        print(index_text + line)
In [6]:
import os
import sys

from cffi import FFI

FFI_ = FFI()
FFI_.cdef('extern void* CreateSubarea(char * modelId, double areaKm2);')
FFI_.cdef('extern char** GetSubareaNames(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeIdentifiers(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeNames(void* simulation, int* size);')

def prepend_path_env (added_paths, to_env='PATH'):
    path_sep = ';'
    prior_path_env = os.environ.get(to_env)
    prior_paths = prior_path_env.split(path_sep)
    added_paths = [x for x in added_paths if os.path.exists(x)]
    new_paths = prior_paths + added_paths
    new_env_val = path_sep.join(new_paths)
    return new_env_val

libs_path = r"C:\Users\aedri\Anaconda3\envs\tf1\lib\R\bin\x64"
dll_path = os.path.join("R.dll")
libs_path2 = r"C:\Users\aedri\Anaconda3\envs\tf1\Lib\R\library\stats\libs\x64"
dll_path2 = os.path.join("stats.dll")

to_env = 'PATH'
if(sys.platform == 'win32'):
    os.environ[to_env] = prepend_path_env([libs_path], to_env)
    os.environ[to_env] = prepend_path_env([libs_path2], to_env)

LIB = FFI_.dlopen(dll_path, 1) # 1 for Lazy loading
dir(LIB)
LIB2 = FFI_.dlopen(dll_path2, 1) # 1 for Lazy loading
dir(LIB2)
Out[6]:
['CreateSubarea', 'GetNodeIdentifiers', 'GetNodeNames', 'GetSubareaNames']
In [7]:
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
rpy2.robjects.numpy2ri.activate()
rstats = importr('stats')
In [9]:
# Display the formatted results of the R stats Fisher_Test, 
# using Monte Carlo Simulation
def r_fisher_output(dataframe):
    results = rstats.fisher_test(dataframe.to_numpy(), \
                                 simulate_p_value = True)
    
    # Convert the listvector object returned from R stats to
    # a list of string values
    d = [key + "_" + str(results.rx2(key)[0]) for key in results.names]
    d2 = []
    for i in d:
        d2.append("".join(i.replace("\t", "").splitlines()))
    
    # Replicate the tabluar data formatting
    for line in d2:
        if len(line.split("_")[0]) < 8:
            print(line.replace("_", "\t\t"))
        else:
            print(line.replace("_", "\t"))

manufacturer

In [15]:
manufacturer_contingency = pd.crosstab(df['manufacturer'], df['failure'])
manufacturer_contingency
Out[15]:
failure False True
manufacturer
HGST 2660507 26
Seagate 7965949 606
Toshiba 322682 40
Western Digital 25295 6
In [16]:
pd.crosstab(df['manufacturer'], df['failure'], normalize = "index")
Out[16]:
failure False True
manufacturer
HGST 0.999990 0.000010
Seagate 0.999924 0.000076
Toshiba 0.999876 0.000124
Western Digital 0.999763 0.000237
In [17]:
chi2_output(manufacturer_contingency)
χ2-Coefficient: 	175.608940580931
P-Value: 		7.828322902603545e-38
Degrees of Freedom: 	3
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
manufacturer:
    HGST: 		[2660368.643	164.357]	[2660507	26]
    Seagate: 		[7966062.857	492.143]	[7965949	606]
    Toshiba: 		[322702.063	19.937]		[322682		40]
    Western Digital: 	[25299.437	1.563]		[25295		6]
In [18]:
r_fisher_output(manufacturer_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(2660507L, 7965949L, 322682L, 25295L, 26L, 606L, 40L, 
data.name2	6L), .Dim = c(4L, 2L))

model

The model column will ultimately be dropped even after all of the work that went into cleaning its data. The large amount of categories in it substantially adds complexity to the model while not improving enough. The manufacturer column, while less specific, contains all of the same variation with only four categories. Additionally, many of the models do not have a single failure, and even more have only thousands of hard drive days that they represent. Leaving the column in for predictive modeling and analysis will only hurt the overall results, and as such, is removed.

In [10]:
model_contingency = pd.crosstab(df['model'], df['failure'])
model_contingency
Out[10]:
failure False True
model
HDS5C4040ALE630 2484 0
HDWE160 368 0
HDWF180 1840 0
HMS5C4040ALE640 253754 4
HMS5C4040BLE640 1168480 12
HMS5C4040BLE641 91 0
HUH721212ALE600 143519 1
HUH721212ALN604 995718 7
HUH728080ALE600 92071 2
HUS726040ALE610 2570 0
MD04ABA400V 9009 0
MG07ACA14TA 232115 7
MQ01ABF050 42542 23
MQ01ABF050M 36808 10
ST10000NM0086 109168 5
ST1000LM024 HN' 91 0
ST12000NM0007 3394528 364
ST12000NM0008 321264 10
ST12000NM0117 462 0
ST16000NM001G 1840 0
ST4000DM000 1757379 119
ST4000DM005 3555 0
ST500LM012 HN 45089 13
ST500LM021 3036 0
ST500LM030 23019 6
ST6000DM001 368 0
ST6000DM004 92 0
ST6000DX000 81492 1
ST8000DM002 896911 35
ST8000DM004 273 0
ST8000DM005 2257 0
ST8000NM0055 1316333 53
Seagate SSD 1820 0
WD5000BPKT 912 0
WD5000LPCX 4928 0
WD5000LPVX 19181 6
WD60EFRX 274 0
ZA2000CM10002 355 0
ZA250CM10002 6844 0
ZA500CM10002 1593 0
In [11]:
pd.crosstab(df['model'], df['failure'], normalize = "index")
Out[11]:
failure False True
model
HDS5C4040ALE630 1.000000 0.000000
HDWE160 1.000000 0.000000
HDWF180 1.000000 0.000000
HMS5C4040ALE640 0.999984 0.000016
HMS5C4040BLE640 0.999990 0.000010
HMS5C4040BLE641 1.000000 0.000000
HUH721212ALE600 0.999993 0.000007
HUH721212ALN604 0.999993 0.000007
HUH728080ALE600 0.999978 0.000022
HUS726040ALE610 1.000000 0.000000
MD04ABA400V 1.000000 0.000000
MG07ACA14TA 0.999970 0.000030
MQ01ABF050 0.999460 0.000540
MQ01ABF050M 0.999728 0.000272
ST10000NM0086 0.999954 0.000046
ST1000LM024 HN' 1.000000 0.000000
ST12000NM0007 0.999893 0.000107
ST12000NM0008 0.999969 0.000031
ST12000NM0117 1.000000 0.000000
ST16000NM001G 1.000000 0.000000
ST4000DM000 0.999932 0.000068
ST4000DM005 1.000000 0.000000
ST500LM012 HN 0.999712 0.000288
ST500LM021 1.000000 0.000000
ST500LM030 0.999739 0.000261
ST6000DM001 1.000000 0.000000
ST6000DM004 1.000000 0.000000
ST6000DX000 0.999988 0.000012
ST8000DM002 0.999961 0.000039
ST8000DM004 1.000000 0.000000
ST8000DM005 1.000000 0.000000
ST8000NM0055 0.999960 0.000040
Seagate SSD 1.000000 0.000000
WD5000BPKT 1.000000 0.000000
WD5000LPCX 1.000000 0.000000
WD5000LPVX 0.999687 0.000313
WD60EFRX 1.000000 0.000000
ZA2000CM10002 1.000000 0.000000
ZA250CM10002 1.000000 0.000000
ZA500CM10002 1.000000 0.000000
In [14]:
chi2_output(model_contingency)
χ2-Coefficient: 	519.3493099524433
P-Value: 		3.025361166876539e-85
Degrees of Freedom: 	39
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
model:
    HDS5C4040ALE630: 	[2483.847	0.153]		[2484		0]
    HDWE160: 		[367.977	0.023]		[368		0]
    HDWF180: 		[1839.886	0.114]		[1840		0]
    HMS5C4040ALE640: 	[253742.324	15.676]		[253754		4]
    HMS5C4040BLE640: 	[1168419.815	72.185]		[1168480	12]
    HMS5C4040BLE641: 	[90.994		0.006]		[91		0]
    HUH721212ALE600: 	[143511.134	8.866]		[143519		1]
    HUH721212ALN604: 	[995663.488	61.512]		[995718		7]
    HUH728080ALE600: 	[92067.312	5.688]		[92071		2]
    HUS726040ALE610: 	[2569.841	0.159]		[2570		0]
    MD04ABA400V: 	[9008.443	0.557]		[9009		0]
    MG07ACA14TA: 	[232107.660	14.340]		[232115		7]
    MQ01ABF050: 	[42562.370	2.630]		[42542		23]
    MQ01ABF050M: 	[36815.726	2.274]		[36808		10]
    ST10000NM0086: 	[109166.256	6.744]		[109168		5]
    ST1000LM024 HN': 	[90.994		0.006]		[91		0]
    ST12000NM0007: 	[3394682.277	209.723]	[3394528	364]
    ST12000NM0008: 	[321254.153	19.847]		[321264		10]
    ST12000NM0117: 	[461.971	0.029]		[462		0]
    ST16000NM001G: 	[1839.886	0.114]		[1840		0]
    ST4000DM000: 	[1757389.429	108.571]	[1757379	119]
    ST4000DM005: 	[3554.780	0.220]		[3555		0]
    ST500LM012 HN: 	[45099.214	2.786]		[45089		13]
    ST500LM021: 	[3035.812	0.188]		[3036		0]
    ST500LM030: 	[23023.578	1.422]		[23019		6]
    ST6000DM001: 	[367.977	0.023]		[368		0]
    ST6000DM004: 	[91.994		0.006]		[92		0]
    ST6000DX000: 	[81487.966	5.034]		[81492		1]
    ST8000DM002: 	[896890.590	55.410]		[896911		35]
    ST8000DM004: 	[272.983	0.017]		[273		0]
    ST8000DM005: 	[2256.861	0.139]		[2257		0]
    ST8000NM0055: 	[1316304.679	81.321]		[1316333	53]
    Seagate SSD: 	[1819.888	0.112]		[1820		0]
    WD5000BPKT: 	[911.944	0.056]		[912		0]
    WD5000LPCX: 	[4927.696	0.304]		[4928		0]
    WD5000LPVX: 	[19185.815	1.185]		[19181		6]
    WD60EFRX: 		[273.983	0.017]		[274		0]
    ZA2000CM10002: 	[354.978	0.022]		[355		0]
    ZA250CM10002: 	[6843.577	0.423]		[6844		0]
    ZA500CM10002: 	[1592.902	0.098]		[1593		0]
In [15]:
r_fisher_output(model_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(2484L, 368L, 1840L, 253754L, 1168480L, 91L, 143519L, 
data.name2	995718L, 92071L, 2570L, 9009L, 232115L, 42542L, 36808L, 109168L, 
data.name3	91L, 3394528L, 321264L, 462L, 1840L, 1757379L, 3555L, 45089L, 
data.name4	3036L, 23019L, 368L, 92L, 81492L, 896911L, 273L, 2257L, 1316333L, 
data.name5	1820L, 912L, 4928L, 19181L, 274L, 355L, 6844L, 1593L, 0L, 0L, 
data.name6	0L, 4L, 12L, 0L, 1L, 7L, 2L, 0L, 0L, 7L, 23L, 10L, 5L, 0L, 364L, 
data.name7	10L, 0L, 0L, 119L, 0L, 13L, 0L, 6L, 0L, 0L, 1L, 35L, 0L, 0L, 
data.name8	53L, 0L, 0L, 0L, 6L, 0L, 0L, 0L, 0L), .Dim = c(40L, 2L))
In [12]:
df.drop('model', axis = 1, inplace = True)

capacity_TB

In [182]:
capacity_contingency = pd.crosstab(df['capacity_TB'], df['failure'])
capacity_contingency
Out[182]:
failure False True
capacity_TB
0.25 6844 0
0.50 177108 58
1.00 91 0
2.00 355 0
4.00 3197322 135
6.00 82594 1
8.00 2309685 90
10.00 110988 5
12.00 4855491 382
14.00 232115 7
16.00 1840 0
In [183]:
pd.crosstab(df['capacity_TB'], df['failure'], normalize = "index")
Out[183]:
failure False True
capacity_TB
0.25 1.000000 0.000000
0.50 0.999673 0.000327
1.00 1.000000 0.000000
2.00 1.000000 0.000000
4.00 0.999958 0.000042
6.00 0.999988 0.000012
8.00 0.999961 0.000039
10.00 0.999955 0.000045
12.00 0.999921 0.000079
14.00 0.999970 0.000030
16.00 1.000000 0.000000
In [335]:
chi2_output(capacity_contingency)
χ2-distribution: 	272.1246384142239
P-Value: 		1.192577692516227e-52
Degrees of Freedom: 	10
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
capacity_TB:
    0.25: 		[6843.577	0.423]		[6844		0]
    0.5: 		[177155.055	10.945]		[177108		58]
    1.0: 		[90.994		0.006]		[91		0]
    2.0: 		[354.978	0.022]		[355		0]
    4.0: 		[3197259.473	197.527]	[3197322	135]
    6.0: 		[82589.898	5.102]		[82594		1]
    8.0: 		[2309632.311	142.689]	[2309685	90]
    10.0: 		[110986.143	6.857]		[110988		5]
    12.0: 		[4855573.023	299.977]	[4855491	382]
    14.0: 		[232107.660	14.340]		[232115		7]
    16.0: 		[1839.886	0.114]		[1840		0]
In [527]:
r_fisher_output(capacity_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(6844L, 177108L, 91L, 355L, 3197322L, 82594L, 2309685L, 
data.name2	110988L, 4855491L, 232115L, 1840L, 0L, 58L, 0L, 0L, 135L, 1L, 
data.name3	90L, 5L, 382L, 7L, 0L), .Dim = c(11L, 2L))

smart_191_cat

This column has the highest p-value out of all of the category columns. While still statistically significant, this is likely from the size of the dataset and not out of pure correlation. smart_191_cat is not likely to be a good predictor variable.

In [292]:
smart_191_contingency = pd.crosstab(df['smart_191_cat'], df['failure'])
smart_191_contingency
Out[292]:
failure False True
smart_191_cat
0 6402442 403
1 3563330 238
2 1008661 37
In [293]:
pd.crosstab(df['smart_191_cat'], df['failure'], normalize = "index")
Out[293]:
failure False True
smart_191_cat
0 0.999937 0.000063
1 0.999933 0.000067
2 0.999963 0.000037
In [336]:
chi2_output(smart_191_contingency)
χ2-distribution: 	11.872750463233979
P-Value: 		0.002641587460833803
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_191_cat:
    0: 			[6402449.457	395.543]	[6402442	403]
    1: 			[3563347.857	220.143]	[3563330	238]
    2: 			[1008635.687	62.313]		[1008661	37]
In [528]:
r_fisher_output(smart_191_contingency)
p.value		0.0014992503748125937
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(6402442L, 3563330L, 1008661L, 403L, 238L, 37L), .Dim = 3:2)

smart_184_cat

With a p-value of 0.0, this is likely the strongest relation to failure in the dataset.

In [337]:
smart_184_contingency = pd.crosstab(df['smart_184_cat'], df['failure'])
smart_184_contingency
Out[337]:
failure False True
smart_184_cat
0 10974427 672
1 6 6
In [338]:
pd.crosstab(df['smart_184_cat'], df['failure'], normalize = "index")
Out[338]:
failure False True
smart_184_cat
0 0.999939 0.000061
1 0.500000 0.500000
In [339]:
chi2_output(smart_184_contingency)
χ2-distribution: 	40797.503188875104
P-Value: 		0.0
Degrees of Freedom: 	1
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_184_cat:
    0: 			[10974421.001	677.999]	[10974427	672]
    1: 			[11.999		0.001]		[6		6]
In [529]:
r_fisher_output(smart_184_contingency)
p.value		5.0214144599400225e-23
conf.int	4201.0256410217235
estimate	16382.987859030574
null.value	1.0
alternative	two.sided
method		Fisher's Exact Test for Count Data
data.name	structure(c(10974427L, 6L, 672L, 6L), .Dim = c(2L, 2L))

smart_200_cat

In [340]:
smart_200_contingency = pd.crosstab(df['smart_200_cat'], df['failure'])
smart_200_contingency
Out[340]:
failure False True
smart_200_cat
0 7076696 280
1 3853411 385
2 44326 13
In [341]:
pd.crosstab(df['smart_200_cat'], df['failure'], normalize = "index")
Out[341]:
failure False True
smart_200_cat
0 0.999960 0.000040
1 0.999900 0.000100
2 0.999707 0.000293
In [342]:
chi2_output(smart_200_contingency)
χ2-distribution: 	185.64260160941936
P-Value: 		4.877769316310266e-41
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_200_cat:
    0: 			[7076538.812	437.188]	[7076696	280]
    1: 			[3853557.927	238.073]	[3853411	385]
    2: 			[44336.261	2.739]		[44326		13]
In [530]:
r_fisher_output(smart_200_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7076696L, 3853411L, 44326L, 280L, 385L, 13L), .Dim = 3:2)

smart_196_cat

In [346]:
smart_196_contingency = pd.crosstab(df['smart_196_cat'], df['failure'])
smart_196_contingency
Out[346]:
failure False True
smart_196_cat
0 7920769 593
1 3042132 76
2 11532 9
In [347]:
pd.crosstab(df['smart_196_cat'], df['failure'], normalize = "index")
Out[347]:
failure False True
smart_196_cat
0 0.999925 0.000075
1 0.999975 0.000025
2 0.999220 0.000780
In [348]:
chi2_output(smart_196_contingency)
χ2-distribution: 	184.9589741070181
P-Value: 		6.865451182810317e-41
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_196_cat:
    0: 			[7920872.649	489.351]	[7920769	593]
    1: 			[3042020.064	187.936]	[3042132	76]
    2: 			[11540.287	0.713]		[11532		9]
In [532]:
r_fisher_output(smart_196_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7920769L, 3042132L, 11532L, 593L, 76L, 9L), .Dim = 3:2)

smart_8_cat

In [349]:
smart_8_contingency = pd.crosstab(df['smart_8_cat'], df['failure'])
smart_8_contingency
Out[349]:
failure False True
smart_8_cat
0 7946064 599
1 1605347 63
2 1423022 16
In [350]:
pd.crosstab(df['smart_8_cat'], df['failure'], normalize = "index")
Out[350]:
failure False True
smart_8_cat
0 0.999925 0.000075
1 0.999961 0.000039
2 0.999989 0.000011
In [351]:
chi2_output(smart_8_contingency)
χ2-distribution: 	95.8211076517963
P-Value: 		1.5585145046397941e-21
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_8_cat:
    0: 			[7946172.086	490.914]	[7946064	599]
    1: 			[1605310.824	99.176]		[1605347	63]
    2: 			[1422950.090	87.910]		[1423022	16]
In [533]:
r_fisher_output(smart_8_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7946064L, 1605347L, 1423022L, 599L, 63L, 16L), .Dim = 3:2)

smart_2_cat

In [352]:
smart_2_contingency = pd.crosstab(df['smart_2_cat'], df['failure'])
smart_2_contingency
Out[352]:
failure False True
smart_2_cat
0 7946064 599
1 1017453 56
2 2010916 23
In [353]:
pd.crosstab(df['smart_2_cat'], df['failure'], normalize = "index")
Out[353]:
failure False True
smart_2_cat
0 0.999925 0.000075
1 0.999945 0.000055
2 0.999989 0.000011
In [354]:
chi2_output(smart_2_contingency)
χ2-distribution: 	107.03867719554249
P-Value: 		5.712767796083653e-24
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_2_cat:
    0: 			[7946172.086	490.914]	[7946064	599]
    1: 			[1017446.142	62.858]		[1017453	56]
    2: 			[2010814.772	124.228]	[2010916	23]
In [534]:
r_fisher_output(smart_2_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7946064L, 1017453L, 2010916L, 599L, 56L, 23L), .Dim = 3:2)

smart_223_cat

In [355]:
smart_223_contingency = pd.crosstab(df['smart_223_cat'], df['failure'])
smart_223_contingency
Out[355]:
failure False True
smart_223_cat
0 10463052 624
1 469302 41
2 42079 13
In [356]:
pd.crosstab(df['smart_223_cat'], df['failure'], normalize = "index")
Out[356]:
failure False True
smart_223_cat
0 0.999940 0.000060
1 0.999913 0.000087
2 0.999691 0.000309
In [357]:
chi2_output(smart_223_contingency)
χ2-distribution: 	47.3441011733582
P-Value: 		5.240335045424587e-11
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_223_cat:
    0: 			[10463029.594	646.406]	[10463052	624]
    1: 			[469314.006	28.994]		[469302		41]
    2: 			[42089.400	2.600]		[42079		13]
In [535]:
r_fisher_output(smart_223_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10463052L, 469302L, 42079L, 624L, 41L, 13L), .Dim = 3:2)

smart_220_cat

In [540]:
smart_220_contingency = pd.crosstab(df['smart_220_cat'], df['failure'])
smart_220_contingency
Out[540]:
failure False True
smart_220_cat
0 10652005 638
1 161305 36
2 161123 4
In [359]:
pd.crosstab(df['smart_220_cat'], df['failure'], normalize = "index")
Out[359]:
failure False True
smart_220_cat
0 0.999940 0.000060
1 0.999777 0.000223
2 0.999975 0.000025
In [360]:
chi2_output(smart_220_contingency)
χ2-distribution: 	72.17414321356232
P-Value: 		2.1261012025679754e-16
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_220_cat:
    0: 			[10651984.921	658.079]	[10652005	638]
    1: 			[161331.033	9.967]		[161305		36]
    2: 			[161117.046	9.954]		[161123		4]
In [541]:
r_fisher_output(smart_220_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10652005L, 161305L, 161123L, 638L, 36L, 4L), .Dim = 3:2)

smart_222_cat

In [362]:
smart_222_contingency = pd.crosstab(df['smart_222_cat'], df['failure'])
smart_222_contingency
Out[362]:
failure False True
smart_222_cat
0 10651751 638
1 154615 6
2 168067 34
In [363]:
pd.crosstab(df['smart_222_cat'], df['failure'], normalize = "index")
Out[363]:
failure False True
smart_222_cat
0 0.999940 0.000060
1 0.999961 0.000039
2 0.999798 0.000202
In [364]:
chi2_output(smart_222_contingency)
χ2-Coefficient: 	55.638904842943205
P-Value: 		8.28257398176623e-13
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_222_cat:
    0: 			[10651730.937	658.063]	[10651751	638]
    1: 			[154611.448	9.552]		[154615		6]
    2: 			[168090.615	10.385]		[168067		34]
In [537]:
r_fisher_output(smart_222_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10651751L, 154615L, 168067L, 638L, 6L, 34L), .Dim = 3:2)

smart_226_cat

In [365]:
smart_226_contingency = pd.crosstab(df['smart_226_cat'], df['failure'])
smart_226_contingency
Out[365]:
failure False True
smart_226_cat
0 10651751 638
1 88352 33
2 234330 7
In [366]:
pd.crosstab(df['smart_226_cat'], df['failure'], normalize = "index")
Out[366]:
failure False True
smart_226_cat
0 0.999940 0.000060
1 0.999627 0.000373
2 0.999970 0.000030
In [367]:
chi2_output(smart_226_contingency)
χ2-Coefficient: 	143.3893719250644
P-Value: 		7.301187551981327e-32
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_226_cat:
    0: 			[10651730.937	658.063]	[10651751	638]
    1: 			[88379.540	5.460]		[88352		33]
    2: 			[234322.524	14.476]		[234330		7]
In [538]:
r_fisher_output(smart_226_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10651751L, 88352L, 234330L, 638L, 33L, 7L), .Dim = 3:2)

smart_11_cat

In [368]:
smart_11_contingency = pd.crosstab(df['smart_11_cat'], df['failure'])
smart_11_contingency
Out[368]:
failure False True
smart_11_cat
0 10903958 659
1 45459 10
2 25016 9
In [369]:
pd.crosstab(df['smart_11_cat'], df['failure'], normalize = "index")
Out[369]:
failure False True
smart_11_cat
0 0.99994 0.00006
1 0.99978 0.00022
2 0.99964 0.00036
In [370]:
chi2_output(smart_11_contingency)
χ2-Coefficient: 	54.67278414829015
P-Value: 		1.3426282073300716e-12
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_11_cat:
    0: 			[10903943.355	673.645]	[10903958	659]
    1: 			[45466.191	2.809]		[45459		10]
    2: 			[25023.454	1.546]		[25016		9]
In [539]:
r_fisher_output(smart_11_contingency)
p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10903958L, 45459L, 25016L, 659L, 10L, 9L), .Dim = 3:2)

To begin performing factor analysis, the dataset will need to be prepared through standardization and normalization, as well as the test, train, and validation splits. Doing these before the PCA ensures that no data is contaminated with the influence of the testing and validation data.

Dataset Splitting: Train, Test, and Validation

In [13]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
In [14]:
y_df = df['failure']
x_df = df.drop('failure', axis = 1)
In [15]:
x_df.drop(['date', 'serial_number'], axis = 1, inplace = True)
In [16]:
del df

The first split is 80% Train and 20% Test, stratified on the y_df / failure series.

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, \
                        test_size = 0.2, random_state = 13, stratify = y_df)

Verify the stratified splitting.

In [196]:
y_train.value_counts()
Out[196]:
False    8779546
True         542
Name: failure, dtype: int64
In [102]:
y_train.value_counts()[1] / y_train.value_counts()[0]
Out[102]:
6.173439947805957e-05
In [197]:
y_test.value_counts()
Out[197]:
False    2194887
True         136
Name: failure, dtype: int64

Note that while the ratio is not exact, it is the closest possible.

In [104]:
y_test.value_counts()[1] / y_test.value_counts()[0]
Out[104]:
6.196218757503233e-05
In [105]:
(y_test.value_counts()[1] - 1) / y_test.value_counts()[0]
Out[105]:
6.150658325462769e-05
In [106]:
(y_test.value_counts()[1] + 1) / y_test.value_counts()[0]
Out[106]:
6.241779189543698e-05

The second split is 87.5% Train and 12.5% Validation, stratified on the y_df / failure series, to result in 70% Train and 10% Validation overall.

In [37]:
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, \
                    test_size = 0.125, random_state = 13, stratify = y_train)
In [126]:
y_train.value_counts()
Out[126]:
False    7682103
True         474
Name: failure, dtype: int64
In [109]:
y_train.value_counts()[1] / y_train.value_counts()[0]
Out[109]:
6.170185429692885e-05
In [199]:
y_valid.value_counts()
Out[199]:
False    1097443
True          68
Name: failure, dtype: int64
In [111]:
y_valid.value_counts()[1] / y_valid.value_counts()[0]
Out[111]:
6.196221580528556e-05

Continuous Variable Standardization

A scaler fit to the training data is created to standardize the continuous columns for model training. This avoids any contamination of the training data by ensuring that the test and validation datasets do not influence the training data at all.

In [148]:
scaler = preprocessing.StandardScaler()
In [154]:
x_train.columns
Out[154]:
Index(['smart_1_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw',
       'smart_7_raw', 'smart_9_raw', 'smart_10_raw', 'smart_12_raw',
       'smart_187_raw', 'smart_188_raw', 'smart_190_raw', 'smart_192_raw',
       'smart_194_raw', 'smart_197_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat'],
      dtype='object')
In [155]:
cont_cols = [
    'smart_1_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw',
    'smart_7_raw', 'smart_9_raw', 'smart_10_raw', 'smart_12_raw',
    'smart_187_raw', 'smart_188_raw', 'smart_190_raw', 'smart_192_raw',
    'smart_194_raw', 'smart_197_raw', 'smart_199_raw', 'smart_240_raw',
    'smart_241_raw', 'smart_242_raw', 'smart_193_225', 'capacity_TB'
]

This fits the scaler to the continuous columns of the training data. The fit scaler will then be used to scale the testing and validation datasets.

In [200]:
x_train[cont_cols] = scaler.fit_transform(x_train[cont_cols])

A mean as close to zero as possible given the dataset and a standard deviation of 1 is a successful standardization.

In [157]:
x_train[cont_cols].describe()
Out[157]:
smart_1_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw smart_9_raw smart_10_raw smart_12_raw smart_187_raw smart_188_raw smart_190_raw smart_192_raw smart_194_raw smart_197_raw smart_199_raw smart_240_raw smart_241_raw smart_242_raw smart_193_225 capacity_TB
count 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06 7.682577e+06
mean 6.559957e-17 1.272166e-17 -3.058378e-17 -6.428810e-18 -2.783875e-18 -3.501949e-17 7.288019e-19 2.408561e-17 1.202338e-18 -5.216298e-18 -1.113003e-14 -9.474425e-18 2.056072e-16 8.952795e-19 2.092068e-18 -9.409711e-16 1.715198e-15 -2.116430e-16 -1.771507e-17 -1.632683e-16
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.084328e+00 -2.132400e-01 -6.419432e-02 -3.532305e-02 -2.138032e-02 -1.709926e+00 -1.664074e-02 -6.012170e-01 -4.769771e-03 -3.586239e-02 -2.748210e+00 -2.328147e-01 -2.850415e+00 -1.324463e-02 -1.351191e-02 -1.947329e+00 -3.902050e+00 -9.574821e-01 -2.973543e-01 -2.305760e+00
25% -1.084328e+00 -2.132400e-01 -4.929733e-02 -3.532305e-02 -2.138032e-02 -8.959860e-01 -1.664074e-02 -3.318594e-01 -4.769771e-03 -3.586239e-02 -6.234342e-01 -2.317537e-01 -7.578256e-01 -1.324463e-02 -1.351191e-02 -4.645074e-01 -4.604450e-02 -1.503821e-02 -2.879344e-01 -1.271726e+00
50% -1.624969e-01 -2.132400e-01 -2.695186e-02 -3.532305e-02 -1.536102e-02 -4.831020e-02 -1.664074e-02 -1.522877e-01 -4.769771e-03 -3.586239e-02 -5.786530e-05 -1.479354e-01 -6.029569e-02 -1.324463e-02 -1.351191e-02 -9.956174e-05 -6.576903e-05 -2.941769e-04 -2.715625e-01 -1.687557e-01
75% 8.856612e-01 -2.132400e-01 -4.606390e-03 -3.532305e-02 -8.171134e-03 8.429774e-01 -1.664074e-02 2.068557e-01 -4.769771e-03 -3.586239e-02 3.423730e-01 2.925006e-02 6.372342e-01 -1.324463e-02 -1.351191e-02 4.459378e-01 5.464831e-01 1.809309e-01 -2.633865e-02 9.342143e-01
max 7.025662e+00 9.594751e+00 1.871909e+02 1.028342e+02 2.235101e+02 3.480919e+00 1.662685e+02 7.742269e+01 3.789315e+02 1.666108e+02 7.103023e+00 2.776461e+01 8.135681e+00 3.798481e+02 2.459751e+02 3.947373e+00 1.114197e+01 1.911032e+02 5.666689e+01 2.037184e+00
In [201]:
x_test[cont_cols] = scaler.transform(x_test[cont_cols])
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
In [202]:
x_valid[cont_cols] = scaler.transform(x_valid[cont_cols])

PCA

In [160]:
import prince
In [204]:
pca = prince.PCA(
    n_components = len(cont_cols),
    n_iter = 3 ,
    copy = True,
    check_input = True,
    random_state = 13
)
In [162]:
pca = pca.fit(x_train[cont_cols])
In [163]:
ax = pca.plot_row_coordinates(
    x_train[cont_cols],
    ax = None,
    figsize = (6, 6),
    x_component = 0,
    y_component = 1
)

# No .svg file will be saved for this plot as it takes up
# 1.07 GB (1,158,481,389 bytes).
#plt.savefig("Charts/PCA.svg")
plt.savefig("Charts/PCA.png")
No handles with labels found to put in legend.
In [166]:
pca_results_df = pca.column_correlations(x_train[cont_cols])
pca_results_df
Out[166]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
smart_1_raw 0.078377 0.099782 -0.801482 0.078892 -0.026641 0.190968 -0.014460 -0.105466 -0.034428 0.011094 0.037297 0.073143 0.109396 -0.024232 -0.212629 -0.134207 -0.448838 0.018779 0.026033 -0.021886
smart_3_raw -0.221081 -0.226726 0.591213 0.054313 0.047426 0.393056 -0.342183 0.205728 -0.046383 0.054827 -0.027654 -0.045865 0.022211 -0.031782 -0.071179 -0.228396 -0.391735 -0.102950 0.011255 -0.011746
smart_4_raw 0.048660 0.072129 0.154954 0.551078 -0.057755 0.284603 0.296320 -0.329400 0.110936 -0.040330 -0.032100 0.092808 0.427198 -0.057928 0.403422 0.094201 -0.053534 -0.012757 -0.002279 0.000573
smart_5_raw -0.006278 -0.001843 -0.081180 -0.078453 0.205831 0.287805 0.477535 0.287489 -0.008891 -0.001993 -0.021501 -0.728418 0.012169 0.126458 0.006132 0.060036 -0.019917 -0.002544 0.000348 -0.001090
smart_7_raw 0.029846 0.008393 -0.028766 0.033013 -0.005494 0.061966 -0.092231 0.189596 0.555230 -0.395619 0.690174 0.015728 -0.057464 0.016059 0.053744 0.028973 -0.010901 0.002984 0.000545 -0.000641
smart_9_raw 0.914468 0.216561 0.056618 -0.011368 -0.016741 -0.073360 -0.047572 0.082071 -0.022034 0.012941 -0.017792 -0.037410 -0.028725 0.009031 0.098398 -0.079393 -0.175235 0.047671 -0.125063 0.175988
smart_10_raw 0.013026 0.005765 0.095699 0.041742 -0.003344 -0.084262 0.137335 0.030474 -0.704358 0.094477 0.654839 0.039760 0.157354 -0.024904 -0.032661 -0.005514 0.006891 -0.005280 0.000779 -0.000430
smart_12_raw 0.194149 0.131716 0.141257 0.432546 0.017109 0.215855 0.269566 -0.270411 -0.037091 -0.005874 0.062788 0.090480 -0.662569 0.138942 -0.244017 -0.112547 0.033331 -0.005667 -0.000968 -0.006064
smart_187_raw -0.001460 -0.005599 -0.021650 -0.016887 0.080140 0.128587 0.422809 0.626073 -0.075976 -0.191259 -0.187202 0.572773 0.006948 -0.001307 0.001157 0.000681 -0.001755 -0.000492 -0.000030 -0.000112
smart_188_raw -0.010913 0.025422 -0.060009 0.188750 0.710788 -0.112109 -0.053136 0.009293 0.001901 -0.001210 0.000090 -0.033541 -0.101379 -0.653558 0.013578 0.016314 0.012988 -0.011107 -0.002864 0.003416
smart_190_raw -0.497725 0.827636 0.062881 0.012931 -0.026880 -0.041007 -0.055579 0.077399 -0.007290 0.007396 -0.012731 -0.028185 0.007987 0.004576 0.004503 -0.005350 -0.052993 0.007085 -0.198862 -0.089894
smart_192_raw -0.050711 -0.012418 0.359853 -0.342269 0.086477 -0.352633 0.485804 -0.321156 0.213810 -0.107886 0.022458 0.059658 0.192683 -0.060343 -0.335626 -0.100792 -0.233609 -0.008731 -0.004174 -0.000321
smart_194_raw -0.490057 0.829796 0.074658 0.030969 -0.035558 -0.076954 -0.029637 0.070888 0.001589 0.002541 -0.015770 -0.030898 -0.005057 0.009443 -0.003068 0.023080 -0.032537 -0.057000 0.190803 0.095273
smart_197_raw -0.001079 -0.002520 -0.011483 0.002369 0.006971 -0.000050 0.139321 0.155128 0.344853 0.882787 0.200694 0.132922 -0.001450 -0.006854 0.007565 -0.004703 0.004413 -0.000767 -0.000051 -0.000029
smart_199_raw 0.003781 0.006615 -0.012864 0.153402 0.698261 -0.157618 -0.162203 -0.040339 0.010049 0.012121 -0.002605 0.109544 0.162326 0.630061 -0.016127 -0.005706 -0.004494 0.000095 0.000475 -0.000275
smart_240_raw 0.881871 0.205289 -0.068262 -0.098361 -0.008913 -0.094303 0.014801 0.007841 0.007242 -0.005509 -0.010820 -0.006907 0.007021 0.008348 0.083749 -0.036939 0.044719 -0.375409 0.020113 -0.067021
smart_241_raw 0.410374 0.306128 0.049650 -0.419610 0.172929 0.409955 -0.016359 -0.160246 -0.001752 0.002760 0.044892 0.084822 0.086429 -0.037152 0.132310 -0.479829 0.215578 0.141245 0.044802 -0.034395
smart_242_raw 0.332081 0.172154 0.113607 -0.312483 0.128006 0.448299 -0.107209 -0.154212 -0.028217 0.033677 0.028729 0.142897 0.020378 -0.020922 -0.216429 0.651921 -0.025424 0.019587 0.000311 0.000631
smart_193_225 0.397954 0.103282 0.066931 0.396076 -0.119475 0.039030 -0.123916 0.181213 0.084530 -0.012458 -0.058806 -0.099335 0.386078 -0.075974 -0.587553 -0.088894 0.275922 0.030176 0.000386 -0.000895
capacity_TB -0.771975 -0.165620 -0.216108 -0.147941 0.065255 0.312972 0.041428 -0.180356 0.019113 -0.013801 0.052656 0.093126 0.080721 -0.003217 -0.128277 -0.131585 0.230846 -0.211757 -0.094550 0.110517
In [167]:
fig, ax = plt.subplots(figsize = (30, 23))

sns.heatmap(
    pca_results_df,
    ax = ax,
    annot = True,
    fmt = ".1%",
    vmin = -1, vmax = 1, center = 0,
    linewidths = 3,
    linecolor = "white",
    xticklabels = pca_results_df.columns,
    yticklabels = pca_results_df.index,
    square = True,
    cbar = True
)

plt.title("PCA Results Heatmap", fontsize = 54)
plt.savefig("Charts/PCA Heatmap.svg")
plt.savefig("Charts/PCA Heatmap.png")
In [169]:
pca_eigenvalues = pca.eigenvalues_
pca_eigenvalues
Out[169]:
[24843306.51,
 13129034.19,
 9758841.017,
 8826913.796,
 8626783.65,
 8304944.847,
 7926688.784,
 7744995.316,
 7689628.569,
 7679972.369,
 7671021.412,
 7440170.605,
 6917334.239,
 6736046.078,
 6395419.9,
 6110364.178,
 4819698.087,
 1720733.771,
 797311.7685,
 512330.9103]
In [164]:
pca.explained_inertia_
Out[164]:
[0.16168602354268868,
 0.08544681158976027,
 0.06351280968257833,
 0.05744761032724548,
 0.05614511673603315,
 0.05405051486588958,
 0.051588736334488725,
 0.05040623293370864,
 0.0500458932548495,
 0.049983048454989215,
 0.04992479354032559,
 0.04842236273611431,
 0.04501962192457163,
 0.04383975635902524,
 0.041622881877629705,
 0.039767672864061826,
 0.03136771741602234,
 0.011198936056775211,
 0.005189090643217338,
 0.003334368860027106]
In [170]:
plt.plot(np.arange(len(cont_cols)), pca_eigenvalues, 'ro-')
plt.title("PCA Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.xticks(range(0, len(cont_cols)))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Scree Plot.svg")
plt.savefig("Charts/PCA Scree Plot.png")
In [173]:
cum_inertia = [0]

for i, e in enumerate(pca_eigenvalues):
    cum_inertia.append(sum(pca_eigenvalues[0:i+1]) / sum(pca_eigenvalues))
    
cum_inertia
Out[173]:
[0,
 0.1616860235,
 0.2471328351,
 0.3106456448,
 0.3680932551,
 0.4242383719,
 0.4782888867,
 0.5298776231,
 0.580283856,
 0.6303297493,
 0.6803127977,
 0.7302375913,
 0.778659954,
 0.8236795759,
 0.8675193323,
 0.9091422142,
 0.948909887,
 0.9802776044,
 0.9914765405,
 0.9966656311,
 1]
In [174]:
sum(pca_eigenvalues[0:13]) / sum(pca_eigenvalues)
Out[174]:
0.8236795759
In [175]:
plt.plot(range(0, len(cum_inertia)), cum_inertia)
plt.title("Inertia by Principal Components Kept")
plt.xlabel("Number of Principal Components")
plt.ylabel("Inertia")
plt.xticks(range(0, len(cum_inertia)))
plt.grid(b=True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b=True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Inertia Plot.svg")
plt.savefig("Charts/PCA Inertia Plot.png")

The eigenvalues and explained inertia were used to create a scree plot, and this plot is then used alongside the cumulative inertia to determine that 13 principal components an appropriate amount of dimensionality reduction to use as these components made up 82.37% of the inertia of the dataset in only 13 out of the 20, or 65%, of the total components.

PCA as a form of dimensionality reduction ensures that as little information, in the form of inertia, is lost as possible for the given number of dimensions reduced. As this dataset is quite large, any amount of dimensionality reduction will greatly affect the speed and chance of proper convergence in the predictive models to come. The result is reducing of the data by 35% while only losing 17.63% of the information, a 2-for-1 trade.

In [207]:
pca = prince.PCA(
    n_components = 13,
    n_iter = 3,
    copy = True,
    check_input = True,
    random_state = 13
)
In [208]:
pca = pca.fit(x_train[cont_cols])
In [209]:
pca.explained_inertia_
Out[209]:
[0.16168602354268732,
 0.08544681158975947,
 0.0635128096825781,
 0.0574476103272455,
 0.056145116736033195,
 0.05405051486588986,
 0.051588736334489044,
 0.05040623293370882,
 0.05004589325484982,
 0.04998304845498967,
 0.04992479354032566,
 0.04842236273611464,
 0.045019621924571526]

Training set PCA transformation

In [211]:
pca_df = pca.transform(x_train[cont_cols])
In [212]:
pca_df = pca_df.add_prefix('pca_component_')
pca_df
Out[212]:
pca_component_0 pca_component_1 pca_component_2 pca_component_3 pca_component_4 pca_component_5 pca_component_6 pca_component_7 pca_component_8 pca_component_9 pca_component_10 pca_component_11 pca_component_12
3450686 2.588802 -0.256104 0.363015 -0.194650 -0.027685 -0.374874 -0.079665 0.163281 -0.092533 0.037941 -0.097018 -0.063204 -0.364961
7599474 4.400617 2.545297 3.722155 9.092851 -1.929462 2.098228 0.360051 0.414306 1.651123 -0.335092 -1.139049 -0.840110 6.994877
3323950 1.016396 1.945011 -0.376139 0.035172 0.042050 0.652551 0.011721 -0.268730 -0.061039 0.005665 0.088317 0.159699 -0.360805
4727599 2.440267 1.156597 -0.146680 0.295595 -0.118076 -0.055855 -0.023809 0.021113 -0.125542 0.048756 -0.049074 0.016417 -0.580271
8582613 -0.869126 2.660505 -0.760537 -0.089982 -0.177652 -0.033002 -0.394537 0.278956 -0.085833 0.050159 -0.045416 -0.020975 0.339985
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3241991 -1.011622 2.492260 0.803247 0.086082 -0.155075 -0.444901 -0.163175 0.383642 -0.010255 0.005194 -0.093564 -0.167842 -0.274331
9697415 1.091494 1.392319 -1.121297 -0.437001 0.016222 0.545738 -0.344713 -0.012948 -0.044460 0.022871 0.043195 0.104584 0.589097
1525039 3.254400 -0.734202 -1.408422 0.272469 -0.208944 -0.032483 -0.251709 0.138698 -0.075531 0.041390 -0.069626 -0.010925 0.452372
8962312 -1.078743 -0.454942 0.457822 -0.630959 0.110766 -0.152421 0.232372 -0.239497 0.143571 -0.083654 0.008128 0.086702 0.205345
9385528 -2.571192 0.481604 -0.905770 0.892347 -0.368219 -0.440418 0.128963 0.053498 -0.015385 -0.031840 -0.004730 -0.035442 -0.150307

7682577 rows × 13 columns

In [186]:
pca_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 13 columns):
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: float64(13)
memory usage: 820.6 MB
In [213]:
# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_train.drop(cont_cols, axis = 1, inplace = True)
In [214]:
x_train = x_train.join(pca_df)
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 25 columns):
manufacturer        category
smart_191_cat       category
smart_184_cat       category
smart_200_cat       category
smart_196_cat       category
smart_8_cat         category
smart_2_cat         category
smart_223_cat       category
smart_220_cat       category
smart_222_cat       category
smart_226_cat       category
smart_11_cat        category
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: category(12), float64(13)
memory usage: 1.2 GB
In [189]:
x_train.head()
Out[189]:
manufacturer smart_191_cat smart_184_cat smart_200_cat smart_196_cat smart_8_cat smart_2_cat smart_223_cat smart_220_cat smart_222_cat ... pca_component_3 pca_component_4 pca_component_5 pca_component_6 pca_component_7 pca_component_8 pca_component_9 pca_component_10 pca_component_11 pca_component_12
3450686 Seagate 1 0 0 0 0 0 0 0 0 ... 3.866352e-11 4.344533e-10 1.634889e-08 2.984379e-09 3.583231e-11 5.123224e-09 1.308675e-08 5.061791e-08 2.452409e-08 5.690374e-08
7599474 Western Digital 1 0 1 1 0 0 0 0 0 ... 1.364831e-09 4.317408e-08 5.819838e-07 1.480315e-07 1.752998e-15 3.629931e-07 6.217520e-08 7.128673e-08 6.204078e-07 1.484787e-06
3323950 Seagate 2 0 0 0 0 0 0 0 0 ... 2.010200e-15 2.823538e-10 3.050107e-08 2.883427e-09 2.322707e-10 1.022783e-09 4.633041e-09 6.725898e-08 6.051673e-09 2.219027e-08
4727599 Seagate 1 0 0 0 0 0 0 0 0 ... 1.077629e-10 6.118397e-10 1.040523e-08 6.354446e-09 6.307566e-13 2.605017e-09 1.071095e-08 1.458014e-09 2.410012e-08 3.769567e-08
8582613 Seagate 1 0 0 0 0 0 0 0 0 ... 1.435735e-10 5.794786e-10 2.621771e-08 4.247093e-09 8.370350e-10 3.828430e-08 5.115769e-08 3.508412e-07 6.707833e-10 2.986888e-08

5 rows × 25 columns

Test set PCA transformation

In [215]:
pca_df = pca.transform(x_test[cont_cols])
In [216]:
pca_df = pca_df.add_prefix('pca_component_')
pca_df
Out[216]:
pca_component_0 pca_component_1 pca_component_2 pca_component_3 pca_component_4 pca_component_5 pca_component_6 pca_component_7 pca_component_8 pca_component_9 pca_component_10 pca_component_11 pca_component_12
3013725 0.435791 -0.684580 -0.283102 0.075296 -0.039540 0.113908 -0.040595 -0.065597 -0.034287 0.004561 -0.002883 0.057713 -0.023175
9172422 0.263702 -0.277320 0.387045 -0.015873 -0.038166 -0.178170 -0.018477 0.074526 -0.003965 -0.009490 -0.038771 -0.033708 -0.182556
743251 2.445029 -0.802264 -1.245554 0.046790 -0.151859 -0.262728 -0.136119 0.086480 -0.128149 0.041046 -0.042140 0.006269 0.096354
3621925 3.280192 -0.808463 0.640664 -0.465401 0.037590 -0.279781 -0.217491 0.253497 -0.032683 0.032800 -0.130896 -0.098988 0.032410
111857 2.787889 -2.080358 -1.217044 0.045168 -0.112905 -0.224616 -0.064388 -0.035409 -0.133254 0.040308 -0.025953 0.069686 -0.006195
... ... ... ... ... ... ... ... ... ... ... ... ... ...
843336 0.181057 0.657489 1.166916 -0.040047 -0.080168 -0.652888 -0.104184 0.389101 -0.012324 0.013322 -0.140614 -0.198347 -0.312721
31386 -2.001845 -0.630380 -0.412659 0.206235 -0.157499 -0.283062 0.032125 0.039428 0.024065 -0.039794 0.000278 -0.018526 0.129898
3968575 -0.748217 -0.424364 -0.575875 -0.404148 0.161604 0.676065 0.116033 -0.505856 -0.013727 -0.020468 0.122741 0.284857 0.033014
9172105 0.832329 2.323726 0.039687 -0.420184 0.030308 0.301889 -0.282565 0.119185 -0.044276 0.035200 -0.046526 -0.004071 0.193693
8124260 -1.378419 -0.486651 -1.231650 0.368690 -0.085083 0.380618 0.188240 -0.423705 -0.057614 -0.016030 0.108468 0.229276 -0.155963

2195023 rows × 13 columns

In [217]:
# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_test.drop(cont_cols, axis = 1, inplace = True)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:4102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
In [218]:
x_test = x_test.join(pca_df)
x_test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2195023 entries, 3013725 to 8124260
Data columns (total 25 columns):
manufacturer        category
smart_191_cat       category
smart_184_cat       category
smart_200_cat       category
smart_196_cat       category
smart_8_cat         category
smart_2_cat         category
smart_223_cat       category
smart_220_cat       category
smart_222_cat       category
smart_226_cat       category
smart_11_cat        category
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: category(12), float64(13)
memory usage: 339.6 MB

Validation set PCA transformation

In [ ]:
pca_df = pca.transform(x_valid[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')

# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)
x_valid = x_valid.join(pca_df)
In [220]:
pca_df = pca.transform(x_valid[cont_cols])
In [221]:
pca_df = pca_df.add_prefix('pca_component_')
pca_df
Out[221]:
pca_component_0 pca_component_1 pca_component_2 pca_component_3 pca_component_4 pca_component_5 pca_component_6 pca_component_7 pca_component_8 pca_component_9 pca_component_10 pca_component_11 pca_component_12
3193456 1.271755 -0.775623 -0.623198 0.296651 0.027323 0.591244 0.176000 -0.441434 -0.065563 0.001002 0.083200 0.218362 -0.388535
7003193 -0.685598 -1.210003 -1.726288 0.055368 0.033821 0.742198 0.185452 -0.637200 -0.060003 -0.016007 0.164665 0.353169 0.098202
10701024 -1.255734 -0.729666 -1.092088 0.295359 -0.069536 0.307147 0.197465 -0.395259 -0.039760 -0.024364 0.097766 0.202955 -0.117044
389431 -0.636825 0.245152 -0.790628 -0.498172 0.192483 0.831852 0.114385 -0.569126 -0.035362 -0.011972 0.137146 0.316239 0.065401
4658269 -2.051428 -1.250410 -1.459096 0.951363 -0.366477 -0.393068 0.188514 -0.074730 -0.021462 -0.039309 0.024574 0.024430 -0.022981
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3727837 1.785107 2.579171 -1.057921 0.130318 -0.227866 -0.145953 -0.319911 0.273555 -0.147570 0.071437 -0.073102 -0.047995 0.064569
69531 0.198868 1.223442 -0.966392 -0.124466 -0.090448 0.253455 -0.307763 0.082173 -0.063683 0.035205 -0.011270 0.048517 0.447177
3740929 -1.186275 3.395984 -1.217923 0.144728 -0.255995 -0.006366 -0.377732 0.262114 -0.109902 0.054291 -0.029303 -0.011575 0.316286
4836536 -2.306185 0.722970 -1.358860 0.315354 -0.222531 -0.032178 -0.043685 -0.041335 -0.044087 -0.007520 0.032242 0.066171 0.262547
773013 -0.715691 -1.095640 0.316101 -0.270242 0.131994 0.250370 0.217861 -0.346740 0.035531 -0.046744 0.062864 0.147693 -0.224468

1097511 rows × 13 columns

In [222]:
# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)
In [223]:
x_valid = x_valid.join(pca_df)
x_valid.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1097511 entries, 3193456 to 773013
Data columns (total 25 columns):
manufacturer        1097511 non-null category
smart_191_cat       1097511 non-null category
smart_184_cat       1097511 non-null category
smart_200_cat       1097511 non-null category
smart_196_cat       1097511 non-null category
smart_8_cat         1097511 non-null category
smart_2_cat         1097511 non-null category
smart_223_cat       1097511 non-null category
smart_220_cat       1097511 non-null category
smart_222_cat       1097511 non-null category
smart_226_cat       1097511 non-null category
smart_11_cat        1097511 non-null category
pca_component_0     1097511 non-null float64
pca_component_1     1097511 non-null float64
pca_component_2     1097511 non-null float64
pca_component_3     1097511 non-null float64
pca_component_4     1097511 non-null float64
pca_component_5     1097511 non-null float64
pca_component_6     1097511 non-null float64
pca_component_7     1097511 non-null float64
pca_component_8     1097511 non-null float64
pca_component_9     1097511 non-null float64
pca_component_10    1097511 non-null float64
pca_component_11    1097511 non-null float64
pca_component_12    1097511 non-null float64
dtypes: category(12), float64(13)
memory usage: 169.8 MB
In [38]:
if not os.path.isfile('pca_x_train.csv'):
    x_train.to_csv('pca_x_train.csv', index = False)
if not os.path.isfile('y_train.csv'):
    y_train.to_csv('y_train.csv', index = False, header = True)
    
if not os.path.isfile('pca_x_test.csv'):
    x_test.to_csv('pca_x_test.csv', index = False)
if not os.path.isfile('y_test.csv'):
    y_test.to_csv('y_test.csv', index = False, header = True)
    
if not os.path.isfile('pca_x_valid.csv'):
    x_valid.to_csv('pca_x_valid.csv', index = False)
if not os.path.isfile('y_valid.csv'):
    y_valid.to_csv('y_valid.csv', index = False, header = True)
In [4]:
reload_pca = False
if reload_pca:
    x_train = pd.read_csv('pca_x_train.csv')
    y_train = pd.read_csv('pca_y_train.csv')
    x_test = pd.read_csv('pca_x_test.csv')
    y_test = pd.read_csv('pca_y_test.csv')
    x_valid = pd.read_csv('pca_x_valid.csv')
    y_valid = pd.read_csv('pca_y_valid.csv')
    
    n_rows = len(df)

    
    #df['manufacturer'] = df['manufacturer'].astype('category')
    #df['smart_191_cat'] = df['smart_191_cat'].astype('category')
    #df['smart_184_cat'] = df['smart_184_cat'].astype('category')
    #df['smart_200_cat'] = df['smart_200_cat'].astype('category')
    #df['smart_196_cat'] = df['smart_196_cat'].astype('category')
    #df['smart_8_cat'] = df['smart_8_cat'].astype('category')
    #df['smart_2_cat'] = df['smart_2_cat'].astype('category')
    #df['smart_223_cat'] = df['smart_223_cat'].astype('category')
    #df['smart_220_cat'] = df['smart_220_cat'].astype('category')
    #df['smart_222_cat'] = df['smart_222_cat'].astype('category')
    #df['smart_226_cat'] = df['smart_226_cat'].astype('category')
    #df['smart_11_cat'] = df['smart_11_cat'].astype('category')

MCA

To begin doing MCA, the categorical columns need converted to boolean encoding columns.

In [225]:
cat_cols = [
    'manufacturer',  'smart_191_cat', 'smart_184_cat',
    'smart_200_cat', 'smart_196_cat', 'smart_8_cat',
    'smart_2_cat', 'smart_223_cat', 'smart_220_cat',
    'smart_222_cat', 'smart_226_cat', 'smart_11_cat'
]
In [226]:
x_train_cat = pd.get_dummies(x_train[cat_cols], \
                             columns = cat_cols, dtype = bool)
x_test_cat = pd.get_dummies(x_test[cat_cols], \
                            columns = cat_cols, dtype = bool)
x_valid_cat = pd.get_dummies(x_valid[cat_cols], \
                             columns = cat_cols, dtype = bool)
In [227]:
x_train_cat.columns
Out[227]:
Index(['manufacturer_HGST', 'manufacturer_Seagate', 'manufacturer_Toshiba',
       'manufacturer_Western Digital', 'smart_191_cat_0', 'smart_191_cat_1',
       'smart_191_cat_2', 'smart_184_cat_0', 'smart_184_cat_1',
       'smart_200_cat_0', 'smart_200_cat_1', 'smart_200_cat_2',
       'smart_196_cat_0', 'smart_196_cat_1', 'smart_196_cat_2',
       'smart_8_cat_0', 'smart_8_cat_1', 'smart_8_cat_2', 'smart_2_cat_0',
       'smart_2_cat_1', 'smart_2_cat_2', 'smart_223_cat_0', 'smart_223_cat_1',
       'smart_223_cat_2', 'smart_220_cat_0', 'smart_220_cat_1',
       'smart_220_cat_2', 'smart_222_cat_0', 'smart_222_cat_1',
       'smart_222_cat_2', 'smart_226_cat_0', 'smart_226_cat_1',
       'smart_226_cat_2', 'smart_11_cat_0', 'smart_11_cat_1',
       'smart_11_cat_2'],
      dtype='object')
In [7]:
cat_df.dtypes
Out[7]:
manufacturer_HGST               bool
manufacturer_Seagate            bool
manufacturer_Toshiba            bool
manufacturer_Western Digital    bool
smart_191_cat_0                 bool
smart_191_cat_1                 bool
smart_191_cat_2                 bool
smart_184_cat_0                 bool
smart_184_cat_1                 bool
smart_200_cat_0                 bool
smart_200_cat_1                 bool
smart_200_cat_2                 bool
smart_196_cat_0                 bool
smart_196_cat_1                 bool
smart_196_cat_2                 bool
smart_8_cat_0                   bool
smart_8_cat_1                   bool
smart_8_cat_2                   bool
smart_2_cat_0                   bool
smart_2_cat_1                   bool
smart_2_cat_2                   bool
smart_223_cat_0                 bool
smart_223_cat_1                 bool
smart_223_cat_2                 bool
smart_220_cat_0                 bool
smart_220_cat_1                 bool
smart_220_cat_2                 bool
smart_222_cat_0                 bool
smart_222_cat_1                 bool
smart_222_cat_2                 bool
smart_226_cat_0                 bool
smart_226_cat_1                 bool
smart_226_cat_2                 bool
smart_11_cat_0                  bool
smart_11_cat_1                  bool
smart_11_cat_2                  bool
dtype: object
In [8]:
cat_df.memory_usage().sum()
Out[8]:
395104124
In [28]:
mca = prince.MCA(
    n_components = 13,
    n_iter = 3,
    copy = True,
    random_state = 13
)
In [29]:
mca = mca.fit(cat_df)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\ops\__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self.default_fill_value,
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
In [30]:
ax = mca.plot_coordinates(
    X = cat_df,
    ax = None,
    figsize=(20, 20),
    show_row_points = True,
    row_points_size = 10,
    show_row_labels = False,
    show_column_points = True,
    column_points_size = 30,
    show_column_labels = False,
    legend_n_cols = 1
)

plt.savefig('Charts/MCA With Rows.png')
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
In [31]:
mca_eigenvalues = mca.eigenvalues_
mca_eigenvalues
Out[31]:
[0.278968505,
 0.1813786161,
 0.1017688289,
 0.07509976733,
 0.05613721872,
 0.05382373409,
 0.04912671174,
 0.04781589597,
 0.03590523611,
 0.0300406049,
 0.0262677525,
 0.01661171385,
 0.0161819015]
In [33]:
plt.plot(np.arange(len(mca_eigenvalues)), mca_eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()
In [34]:
ax = mca.plot_coordinates(
    X = cat_df,
    ax = None,
    figsize = (20, 20),
    show_row_points = False,
    show_row_labels = False,
    show_column_points = True,
    column_points_size = 30,
    show_column_labels = True,
    legend_n_cols = 3
)

plt.savefig('Charts/MCA.svg')
plt.savefig('Charts/MCA.png')
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
In [230]:
# Replace the categorical columns with their encoded representation columns.
x_train.drop(cat_cols, axis = 1, inplace = True)
x_train = x_train.join(x_train_cat)
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 49 columns):
pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               bool
manufacturer_Seagate            bool
manufacturer_Toshiba            bool
manufacturer_Western Digital    bool
smart_191_cat_0                 bool
smart_191_cat_1                 bool
smart_191_cat_2                 bool
smart_184_cat_0                 bool
smart_184_cat_1                 bool
smart_200_cat_0                 bool
smart_200_cat_1                 bool
smart_200_cat_2                 bool
smart_196_cat_0                 bool
smart_196_cat_1                 bool
smart_196_cat_2                 bool
smart_8_cat_0                   bool
smart_8_cat_1                   bool
smart_8_cat_2                   bool
smart_2_cat_0                   bool
smart_2_cat_1                   bool
smart_2_cat_2                   bool
smart_223_cat_0                 bool
smart_223_cat_1                 bool
smart_223_cat_2                 bool
smart_220_cat_0                 bool
smart_220_cat_1                 bool
smart_220_cat_2                 bool
smart_222_cat_0                 bool
smart_222_cat_1                 bool
smart_222_cat_2                 bool
smart_226_cat_0                 bool
smart_226_cat_1                 bool
smart_226_cat_2                 bool
smart_11_cat_0                  bool
smart_11_cat_1                  bool
smart_11_cat_2                  bool
dtypes: bool(36), float64(13)
memory usage: 1.4 GB
In [231]:
# Replace the categorical columns with their encoded representation columns.
x_test.drop(cat_cols, axis = 1, inplace = True)
x_test = x_test.join(x_test_cat)
In [232]:
# Replace the categorical columns with their encoded representation columns.
x_valid.drop(cat_cols, axis = 1, inplace = True)
x_valid = x_valid.join(x_valid_cat)
In [233]:
if not os.path.isfile('cat_x_train.csv'):
    x_train.to_csv('cat_x_train.csv', index = False)
    
if not os.path.isfile('cat_x_test.csv'):
    x_test.to_csv('cat_x_test.csv', index = False)
    
if not os.path.isfile('cat_x_valid.csv'):
    x_valid.to_csv('cat_x_valid.csv', index = False)
In [3]:
reload_cat = True
if reload_cat:
    x_train = pd.read_csv('cat_x_train.csv')
    y_train = pd.read_csv('y_train.csv')
    x_test = pd.read_csv('cat_x_test.csv')
    y_test = pd.read_csv('y_test.csv')
    x_valid = pd.read_csv('cat_x_valid.csv')
    y_valid = pd.read_csv('y_valid.csv')
    n_rows = len(x_train)

While Factor Analysis of Mixed Data (FAMD) would have been ideal for dimensionality reduction, the current hardware requirements and software availability do not allow for it with such a large dataset.

Why SMOTE (Synthetic Minority Oversampling Technique) is Needed

Traditional training fails as hard drive failure is an extremely rare occurence. The model learns to only predict non-failure, making it useless for actually predicting failure. This is why a combination of undersampling the non-failures and oversampling the failures will improve the training and production of the predictive models.

In [28]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, \
                            classification_report, roc_curve, auc
In [7]:
regression = LogisticRegression(solver = 'sag', n_jobs = -1)
regression.fit(x_train, y_train.values.ravel())
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
Out[7]:
LogisticRegression(n_jobs=-1, solver='sag')
In [8]:
regression.intercept_
Out[8]:
array([-1.09223837])
In [9]:
regression.coef_
Out[9]:
array([[ 1.18621790e-01,  1.92844770e-01, -4.87481695e-01,
        -2.14837069e-02,  2.54997364e-03, -1.07725167e-01,
         2.24330184e-01, -1.15211496e-01,  3.71445683e-02,
        -1.62009855e-02,  2.17645436e-02,  1.05120126e-04,
         1.57435397e-02, -4.60059511e-01, -4.92139451e-01,
        -1.40594384e-01,  3.81862959e-03, -5.73064084e-01,
        -3.69431658e-01, -1.46478974e-01, -1.09739378e+00,
         8.41906555e-03, -9.57929442e-01, -1.32230176e-01,
         1.18490140e-03, -4.92988130e-01, -6.07767236e-01,
         1.17806493e-02, -4.89169500e-01, -3.63043862e-01,
        -2.36761353e-01, -4.89169500e-01, -2.62213597e-01,
        -3.37591619e-01, -9.17877465e-01, -1.73561541e-01,
         2.46428982e-03, -9.48460309e-01, -7.29937015e-02,
        -6.75207055e-02, -9.48380332e-01, -6.88172542e-02,
        -7.17771298e-02, -9.48380332e-01, -4.38120894e-02,
        -9.67822946e-02, -1.09364202e+00, -9.17299916e-04,
         5.58460842e-03]])
In [11]:
coefs = pd.concat([pd.DataFrame(x_train.columns),
                   pd.DataFrame(np.transpose(regression.coef_))], axis = 1)

coefs.columns = ["Column", "Coefficient"]
coefs
Out[11]:
Column Coefficient
0 pca_component_0 0.118622
1 pca_component_1 0.192845
2 pca_component_2 -0.487482
3 pca_component_3 -0.021484
4 pca_component_4 0.002550
5 pca_component_5 -0.107725
6 pca_component_6 0.224330
7 pca_component_7 -0.115211
8 pca_component_8 0.037145
9 pca_component_9 -0.016201
10 pca_component_10 0.021765
11 pca_component_11 0.000105
12 pca_component_12 0.015744
13 manufacturer_HGST -0.460060
14 manufacturer_Seagate -0.492139
15 manufacturer_Toshiba -0.140594
16 manufacturer_Western Digital 0.003819
17 smart_191_cat_0 -0.573064
18 smart_191_cat_1 -0.369432
19 smart_191_cat_2 -0.146479
20 smart_184_cat_0 -1.097394
21 smart_184_cat_1 0.008419
22 smart_200_cat_0 -0.957929
23 smart_200_cat_1 -0.132230
24 smart_200_cat_2 0.001185
25 smart_196_cat_0 -0.492988
26 smart_196_cat_1 -0.607767
27 smart_196_cat_2 0.011781
28 smart_8_cat_0 -0.489170
29 smart_8_cat_1 -0.363044
30 smart_8_cat_2 -0.236761
31 smart_2_cat_0 -0.489170
32 smart_2_cat_1 -0.262214
33 smart_2_cat_2 -0.337592
34 smart_223_cat_0 -0.917877
35 smart_223_cat_1 -0.173562
36 smart_223_cat_2 0.002464
37 smart_220_cat_0 -0.948460
38 smart_220_cat_1 -0.072994
39 smart_220_cat_2 -0.067521
40 smart_222_cat_0 -0.948380
41 smart_222_cat_1 -0.068817
42 smart_222_cat_2 -0.071777
43 smart_226_cat_0 -0.948380
44 smart_226_cat_1 -0.043812
45 smart_226_cat_2 -0.096782
46 smart_11_cat_0 -1.093642
47 smart_11_cat_1 -0.000917
48 smart_11_cat_2 0.005585
In [12]:
coefs.where(coefs['Coefficient'] > 0).sort_values(['Coefficient'], \
                                                  ascending = False).dropna()
Out[12]:
Column Coefficient
6 pca_component_6 0.224330
1 pca_component_1 0.192845
0 pca_component_0 0.118622
8 pca_component_8 0.037145
10 pca_component_10 0.021765
12 pca_component_12 0.015744
27 smart_196_cat_2 0.011781
21 smart_184_cat_1 0.008419
48 smart_11_cat_2 0.005585
16 manufacturer_Western Digital 0.003819
4 pca_component_4 0.002550
36 smart_223_cat_2 0.002464
24 smart_200_cat_2 0.001185
11 pca_component_11 0.000105
In [13]:
coefs.where(coefs['Coefficient'] < 0).sort_values(['Coefficient']).dropna()
Out[13]:
Column Coefficient
20 smart_184_cat_0 -1.097394
46 smart_11_cat_0 -1.093642
22 smart_200_cat_0 -0.957929
37 smart_220_cat_0 -0.948460
43 smart_226_cat_0 -0.948380
40 smart_222_cat_0 -0.948380
34 smart_223_cat_0 -0.917877
26 smart_196_cat_1 -0.607767
17 smart_191_cat_0 -0.573064
25 smart_196_cat_0 -0.492988
14 manufacturer_Seagate -0.492139
31 smart_2_cat_0 -0.489170
28 smart_8_cat_0 -0.489170
2 pca_component_2 -0.487482
13 manufacturer_HGST -0.460060
18 smart_191_cat_1 -0.369432
29 smart_8_cat_1 -0.363044
33 smart_2_cat_2 -0.337592
32 smart_2_cat_1 -0.262214
30 smart_8_cat_2 -0.236761
35 smart_223_cat_1 -0.173562
19 smart_191_cat_2 -0.146479
15 manufacturer_Toshiba -0.140594
23 smart_200_cat_1 -0.132230
7 pca_component_7 -0.115211
5 pca_component_5 -0.107725
45 smart_226_cat_2 -0.096782
38 smart_220_cat_1 -0.072994
42 smart_222_cat_2 -0.071777
41 smart_222_cat_1 -0.068817
39 smart_220_cat_2 -0.067521
44 smart_226_cat_1 -0.043812
3 pca_component_3 -0.021484
9 pca_component_9 -0.016201
47 smart_11_cat_1 -0.000917
In [14]:
accuracy = regression.score(x_test, y_test)
accuracy
Out[14]:
0.9999375860754078
In [15]:
predictions = regression.predict(x_test)
actual = y_test
In [16]:
confusion = confusion_matrix(actual, predictions)
confusion
Out[16]:
array([[2194886,       1],
       [    136,       0]], dtype=int64)
In [17]:
precision = precision_score(actual, predictions)
precision
Out[17]:
0.0
In [18]:
print(classification_report(actual, predictions))
              precision    recall  f1-score   support

       False       1.00      1.00      1.00   2194887
        True       0.00      0.00      0.00       136

    accuracy                           1.00   2195023
   macro avg       0.50      0.50      0.50   2195023
weighted avg       1.00      1.00      1.00   2195023

SMOTE

In [19]:
sm = SMOTE(random_state = 13)
In [20]:
x_train, y_train = sm.fit_resample(x_train, y_train)
In [23]:
y_train['failure'].value_counts()
Out[23]:
True     7682103
False    7682103
Name: failure, dtype: int64
In [24]:
if not os.path.isfile('smote_x_df.pkl'):
    x_train.to_pickle('smote_x_df.pkl')
In [25]:
if not os.path.isfile('smote_y_df.pkl'):
    y_train.to_pickle('smote_y_df.pkl')
In [4]:
reload_smote = True
if reload_smote:
    x_train = pd.read_pickle('smote_x_df.pkl')
    y_train = pd.read_pickle('smote_y_df.pkl')
    x_test = pd.read_csv('cat_x_test.csv')
    y_test = pd.read_csv('y_test.csv')
    x_valid = pd.read_csv('cat_x_valid.csv')
    y_valid = pd.read_csv('y_valid.csv')
    n_rows = len(x_train)

Logistic Regression With SMOTE

In [51]:
regression = LogisticRegression(solver = 'liblinear')
In [52]:
regression.fit(x_train, y_train)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\svm\_base.py:975: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
Out[52]:
LogisticRegression(solver='liblinear')
In [53]:
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
Out[53]:
0.6419859837459562
In [54]:
regression_predictions = regression.predict(x_test)
actual = y_test
In [55]:
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
Out[55]:
array([[1409080,  785807],
       [     42,      94]], dtype=int64)
In [56]:
regression_precision = precision_score(actual, regression_predictions)
regression_precision
Out[56]:
0.00011960794044033536
In [57]:
print(classification_report(actual, regression_predictions))
              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.69      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.67      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

In [58]:
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
In [59]:
regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [60]:
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
Out[60]:
0.7732849586066055
In [61]:
plt.title('Liblinear Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Liblinear Logistic ROC AUC.svg')
plt.show()
In [62]:
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression Liblinear.sav', 'wb'))
In [63]:
regression = LogisticRegression(solver = 'sag', n_jobs = -1)
In [64]:
regression.fit(x_train, y_train)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
Out[64]:
LogisticRegression(solver='sag')
In [65]:
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
Out[65]:
0.6437262844170654
In [66]:
regression_predictions = regression.predict(x_test)
actual = y_test
In [67]:
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
Out[67]:
array([[1412901,  781986],
       [     43,      93]], dtype=int64)
In [68]:
regression_precision = precision_score(actual, regression_predictions)
regression_precision
Out[68]:
0.00011891381816926424
In [69]:
print(classification_report(actual, regression_predictions))
              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.68      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.66      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

In [70]:
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
In [71]:
regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [72]:
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
Out[72]:
0.7734260619446601
In [74]:
plt.title('SAG Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAG ROC AUC.svg')
plt.show()
In [75]:
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAG.sav', 'wb'))
In [76]:
regression = LogisticRegression(solver = 'saga', n_jobs = -1)
In [77]:
regression.fit(x_train, y_train)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
Out[77]:
LogisticRegression(n_jobs=-1, solver='saga')
In [78]:
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
Out[78]:
0.6401891916394498
In [79]:
regression_predictions = regression.predict(x_test)
actual = y_test
In [80]:
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
Out[80]:
array([[1405136,  789751],
       [     42,      94]], dtype=int64)
In [81]:
regression_precision = precision_score(actual, regression_predictions)
regression_precision
Out[81]:
0.00011901069197120954
In [82]:
print(classification_report(actual, regression_predictions))
              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.69      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.67      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

In [83]:
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
In [84]:
regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [85]:
regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc
Out[85]:
0.7773458805155158
In [86]:
plt.title('SAGA Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAGA ROC AUC.svg')
plt.show()
In [87]:
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAGA.sav', 'wb'))

LBFGS Solver Logistic Regression

In [58]:
regression = LogisticRegression(solver = 'lbfgs', \
                                max_iter = 10000, n_jobs = 1)
In [59]:
regression.fit(x_train, y_train.values.ravel())
Out[59]:
LogisticRegression(max_iter=10000, n_jobs=1)
In [60]:
regression_accuracy = regression.score(x_test, y_test)
regression_accuracy
Out[60]:
0.9731984585127355
In [61]:
regression_predictions = regression.predict(x_test)
actual = y_test
In [62]:
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
Out[62]:
array([[2136106,   58781],
       [     49,      87]], dtype=int64)
In [63]:
regression_precision = precision_score(actual, regression_predictions)
regression_precision
Out[63]:
0.0014778827206631787
In [64]:
print(classification_report(actual, regression_predictions))
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99   2194887
         1.0       0.00      0.64      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.81      0.49   2195023
weighted avg       1.00      0.97      0.99   2195023

In [65]:
regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]
In [66]:
regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [67]:
regression_roc_auc = auc(regression_false_positive_rate, \
                         regression_true_positive_rate)
regression_roc_auc
Out[67]:
0.8729115935460594
In [68]:
plt.title('LBFGS Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic ROC AUC.png')
plt.show()
In [69]:
# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression lbfgs.sav', 'wb'))
In [70]:
coefs = pd.concat([pd.DataFrame(x_train.columns),
                   pd.DataFrame(np.transpose(regression.coef_))], axis = 1)

coefs.columns = ["Column", "Coefficient"]
coefs
Out[70]:
Column Coefficient
0 pca_component_0 0.032932
1 pca_component_1 0.003315
2 pca_component_2 -0.077006
3 pca_component_3 0.066112
4 pca_component_4 0.439588
5 pca_component_5 0.169322
6 pca_component_6 1.206256
7 pca_component_7 0.827434
8 pca_component_8 1.075113
9 pca_component_9 2.859798
10 pca_component_10 0.484139
11 pca_component_11 -0.145952
12 pca_component_12 -0.164462
13 manufacturer_HGST -6.401614
14 manufacturer_Seagate 10.498548
15 manufacturer_Toshiba -14.982792
16 manufacturer_Western Digital 1.954611
17 smart_191_cat_0 11.131975
18 smart_191_cat_1 9.553069
19 smart_191_cat_2 9.068537
20 smart_184_cat_0 -21.494625
21 smart_184_cat_1 14.234208
22 smart_200_cat_0 -0.668820
23 smart_200_cat_1 -1.631528
24 smart_200_cat_2 1.908393
25 smart_196_cat_0 15.354500
26 smart_196_cat_1 14.288892
27 smart_196_cat_2 14.950731
28 smart_8_cat_0 9.101696
29 smart_8_cat_1 8.182413
30 smart_8_cat_2 8.185803
31 smart_2_cat_0 9.101696
32 smart_2_cat_1 8.760081
33 smart_2_cat_2 24.940918
34 smart_223_cat_0 7.907068
35 smart_223_cat_1 -22.029754
36 smart_223_cat_2 -1.942933
37 smart_220_cat_0 -4.428850
38 smart_220_cat_1 15.449774
39 smart_220_cat_2 15.663388
40 smart_222_cat_0 -4.410220
41 smart_222_cat_1 11.463524
42 smart_222_cat_2 16.003702
43 smart_226_cat_0 -4.410220
44 smart_226_cat_1 15.141804
45 smart_226_cat_2 8.614702
46 smart_11_cat_0 -6.074088
47 smart_11_cat_1 3.885108
48 smart_11_cat_2 6.091707

Decision Tree

In [78]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
In [82]:
tree = DecisionTreeClassifier(max_depth = 20, splitter = 'best', \
                              random_state = 13)
In [83]:
tree.fit(x_train, y_train)
Out[83]:
DecisionTreeClassifier(max_depth=20, random_state=13)
In [84]:
tree_accuracy = tree.score(x_test, y_test)
tree_accuracy
Out[84]:
0.9690331263043713
In [85]:
tree_predictions = tree.predict(x_test)
actual = y_test
In [86]:
tree_confusion = confusion_matrix(actual, tree_predictions)
tree_confusion
Out[86]:
array([[2126990,   67897],
       [     76,      60]], dtype=int64)
In [87]:
tree_precision = precision_score(actual, tree_predictions)
tree_precision
Out[87]:
0.0008829112527039157
In [88]:
print(classification_report(actual, tree_predictions))
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.98   2194887
         1.0       0.00      0.44      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.71      0.49   2195023
weighted avg       1.00      0.97      0.98   2195023

In [89]:
tree_probabilities = tree.predict_proba(x_test)
predictions = tree_probabilities[:,1]
In [90]:
tree_false_positive_rate, tree_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [91]:
tree_roc_auc = auc(tree_false_positive_rate, tree_true_positive_rate)
tree_roc_auc
Out[91]:
0.6900202690992079
In [92]:
plt.title('Decision Tree Receiver Operating Characteristic Curve')
plt.plot(tree_false_positive_rate, tree_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % tree_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Tree ROC AUC.svg')
plt.savefig('Charts/Tree ROC AUC.png')
plt.show()
In [97]:
fig, ax = plt.subplots(figsize=(40, 20))
plot_tree(tree, fontsize = 6, max_depth = 3, class_names = True, \
          feature_names = x_train.columns)
plt.savefig('Charts/Decision Tree.svg', dpi=100)
plt.savefig('Charts/Decision Tree.png', dpi=100)
In [93]:
# Save the model to disk
pickle.dump(tree, open('Models/Decision Tree.sav', 'wb'))
In [98]:
tree = pickle.load(open('Models/Decision Tree.sav', 'rb'))

Random Forest Ensemble

In [5]:
from sklearn.ensemble import RandomForestClassifier
In [7]:
forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
                                random_state = 13, n_jobs = -1)
In [8]:
forest.fit(x_train, y_train.values.ravel())
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed: 16.5min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 45.8min finished
Out[8]:
RandomForestClassifier(max_depth=20, n_jobs=-1, random_state=13, verbose=1)
In [91]:
forest_accuracy = forest.score(x_test, y_test)
forest_accuracy
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.9s finished
Out[91]:
0.9902342708937446
In [92]:
forest_predictions = forest.predict(x_test)
actual = y_test
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.8s finished
In [93]:
forest_confusion = confusion_matrix(actual, forest_predictions)
forest_confusion
Out[93]:
array([[2173538,   21349],
       [     87,      49]], dtype=int64)
In [14]:
forest_precision = precision_score(actual, forest_predictions)
forest_precision
Out[14]:
0.0022899336386578185
In [15]:
print(classification_report(actual, forest_predictions))
              precision    recall  f1-score   support

       False       1.00      0.99      1.00   2194887
        True       0.00      0.36      0.00       136

    accuracy                           0.99   2195023
   macro avg       0.50      0.68      0.50   2195023
weighted avg       1.00      0.99      1.00   2195023

In [16]:
forest_probabilities = forest.predict_proba(x_test)
predictions = forest_probabilities[:,1]
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.2s finished
In [17]:
forest_false_positive_rate, forest_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)
In [18]:
forest_roc_auc = auc(forest_false_positive_rate, forest_true_positive_rate)
forest_roc_auc
Out[18]:
0.7974132039599305
In [20]:
plt.title('Random Forest Receiver Operating Characteristic Curve')
plt.plot(forest_false_positive_rate, forest_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Forest ROC AUC.png')
plt.savefig('Charts/Forest ROC AUC.svg')
plt.show()
In [44]:
# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble.sav', 'wb'))
In [ ]:
reload_forest = False
if reload_forest:
    forest = pickle.load(open('Models/Random Forest Ensemble.sav', 'rb'))

Random Forest Ensemble With Class Weights

In an attempt to train the ensemble in a way that prioritizes the true negative, or actual failure cases, this version of the random forest weights failures as twice as important as non-failures.

In [45]:
weighted_forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
        random_state = 13, n_jobs = -1, class_weight = {0: 1, 1: 2})
In [46]:
weighted_forest.fit(x_train, y_train.values.ravel())
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed: 15.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 44.5min finished
Out[46]:
RandomForestClassifier(class_weight={0: 1, 1: 2}, max_depth=20, n_jobs=-1,
                       random_state=13, verbose=1)
In [47]:
weighted_forest_accuracy = weighted_forest.score(x_test, y_test)
weighted_forest_accuracy
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.9s finished
Out[47]:
0.9717009798986161
In [88]:
weighted_forest_predictions = weighted_forest.predict(x_test)
actual = y_test
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    2.3s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.5s finished

Compared to the unweighted Random Forest ensemble, this weighted ensemble gains another 6 true negative classifications for a true negative rate of 40% rather than 36%, but also gains 40,687 false positive classifications, for 0.028%, instead of 0.0097% false positives.

In [89]:
weighted_forest_confusion = confusion_matrix(actual, \
                                             weighted_forest_predictions)
weighted_forest_confusion
Out[89]:
array([[2132851,   62036],
       [     81,      55]], dtype=int64)
In [50]:
weighted_forest_precision = precision_score(actual, \
                                            weighted_forest_predictions)
weighted_forest_precision
Out[50]:
0.0008857966532991899
In [51]:
print(classification_report(actual, weighted_forest_predictions))
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99   2194887
         1.0       0.00      0.40      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.69      0.49   2195023
weighted avg       1.00      0.97      0.99   2195023

In [52]:
weighted_forest_probabilities = weighted_forest.predict_proba(x_test)
weighted_predictions = weighted_forest_probabilities[:,1]
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.2s finished
In [53]:
weighted_forest_false_positive_rate, weighted_forest_true_positive_rate, \
                    threshold = roc_curve(y_test, weighted_predictions)
In [54]:
weighted_forest_roc_auc = auc(weighted_forest_false_positive_rate, \
                              weighted_forest_true_positive_rate)
weighted_forest_roc_auc
Out[54]:
0.799827248576833
In [55]:
plt.title('Weighted Random Forest Receiver Operating Characteristic Curve')
plt.plot(weighted_forest_false_positive_rate, \
         weighted_forest_true_positive_rate, \
         'blue', label = 'AUC = %0.2f' % weighted_forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Weighted Forest ROC AUC.png')
plt.savefig('Charts/Weighted Forest ROC AUC.svg')
plt.show()
In [56]:
# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble Weighted.sav', 'wb'))
In [57]:
reload_weighted_forest = False
if reload_weighted_forest:
    forest = pickle.load(open('Models/Random Forest Ensemble Weighted.sav', 'rb'))

Neural Networks

In [5]:
import torch
from torch import nn, optim
import torch.utils.data as data_utils
torch.manual_seed(13)
Out[5]:
<torch._C.Generator at 0x1ed8cb527d0>

PyTorch requires the boolean values to be converted to floating point, so these dtypes will be changed before the neural network is defined.

In [72]:
x_train.dtypes
Out[72]:
pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object
In [6]:
for col in x_train:
    if x_train[col].dtype == "bool":
        x_train[col] = x_train[col].astype(float)
        x_test[col] = x_test[col].astype(float)
In [7]:
x_train.dtypes
Out[7]:
pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object
In [75]:
x_train.isna().sum().sum()
Out[75]:
0
In [76]:
x_test.dtypes
Out[76]:
pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object
In [77]:
x_test.isna().sum().sum()
Out[77]:
0
In [8]:
y_train = y_train.astype(float)
y_test = y_test.astype(float)
In [9]:
train_label = torch.tensor(y_train.values)
trainset = torch.tensor(x_train.values)
train_tensor = data_utils.TensorDataset(trainset, train_label) 
trainloader = data_utils.DataLoader(dataset = train_tensor, \
                                    batch_size = 512, shuffle = True)


test_label = torch.tensor(y_test.values)
testset = torch.tensor(x_test.values)
test_tensor = data_utils.TensorDataset(testset, test_label) 
testloader = data_utils.DataLoader(dataset = test_tensor, \
                                   batch_size = 512, shuffle = True)
In [10]:
torch.backends.cudnn.enabled
Out[10]:
True
In [11]:
torch.cuda.is_available()
Out[11]:
True
In [12]:
print(torch.version.cuda)
10.2
In [13]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device
Out[13]:
device(type='cuda', index=0)

Neural Network 1

In [61]:
class nn_Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(49, 24)
        self.act1 = nn.LeakyReLU()
        self.fc2 = nn.Linear(24, 12)
        self.act2 = nn.LeakyReLU()
        self.fc3 = nn.Linear(12, 1)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(-1, 49)
        
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        
        return x
In [62]:
neural_network = nn_Classifier()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network.parameters(), lr = 1e-7, \
                       weight_decay = 1e-5)
In [63]:
def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)
            
neural_network.apply(init_weights)
Out[63]:
nn_Classifier(
  (fc1): Linear(in_features=49, out_features=24, bias=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (fc2): Linear(in_features=24, out_features=12, bias=True)
  (act2): LeakyReLU(negative_slope=0.01)
  (fc3): Linear(in_features=12, out_features=1, bias=True)
)
In [64]:
n_train = len(x_train)
epochs = 10
neural_network.to(device);
In [65]:
train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf 

for e in range(epochs):
    neural_network.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)
        
        optimizer.zero_grad()
        
        output = neural_network(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network.eval()
        test_loss = 0
        accuracy = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                
                output = neural_network(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = test_loss/len(testloader)
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
         
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network.state_dict(), 'Models/Neural Network 1.pt')
            test_loss_min = valid_loss
Epoch: 1/10..  Training Loss: 0.66..  Test Loss: 0.90.. 
Test loss decreased (inf --> 0.8998).  Saving model ...
Epoch: 2/10..  Training Loss: 0.64..  Test Loss: 0.83.. 
Test loss decreased (0.8998 --> 0.8279).  Saving model ...
Epoch: 3/10..  Training Loss: 0.62..  Test Loss: 0.77.. 
Test loss decreased (0.8279 --> 0.7658).  Saving model ...
Epoch: 4/10..  Training Loss: 0.61..  Test Loss: 0.71.. 
Test loss decreased (0.7658 --> 0.7128).  Saving model ...
Epoch: 5/10..  Training Loss: 0.59..  Test Loss: 0.67.. 
Test loss decreased (0.7128 --> 0.6690).  Saving model ...
Epoch: 6/10..  Training Loss: 0.58..  Test Loss: 0.63.. 
Test loss decreased (0.6690 --> 0.6340).  Saving model ...
Epoch: 7/10..  Training Loss: 0.57..  Test Loss: 0.61.. 
Test loss decreased (0.6340 --> 0.6058).  Saving model ...
Epoch: 8/10..  Training Loss: 0.56..  Test Loss: 0.58.. 
Test loss decreased (0.6058 --> 0.5828).  Saving model ...
Epoch: 9/10..  Training Loss: 0.55..  Test Loss: 0.56.. 
Test loss decreased (0.5828 --> 0.5647).  Saving model ...
Epoch: 10/10..  Training Loss: 0.54..  Test Loss: 0.55.. 
Test loss decreased (0.5647 --> 0.5507).  Saving model ...

While it may eventually improve with enough training, it's most likely that this architecture of neural network is too simple for the problem at hand. A more complex one will be built next.

In [66]:
for param in neural_network.parameters():
    print(param.data)
tensor([[-0.1473,  0.2276,  0.0955,  ..., -0.0943, -0.1436,  0.2180],
        [ 0.1537,  0.1776,  0.0052,  ..., -0.2855,  0.2067,  0.2886],
        [ 0.0298,  0.0174, -0.2388,  ...,  0.2323, -0.2218,  0.1123],
        ...,
        [-0.0063,  0.2199,  0.0515,  ...,  0.2911, -0.0368, -0.2214],
        [ 0.1956,  0.0806, -0.0785,  ...,  0.1534,  0.2400,  0.1957],
        [-0.2404, -0.0611, -0.2725,  ..., -0.1479,  0.0216,  0.1225]],
       device='cuda:0')
tensor([ 0.0136, -0.0186,  0.0186,  0.0352, -0.0144, -0.0008,  0.0370,  0.0393,
         0.0043,  0.0392, -0.0165,  0.0240, -0.0170,  0.0299, -0.0164,  0.0381,
         0.0371,  0.0386,  0.0289, -0.0026,  0.0215,  0.0374,  0.0129,  0.0231],
       device='cuda:0')
tensor([[ 1.7849e-01, -2.0251e-01,  6.2617e-02,  3.1074e-01, -3.6079e-01,
          2.9729e-01, -2.8113e-01,  2.7099e-01,  2.2215e-01, -2.2240e-02,
         -4.0342e-01,  1.1523e-01, -3.1116e-01,  2.6588e-01, -3.7920e-01,
          3.0629e-01,  1.4652e-02,  1.8915e-01,  3.9967e-01, -1.0067e-01,
          4.2945e-01,  3.6560e-01,  3.9827e-01, -1.6999e-01],
        [ 1.9745e-01,  1.8876e-01, -2.7809e-01,  5.0733e-02,  6.1933e-02,
          3.1665e-01,  4.1407e-01, -2.2936e-01, -2.5261e-01,  3.3287e-01,
          1.2620e-01,  2.0362e-02, -1.3069e-01,  2.2032e-02, -4.3166e-01,
          2.4362e-01,  3.8399e-01, -1.9928e-01,  3.2188e-02,  3.2998e-02,
          3.4668e-01,  2.7267e-01, -2.1278e-01,  4.0523e-01],
        [-3.4832e-01,  1.9394e-01,  3.4774e-02,  2.4953e-01,  3.8886e-01,
          1.6397e-01, -2.1415e-01, -1.7721e-01, -1.7167e-01,  1.0021e-01,
         -4.3153e-02, -2.4533e-01, -1.9537e-01,  3.3032e-01,  3.9929e-01,
          2.0657e-01,  2.8361e-01,  2.9780e-01,  2.2265e-01,  3.3895e-01,
          6.8382e-02, -3.8065e-01, -1.2103e-02,  1.7160e-01],
        [ 1.8496e-01, -2.6005e-01, -1.9058e-01,  1.0866e-01,  1.9734e-01,
         -2.8778e-01,  2.5133e-01,  6.2480e-02,  4.2115e-02,  8.9481e-02,
          5.0526e-03,  9.6357e-03,  3.3971e-01, -4.2915e-02, -4.1379e-01,
         -2.7652e-01,  2.2339e-01, -2.3685e-01, -3.5120e-01, -6.4610e-02,
         -1.0527e-02, -3.9578e-01,  3.9148e-01,  3.3773e-01],
        [ 2.0560e-01,  1.0526e-01, -1.4828e-01,  7.8619e-02, -1.1924e-01,
         -2.5331e-01,  2.6876e-01,  6.1315e-02,  1.0085e-01,  2.3695e-01,
         -1.8098e-01, -2.2123e-01, -3.6014e-01,  1.5567e-01, -2.7817e-01,
          2.1415e-01,  2.9940e-01, -4.1336e-01,  1.4994e-01,  6.3024e-03,
         -3.0537e-01,  3.2188e-01,  9.7505e-02,  2.3926e-01],
        [ 2.0996e-01,  2.7758e-01,  3.3542e-01, -2.8499e-01,  2.1536e-01,
          2.7024e-01, -1.7165e-01, -1.5995e-01,  5.9415e-02, -1.0555e-02,
          3.5253e-01, -1.3720e-01,  8.6542e-02,  1.8572e-01,  3.2024e-01,
          2.2338e-01, -3.0169e-01,  3.4114e-01, -5.0227e-02, -2.0665e-01,
         -3.6495e-01,  7.1415e-02,  2.7955e-01, -2.1537e-02],
        [ 3.8182e-01,  6.1844e-02,  9.8586e-04, -5.1438e-02,  8.6961e-02,
          1.2606e-01,  3.0335e-01, -4.0351e-01,  3.9948e-01, -2.6709e-01,
         -3.5960e-01,  3.3448e-01,  2.4755e-01, -3.9389e-01, -3.6011e-01,
          1.8035e-01,  3.8077e-01, -1.5780e-01, -3.8327e-01,  1.1362e-01,
         -3.3365e-02, -4.5110e-02,  3.3626e-02, -2.1293e-01],
        [ 3.8949e-01, -3.8495e-01, -3.0542e-01, -1.8660e-01, -3.7620e-02,
         -6.8434e-02,  1.7219e-02,  2.8387e-01, -3.5285e-01, -2.6009e-01,
         -2.8474e-01, -1.1658e-01,  1.4181e-01, -4.0640e-01,  1.0785e-01,
          2.0529e-02, -5.4648e-02,  8.1995e-02,  3.9354e-01,  7.1802e-02,
          3.3340e-01,  2.8701e-01, -7.4615e-02,  4.0314e-01],
        [-3.6525e-01,  5.6254e-02,  3.6376e-01,  2.2438e-01,  1.9816e-01,
         -7.9381e-02, -2.4098e-01, -3.7098e-02, -5.5936e-02, -2.4190e-01,
         -2.6380e-01,  3.9556e-01,  2.1115e-01,  3.9093e-01,  8.4963e-02,
         -2.2577e-01,  3.3309e-03,  5.9278e-02, -5.3482e-02, -1.1502e-01,
          1.0826e-01,  6.7052e-02, -2.8193e-02,  4.0294e-01],
        [-2.0106e-01,  1.8851e-01, -6.0427e-02,  3.1058e-01,  2.5597e-01,
          3.3932e-01,  2.1778e-01, -1.2057e-01,  3.8046e-01, -8.1065e-02,
          9.2627e-02, -2.2181e-01, -3.0132e-01,  2.6635e-01,  1.2915e-01,
         -1.2134e-01, -3.2905e-01, -1.2862e-01,  3.8297e-01,  1.0554e-01,
          3.5215e-01, -2.7842e-01,  6.2954e-02,  3.0246e-01],
        [ 8.5760e-02,  2.1523e-01, -3.6957e-01, -2.9393e-01,  3.2447e-01,
          1.4559e-03, -1.8984e-01,  5.3440e-02, -1.6062e-01,  3.8645e-01,
         -2.7754e-01, -1.2209e-01, -3.5887e-01,  7.9270e-03, -3.2843e-01,
          7.7787e-03, -8.3935e-02, -3.9126e-01, -1.0404e-01,  3.5060e-01,
          3.9034e-01, -2.0370e-01, -1.8387e-01,  1.1450e-01],
        [ 3.8823e-01, -8.8690e-02, -1.4096e-01, -3.3738e-01, -2.4424e-01,
          1.5122e-01, -8.6990e-02,  2.6272e-01, -2.7245e-01, -6.6767e-05,
         -1.1266e-01, -1.3454e-01,  1.8355e-01, -1.9562e-01, -1.4209e-01,
         -3.4258e-02,  8.6167e-02,  9.8019e-02, -4.0399e-01, -6.7994e-03,
         -6.4918e-02,  5.0701e-02, -1.5243e-01,  1.3844e-01]], device='cuda:0')
tensor([ 0.0377, -0.0176,  0.0356,  0.0121,  0.0375, -0.0067, -0.0167,  0.0026,
         0.0377, -0.0158,  0.0088,  0.0044], device='cuda:0')
tensor([[-0.4812,  0.0197, -0.3723, -0.1464,  0.3376,  0.3456,  0.4956,  0.6614,
         -0.1128,  0.5379,  0.3026,  0.3946]], device='cuda:0')
tensor([-0.0167], device='cuda:0')
In [67]:
plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', \
                     label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', \
                    label = "Test Loss")
plt.title("Neural Network 1 Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN1 Loss Plot.svg")
plt.savefig("Charts/NN1 Loss Plot.png")
In [68]:
neural_network.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network(rows.float())
        pred_targets += targets
In [69]:
output[:30]
Out[69]:
[tensor([0.4071], device='cuda:0'),
 tensor([0.4367], device='cuda:0'),
 tensor([0.4916], device='cuda:0'),
 tensor([0.4602], device='cuda:0'),
 tensor([0.4200], device='cuda:0'),
 tensor([0.5220], device='cuda:0'),
 tensor([0.3182], device='cuda:0'),
 tensor([0.4554], device='cuda:0'),
 tensor([0.4201], device='cuda:0'),
 tensor([0.4309], device='cuda:0'),
 tensor([0.3266], device='cuda:0'),
 tensor([0.3411], device='cuda:0'),
 tensor([0.4875], device='cuda:0'),
 tensor([0.3779], device='cuda:0'),
 tensor([0.3846], device='cuda:0'),
 tensor([0.4094], device='cuda:0'),
 tensor([0.3893], device='cuda:0'),
 tensor([0.3415], device='cuda:0'),
 tensor([0.4405], device='cuda:0'),
 tensor([0.3425], device='cuda:0'),
 tensor([0.3623], device='cuda:0'),
 tensor([0.4208], device='cuda:0'),
 tensor([0.3198], device='cuda:0'),
 tensor([0.3298], device='cuda:0'),
 tensor([0.3357], device='cuda:0'),
 tensor([0.6797], device='cuda:0'),
 tensor([0.4066], device='cuda:0'),
 tensor([0.4710], device='cuda:0'),
 tensor([0.3952], device='cuda:0'),
 tensor([0.6402], device='cuda:0')]
In [70]:
nn1_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn1_predictions.append(0)
    else:
        nn1_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)
In [71]:
nn1_confusion = confusion_matrix(actual, nn1_predictions)
nn1_confusion
Out[71]:
array([[2016112,  178775],
       [     52,      84]], dtype=int64)
In [72]:
nn1_precision = precision_score(actual, nn1_predictions)
nn1_precision
Out[72]:
0.0004696436858083742
In [73]:
print(classification_report(actual, nn1_predictions))
              precision    recall  f1-score   support

           0       1.00      0.92      0.96   2194887
           1       0.00      0.62      0.00       136

    accuracy                           0.92   2195023
   macro avg       0.50      0.77      0.48   2195023
weighted avg       1.00      0.92      0.96   2195023

In [74]:
nn1_false_positive_rate, nn1_true_positive_rate, threshold =\
    roc_curve(actual, nn1_predictions)
In [75]:
nn1_roc_auc = auc(nn1_false_positive_rate, nn1_true_positive_rate)
nn1_roc_auc
Out[75]:
0.7680981982215941
In [76]:
plt.title('Neural Network 1 Receiver Operating Characteristic Curve')
plt.plot(nn1_false_positive_rate, nn1_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn1_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN1 ROC AUC.png')
plt.savefig('Charts/NN1 ROC AUC.svg')
plt.show()

Neural Network 2

In [14]:
class nn_Classifier2(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(49, 98)
        self.act1 = nn.LeakyReLU()
        self.fc2 = nn.Linear(98, 72)
        self.act2 = nn.LeakyReLU()
        self.fc3 = nn.Linear(72, 36)
        self.act3 = nn.LeakyReLU()
        self.fc4 = nn.Linear(36, 9)
        self.act4 = nn.LeakyReLU()
        self.fc5 = nn.Linear(9, 1)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(-1, 49)
        
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        x = self.act3(self.fc3(x))
        x = self.act4(self.fc4(x))
        x = torch.sigmoid(self.fc5(x))
        
        return x
In [15]:
neural_network2 = nn_Classifier2()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network2.parameters(), lr = 1e-7, weight_decay = 1e-5)
In [16]:
def init_weights2(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)
            
neural_network2.apply(init_weights2)
Out[16]:
nn_Classifier2(
  (fc1): Linear(in_features=49, out_features=98, bias=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (fc2): Linear(in_features=98, out_features=72, bias=True)
  (act2): LeakyReLU(negative_slope=0.01)
  (fc3): Linear(in_features=72, out_features=36, bias=True)
  (act3): LeakyReLU(negative_slope=0.01)
  (fc4): Linear(in_features=36, out_features=9, bias=True)
  (act4): LeakyReLU(negative_slope=0.01)
  (fc5): Linear(in_features=9, out_features=1, bias=True)
)
In [17]:
n_train = len(x_train)
epochs = 50
neural_network2.to(device);
In [18]:
train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf

for e in range(epochs):
    neural_network2.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)

        optimizer.zero_grad()
        
        output = neural_network2(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network2.eval()
        test_loss = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                output = neural_network2(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = (test_loss/len(testloader))
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
            test_loss_min = valid_loss
Epoch: 1/50..  Training Loss: 0.64..  Test Loss: 0.70.. 
Test loss decreased (inf --> 0.7035).  Saving model ...
Epoch: 2/50..  Training Loss: 0.60..  Test Loss: 0.63.. 
Test loss decreased (0.7035 --> 0.6301).  Saving model ...
Epoch: 3/50..  Training Loss: 0.56..  Test Loss: 0.55.. 
Test loss decreased (0.6301 --> 0.5467).  Saving model ...
Epoch: 4/50..  Training Loss: 0.53..  Test Loss: 0.51.. 
Test loss decreased (0.5467 --> 0.5089).  Saving model ...
Epoch: 5/50..  Training Loss: 0.51..  Test Loss: 0.48.. 
Test loss decreased (0.5089 --> 0.4839).  Saving model ...
Epoch: 6/50..  Training Loss: 0.50..  Test Loss: 0.47.. 
Test loss decreased (0.4839 --> 0.4651).  Saving model ...
Epoch: 7/50..  Training Loss: 0.48..  Test Loss: 0.45.. 
Test loss decreased (0.4651 --> 0.4544).  Saving model ...
Epoch: 8/50..  Training Loss: 0.47..  Test Loss: 0.44.. 
Test loss decreased (0.4544 --> 0.4439).  Saving model ...
Epoch: 9/50..  Training Loss: 0.46..  Test Loss: 0.43.. 
Test loss decreased (0.4439 --> 0.4348).  Saving model ...
Epoch: 10/50..  Training Loss: 0.46..  Test Loss: 0.43.. 
Test loss decreased (0.4348 --> 0.4289).  Saving model ...
Epoch: 11/50..  Training Loss: 0.45..  Test Loss: 0.42.. 
Test loss decreased (0.4289 --> 0.4247).  Saving model ...
Epoch: 12/50..  Training Loss: 0.44..  Test Loss: 0.42.. 
Test loss decreased (0.4247 --> 0.4200).  Saving model ...
Epoch: 13/50..  Training Loss: 0.44..  Test Loss: 0.42.. 
Test loss decreased (0.4200 --> 0.4164).  Saving model ...
Epoch: 14/50..  Training Loss: 0.43..  Test Loss: 0.41.. 
Test loss decreased (0.4164 --> 0.4122).  Saving model ...
Epoch: 15/50..  Training Loss: 0.43..  Test Loss: 0.41.. 
Test loss decreased (0.4122 --> 0.4094).  Saving model ...
Epoch: 16/50..  Training Loss: 0.42..  Test Loss: 0.41.. 
Test loss decreased (0.4094 --> 0.4061).  Saving model ...
Epoch: 17/50..  Training Loss: 0.42..  Test Loss: 0.40.. 
Test loss decreased (0.4061 --> 0.4030).  Saving model ...
Epoch: 18/50..  Training Loss: 0.41..  Test Loss: 0.40.. 
Test loss decreased (0.4030 --> 0.4004).  Saving model ...
Epoch: 19/50..  Training Loss: 0.41..  Test Loss: 0.40.. 
Test loss decreased (0.4004 --> 0.3973).  Saving model ...
Epoch: 20/50..  Training Loss: 0.41..  Test Loss: 0.39.. 
Test loss decreased (0.3973 --> 0.3937).  Saving model ...
Epoch: 21/50..  Training Loss: 0.40..  Test Loss: 0.39.. 
Test loss decreased (0.3937 --> 0.3901).  Saving model ...
Epoch: 22/50..  Training Loss: 0.40..  Test Loss: 0.39.. 
Test loss decreased (0.3901 --> 0.3871).  Saving model ...
Epoch: 23/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3871 --> 0.3839).  Saving model ...
Epoch: 24/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3839 --> 0.3812).  Saving model ...
Epoch: 25/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3812 --> 0.3775).  Saving model ...
Epoch: 26/50..  Training Loss: 0.38..  Test Loss: 0.38.. 
Test loss decreased (0.3775 --> 0.3758).  Saving model ...
Epoch: 27/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3758 --> 0.3731).  Saving model ...
Epoch: 28/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3731 --> 0.3706).  Saving model ...
Epoch: 29/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3706 --> 0.3670).  Saving model ...
Epoch: 30/50..  Training Loss: 0.37..  Test Loss: 0.37.. 
Test loss decreased (0.3670 --> 0.3650).  Saving model ...
Epoch: 31/50..  Training Loss: 0.37..  Test Loss: 0.36.. 
Test loss decreased (0.3650 --> 0.3623).  Saving model ...
Epoch: 32/50..  Training Loss: 0.37..  Test Loss: 0.36.. 
Test loss decreased (0.3623 --> 0.3590).  Saving model ...
Epoch: 33/50..  Training Loss: 0.36..  Test Loss: 0.36.. 
Test loss decreased (0.3590 --> 0.3568).  Saving model ...
Epoch: 34/50..  Training Loss: 0.36..  Test Loss: 0.35.. 
Test loss decreased (0.3568 --> 0.3535).  Saving model ...
Epoch: 35/50..  Training Loss: 0.36..  Test Loss: 0.35.. 
Test loss decreased (0.3535 --> 0.3511).  Saving model ...
Epoch: 36/50..  Training Loss: 0.35..  Test Loss: 0.35.. 
Test loss decreased (0.3511 --> 0.3467).  Saving model ...
Epoch: 37/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3467 --> 0.3447).  Saving model ...
Epoch: 38/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3447 --> 0.3428).  Saving model ...
Epoch: 39/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3428 --> 0.3398).  Saving model ...
Epoch: 40/50..  Training Loss: 0.34..  Test Loss: 0.34.. 
Test loss decreased (0.3398 --> 0.3369).  Saving model ...
Epoch: 41/50..  Training Loss: 0.34..  Test Loss: 0.34.. 
Test loss decreased (0.3369 --> 0.3360).  Saving model ...
Epoch: 42/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3360 --> 0.3328).  Saving model ...
Epoch: 43/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3328 --> 0.3298).  Saving model ...
Epoch: 44/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3298 --> 0.3279).  Saving model ...
Epoch: 45/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3279 --> 0.3248).  Saving model ...
Epoch: 46/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3248 --> 0.3236).  Saving model ...
Epoch: 47/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3236 --> 0.3208).  Saving model ...
Epoch: 48/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3208 --> 0.3191).  Saving model ...
Epoch: 49/50..  Training Loss: 0.32..  Test Loss: 0.32.. 
Test loss decreased (0.3191 --> 0.3178).  Saving model ...
Epoch: 50/50..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3178 --> 0.3129).  Saving model ...

It may seem quite odd that the testing loss is consistently lower than the training loss. In this case, it's quite likely the the sheer size of the training set causes this to occur. The model is constantly improving every training batch and the training loss is calculated from the entire epoch. The testing loss is calculated after the entire epoch of batches have all affected the model for the better.

In [45]:
plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot.svg")
plt.savefig("Charts/NN2 Loss Plot.png")
In [20]:
train_losses
Out[20]:
[0.6394088806,
 0.5974827975,
 0.5615484586,
 0.5345383341,
 0.5140510356,
 0.4971041327,
 0.4837051073,
 0.4724557599,
 0.4632148664,
 0.4550733971,
 0.4483568522,
 0.4421497967,
 0.4368479946,
 0.4318463335,
 0.4273023369,
 0.4228789921,
 0.4183978202,
 0.4141231419,
 0.4100503742,
 0.4062411901,
 0.4021704196,
 0.3983474331,
 0.3946568989,
 0.3911679414,
 0.3877969839,
 0.3843898797,
 0.3811379338,
 0.3780967995,
 0.3750898181,
 0.3719848349,
 0.368968146,
 0.3658273222,
 0.3629895356,
 0.3601837461,
 0.3575650346,
 0.3548797017,
 0.3518992016,
 0.3494093663,
 0.347011814,
 0.3447187431,
 0.3424452845,
 0.3400888381,
 0.3374523245,
 0.3351643949,
 0.3328306837,
 0.3308196616,
 0.3288264187,
 0.3263858052,
 0.3239979536,
 0.3216325002]
In [21]:
with open('Models/NN2_train_losses.txt', 'w') as loss:
    for epoch in train_losses:
        loss.write(str(epoch) + "\n")
In [22]:
for epoch in test_losses:
    print(epoch.item())
0.7034912109375
0.6301103234291077
0.5466606616973877
0.5088833570480347
0.4839249551296234
0.46507981419563293
0.4544140696525574
0.4438667297363281
0.4348237216472626
0.42894241213798523
0.4246806800365448
0.4199894666671753
0.4163917005062103
0.41220182180404663
0.40937915444374084
0.4061104357242584
0.403046190738678
0.4004068672657013
0.3973144590854645
0.39373621344566345
0.3901030123233795
0.3871174156665802
0.38390490412712097
0.3812478482723236
0.3775373697280884
0.3758487105369568
0.37314432859420776
0.37061068415641785
0.3670412600040436
0.36504238843917847
0.3622877299785614
0.3589939475059509
0.35675162076950073
0.3535061180591583
0.35114216804504395
0.34674540162086487
0.34468403458595276
0.34278616309165955
0.3398434817790985
0.3369307816028595
0.33598217368125916
0.3327692747116089
0.3297843337059021
0.3278554379940033
0.3247697651386261
0.3235817551612854
0.3208043873310089
0.3191089332103729
0.31777939200401306
0.31286540627479553
In [23]:
with open('Models/NN2_test_losses.txt', 'w') as loss:
    for epoch in test_losses:
        loss.write(str(epoch.item()) + "\n")
In [24]:
neural_network2.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network2(rows.float())
        pred_targets += targets
In [25]:
output[:30]
Out[25]:
[tensor([0.2725], device='cuda:0'),
 tensor([0.2781], device='cuda:0'),
 tensor([0.1641], device='cuda:0'),
 tensor([0.2510], device='cuda:0'),
 tensor([0.1950], device='cuda:0'),
 tensor([0.5347], device='cuda:0'),
 tensor([0.4186], device='cuda:0'),
 tensor([0.2010], device='cuda:0'),
 tensor([0.0081], device='cuda:0'),
 tensor([0.3655], device='cuda:0'),
 tensor([0.0787], device='cuda:0'),
 tensor([0.0925], device='cuda:0'),
 tensor([0.1309], device='cuda:0'),
 tensor([0.3426], device='cuda:0'),
 tensor([0.2741], device='cuda:0'),
 tensor([0.4018], device='cuda:0'),
 tensor([0.1861], device='cuda:0'),
 tensor([0.1662], device='cuda:0'),
 tensor([0.1504], device='cuda:0'),
 tensor([0.0069], device='cuda:0'),
 tensor([0.0103], device='cuda:0'),
 tensor([0.0778], device='cuda:0'),
 tensor([0.0693], device='cuda:0'),
 tensor([0.2810], device='cuda:0'),
 tensor([0.0530], device='cuda:0'),
 tensor([0.4706], device='cuda:0'),
 tensor([0.3761], device='cuda:0'),
 tensor([0.1868], device='cuda:0'),
 tensor([0.0669], device='cuda:0'),
 tensor([0.0670], device='cuda:0')]
In [26]:
nn2_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn2_predictions.append(0)
    else:
        nn2_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)
In [29]:
nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion
Out[29]:
array([[2060059,  134828],
       [     40,      96]], dtype=int64)
In [30]:
nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision
Out[30]:
0.0007115116658266876
In [31]:
print(classification_report(actual, nn2_predictions))
              precision    recall  f1-score   support

           0       1.00      0.94      0.97   2194887
           1       0.00      0.71      0.00       136

    accuracy                           0.94   2195023
   macro avg       0.50      0.82      0.48   2195023
weighted avg       1.00      0.94      0.97   2195023

In [32]:
nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
    roc_curve(actual, nn2_predictions)
In [33]:
nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc
Out[33]:
0.8222270668148293
In [46]:
plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 50 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.svg')
plt.show()

While 50 epochs were originally planned, the test loss consistently decreases even at the 50th epoch. Additionally, when compared to the other models, this neural network has a very high amount of true negative predictions and a moderately low amount of false positive predictions. Additional training may result in a model that outperforms even the LBFGS solved logistic regression model for this task.

In [47]:
for e in range(20):
    neural_network2.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)

        optimizer.zero_grad()
        
        output = neural_network2(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network2.eval()
        test_loss = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                output = neural_network2(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = (test_loss/len(testloader))
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e + 51, epochs + 20),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
            test_loss_min = valid_loss
Epoch: 51/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3129 --> 0.3113).  Saving model ...
Epoch: 52/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3113 --> 0.3111).  Saving model ...
Epoch: 53/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3111 --> 0.3089).  Saving model ...
Epoch: 54/60..  Training Loss: 0.31..  Test Loss: 0.31.. 
Test loss decreased (0.3089 --> 0.3066).  Saving model ...
Epoch: 55/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3066 --> 0.3050).  Saving model ...
Epoch: 56/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3050 --> 0.3037).  Saving model ...
Epoch: 57/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3037 --> 0.3018).  Saving model ...
Epoch: 58/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3018 --> 0.3009).  Saving model ...
Epoch: 59/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3009 --> 0.2989).  Saving model ...
Epoch: 60/60..  Training Loss: 0.30..  Test Loss: 0.30.. 
Test loss decreased (0.2989 --> 0.2973).  Saving model ...
Epoch: 61/60..  Training Loss: 0.30..  Test Loss: 0.30.. 
Test loss decreased (0.2973 --> 0.2953).  Saving model ...
Epoch: 62/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2953 --> 0.2938).  Saving model ...
Epoch: 63/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2938 --> 0.2935).  Saving model ...
Epoch: 64/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2935 --> 0.2918).  Saving model ...
Epoch: 65/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2918 --> 0.2894).  Saving model ...
Epoch: 66/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2894 --> 0.2878).  Saving model ...
Epoch: 67/60..  Training Loss: 0.29..  Test Loss: 0.29.. 
Test loss decreased (0.2878 --> 0.2865).  Saving model ...
Epoch: 68/60..  Training Loss: 0.29..  Test Loss: 0.29.. 
Test loss decreased (0.2865 --> 0.2864).  Saving model ...
Epoch: 69/60..  Training Loss: 0.29..  Test Loss: 0.28.. 
Test loss decreased (0.2864 --> 0.2844).  Saving model ...
Epoch: 70/60..  Training Loss: 0.29..  Test Loss: 0.28.. 
Test loss decreased (0.2844 --> 0.2834).  Saving model ...
In [48]:
plt.figure(figsize = (18, 5))
train_ax, = plt.plot(np.arange(epochs + 20), train_losses, 'r--', \
                     label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs + 20), test_losses, 'b--',\
                    label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs + 20))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot 2.svg")
plt.savefig("Charts/NN2 Loss Plot 2.png")
In [49]:
train_losses
Out[49]:
[0.6394088806,
 0.5974827975,
 0.5615484586,
 0.5345383341,
 0.5140510356,
 0.4971041327,
 0.4837051073,
 0.4724557599,
 0.4632148664,
 0.4550733971,
 0.4483568522,
 0.4421497967,
 0.4368479946,
 0.4318463335,
 0.4273023369,
 0.4228789921,
 0.4183978202,
 0.4141231419,
 0.4100503742,
 0.4062411901,
 0.4021704196,
 0.3983474331,
 0.3946568989,
 0.3911679414,
 0.3877969839,
 0.3843898797,
 0.3811379338,
 0.3780967995,
 0.3750898181,
 0.3719848349,
 0.368968146,
 0.3658273222,
 0.3629895356,
 0.3601837461,
 0.3575650346,
 0.3548797017,
 0.3518992016,
 0.3494093663,
 0.347011814,
 0.3447187431,
 0.3424452845,
 0.3400888381,
 0.3374523245,
 0.3351643949,
 0.3328306837,
 0.3308196616,
 0.3288264187,
 0.3263858052,
 0.3239979536,
 0.3216325002,
 0.3197079353,
 0.3179992844,
 0.3160571078,
 0.3142515663,
 0.3126136929,
 0.3110215347,
 0.3094471151,
 0.3077344343,
 0.3061700857,
 0.3046334704,
 0.3029804776,
 0.3013822753,
 0.3000134525,
 0.2985723903,
 0.2972142988,
 0.2958292321,
 0.294525181,
 0.2932756564,
 0.2920163189,
 0.2907511399]
In [50]:
with open('Models/NN2_train_losses.txt', 'w') as loss:
    for epoch in train_losses:
        loss.write(str(epoch) + "\n")
In [51]:
for epoch in test_losses:
    print(epoch.item())
0.7034912109375
0.6301103234291077
0.5466606616973877
0.5088833570480347
0.4839249551296234
0.46507981419563293
0.4544140696525574
0.4438667297363281
0.4348237216472626
0.42894241213798523
0.4246806800365448
0.4199894666671753
0.4163917005062103
0.41220182180404663
0.40937915444374084
0.4061104357242584
0.403046190738678
0.4004068672657013
0.3973144590854645
0.39373621344566345
0.3901030123233795
0.3871174156665802
0.38390490412712097
0.3812478482723236
0.3775373697280884
0.3758487105369568
0.37314432859420776
0.37061068415641785
0.3670412600040436
0.36504238843917847
0.3622877299785614
0.3589939475059509
0.35675162076950073
0.3535061180591583
0.35114216804504395
0.34674540162086487
0.34468403458595276
0.34278616309165955
0.3398434817790985
0.3369307816028595
0.33598217368125916
0.3327692747116089
0.3297843337059021
0.3278554379940033
0.3247697651386261
0.3235817551612854
0.3208043873310089
0.3191089332103729
0.31777939200401306
0.31286540627479553
0.3113062083721161
0.31107133626937866
0.30892065167427063
0.3065762221813202
0.3049827516078949
0.303663969039917
0.3017641305923462
0.3008725643157959
0.29889076948165894
0.29732397198677063
0.2953140437602997
0.29375478625297546
0.2934665083885193
0.2917536795139313
0.2893546223640442
0.28776565194129944
0.2865000367164612
0.2864183485507965
0.2844005227088928
0.28341546654701233
In [52]:
with open('Models/NN2_test_losses.txt', 'w') as loss:
    for epoch in test_losses:
        loss.write(str(epoch.item()) + "\n")
In [53]:
neural_network2.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network2(rows.float())
        pred_targets += targets
In [54]:
nn2_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn2_predictions.append(0)
    else:
        nn2_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)
In [55]:
nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion
Out[55]:
array([[2055329,  139558],
       [     39,      97]], dtype=int64)
In [56]:
nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision
Out[56]:
0.0006945687587268626
In [57]:
print(classification_report(actual, nn2_predictions))
              precision    recall  f1-score   support

           0       1.00      0.94      0.97   2194887
           1       0.00      0.71      0.00       136

    accuracy                           0.94   2195023
   macro avg       0.50      0.82      0.48   2195023
weighted avg       1.00      0.94      0.97   2195023

In [58]:
nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
    roc_curve(actual, nn2_predictions)
In [59]:
nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc
Out[59]:
0.8248260331853076
In [60]:
plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 70 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.svg')
plt.show()

Validation Results

In [99]:
regression = pickle.load(open('Models/Logistic Regression lbfgs.sav', 'rb'))
In [100]:
regression_accuracy = regression.score(x_valid, y_valid)
regression_accuracy
Out[100]:
0.9732312477961497
In [101]:
regression_predictions = regression.predict(x_valid)
actual = y_valid
In [102]:
regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion
Out[102]:
array([[1068088,   29355],
       [     24,      44]], dtype=int64)
In [103]:
regression_precision = precision_score(actual, regression_predictions)
regression_precision
Out[103]:
0.001496649545902922
In [104]:
print(classification_report(actual, regression_predictions))
              precision    recall  f1-score   support

       False       1.00      0.97      0.99   1097443
        True       0.00      0.65      0.00        68

    accuracy                           0.97   1097511
   macro avg       0.50      0.81      0.49   1097511
weighted avg       1.00      0.97      0.99   1097511

In [105]:
regression_probabilities = regression.predict_proba(x_valid)
predictions = regression_probabilities[:,1]
In [106]:
regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_valid, predictions)
In [107]:
regression_roc_auc = auc(regression_false_positive_rate, \
                         regression_true_positive_rate)
regression_roc_auc
Out[107]:
0.8627243456996373
In [108]:
plt.title('LBFGS Logistic Regression Validation ROC Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, \
         'blue', label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.png')
plt.show()

Table 1
HDD Failure Predictive Model Testing Results

Model Sensitivity Specificity Precision Error Rate ROC AUC
Logistic Regression 0.6397 0.9732 1.1478e-3 2.68% 0.8729
Decision Tree 0.4412 0.9690 0.8829e-3 3.10% 0.6900
Random Forest 0.3603 0.9903 2.2900e-3 0.98% 0.7974
Class-Weighted Random Forest 0.4044 0.9717 0.8858e-3 2.83% 0.7998
Simple DNN 0.6176 0.9185 0.4696e-3 8.15% 0.7681
Complex DNN 0.7132 0.9364 0.6946e-3 6.36% 0.8248

The Project Limitations

A few limitations of this project exist. First, a very large amount of the dataset was made up of missing values. A second limitation that deserves caution is that the ratios of drives made by each manufacturer in the dataset is very imbalanced. No assumptions about value or reliability of the four manufacturers included in the dataset should be made from this data. A third limitation is that the dataset was extremely imbalanced in terms of the minority (failure) and majority (non-failure) classes. Though SMOTE succeeded exceptionally well at allowing predictive models to learn from the imbalanced data, it does introduce bias as the synthetically created instances of the minority classes overrepresent their information in the analysis. Finally, working computer memory was a great limitation throughout the project as the dataset is so large. This limitation prevented factor analysis of mixed data from being performed and PCA had to be selected as the alternative.

Actions Proposed and Expected Benefits

It is highly recommended that either the logistic regression model or the more complex DNN model is added to the daily HDD diagnostics checks and backup procedure pipeline. The complex DNN will successfully flag 71.3% of drives expected to fail that day and the logistic regression 64%, allowing for total backup and retirement of the drive before the failure occurs. Do note that while more sensitive to detecting failure, the DNN does have a higher false positive rate, at 6.36%, than the more conservative logistic regression at 2.68%. Until this can be completed, special care should be given to drives with higher values of SMART attributes 5, 197, and 9 to reduce data loss and complications arising from the events of HDD failure.

Once implemented, an ensemble approach between the two should be tested to further reduce the false positive rate. Furthermore, additional research is warranted beyond the scope and limitations of the project. Taking an RNN approach to the data tidying and predictive modeling will almost certainly improve the results quite significantly, as they are specifically designed for time-series data such as this.

References

Acronis. Knowledge Base 9105. S.M.A.R.T. Attribute: Reallocated Sectors Count | Knowledge Base. https://kb.acronis.com/content/9105.

Acronis. Knowledge Base 9109. S.M.A.R.T. Attribute: Power-On Hours (POH) | Knowledge Base. https://kb.acronis.com/content/9109.

Acronis. Knowledge Base 9128. S.M.A.R.T. Attribute: Load Cycle Count; Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9128.

Acronis. Knowledge Base 9133. S.M.A.R.T. Attribute: Current Pending Sector Count | Knowledge Base. https://kb.acronis.com/content/9133.

Acronis. Knowledge Base 9152. S.M.A.R.T. Attribute: Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9152.

Backblaze. (2020). data_Q4_2019. San Mateo, CA; Backblaze. Klein, A. (2015, April 16). SMART Hard Drive Attributes: SMART 22 is a Gas Gas Gas. Backblaze Blog | Cloud Storage & Cloud Backup. https://www.backblaze.com/blog/smart-22-is-a-gas-gas-gas/.

Painchaud, A. (2018, October 31). 8 Reasons on How Data Loss Can Negatively Impact Your Bussiness. https://www.sherweb.com/blog/security/statistics-on-data-loss/.

Sanders, J. (2018, November 13). Western Digital spins down HGST and Tegile brands in hard disk market shuffle. TechRepublic. https://www.techrepublic.com/article/western-digital-spins-down-hgst-and-tegile-brands-in-hard-disk-market-shuffle/.

Weiss, G. M. (2013). Foundations of Imbalanced Learning. Imbalanced Learning, 13–41. https://doi.org/10.1002/9781118646106.ch2