Statistical Analysis of HDD Failure¶

Matthew Unrue, Spring 2020¶

Western Govenors University MSDA Capstone¶

Website Version Note:¶

This notebook is so large and works with so much that data that it was run in multiple settings with the kernel reset for memory management each time. As such, the code cell blocks have execution numbers that are not perfectly in order. Though these do not match up perfectly in this version, the code was and should only be excecuted from top to bottom.

Additional Resources:¶

The 5-page project Executive Summary can be found here.
The reveal.js based multimedia presentation notes can be found here.
The 87-page report write-up of this project can be found here.

Table of Contents:¶

I: Dataset Preparation
II: Data Tidying
III: Nan Value Management
IV: Analysis of Potential Predictors
V: Dataset Preparation
VI: PCA and MCA
VII: Model Creation
VIII: Conclusions

Introduction¶

What factors indicate impending hard disk drive failure?¶

H0: Study factors do not significantly indicate impending hard disk failure.¶

H1: Study factors do significantly indicate impending hard disk failure.¶

Context¶

Data helps businesses solve problems, make better decisions, and understand consumers, but a lot of data needs to be stored and available to enable these benefits. Hard drive failure is the most common form of data loss, which is one of the most impactful problems that businesses can experience today as simple drive recovery can cost up to $7,500 per drive (Painchaud, 2018). For cloud-based data centers, keeping multitudes of businesses’ data intact for their own operations is crucial. Being able to predict which hard drives are at the highest risk of failure based on understanding of the combinations of routine diagnostics test results is an ideal solution to backup and replace failing drives before the data is lost.

Data¶

The dataset used is Backblaze’s 4th quarter data from 2019 (Backblaze, 2020). All of the needed data is contained within the .zip file that Backblaze provides to the public as .csv files split by day.

The dataset contains .csv files for each day of its corresponding quarter, from 2019-10-01 to 2019-12-31. As an example, the subsection of the dataset for 2019-10-01 contains 115,259 rows of data. However, as this data contains recorded readings from a live data center, the number of hard drives and thus rows, changed daily as failed drives were taken out and new drives were installed. The 129 column attributes are Date, Serial Number, Model Number, Capacity, Failure, 62 Self-Monitoring, Analysis and Reporting Technology (SMART) test results, and 62 normalized values of the SMART test values. The Failure attribute is the dependent variable of this study and is a qualitative binary categorical variable. The Date, Serial Number, and Model are nominal qualitative independent variables. Finally, and the SMART value columns are continuous quantitative independent variables.

As stated in Backblaze’s Hard Drive Data and Stats page (Backblaze, n.d.), this dataset is free for any use as long as Backblaze is cited as the data source, that users accept that they are solely responsible for how the data is used, and that the data cannot be sold to anybody as it publicly available.

Data Analytics Tools and Techniques¶

Python, pandas, and the scikit-learn stack are extensively used for the loading, tidying, manipulation, and analysis of the datasets. PyTorch is used for all neural network related tasks of the analysis and model production. Matplotlib and seaborn are used to create charts and graphics for analysis and presentation of project findings. A needed algorithm, namely Fisher's Exact test for contingency tables greater than 2x2 dimensions, is unavailable in the scikit-learn ecosystem, and R.stats is used for this by using rpy2 to run the R code by embedding it in the Python process. Prince is used for factor analysis, and imbalanced-learn is used for the implementation of SMOTE.

Like R and unlike SAS, all of these packages are easily available, free, and open-source with Python. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.

Synthetic Minority Over-Sampling Technique (SMOTE) is used specifically to handle the imbalanced classes for training and testing splits. PCA is used for dimensionality reduction. Predictor variables are examined through correlation coefficients and Fisher's exact test, as well as graphed univariate and bivariate distributions. A logistic regression model and a decision tree model are examined along with the results of the PCA to find predictor variables as well. For building a predictive model for future use, the logistic regression model, a random forest ensemble model, and neural networks are compared to determine which can produce the most useful model.

As HDD failure is an extremely rare event, the dependent variable class is extremely imbalanced and failing to control for the imbalance through techniques like boosting or oversampling would lead to ineffective models. As the dependent variable is a Boolean value, this task is a binary classification task. Logistic regression is an ideal predictive model for binary classification tasks that gives a probability for classification while also having a simplistic interpretation of coefficients that can be used for feature selection. Decision trees are also simple to understand and work well for classification tasks. Given the complexity of the various fields in the dataset, a more complicated model may work better for predictive power. Random forests and neural networks work very well for classification tasks under these circumstances.

Project Outcomes¶

The key project outcomes are a deep understanding of the risk of hard drive failure based on the results of SMART test values regardless of manufacturer and predictive models that will be able to flag hard drives that are at high risk of failing. The understanding of the risk of failure based on test values will empower better business decisions by optimizing the choice of storage used based on projected lifetime. The predictive models will allow the business to proactively backup data from storage onto new storage devices before failure while also allowing hard drives to continue working closer to their end of life, minimizing waste from constantly replacing hard drives before it is needed. The combination of these two products will also enable the future creation of a more automated system that protects data from hard drive failure.

Dataset Preparation ¶

The dataset provided by Backblaze is made up of 92 .csv files, 1 for each day in the 2019 4th quarter, totaling 3.13GB of text data. As hard drive failure is an extremely rare event, all of these days will need to be considered together in order to have enough failures to draw conclusions. The project begins by combining all parts of the dataset from their .csv files into a single file.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import csv
import scipy.stats as scs
import gc
import pickle

# Jupyter magic commands for displaying plot objects in the notebook and
# setting float display limits.
%matplotlib inline
%precision %.10g
sns.set_style("dark")

if not os.path.isfile('q4_combined.csv'):
    # Create a generator of dataset files in the current working directory.
    files = glob.glob(os.path.join(os.getcwd(), "2019-*.csv"))

    # Combine the fields into a single file, writing the column index from
    # only the first .csv file.
    index = False
    with open('q4_combined.csv', 'w') as combined:
        for file in files:
            with open(file, 'r') as part:

                if not index:
                    for row in part:
                        combined.write(row)
                    index = True

                else:
                    next(part)
                    for row in part:
                        combined.write(row)

with open('q4_combined.csv') as file:
    for (count, _) in enumerate(file, 0):
        pass
    
row_count = count
print("Rows: " + str(row_count))

Rows: 10991209

df = pd.read_csv('q4_combined.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Columns: 131 entries, date to smart_255_raw
dtypes: float64(126), int64(2), object(3)
memory usage: 10.7+ GB

Out of 10,991,209 hard drive days, there were only 678 failures, which gives a failure rate of 0.006169%.

df['failure'].value_counts()

0    10990531
1         678
Name: failure, dtype: int64

nonfailed, failed = df['failure'].value_counts()
failure_rate = failed / nonfailed
print("Failure Rate: " + str("{:.6f}".format(failure_rate * 100)) + "%")

Failure Rate: 0.006169%

Weiss (2013) defined the imbalance ratio as the ratio between majority and minority classes with a modestly imbalanced dataset having an imbalance ratio of 10:1, and extremely imbalanced datasets as having an imbalance ratio of 1000:1 or greater (pg. 15). This dataset has an imbalance ratio of approximately 16,210:1 and as such will require very careful cultivation in order for any predictive model to successfully learn from. The rarity of the positive failure cases is also the reason that the entire 4th quarter dataset is required.

Unfortunately, this combined file requires too much memory to load all at once for current hardware restraints. It needs 13.5GB for just the data, not including the memory needed for the OS and other software, nor memory for calculations.

# Return the summed memory usage of each column in bytes.
memory_usage = sum(df.memory_usage(deep=True))
memory_usage

13499129713

print(str(memory_usage / 1000) + "KB")
print(str("{:.2f}".format(memory_usage / 1000000)) + "MB")
print(str("{:.2f}".format(memory_usage / 1000000000)) + "GB")

13499129.713KB
13499.13MB
13.50GB

As this dataset contains both raw and normalized values for all of the SMART values, a simple way to deal with the memory issues is to divide the dataset into a raw form and a normalized form.

list(df.columns.values)

['date',
 'serial_number',
 'model',
 'capacity_bytes',
 'failure',
 'smart_1_normalized',
 'smart_1_raw',
 'smart_2_normalized',
 'smart_2_raw',
 'smart_3_normalized',
 'smart_3_raw',
 'smart_4_normalized',
 'smart_4_raw',
 'smart_5_normalized',
 'smart_5_raw',
 'smart_7_normalized',
 'smart_7_raw',
 'smart_8_normalized',
 'smart_8_raw',
 'smart_9_normalized',
 'smart_9_raw',
 'smart_10_normalized',
 'smart_10_raw',
 'smart_11_normalized',
 'smart_11_raw',
 'smart_12_normalized',
 'smart_12_raw',
 'smart_13_normalized',
 'smart_13_raw',
 'smart_15_normalized',
 'smart_15_raw',
 'smart_16_normalized',
 'smart_16_raw',
 'smart_17_normalized',
 'smart_17_raw',
 'smart_18_normalized',
 'smart_18_raw',
 'smart_22_normalized',
 'smart_22_raw',
 'smart_23_normalized',
 'smart_23_raw',
 'smart_24_normalized',
 'smart_24_raw',
 'smart_168_normalized',
 'smart_168_raw',
 'smart_170_normalized',
 'smart_170_raw',
 'smart_173_normalized',
 'smart_173_raw',
 'smart_174_normalized',
 'smart_174_raw',
 'smart_177_normalized',
 'smart_177_raw',
 'smart_179_normalized',
 'smart_179_raw',
 'smart_181_normalized',
 'smart_181_raw',
 'smart_182_normalized',
 'smart_182_raw',
 'smart_183_normalized',
 'smart_183_raw',
 'smart_184_normalized',
 'smart_184_raw',
 'smart_187_normalized',
 'smart_187_raw',
 'smart_188_normalized',
 'smart_188_raw',
 'smart_189_normalized',
 'smart_189_raw',
 'smart_190_normalized',
 'smart_190_raw',
 'smart_191_normalized',
 'smart_191_raw',
 'smart_192_normalized',
 'smart_192_raw',
 'smart_193_normalized',
 'smart_193_raw',
 'smart_194_normalized',
 'smart_194_raw',
 'smart_195_normalized',
 'smart_195_raw',
 'smart_196_normalized',
 'smart_196_raw',
 'smart_197_normalized',
 'smart_197_raw',
 'smart_198_normalized',
 'smart_198_raw',
 'smart_199_normalized',
 'smart_199_raw',
 'smart_200_normalized',
 'smart_200_raw',
 'smart_201_normalized',
 'smart_201_raw',
 'smart_218_normalized',
 'smart_218_raw',
 'smart_220_normalized',
 'smart_220_raw',
 'smart_222_normalized',
 'smart_222_raw',
 'smart_223_normalized',
 'smart_223_raw',
 'smart_224_normalized',
 'smart_224_raw',
 'smart_225_normalized',
 'smart_225_raw',
 'smart_226_normalized',
 'smart_226_raw',
 'smart_231_normalized',
 'smart_231_raw',
 'smart_232_normalized',
 'smart_232_raw',
 'smart_233_normalized',
 'smart_233_raw',
 'smart_235_normalized',
 'smart_235_raw',
 'smart_240_normalized',
 'smart_240_raw',
 'smart_241_normalized',
 'smart_241_raw',
 'smart_242_normalized',
 'smart_242_raw',
 'smart_250_normalized',
 'smart_250_raw',
 'smart_251_normalized',
 'smart_251_raw',
 'smart_252_normalized',
 'smart_252_raw',
 'smart_254_normalized',
 'smart_254_raw',
 'smart_255_normalized',
 'smart_255_raw']

raw_cols = []
for col in df.columns.values:
    if "normalized" not in col:
        raw_cols.append(col)

print(raw_cols)

['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw', 'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_13_raw', 'smart_15_raw', 'smart_16_raw', 'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw', 'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw', 'smart_174_raw', 'smart_177_raw', 'smart_179_raw', 'smart_181_raw', 'smart_182_raw', 'smart_183_raw', 'smart_184_raw', 'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw', 'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw', 'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_200_raw', 'smart_201_raw', 'smart_218_raw', 'smart_220_raw', 'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw', 'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw', 'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw', 'smart_250_raw', 'smart_251_raw', 'smart_252_raw', 'smart_254_raw', 'smart_255_raw']

norm_cols = []
for col in df.columns.values:
    if "raw" not in col:
        norm_cols.append(col)

print(norm_cols)

['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_normalized', 'smart_2_normalized', 'smart_3_normalized', 'smart_4_normalized', 'smart_5_normalized', 'smart_7_normalized', 'smart_8_normalized', 'smart_9_normalized', 'smart_10_normalized', 'smart_11_normalized', 'smart_12_normalized', 'smart_13_normalized', 'smart_15_normalized', 'smart_16_normalized', 'smart_17_normalized', 'smart_18_normalized', 'smart_22_normalized', 'smart_23_normalized', 'smart_24_normalized', 'smart_168_normalized', 'smart_170_normalized', 'smart_173_normalized', 'smart_174_normalized', 'smart_177_normalized', 'smart_179_normalized', 'smart_181_normalized', 'smart_182_normalized', 'smart_183_normalized', 'smart_184_normalized', 'smart_187_normalized', 'smart_188_normalized', 'smart_189_normalized', 'smart_190_normalized', 'smart_191_normalized', 'smart_192_normalized', 'smart_193_normalized', 'smart_194_normalized', 'smart_195_normalized', 'smart_196_normalized', 'smart_197_normalized', 'smart_198_normalized', 'smart_199_normalized', 'smart_200_normalized', 'smart_201_normalized', 'smart_218_normalized', 'smart_220_normalized', 'smart_222_normalized', 'smart_223_normalized', 'smart_224_normalized', 'smart_225_normalized', 'smart_226_normalized', 'smart_231_normalized', 'smart_232_normalized', 'smart_233_normalized', 'smart_235_normalized', 'smart_240_normalized', 'smart_241_normalized', 'smart_242_normalized', 'smart_250_normalized', 'smart_251_normalized', 'smart_252_normalized', 'smart_254_normalized', 'smart_255_normalized']

if not os.path.isfile('q4_raw.csv'):
    df.to_csv('q4_raw.csv', columns = raw_cols, index=False)

if not os.path.isfile('q4_normalized.csv'):
    df.to_csv('q4_normalized.csv', columns = norm_cols, index=False)

try:
    del [df, nonfailed, failed, failure_rate, memory_usage, raw_cols, norm_cols]
    print("Memory cleared successfully.")
except:
    pass

Memory cleared successfully.

Data Tidying ¶

The considerably smaller raw value subset of data is the main dataset of this project. As with nearly all real-world datasets, this one needs considerable cleaning and tidying in order to use for analysis.

df = pd.read_csv('q4_raw.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Data columns (total 68 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   serial_number   object 
 2   model           object 
 3   capacity_bytes  int64  
 4   failure         int64  
 5   smart_1_raw     float64
 6   smart_2_raw     float64
 7   smart_3_raw     float64
 8   smart_4_raw     float64
 9   smart_5_raw     float64
 10  smart_7_raw     float64
 11  smart_8_raw     float64
 12  smart_9_raw     float64
 13  smart_10_raw    float64
 14  smart_11_raw    float64
 15  smart_12_raw    float64
 16  smart_13_raw    float64
 17  smart_15_raw    float64
 18  smart_16_raw    float64
 19  smart_17_raw    float64
 20  smart_18_raw    float64
 21  smart_22_raw    float64
 22  smart_23_raw    float64
 23  smart_24_raw    float64
 24  smart_168_raw   float64
 25  smart_170_raw   float64
 26  smart_173_raw   float64
 27  smart_174_raw   float64
 28  smart_177_raw   float64
 29  smart_179_raw   float64
 30  smart_181_raw   float64
 31  smart_182_raw   float64
 32  smart_183_raw   float64
 33  smart_184_raw   float64
 34  smart_187_raw   float64
 35  smart_188_raw   float64
 36  smart_189_raw   float64
 37  smart_190_raw   float64
 38  smart_191_raw   float64
 39  smart_192_raw   float64
 40  smart_193_raw   float64
 41  smart_194_raw   float64
 42  smart_195_raw   float64
 43  smart_196_raw   float64
 44  smart_197_raw   float64
 45  smart_198_raw   float64
 46  smart_199_raw   float64
 47  smart_200_raw   float64
 48  smart_201_raw   float64
 49  smart_218_raw   float64
 50  smart_220_raw   float64
 51  smart_222_raw   float64
 52  smart_223_raw   float64
 53  smart_224_raw   float64
 54  smart_225_raw   float64
 55  smart_226_raw   float64
 56  smart_231_raw   float64
 57  smart_232_raw   float64
 58  smart_233_raw   float64
 59  smart_235_raw   float64
 60  smart_240_raw   float64
 61  smart_241_raw   float64
 62  smart_242_raw   float64
 63  smart_250_raw   float64
 64  smart_251_raw   float64
 65  smart_252_raw   float64
 66  smart_254_raw   float64
 67  smart_255_raw   float64
dtypes: float64(63), int64(2), object(3)
memory usage: 5.6+ GB

null_values = df.isna().sum().sum()
null_values

452576024

len(df.columns)

68

n_rows = len(df)
n_rows

10991209

n_values = n_rows * len(df.columns)
n_values

747402212

null_values / n_values

0.6055320906649926

# Calculate the number of values in the total dataset
n_rows * 131

1439848379

df.head(30)

# Return the memory usage of each column in bytes.
print(df.memory_usage(deep=True))

Index                   128
date              736411003
serial_number     724524693
model             783195873
capacity_bytes     87929672
                    ...    
smart_250_raw      87929672
smart_251_raw      87929672
smart_252_raw      87929672
smart_254_raw      87929672
smart_255_raw      87929672
Length: 69, dtype: int64

# Total number of failures
df.failure.sum()

678

# Average number of failures per day
df.failure.sum() / len(df.date.unique())

7.369565217391305

All SMART test columns have null values in some rows. The dataset notes state that this comes from differing manufacturer's standards despite the standardized nature of SMART tests.

for col in df.columns.values:
    print(col + ": " + str(df[col].isnull().values.any()))

date: False
serial_number: False
model: False
capacity_bytes: False
failure: False
smart_1_raw: True
smart_2_raw: True
smart_3_raw: True
smart_4_raw: True
smart_5_raw: True
smart_7_raw: True
smart_8_raw: True
smart_9_raw: True
smart_10_raw: True
smart_11_raw: True
smart_12_raw: True
smart_13_raw: True
smart_15_raw: True
smart_16_raw: True
smart_17_raw: True
smart_18_raw: True
smart_22_raw: True
smart_23_raw: True
smart_24_raw: True
smart_168_raw: True
smart_170_raw: True
smart_173_raw: True
smart_174_raw: True
smart_177_raw: True
smart_179_raw: True
smart_181_raw: True
smart_182_raw: True
smart_183_raw: True
smart_184_raw: True
smart_187_raw: True
smart_188_raw: True
smart_189_raw: True
smart_190_raw: True
smart_191_raw: True
smart_192_raw: True
smart_193_raw: True
smart_194_raw: True
smart_195_raw: True
smart_196_raw: True
smart_197_raw: True
smart_198_raw: True
smart_199_raw: True
smart_200_raw: True
smart_201_raw: True
smart_218_raw: True
smart_220_raw: True
smart_222_raw: True
smart_223_raw: True
smart_224_raw: True
smart_225_raw: True
smart_226_raw: True
smart_231_raw: True
smart_232_raw: True
smart_233_raw: True
smart_235_raw: True
smart_240_raw: True
smart_241_raw: True
smart_242_raw: True
smart_250_raw: True
smart_251_raw: True
smart_252_raw: True
smart_254_raw: True
smart_255_raw: True

Deriving the manufacturer from the model column will allow the dataset to be easily divided by manufacturer.

df.model.unique()

array(['ST4000DM000', 'ST12000NM0007', 'HGST HMS5C4040ALE640',
       'ST8000NM0055', 'ST8000DM002', 'HGST HMS5C4040BLE640',
       'HGST HUH721212ALN604', 'TOSHIBA MG07ACA14TA',
       'HGST HUH721212ALE600', 'TOSHIBA MQ01ABF050', 'ST500LM030',
       'ST6000DX000', 'ST10000NM0086', 'DELLBOSS VD',
       'TOSHIBA MQ01ABF050M', 'WDC WD5000LPVX', 'ST500LM012 HN',
       'HGST HUH728080ALE600', 'TOSHIBA MD04ABA400V', 'TOSHIBA HDWF180',
       'ST8000DM005', 'Seagate SSD', 'HGST HUH721010ALE600',
       'ST4000DM005', 'WDC WD5000LPCX', 'HGST HDS5C4040ALE630',
       'ST500LM021', 'Hitachi HDS5C4040ALE630', 'HGST HUS726040ALE610',
       'Seagate BarraCuda SSD ZA500CM10002', 'ST12000NM0117',
       'Seagate BarraCuda SSD ZA2000CM10002',
       'Seagate BarraCuda SSD ZA250CM10002', 'TOSHIBA HDWE160',
       'WDC WD5000BPKT', 'ST6000DM001', 'WDC WD60EFRX', 'ST8000DM004',
       'HGST HMS5C4040BLE641', 'ST1000LM024 HN', 'ST6000DM004',
       'ST12000NM0008', 'ST16000NM001G'], dtype=object)

The "DELLBOSS VD" model value seems the be the only value potentially out of place.

df.loc[(df['model'] == "DELLBOSS VD") &
       (df['date'] == "2019-10-01")]

None of the SMART values exist for this hard drive model, but 60 of the drives have this model value. Additionally, no failures for this model exist in the dataset. Any row with this model value should be removed from the training data before any predictive analysis. Some searching online leads to the belief that it may be a RAID controller. (https://www.dell.com/support/manuals/au/en/aubsd1/boss-s-1/boss_s1_ug_publication/overview?guid=guid-b20ef25b-b7e3-40f2-b7cd-e497358cd10a&lang=en-us)

df.loc[(df['model'] == "DELLBOSS VD") &
       (df['failure'] == 1)]

Additionally the "Seagate SSD" model seems to be missing information. Like the "DELLBOSS VD" model rows, this one also does not have any failures and will need to be removed before predictive analysis is performed.

df.loc[(df['model'] == "Seagate SSD") &
       (df['date'] == "2019-10-01")]

df.loc[(df['model'] == "Seagate SSD") &
       (df['failure'] == 1)]

The rows not appropriate for analysis are deleted.

df.drop(df[(df['model'] == "DELLBOSS VD") | \
           (df['model'] == "Seagate SSD")].index, axis = 0, inplace = True)

n_rows = len(df)
n_rows

10976221

# model: ["Manufacturer", "New Model"]
manufacturer_dict = {
    'ST4000DM000': ["Seagate", "ST4000DM000"],
    'ST12000NM0007': ["Seagate", "ST12000NM0007"],
    'HGST HMS5C4040ALE640': ["HGST", "HMS5C4040ALE640"],
    'ST8000NM0055': ["Seagate", "ST8000NM0055"],
    'ST8000DM002': ["Seagate", "ST8000DM002"],
    'HGST HMS5C4040BLE640': ["HGST", "HMS5C4040BLE640"],
    'HGST HUH721212ALN604': ["HGST", "HUH721212ALN604"],
    'TOSHIBA MG07ACA14TA': ["Toshiba", "MG07ACA14TA"],
    'HGST HUH721212ALE600': ["HGST", "HUH721212ALE600"],
    'TOSHIBA MQ01ABF050': ["Toshiba", "MQ01ABF050"],
    'ST500LM030': ["Seagate", "ST500LM030"],
    'ST6000DX000': ["Seagate", "ST6000DX000"],
    'ST10000NM0086': ["Seagate", "ST10000NM0086"],
    'DELLBOSS VD': ["Dell", "DELLBOSS VD"],
    'TOSHIBA MQ01ABF050M': ["Toshiba", "MQ01ABF050M"],
    'WDC WD5000LPVX': ["Western Digital", "WD5000LPVX"],
    'ST500LM012 HN': ["Seagate", "ST500LM012 HN"],
    'HGST HUH728080ALE600': ["HGST", "HUH728080ALE600"],
    'TOSHIBA MD04ABA400V': ["Toshiba", "MD04ABA400V"],
    'TOSHIBA HDWF180': ["Toshiba", "HDWF180"],
    'ST8000DM005': ["Seagate", "ST8000DM005"],
    'Seagate SSD': ["Seagate", "Seagate SSD"],
    'HGST HUH721010ALE600': ["HGST", "Seagate SSD"],
    'ST4000DM005': ["Seagate", "ST4000DM005"],
    'WDC WD5000LPCX': ["Western Digital", "WD5000LPCX"],
    'HGST HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'ST500LM021': ["Seagate", "ST500LM021"],
    'Hitachi HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'HGST HUS726040ALE610': ["HGST", "HUS726040ALE610"],
    'Seagate BarraCuda SSD ZA500CM10002': ["Seagate", "ZA500CM10002"],
    'ST12000NM0117': ["Seagate", "ST12000NM0117"],
    'Seagate BarraCuda SSD ZA2000CM10002': ["Seagate", "ZA2000CM10002"],
    'Seagate BarraCuda SSD ZA250CM10002': ["Seagate", "ZA250CM10002"],
    'TOSHIBA HDWE160': ["Toshiba", "HDWE160"],
    'WDC WD5000BPKT': ["Western Digital", "WD5000BPKT"],
    'ST6000DM001': ["Seagate", "ST6000DM001"],
    'WDC WD60EFRX': ["Western Digital", "WD60EFRX"],
    'ST8000DM004': ["Seagate", "ST8000DM004"],
    'HGST HMS5C4040BLE641': ["HGST", "HMS5C4040BLE641"],
    'ST1000LM024 HN': ["Seagate", "ST1000LM024 HN'"],
    'ST6000DM004': ["Seagate", "ST6000DM004"],
    'ST12000NM0008': ["Seagate", "ST12000NM0008"],
    'ST16000NM001G': ["Seagate", "ST16000NM001G"]
}

# Change the model column into Manufacturer and Model columns.
df['model_temp'] = df['model']
df['manufacturer'] = ''

df['manufacturer'] = df['model_temp'].map(lambda x: manufacturer_dict[x][0])
df['model'] = df['model_temp'].map(lambda x: manufacturer_dict[x][1])

df.drop(['model_temp'], axis=1, inplace=True)

df.head()

Dtype Conversion ¶

Given the size of the dataset, a few minor changes to the columns may free up a considerable amount of memory. The date and capacity_bytes columns are two easy places to improve.

# date
df['date'].value_counts()

2019-12-23    124853
2019-12-24    124853
2019-12-25    124853
2019-12-22    124851
2019-12-26    124850
               ...  
2019-10-09    115102
2019-10-04    115101
2019-10-07    115100
2019-10-03    115099
2019-11-05     55837
Name: date, Length: 92, dtype: int64

df['date'][0:5]

0    2019-10-01
1    2019-10-01
2    2019-10-01
3    2019-10-01
4    2019-10-01
Name: date, dtype: object

before_mem = df['date'].memory_usage()
before_mem

175619536

df['date'] = df['date'].str[-5:]
df.head()

df['date'] = df['date'].astype('category')
df['date'][0:5]

0    10-01
1    10-01
2    10-01
3    10-01
4    10-01
Name: date, dtype: category
Categories (92, object): [10-01, 10-02, 10-03, 10-04, ..., 12-28, 12-29, 12-30, 12-31]

after_mem = df['date'].memory_usage()
after_mem

98789285

memory_saved = before_mem - after_mem
print("Memory saved on the date column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")

Memory saved on the date column: 73.27MB

# model
before_mem = df['model'].memory_usage()
df['model'] = df['model'].astype('category')
after_mem = df['model'].memory_usage()
memory_saved = before_mem - after_mem
print("Memory saved on the model column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")

Memory saved on the model column: 73.27MB

# failure
before_mem = df['failure'].memory_usage(deep = True)
df['failure'] = df['failure'].astype('bool')
after_mem = df['failure'].memory_usage(deep = True)
memory_saved = before_mem - after_mem
print("Memory saved on the failure column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")

Memory saved on the failure column: 73.27MB

# capacity_bytes
before_memory = df['capacity_bytes'].memory_usage(deep = True)
before_memory

175619536

Here we can see that 1108 drive days have an error value rather than their actual capacity. These rows may need to be removed, but may also be an excellent signal for a failing drive.

df.loc[df["capacity_bytes"] == -1]["manufacturer"].value_counts()

Seagate            759
HGST               299
Toshiba             48
Western Digital      2
Name: manufacturer, dtype: int64

sns.countplot(x = df.loc[df["capacity_bytes"] == -1]["capacity_bytes"], \
              hue = df["failure"])

<AxesSubplot:xlabel='capacity_bytes', ylabel='count'>

Unfortunately, all drives experiencing this error do not fail and this can introduce problems in the final model. As it only affects 0.01% of the dataset, removing the affected rows seems best.

# Calculate the percentage of the dataset that is affected by this error.
str(np.around(((1008/n_rows) * 100), 2)) + "%"

'0.01%'

df.drop(df[(df['capacity_bytes'] == -1)].index, axis = 0, inplace = True)

n_rows = len(df)
n_rows

10975113

df['capacity_bytes'].value_counts()

12000138625024    4855875
4000787030016     3197457
8001563222016     2309775
14000519643136     232122
500107862016       177166
10000831348736     110993
6001175126016       82595
250059350016         6844
16000900661248       1840
2000398934016         355
1000204886016          91
Name: capacity_bytes, dtype: int64

The capacity_bytes column is converted from bytes to terabytes to condense the information on disk.

df['capacity_TB'] = np.around((df['capacity_bytes']/(1000*1000*1000*1000)), \
                              decimals = 2)
df.head()

df['capacity_TB'].value_counts()

12.00    4855875
4.00     3197457
8.00     2309775
14.00     232122
0.50      177166
10.00     110993
6.00       82595
0.25        6844
16.00       1840
2.00         355
1.00          91
Name: capacity_TB, dtype: int64

df['capacity_TB'] = df['capacity_TB'].astype('category')
after_mem = df['capacity_TB'].memory_usage()
memory_saved = before_memory - after_mem
print("Memory saved on the capacity column: " + \
      str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")

Memory saved on the capacity column: 73.28MB

df.drop(['capacity_bytes'], axis=1, inplace=True)
df.head()

fail_df = pd.crosstab(df["manufacturer"], df["failure"])
fail_df

fail_df['Rate'] = fail_df[1] / (fail_df[0] + fail_df[1])
fail_df

corr_df = df.corr()

corr_df['failure']

failure          1.000000
smart_1_raw      0.002183
smart_2_raw     -0.003998
smart_3_raw     -0.000161
smart_4_raw      0.001086
                   ...   
smart_250_raw         NaN
smart_251_raw         NaN
smart_252_raw         NaN
smart_254_raw         NaN
smart_255_raw         NaN
Name: failure, Length: 64, dtype: float64

Raw Data Univariate Distributions ¶

With these things finished, the univariate distributions can be examined to gain a better sense of the data.

The first column, date shows some sort of testing or operational failure on November 5th.

plt.figure(figsize = (20, 10))
plt.title('Number of Drives in Operation per Day (Q4 2019)')
g = sns.countplot(df['date'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.figure.savefig("Charts/Date Distribution.png")
g.figure.savefig("Charts/Date Distribution.svg")

Drive capacities are mostly 4, 8, and 12 TB, likely coinciding with large investments in new drives for the datacenter and possibly alongside the price lowering of specific models.

plt.figure(figsize = (5, 5))
plt.title('Capacity of Drives')
g = sns.countplot(df['capacity_TB'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Capacity Distribution.svg")
g.figure.savefig("Charts/Capacity Distribution.svg")

The manufacturer of the most drives in this dataset is Seagate at 72.59%. HGST is the second highest at 24.24%. Western Digital is the least represented manufacturer in the dataset with only 0.23%, but as HGST was acquired by Western Digital in 2012 (Sanders, 2018), the drives in this dataset will likely be quite similar between the two manufacturers given the seven-year timespan between then and the time of dataset recording and creation. Finally, Toshiba is the other manufacturer, with 2.94% of the dataset. This amount is quite low and may make it difficult to accurately predict their drives in comparison.

plt.figure(figsize = (5, 5))
plt.title('Manufacturers of Drives')
g = sns.countplot(df['manufacturer'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Manufacturer Distribution.svg")
g.figure.savefig("Charts/Manufacturer Distribution.png")

The SMART values vary greatly from the number of different types of drives that exist in this dataset. Before the columns can be graphed appropriately, the NaN/null values need to be examined. It's most likely that the missing data is most related to the hard drive's manufacturer or model.

sns.distplot(df['smart_1_raw'])
plt.grid(True)
plt.show()

# Pandas styling function
def highlight_nans(val):
    color = 'red' if val == True or val > 0 else 'black'
    return 'color: %s' % color

Every single SMART figure column has null values.

pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 75)
df.isna().any()

date             False
serial_number    False
model            False
failure          False
smart_1_raw       True
smart_2_raw       True
smart_3_raw       True
smart_4_raw       True
smart_5_raw       True
smart_7_raw       True
smart_8_raw       True
smart_9_raw       True
smart_10_raw      True
smart_11_raw      True
smart_12_raw      True
smart_13_raw      True
smart_15_raw      True
smart_16_raw      True
smart_17_raw      True
smart_18_raw      True
smart_22_raw      True
smart_23_raw      True
smart_24_raw      True
smart_168_raw     True
smart_170_raw     True
smart_173_raw     True
smart_174_raw     True
smart_177_raw     True
smart_179_raw     True
smart_181_raw     True
smart_182_raw     True
smart_183_raw     True
smart_184_raw     True
smart_187_raw     True
smart_188_raw     True
smart_189_raw     True
smart_190_raw     True
smart_191_raw     True
smart_192_raw     True
smart_193_raw     True
smart_194_raw     True
smart_195_raw     True
smart_196_raw     True
smart_197_raw     True
smart_198_raw     True
smart_199_raw     True
smart_200_raw     True
smart_201_raw     True
smart_218_raw     True
smart_220_raw     True
smart_222_raw     True
smart_223_raw     True
smart_224_raw     True
smart_225_raw     True
smart_226_raw     True
smart_231_raw     True
smart_232_raw     True
smart_233_raw     True
smart_235_raw     True
smart_240_raw     True
smart_241_raw     True
smart_242_raw     True
smart_250_raw     True
smart_251_raw     True
smart_252_raw     True
smart_254_raw     True
smart_255_raw     True
manufacturer     False
capacity_TB      False
dtype: bool

manu_nan_df = pd.DataFrame()
for manu in df['manufacturer'].unique():
    manu_nan_df[manu] = df.loc[df['manufacturer'] == manu].isna().sum()

manu_nan_df.style.applymap(highlight_nans)

model_nan_df = pd.DataFrame()
for model in df['model'].unique():
    model_nan_df[model] = df.loc[df['model'] == model].isna().sum()

model_nan_df.style.applymap(highlight_nans)

model_nan_percent_df = pd.DataFrame()
for model in df['model'].unique():
    model_nan_percent_df[model] = (df.loc[df['model'] == model].isna().sum())\
        /len(df.loc[df['model'] == model])

model_nan_percent_df

plt.figure(figsize = (20, 20))
plt.title('Model NaN Value Proportion by Hard Drive Model')
g = sns.heatmap(model_nan_percent_df, linewidths=0.2)
g.figure.savefig("Charts/Model NaN Heatmap.svg")
g.figure.savefig("Charts/Model NaN Heatmap.png")

description_df = df.describe()
description_df

The count row is equivalent to the number of non-null values. If a column has a count of 0, every single value in it is NaN or null, and should be deleted.

description_df.iloc[0]

smart_1_raw      10975111.0
smart_2_raw       3028448.0
smart_3_raw      10966319.0
smart_4_raw      10966319.0
smart_5_raw      10966319.0
smart_7_raw      10966319.0
smart_8_raw       3028448.0
smart_9_raw      10975111.0
smart_10_raw     10966319.0
smart_11_raw        70494.0
smart_12_raw     10975111.0
smart_13_raw            0.0
smart_15_raw            0.0
smart_16_raw         8792.0
smart_17_raw         8792.0
smart_18_raw       323114.0
smart_22_raw      1233138.0
smart_23_raw       232122.0
smart_24_raw       232122.0
smart_168_raw        8792.0
smart_170_raw        8792.0
smart_173_raw        8792.0
smart_174_raw        8792.0
smart_177_raw        8792.0
smart_179_raw           0.0
smart_181_raw           0.0
smart_182_raw           0.0
smart_183_raw     1842819.0
smart_184_raw     4194102.0
smart_187_raw     7912570.0
smart_188_raw     7912570.0
smart_189_raw     4194102.0
smart_190_raw     7912570.0
smart_191_raw     4572266.0
smart_192_raw    10975111.0
smart_193_raw    10921126.0
smart_194_raw    10975111.0
smart_195_raw     6168809.0
smart_196_raw     3053749.0
smart_197_raw    10966319.0
smart_198_raw    10966319.0
smart_199_raw    10966319.0
smart_200_raw     3898135.0
smart_201_raw           0.0
smart_218_raw        8792.0
smart_220_raw      322722.0
smart_222_raw      322722.0
smart_223_raw      511435.0
smart_224_raw      322722.0
smart_225_raw       45193.0
smart_226_raw      322722.0
smart_231_raw        8792.0
smart_232_raw        8792.0
smart_233_raw        8792.0
smart_235_raw        8792.0
smart_240_raw     8241859.0
smart_241_raw     8065794.0
smart_242_raw     8065794.0
smart_250_raw           0.0
smart_251_raw           0.0
smart_252_raw           0.0
smart_254_raw       26973.0
smart_255_raw           0.0
Name: count, dtype: float64

smart_13_raw, smart_15_raw, smart_179_raw, smart_181_raw, smart_182_raw, smart_201_raw, smart_250_raw, smart_251_raw, smart_252_raw, and smart_255_raw are all empty in this dataset, as all rows have NaN values in these columns.

count_df = pd.DataFrame()
count_df['count'] = description_df.iloc[0]
count_df

# Pandas styling function
def highlight_count_nans1(val):
    if val >= 66.6:
        color = 'green'
    elif val >= 33.3 and val < 66.6:
        color = 'yellow'
    else:
        color = 'red'
        
    return 'color: %s' % color

# Pandas styling function
def highlight_count_nans2(val):
    green = int((val * 255) / 100)
    red = int(255 - green)
    rgb = (red, green, 0)

    # Convert to hexadecimal for pandas styling
    color = '#%02x%02x%02x' % rgb
    
    return 'color: %s' % color

count_df['perc_not_nan'] = (count_df['count'] / n_rows) * 100
count_df

count_df.style.applymap(highlight_count_nans1, subset = ['perc_not_nan'])

count_df['bar'] = count_df['perc_not_nan']
count_df.style.\
    applymap(highlight_count_nans2, subset = ['perc_not_nan']).\
    bar(subset=['bar'], color='#d65f5f')

empty_columns = []
columns_to_examine = []

for row in count_df.iterrows():
    if row[1][0] == 0.0:
        empty_columns.append(row[0])
        
    elif row[1][0] < (0.8 * n_rows):
        columns_to_examine.append(row[0])
        
        
empty_columns

['smart_13_raw',
 'smart_15_raw',
 'smart_179_raw',
 'smart_181_raw',
 'smart_182_raw',
 'smart_201_raw',
 'smart_250_raw',
 'smart_251_raw',
 'smart_252_raw',
 'smart_255_raw']

columns_to_examine

['smart_2_raw',
 'smart_8_raw',
 'smart_11_raw',
 'smart_16_raw',
 'smart_17_raw',
 'smart_18_raw',
 'smart_22_raw',
 'smart_23_raw',
 'smart_24_raw',
 'smart_168_raw',
 'smart_170_raw',
 'smart_173_raw',
 'smart_174_raw',
 'smart_177_raw',
 'smart_183_raw',
 'smart_184_raw',
 'smart_187_raw',
 'smart_188_raw',
 'smart_189_raw',
 'smart_190_raw',
 'smart_191_raw',
 'smart_195_raw',
 'smart_196_raw',
 'smart_200_raw',
 'smart_218_raw',
 'smart_220_raw',
 'smart_222_raw',
 'smart_223_raw',
 'smart_224_raw',
 'smart_225_raw',
 'smart_226_raw',
 'smart_231_raw',
 'smart_232_raw',
 'smart_233_raw',
 'smart_235_raw',
 'smart_240_raw',
 'smart_241_raw',
 'smart_242_raw',
 'smart_254_raw']

before_mem = df.memory_usage(deep=True).sum()
df.drop(empty_columns, axis=1, inplace=True)
after_mem = df.memory_usage(deep=True).sum()
memory_saved = before_mem - after_mem
print("Memory saved on empty column removal: " + \
      str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")

Memory saved on empty column removal: 837.33MB

# Free up memory for the next computation.
try:
    del [empty_columns, manufacturing_dict, before_mem, after_mem, memory_saved, fail_df, corr_df]
    print("Memory successfully cleared.")
except:
    pass

# Save the current form of the dataframe for restoration after the following calculations are performed.
if not os.path.isfile('pre_viz_df.csv'):
    df.to_csv("pre_viz_df.csv", index = False)

df = pd.read_csv('pre_viz_df.csv')

viz_df = df.drop(['date', 'serial_number', 'failure', 'model', 'manufacturer', 'capacity_TB'], axis = 1)
viz_df.columns

Index(['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
       'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
       'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
       'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
       'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
       'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
       'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
       'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
       'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
       'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
       'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
       'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
       'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
       'smart_254_raw'],
      dtype='object')

# Free up memory for the next computation.
try:
    del df
    print("Memory successfully cleared.")
except:
    pass

Memory successfully cleared.

# Melt the df in chunks as df.melt() will take far too much memory.
pivot_list = list()
chunk_size = 250000

for i in range(0, len(viz_df), chunk_size):
    row_pivot = viz_df.iloc[i: i + chunk_size].melt()
    pivot_list.append(row_pivot)

melted = pd.concat(pivot_list)
del pivot_list

melted[0:30]

# Free up memory for the next computation.
try:
    del viz_df
    print("Memory successfully cleared.")
except:
    pass

gc.collect()

Memory successfully cleared.

40

g = sns.FacetGrid(
    melted,
    col = 'variable',
    hue = 'value',
    sharey = 'row',
    sharex = 'col',
    col_wrap = 7,
    legend_out = True,
)

g = g.map(sns.distplot).add_legend()

plt.subplots_adjust(top = 0.9)
g.fig.suptitle('Univariate Continuous Variable Distributions')

        
g.savefig("Charts/Univariate Distributions.svg")
g

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-9-e4bf2b072853> in <module>
      6     sharex = 'col',
      7     col_wrap = 7,
----> 8     legend_out = True,
      9 )
     10 

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, height, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws, size)
    250             hue_names = utils.categorical_order(data[hue], hue_order)
    251 
--> 252         colors = self._get_palette(data, hue, hue_order, palette)
    253 
    254         # Set up the lists of names for the row and column facet variables

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\axisgrid.py in _get_palette(self, data, hue, hue_order, palette)
    163                 current_palette = utils.get_color_cycle()
    164                 if n_colors > len(current_palette):
--> 165                     colors = color_palette("husl", n_colors)
    166                 else:
    167                     colors = color_palette(n_colors=n_colors)

~\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\palettes.py in color_palette(palette, n_colors, desat)
    242     try:
    243         palette = map(mpl.colors.colorConverter.to_rgb, palette)
--> 244         palette = _ColorPalette(palette)
    245     except ValueError:
    246         raise ValueError("Could not generate a palette for %s" % str(palette))

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in to_rgb(c)
    343 def to_rgb(c):
    344     """Convert *c* to an RGB color, silently dropping the alpha channel."""
--> 345     return to_rgba(c)[:3]
    346 
    347 

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in to_rgba(c, alpha)
    183         rgba = None
    184     if rgba is None:  # Suppress exception chaining of cache lookup failure.
--> 185         rgba = _to_rgba_no_colorcycle(c, alpha)
    186         try:
    187             _colors_full_map.cache[c, alpha] = rgba

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in _to_rgba_no_colorcycle(c, alpha)
    275     if alpha is not None:
    276         c = c[:3] + (alpha,)
--> 277     if any(elem < 0 or elem > 1 for elem in c):
    278         raise ValueError("RGBA values should be within 0-1 range")
    279     return c

~\Anaconda3\envs\pytorch2\lib\site-packages\matplotlib\colors.py in <genexpr>(.0)
    275     if alpha is not None:
    276         c = c[:3] + (alpha,)
--> 277     if any(elem < 0 or elem > 1 for elem in c):
    278         raise ValueError("RGBA values should be within 0-1 range")
    279     return c

KeyboardInterrupt:

Unfortunately, this operation takes too much memory to do in this manner. Each column will have to be graphed separately and then the graphs combined into a single graphic for the same effect.

# Reset to the dataframes and memory allocations from before the graphing attempts.
try:
    del melted
except:
    pass

df = pd.read_csv('pre_viz_df.csv')

df.head()

sns.distplot(df['smart_1_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c7de8f8c08>

sns.distplot(df['smart_2_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c82fd69ec8>

sns.distplot(df['smart_3_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c83200e888>

sns.distplot(df['smart_4_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f32a4788>

sns.distplot(df['smart_5_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f324fe08>

sns.distplot(df['smart_7_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f340f3c8>

sns.distplot(df['smart_8_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f340f148>

sns.distplot(df['smart_9_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3505048>

sns.distplot(df['smart_10_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f374e0c8>

sns.distplot(df['smart_11_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f37dab88>

sns.distplot(df['smart_12_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f386bbc8>

sns.distplot(df['smart_16_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f38ef688>

sns.distplot(df['smart_17_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3926c88>

sns.distplot(df['smart_18_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f37ac948>

sns.distplot(df['smart_22_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3d0ef88>

sns.distplot(df['smart_23_raw'])

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3a7d748>

sns.distplot(df['smart_24_raw'])

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f49886c8>

sns.distplot(df['smart_168_raw'])

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3b5f708>

sns.distplot(df['smart_170_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3bdbd08>

sns.distplot(df['smart_173_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f3c77048>

sns.distplot(df['smart_174_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4a4e388>

sns.distplot(df['smart_177_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4a66588>

sns.distplot(df['smart_183_raw'], kde_kws={'bw':0.1})

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4b44148>

sns.distplot(df['smart_184_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c7d6fb7a08>

sns.distplot(df['smart_187_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4c34188>

sns.distplot(df['smart_188_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4c8be08>

sns.distplot(df['smart_189_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4d64d88>

sns.distplot(df['smart_190_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4d7fd08>

sns.distplot(df['smart_191_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4e7fcc8>

sns.distplot(df['smart_192_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4ed6448>

sns.distplot(df['smart_193_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4f73648>

sns.distplot(df['smart_194_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f4f73108>

sns.distplot(df['smart_195_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f508d308>

sns.distplot(df['smart_196_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f510fb48>

sns.distplot(df['smart_197_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f518f708>

sns.distplot(df['smart_198_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f522b348>

sns.distplot(df['smart_199_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f522bbc8>

sns.distplot(df['smart_200_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5344348>

sns.distplot(df['smart_218_raw'], kde = False)

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f53c5b48>

sns.distplot(df['smart_220_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f544c708>

sns.distplot(df['smart_222_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f53fdd08>

sns.distplot(df['smart_223_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f54a4d08>

sns.distplot(df['smart_224_raw'], kde = False)

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f558d5c8>

sns.distplot(df['smart_225_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f56167c8>

sns.distplot(df['smart_226_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5696848>

sns.distplot(df['smart_231_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5775388>

sns.distplot(df['smart_232_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f57ce248>

sns.distplot(df['smart_233_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5813b48>

sns.distplot(df['smart_235_raw'],kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f592f8c8>

sns.distplot(df['smart_240_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f59a4b08>

sns.distplot(df['smart_241_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5aa8988>

sns.distplot(df['smart_242_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5b025c8>

sns.distplot(df['smart_254_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1c6f5b959c8>

fig, axes = plt.subplots(7, 8, figsize = (50, 40))

row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
            'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
            'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
            'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
            'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
            'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
            'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
            'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
            'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
            'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
            'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
            'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
            'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
            'smart_254_raw']:
    
    if col == 8:
        row += 1
        col = 0
        
    sns.distplot(df[df_col], ax = axes[row, col], \
                 kde = False, norm_hist = False)
       
    col += 1
    

axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions.svg")
fig.savefig("Charts/SMART Distributions.png")

fig, axes = plt.subplots(7, 8, figsize = (50, 40))

row = 0
col = 0
for df_col in ['smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw',
            'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw',
            'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_16_raw',
            'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw',
            'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw',
            'smart_174_raw', 'smart_177_raw', 'smart_183_raw', 'smart_184_raw',
            'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw',
            'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw',
            'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw',
            'smart_199_raw', 'smart_200_raw', 'smart_218_raw', 'smart_220_raw',
            'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw',
            'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw',
            'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw',
            'smart_254_raw']:
    
    if col == 8:
        row += 1
        col = 0
        
    try:
        sns.distplot(df[df_col], ax = axes[row][col], norm_hist = True)
    except:
        sns.distplot(df[df_col], kde_kws = {'bw': 0.1}, ax = axes[row][col], norm_hist = True)
        
    col += 1
    

axes[6, 5].set_axis_off()
axes[6, 6].set_axis_off()
axes[6, 7].set_axis_off()

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution and KDE of Raw SMART Values", fontsize = 96, y = 0.95)
fig.savefig("Charts/SMART Distributions KDE.svg")
fig.savefig("Charts/SMART Distributions KDE.png")

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

# Free up memory for the next section.
try:
    del [highlight_nans, manu_nan_df, model_nan_df, model_nan_percent_df,
         description_df, count_df, highlight_count_nans1,
         highlight_counts_nans2, empty_columns]
    print("Memory successfully cleared.")
except:
    pass

gc.collect()

185

NaN Value Management ¶

With some dataset tidying complete, the final major dataset adjustments that need to be made before analysis can be performed is that the NaN values need dealt with. The rows or columns with them can be removed, or they can be filled in through interpolation or estimation.

columns_to_examine = ['smart_13_raw', 'smart_15_raw', 'smart_179_raw',
                      'smart_181_raw', 'smart_182_raw', 'smart_201_raw',
                      'smart_250_raw', 'smart_251_raw', 'smart_252_raw',
                      'smart_255_raw']

columns_to_examine

['smart_13_raw',
 'smart_15_raw',
 'smart_179_raw',
 'smart_181_raw',
 'smart_182_raw',
 'smart_201_raw',
 'smart_250_raw',
 'smart_251_raw',
 'smart_252_raw',
 'smart_255_raw']

#### Memory Management and Reloading Checkpoint
df = pd.read_csv('pre_viz_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')

df.isnull().sum().sort_values()

date                    0
manufacturer            0
failure                 0
capacity_TB             0
model                   0
serial_number           0
smart_1_raw             2
smart_192_raw           2
smart_9_raw             2
smart_12_raw            2
smart_194_raw           2
smart_3_raw          8794
smart_4_raw          8794
smart_5_raw          8794
smart_7_raw          8794
smart_10_raw         8794
smart_199_raw        8794
smart_198_raw        8794
smart_197_raw        8794
smart_193_raw       53987
smart_240_raw     2733254
smart_242_raw     2909319
smart_241_raw     2909319
smart_187_raw     3062543
smart_190_raw     3062543
smart_188_raw     3062543
smart_195_raw     4806304
smart_191_raw     6402847
smart_184_raw     6781011
smart_189_raw     6781011
smart_200_raw     7076978
smart_196_raw     7921364
smart_8_raw       7946665
smart_2_raw       7946665
smart_183_raw     9132294
smart_22_raw      9741975
smart_223_raw    10463678
smart_18_raw     10651999
smart_224_raw    10652391
smart_220_raw    10652391
smart_222_raw    10652391
smart_226_raw    10652391
smart_23_raw     10742991
smart_24_raw     10742991
smart_11_raw     10904619
smart_225_raw    10929920
smart_254_raw    10948140
smart_235_raw    10966321
smart_233_raw    10966321
smart_232_raw    10966321
smart_168_raw    10966321
smart_170_raw    10966321
smart_218_raw    10966321
smart_174_raw    10966321
smart_16_raw     10966321
smart_17_raw     10966321
smart_173_raw    10966321
smart_231_raw    10966321
smart_177_raw    10966321
dtype: int64

The first five mostly complete columns all have two NaNs, which are the result of two rows that have no raw smart values at all. Both drives failed, making them quite important for predicting future failure. However, the lack of data makes them useless for predicting future failure in their current form.

The most likely scenario is that both drives failed just before the diagnostics were collected. As such, these two rows will be deleted and their associated row for the date before their currently marked failures will be updated to have failed that day.

df.loc[df['smart_1_raw'].isnull() & df['smart_192_raw'].isnull() & \
       df['smart_9_raw'].isnull() & df['smart_12_raw'].isnull() & \
       df['smart_194_raw'].isnull()]

df.iloc[4632946]

date                     11-10
serial_number         ZJV00DR4
model            ST12000NM0007
failure                      1
smart_1_raw                NaN
smart_2_raw                NaN
smart_3_raw                NaN
smart_4_raw                NaN
smart_5_raw                NaN
smart_7_raw                NaN
smart_8_raw                NaN
smart_9_raw                NaN
smart_10_raw               NaN
smart_11_raw               NaN
smart_12_raw               NaN
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw              NaN
smart_188_raw              NaN
smart_189_raw              NaN
smart_190_raw              NaN
smart_191_raw              NaN
smart_192_raw              NaN
smart_193_raw              NaN
smart_194_raw              NaN
smart_195_raw              NaN
smart_196_raw              NaN
smart_197_raw              NaN
smart_198_raw              NaN
smart_199_raw              NaN
smart_200_raw              NaN
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              NaN
smart_241_raw              NaN
smart_242_raw              NaN
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4632946, dtype: object

df.iloc[4797700]

date                     11-11
serial_number         ZHZ3M097
model            ST12000NM0008
failure                      1
smart_1_raw                NaN
smart_2_raw                NaN
smart_3_raw                NaN
smart_4_raw                NaN
smart_5_raw                NaN
smart_7_raw                NaN
smart_8_raw                NaN
smart_9_raw                NaN
smart_10_raw               NaN
smart_11_raw               NaN
smart_12_raw               NaN
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw              NaN
smart_188_raw              NaN
smart_189_raw              NaN
smart_190_raw              NaN
smart_191_raw              NaN
smart_192_raw              NaN
smart_193_raw              NaN
smart_194_raw              NaN
smart_195_raw              NaN
smart_196_raw              NaN
smart_197_raw              NaN
smart_198_raw              NaN
smart_199_raw              NaN
smart_200_raw              NaN
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              NaN
smart_241_raw              NaN
smart_242_raw              NaN
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4797700, dtype: object

df.loc[(df['serial_number'] == 'ZJV00DR4') & (df['date'] == '11-09')]

df.at[4514189, 'failure'] = 1
df.iloc[4514189]

date                     11-09
serial_number         ZJV00DR4
model            ST12000NM0007
failure                      1
smart_1_raw        1.18859e+08
smart_2_raw                NaN
smart_3_raw                  0
smart_4_raw                  6
smart_5_raw                 24
smart_7_raw         1.8315e+08
smart_8_raw                NaN
smart_9_raw              16406
smart_10_raw                 0
smart_11_raw               NaN
smart_12_raw                 5
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw               NaN
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw                0
smart_188_raw                0
smart_189_raw              NaN
smart_190_raw               29
smart_191_raw              NaN
smart_192_raw              221
smart_193_raw             1399
smart_194_raw               29
smart_195_raw      1.18859e+08
smart_196_raw              NaN
smart_197_raw                0
smart_198_raw                0
smart_199_raw                0
smart_200_raw                0
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw            15995
smart_241_raw      6.72639e+10
smart_242_raw      1.72113e+11
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4514189, dtype: object

df.loc[(df['serial_number'] == 'ZHZ3M097') & (df['date'] == '11-10')]

df.at[4678156, 'failure'] = 1
df.iloc[4678156]

date                     11-10
serial_number         ZHZ3M097
model            ST12000NM0008
failure                      1
smart_1_raw        1.96598e+08
smart_2_raw                NaN
smart_3_raw                  0
smart_4_raw                  1
smart_5_raw                  0
smart_7_raw         3.2428e+07
smart_8_raw                NaN
smart_9_raw                375
smart_10_raw                 0
smart_11_raw               NaN
smart_12_raw                 1
smart_16_raw               NaN
smart_17_raw               NaN
smart_18_raw                 0
smart_22_raw               NaN
smart_23_raw               NaN
smart_24_raw               NaN
smart_168_raw              NaN
smart_170_raw              NaN
smart_173_raw              NaN
smart_174_raw              NaN
smart_177_raw              NaN
smart_183_raw              NaN
smart_184_raw              NaN
smart_187_raw                0
smart_188_raw                0
smart_189_raw              NaN
smart_190_raw               29
smart_191_raw              NaN
smart_192_raw                0
smart_193_raw              572
smart_194_raw               29
smart_195_raw      1.96598e+08
smart_196_raw              NaN
smart_197_raw                0
smart_198_raw                0
smart_199_raw                0
smart_200_raw                0
smart_218_raw              NaN
smart_220_raw              NaN
smart_222_raw              NaN
smart_223_raw              NaN
smart_224_raw              NaN
smart_225_raw              NaN
smart_226_raw              NaN
smart_231_raw              NaN
smart_232_raw              NaN
smart_233_raw              NaN
smart_235_raw              NaN
smart_240_raw              229
smart_241_raw      4.25531e+09
smart_242_raw      5.70696e+09
smart_254_raw              NaN
manufacturer           Seagate
capacity_TB                 12
Name: 4678156, dtype: object

n_rows

10975113

df.drop(df.index[[4797700, 4632946]], inplace = True)

df.iloc[[4797700, 4632946]]

The next section of columns all have 8792 rows with NaNs, ignoring the 2 rows just removed. Coincidentally, all of these columns share the same problematic rows.

df_8794 = df.loc[df['smart_3_raw'].isnull() & df['smart_4_raw'].isnull() & \
                 df['smart_5_raw'].isnull() & df['smart_7_raw'].isnull() & \
                 df['smart_10_raw'].isnull() & df['smart_197_raw'].isnull() & \
                 df['smart_198_raw'].isnull() & df['smart_199_raw'].isnull()]

This subset of drives are all manfactured by Seagate, and are 3 size variations of the same model line. There is not an updated model from this line in this dataset to interpolate values from.

df_8794['manufacturer'].value_counts()

Seagate    8792
Name: manufacturer, dtype: int64

df_8794['model'].value_counts()

ZA250CM10002     6844
ZA500CM10002     1593
ZA2000CM10002     355
Name: model, dtype: int64

df_8794['capacity_TB'].value_counts()

0.25    6844
0.50    1593
2.00     355
Name: capacity_TB, dtype: int64

df_8794['serial_number'].value_counts()

7M200214    92
7M00020R    92
7LZ01GH1    92
7LZ01GH2    92
7M0002A6    92
            ..
7LZ026L4     1
7LZ026L2     1
7LZ0249A     1
7LZ0249E     1
7LZ0249F     1
Name: serial_number, Length: 179, dtype: int64

df_8794['serial_number'].value_counts().mean()

49.11731843575419

df_8794['failure'].value_counts()

0    8792
Name: failure, dtype: int64

[item for i, item in enumerate(df['model'].unique()) if "ZA" in item]

['ZA500CM10002', 'ZA2000CM10002', 'ZA250CM10002']

Interpolating mean values from the same manufacturer, Seagate, and the models' respective capacity_TB categories would be a good way to estimate the missing values if enough data exists.

Additionally, creating a boolean column to flag interpolated data as missing may help the predictive models account for it.

smart_3_raw¶

For the smart_3_raw data, the median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_3_raw'].mean()

nan

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_3_raw']

Series([], Name: smart_3_raw, dtype: float64)

smart_3_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_3_raw'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_3_raw'].median()
smart_3_median_specialized

1816.0

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_3_raw'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_3_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16be98c0448>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_3_raw']

134            0.0
246         2044.0
714            0.0
1006        1989.0
1502        1801.0
             ...  
10974512       0.0
10974685       0.0
10974769       0.0
10974840       0.0
10974960       0.0
Name: smart_3_raw, Length: 71163, dtype: float64

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_3_raw'].mean()

nan

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_3_raw'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_3_raw']

Series([], Name: smart_3_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_3_raw'].notnull())]['smart_3_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16ba74e4dc8>

smart_3_median = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_3_raw'].notnull())]['smart_3_raw'].median()
smart_3_median

0.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_3_raw'].isnull()) & \
       (df['capacity_TB'] == 0.50), 'smart_3_raw'] = smart_3_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_3_raw'].isnull(), 'smart_3_raw'] = smart_3_median

df['smart_3_raw'].isnull().sum()

0

smart_4_raw¶

For the smart_4_raw data, the mean for the manufacturer and drive capacity will be used for the second model. The mean for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_4_raw']

Series([], Name: smart_4_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw']

134          5.0
246         14.0
714         17.0
1006        13.0
1502        22.0
            ... 
10974512     6.0
10974685    13.0
10974769    10.0
10974840     5.0
10974960     6.0
Name: smart_4_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16ba74e4cc8>

smart_4_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_4_raw'].mean()
smart_4_mean_specialized

12.311720978598428

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_4_raw']

Series([], Name: smart_4_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16ab3fb7c88>

smart_4_mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_4_raw'].notnull())]['smart_4_raw'].mean()
smart_4_mean

8.63607498740538

# Use the mean to fill the capacity category that can be calculated.
df.loc[(df['smart_4_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_4_raw'] = smart_4_mean_specialized

# Use the mean to fill the capacity categories that cannot be calculated.
df.loc[df['smart_4_raw'].isnull(), 'smart_4_raw'] = smart_4_mean

df['smart_4_raw'].isnull().sum()

0

smart_5_raw¶

For the smart_5_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_5_raw']

Series([], Name: smart_5_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw']

134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_5_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16ab402cd08>

smart_5_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_5_raw'].median()
smart_5_median_specialized

0.0

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_5_raw']

Series([], Name: smart_5_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16ab6d25c08>

smart_5_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_5_raw'].notnull())]['smart_5_raw'].median()
smart_5_median

0.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_5_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_5_raw'] = smart_5_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_5_raw'].isnull(), 'smart_5_raw'] = smart_5_median

df['smart_5_raw'].isnull().sum()

0

smart_7_raw¶

For the smart_7_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_7_raw']

Series([], Name: smart_7_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw']

134         901104176.0
246                 0.0
714          29927613.0
1006                0.0
1502                0.0
               ...     
10974512    309799114.0
10974685    331746418.0
10974769    694324598.0
10974840    623555853.0
10974960    216864131.0
Name: smart_7_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16ab6e038c8>

smart_7_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].median()
smart_7_median_specialized

0.0

mean = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_7_raw'].mean()
mean

160253636.9962902

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_7_raw']

Series([], Name: smart_7_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16ab6eecd08>

smart_7_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_7_raw'].notnull())]['smart_7_raw'].median()
smart_7_median

696801289.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_7_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_7_raw'] = smart_7_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_7_raw'].isnull(), 'smart_7_raw'] = smart_7_median

df['smart_7_raw'].isnull().sum()

0

smart_10_raw¶

For the smart_10_raw data, the median for the manufacturer will be used to fill the NaN values.

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_10_raw'].notnull())]['smart_10_raw']

0           0.0
1           0.0
2           0.0
3           0.0
5           0.0
           ... 
10975104    0.0
10975105    0.0
10975108    0.0
10975109    0.0
10975112    0.0
Name: smart_10_raw, Length: 7957763, dtype: float64

df['smart_10_raw'].value_counts()

0.0         10961730
1.0             8518
65536.0         2958
2.0              729
131072.0         549
65537.0          182
131073.0         170
327680.0          92
262144.0          91
3.0               91
196608.0           1
Name: smart_10_raw, dtype: int64

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_10_raw'].notnull())]['smart_10_raw'].value_counts()

0.0    7957763
Name: smart_10_raw, dtype: int64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_10_raw'].notnull())]['smart_10_raw'])

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b3a8908>

smart_10_median = df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_10_raw'].notnull())]['smart_10_raw'].median()
smart_10_median

0.0

df.loc[df['smart_10_raw'].isnull(), 'smart_10_raw'] = smart_10_median

df['smart_10_raw'].isnull().sum()

0

smart_197_raw¶

For the smart_197_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_197_raw']

Series([], Name: smart_197_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_197_raw']

134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_197_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_197_raw'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_197_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b4bc1c8>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].value_counts()

0.0    70905
2.0      122
1.0      119
3.0        9
4.0        8
Name: smart_197_raw, dtype: int64

smart_197_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].median()
smart_197_median_specialized

0.0

smart_197_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_197_raw'].mean()
smart_197_mean_specialized

0.0059300479181597174

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_197_raw']

Series([], Name: smart_197_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b5ac0c8>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].value_counts()

0.0      7892657
8.0        46134
16.0       11240
24.0        3308
32.0        1400
          ...   
600.0          1
400.0          1
432.0          1
520.0          1
776.0          1
Name: smart_197_raw, Length: 74, dtype: int64

smart_197_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_197_raw'].notnull())]['smart_197_raw'].median()
smart_197_median

0.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_197_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_197_raw'] = smart_197_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_197_raw'].isnull(), 'smart_197_raw'] = smart_197_median

df['smart_197_raw'].isnull().sum()

0

smart_198_raw¶

For the smart_198_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_198_raw']

Series([], Name: smart_198_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw']

134         0.0
246         0.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_198_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b670688>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].value_counts()

0.0    71163
Name: smart_198_raw, dtype: int64

smart_198_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_198_raw'].median()
smart_198_median_specialized

0.0

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_198_raw']

Series([], Name: smart_198_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b791488>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].value_counts()

0.0       7892915
8.0         46134
16.0        11240
24.0         3308
32.0         1400
           ...   
1808.0          1
1840.0          1
2416.0          1
2808.0          1
600.0           1
Name: smart_198_raw, Length: 70, dtype: int64

smart_198_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_198_raw'].notnull())]['smart_198_raw'].median()
smart_198_median

0.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_198_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_198_raw'] = smart_198_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_198_raw'].isnull(), 'smart_198_raw'] = smart_198_median

df['smart_198_raw'].isnull().sum()

0

smart_199_raw¶

For the smart_199_raw data, the median for the manufacturer and drive capacity will be used for the second model. The median for the manufacturer without regard for the drive capacity will be used for the first and third model of drive, as there are no appropriate rows with values to interpolate from.

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.25)]['smart_199_raw']

Series([], Name: smart_199_raw, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw']

134         0.0
246         4.0
714         0.0
1006        0.0
1502        0.0
           ... 
10974512    0.0
10974685    0.0
10974769    0.0
10974840    0.0
10974960    0.0
Name: smart_199_raw, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b8a38c8>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].value_counts()

0.0       69076
1.0         438
6.0         275
3.0         274
5.0         273
7.0          92
123.0        92
4.0          92
9.0          92
29.0         92
135.0        92
1296.0       92
2.0          92
12.0         91
Name: smart_199_raw, dtype: int64

smart_199_median_specialized = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 0.50)]['smart_199_raw'].median()
smart_199_median_specialized

0.0

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull()) & (df['capacity_TB'] == 2.00)]['smart_199_raw']

Series([], Name: smart_199_raw, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0b979888>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].value_counts()

0.0      7890277
1.0        12442
2.0         6416
3.0         4587
4.0         3485
          ...   
147.0          1
145.0          1
220.0          1
223.0          1
378.0          1
Name: smart_199_raw, Length: 389, dtype: int64

smart_199_median = df.loc[(df['manufacturer'] == "Seagate") & (df['smart_199_raw'].notnull())]['smart_199_raw'].median()
smart_199_median

0.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_199_raw'].isnull()) & (df['capacity_TB'] == 0.50), 'smart_199_raw'] = smart_199_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_199_raw'].isnull(), 'smart_199_raw'] = smart_199_median

df['smart_199_raw'].isnull().sum()

0

smart_193_raw and smart_225_raw¶

The smart_193_raw column is a different problem than the last group of columns. This group has 53985 rows with NaN values, which is still low enough in this large dataset to interpolate values without majorly ill effects, but still requires caution.

An important note here is that some manufacturers use different SMART attributes to represent the same information. Most Seagate and some Western Digital and Hitachi drives actually use 225 rather than 193 to store the Load/Unload Cycle Count value (Acronis, Knowledge Base 9128; Acronis, Knowledge Base 9152). We can see here that no row has both 193 and 225 values.

df.loc[(df['smart_193_raw'].notnull()) & \
       (df['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]

df_193 = df.loc[df['smart_193_raw'].isnull()]
df_193

df_193['manufacturer'].value_counts()

Seagate            53985
Western Digital        0
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64

df_193['model'].value_counts()

ST500LM012 HN      45102
ZA250CM10002        6844
ZA500CM10002        1593
ZA2000CM10002        355
ST1000LM024 HN'       91
HDWE160                0
HDWF180                0
ST12000NM0007          0
ST10000NM0086          0
MQ01ABF050M            0
MQ01ABF050             0
MG07ACA14TA            0
MD04ABA400V            0
HUS726040ALE610        0
ST12000NM0008          0
HUH721212ALN604        0
HUH721212ALE600        0
HMS5C4040BLE641        0
HMS5C4040BLE640        0
HMS5C4040ALE640        0
HUH728080ALE600        0
ST16000NM001G          0
ST12000NM0117          0
ST4000DM000            0
WD60EFRX               0
WD5000LPVX             0
WD5000LPCX             0
WD5000BPKT             0
Seagate SSD            0
ST8000NM0055           0
ST8000DM005            0
ST8000DM004            0
ST8000DM002            0
ST6000DX000            0
ST6000DM004            0
ST6000DM001            0
ST500LM030             0
ST500LM021             0
ST4000DM005            0
HDS5C4040ALE630        0
Name: model, dtype: int64

df_193.loc[df_193['smart_193_raw'] != \
           df_193['smart_225_raw']][['smart_193_raw', 'smart_225_raw']]

df_193.loc[(df_193['smart_193_raw'].notnull()) & \
    (df_193['smart_225_raw'].notnull())][['smart_193_raw', 'smart_225_raw']]

The only rows that do not have either value are the exact same rows as the last group. These will need interpolated if the rows are to be kept. The 45193 other rows can be filled by combining the two columns that represent the same information.

df_193.loc[(df_193['smart_193_raw'].isnull()) & \
    (df_193['smart_225_raw'].isnull())][['smart_193_raw', 'smart_225_raw']]

df_193.loc[(df_193['smart_193_raw'].isnull()) & \
           (df_193['smart_225_raw'].isnull())]['model'].value_counts()

ZA250CM10002       6844
ZA500CM10002       1593
ZA2000CM10002       355
MD04ABA400V           0
ST12000NM0008         0
ST12000NM0007         0
ST1000LM024 HN'       0
ST10000NM0086         0
MQ01ABF050M           0
MQ01ABF050            0
MG07ACA14TA           0
HUS726040ALE610       0
HDWE160               0
ST12000NM0117         0
HUH721212ALN604       0
HUH721212ALE600       0
HMS5C4040BLE641       0
HMS5C4040BLE640       0
HMS5C4040ALE640       0
HDWF180               0
HUH728080ALE600       0
ST16000NM001G         0
ST4000DM000           0
ST4000DM005           0
WD60EFRX              0
WD5000LPVX            0
WD5000LPCX            0
WD5000BPKT            0
Seagate SSD           0
ST8000NM0055          0
ST8000DM005           0
ST8000DM004           0
ST8000DM002           0
ST6000DX000           0
ST6000DM004           0
ST6000DM001           0
ST500LM030            0
ST500LM021            0
ST500LM012 HN         0
HDS5C4040ALE630       0
Name: model, dtype: int64

The smart_193_raw and smart_225_raw columns will be combined into a new smart_193_225 column and then the remaining values filled as in previous columns.

df['smart_193_225'] = df['smart_193_raw']

df['smart_193_225'].fillna(df['smart_225_raw'], inplace = True)

df[['smart_193_raw', 'smart_225_raw', 'smart_193_225']].isna().sum()

smart_193_raw       53985
smart_225_raw    10929918
smart_193_225        8792
dtype: int64

df.drop(['smart_193_raw', 'smart_225_raw'], axis=1, inplace=True)
df.head()

Now that the two columns have been merged, the same process of interpolation by model and capacity can be used on the remaining group.

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.25)]['smart_193_225']

Series([], Name: smart_193_225, dtype: float64)

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_193_225']

134            266.0
246         310513.0
714             62.0
1006         72805.0
1502         72944.0
              ...   
10974512       651.0
10974685        27.0
10974769        27.0
10974840        19.0
10974960        13.0
Name: smart_193_225, Length: 71163, dtype: float64

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull()) & \
                    (df['capacity_TB'] == 0.50)]['smart_193_225'])

<matplotlib.axes._subplots.AxesSubplot at 0x16c0ba5a488>

df.loc[(df['manufacturer'] == "Seagate") & (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 0.50)]['smart_193_225'].value_counts()

7.0         2001
11.0        1405
15.0        1301
14.0        1059
13.0        1033
            ... 
75245.0        1
75246.0        1
150495.0       1
75248.0        1
179911.0       1
Name: smart_193_225, Length: 11182, dtype: int64

smart_193_225_median_specialized = df.loc[(df['manufacturer'] == "Seagate") &\
                        (df['smart_193_225'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_193_225'].median()
smart_193_225_median_specialized

90136.0

smart_193_225_mean_specialized = df.loc[(df['manufacturer'] == "Seagate") & \
                        (df['smart_193_225'].notnull()) & \
                        (df['capacity_TB'] == 0.50)]['smart_193_225'].mean()
smart_193_225_mean_specialized

205361.6568019898

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull()) & \
       (df['capacity_TB'] == 2.00)]['smart_193_225']

Series([], Name: smart_193_225, dtype: float64)

sns.distplot(df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull())]['smart_193_225'])

<matplotlib.axes._subplots.AxesSubplot at 0x16ab62d0a48>

df.loc[(df['manufacturer'] == "Seagate") & \
       (df['smart_193_225'].notnull())]['smart_193_225'].value_counts()

67.0         7191
68.0         6614
66.0         6069
65.0         5823
1008.0       5694
             ... 
1405860.0       1
1405857.0       1
1405849.0       1
1405844.0       1
131069.0        1
Name: smart_193_225, Length: 117352, dtype: int64

smart_193_225_median = df.loc[(df['manufacturer'] == "Seagate") & \
                    (df['smart_193_225'].notnull())]['smart_193_225'].median()
smart_193_225_median

3694.0

# Use the median to fill the capacity category that can be calculated.
df.loc[(df['smart_193_225'].isnull()) & \
       (df['capacity_TB'] == 0.50), 'smart_193_225'] = \
        smart_193_225_median_specialized

# Use the median to fill the capacity categories that cannot be calculated.
df.loc[df['smart_193_225'].isnull(), 'smart_193_225'] = smart_193_225_median

df['smart_193_225'].isnull().sum()

0

The remaining columns to examine all have over 2 million NaN value rows each. This level of missing data causes interpolation to skew results far more than the previous groups' levels of missing data. The following grouping of columns have at least 70% of their values.

Column            NaN Count
smart_240_raw     2733254
smart_241_raw     2909319
smart_242_raw     2909319
smart_187_raw     3062543
smart_188_raw     3062543
smart_190_raw     3062543

smart_240_raw, smart_241_raw, and smart_242_raw¶

df.loc[df['failure'] == 0]['smart_240_raw'].isnull().value_counts()

False    8241226
True     2733207
Name: smart_240_raw, dtype: int64

df.loc[df['failure'] == 1]['smart_240_raw'].isnull().value_counts()

False    633
True      45
Name: smart_240_raw, dtype: int64

df.loc[df['smart_240_raw'].isnull()].head()

Notably, none of the HGST drives have a value for the smart_240_raw column. Additionally, the drives that are missing the smart_241_raw data are also likely the drives missing the smart_242_raw data.

Seagate drives have enough filled values to use and Toshiba drives have no missing values, but the HGST and Western Digital drives do not have enough values to interpolate from. As such, all missing values will be filled in with the mean.

df.loc[df['smart_240_raw'].isnull()]['manufacturer'].value_counts()

HGST               2660533
Seagate              53985
Western Digital      18734
Toshiba                  0
Name: manufacturer, dtype: int64

df.loc[df['smart_240_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7912570
Toshiba             322722
Western Digital       6567
HGST                     0
Name: manufacturer, dtype: int64

df.loc[df['smart_241_raw'].isnull()]['manufacturer'].value_counts()

HGST               2517013
Toshiba             322722
Seagate              45193
Western Digital      24389
Name: manufacturer, dtype: int64

df.loc[df['smart_241_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7921362
HGST                143520
Western Digital        912
Toshiba                  0
Name: manufacturer, dtype: int64

df.loc[df['smart_242_raw'].isnull()]['manufacturer'].value_counts()

HGST               2517013
Toshiba             322722
Seagate              45193
Western Digital      24389
Name: manufacturer, dtype: int64

df.loc[df['smart_242_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7921362
HGST                143520
Western Digital        912
Toshiba                  0
Name: manufacturer, dtype: int64

print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_240_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_240_raw'].median()))

Not Failed Mean: 19480.03062808859
Not Failed Median: 18474.0
Failed Mean: 17624.443917851502
Failed Median: 16374.0

print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_241_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_241_raw'].median()))

Not Failed Mean: 53260637298.32403
Not Failed Median: 55133127528.0
Failed Mean: 55750164828.86027
Failed Median: 57636526188.0

print("Not Failed Mean: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].mean()))
print("Not Failed Median: " + str(df.loc[df['failure'] == 0]['smart_242_raw'].median()))
print("Failed Mean: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].mean()))
print("Failed Median: " + str(df.loc[df['failure'] == 1]['smart_242_raw'].median()))

Not Failed Mean: 140094197957.17606
Not Failed Median: 154529207824.0
Failed Mean: 144369181046.50168
Failed Median: 156081138388.5

smart_240_mean = df.loc[df['smart_240_raw'].notnull()]['smart_240_raw'].mean()
smart_240_mean

19479.888113349185

df['smart_240_raw'].fillna(smart_240_mean, inplace = True)

df['smart_240_raw'].isnull().sum()

0

smart_241_mean = df.loc[df['smart_241_raw'].notnull()]['smart_241_raw'].mean()
smart_241_mean

53260820637.912384

df['smart_241_raw'].fillna(smart_241_mean, inplace = True)

df['smart_241_raw'].isnull().sum()

0

smart_242_mean = df.loc[df['smart_242_raw'].notnull()]['smart_242_raw'].mean()
smart_242_mean

140094512785.44415

df['smart_242_raw'].fillna(smart_242_mean, inplace = True)

df['smart_242_raw'].isnull().sum()

0

smart_187_raw, smart_188_raw, and smart_190_raw¶

df.loc[df['failure'] == 0]['smart_187_raw'].isnull().value_counts()

False    7911977
True     3062456
Name: smart_187_raw, dtype: int64

df.loc[df['failure'] == 1]['smart_187_raw'].isnull().value_counts()

False    593
True      85
Name: smart_187_raw, dtype: int64

df.loc[df['smart_187_raw'].isnull()].head()

The group of the smart_187_raw, smart_188_raw, and smart_190_raw columns are divided by manufacturer, with all Seagate drives having the values and none of the other drive manufacturers having the values.

df.loc[df['smart_187_raw'].isnull()]['manufacturer'].value_counts()

HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64

df.loc[df['smart_187_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64

sns.distplot(df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'], \
             kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16ab6c04988>

df.loc[df['smart_188_raw'].isnull()]['manufacturer'].value_counts()

HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64

df.loc[df['smart_188_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64

sns.distplot(df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'], \
             kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c0bdc4708>

df.loc[df['smart_190_raw'].isnull()]['manufacturer'].value_counts()

HGST               2660533
Toshiba             322722
Seagate              53985
Western Digital      25301
Name: manufacturer, dtype: int64

df.loc[df['smart_190_raw'].notnull()]['manufacturer'].value_counts()

Seagate            7912570
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64

sns.distplot(df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x16c1069ec08>

Given the column distributions, the smart_187_raw and smart_188_raw NaNs will be filled with the medians, and the smart_190_raw NaNs will be filled with the mean.

smart_187_median = df.loc[df['smart_187_raw'].notnull()]['smart_187_raw'].median()
smart_187_median

0.0

df['smart_187_raw'].fillna(smart_187_median, inplace = True)

df['smart_187_raw'].isnull().sum()

0

smart_188_median = df.loc[df['smart_188_raw'].notnull()]['smart_188_raw'].median()
smart_188_median

0.0

df['smart_188_raw'].fillna(smart_188_median, inplace = True)

df['smart_188_raw'].isnull().sum()

0

smart_190_mean = df.loc[df['smart_190_raw'].notnull()]['smart_190_raw'].mean()
smart_190_mean

28.227229585330683

df['smart_190_raw'].fillna(smart_190_mean, inplace = True)

df['smart_190_raw'].isnull().sum()

0

Memory Management and Reloading Checkpoint¶

if not os.path.isfile('pre_195_df.csv'):
    df.to_csv('pre_195_df.csv', index=False)

The Remaining Columns¶

These remaining columns have over 30% of their values missing, and an individualized approach will be taken with each of them. In some cases, categories of existing values may be helpful to preserve the potential for information with NaNs being their own category.

Row              NaN Count
smart_195_raw     4806304
smart_191_raw     6402847
smart_184_raw     6781011
smart_189_raw     6781011
smart_200_raw     7076978
smart_196_raw     7921364
smart_8_raw       7946665
smart_2_raw       7946665
smart_183_raw     9132294
smart_22_raw      9741975
smart_223_raw    10463678
smart_18_raw     10651999
smart_224_raw    10652391
smart_220_raw    10652391
smart_222_raw    10652391
smart_226_raw    10652391
smart_23_raw     10742991
smart_24_raw     10742991
smart_11_raw     10904619
smart_225_raw    10929920
smart_254_raw    10948140
smart_235_raw    10966321
smart_233_raw    10966321
smart_232_raw    10966321
smart_168_raw    10966321
smart_170_raw    10966321
smart_218_raw    10966321
smart_174_raw    10966321
smart_16_raw     10966321
smart_17_raw     10966321
smart_173_raw    10966321
smart_231_raw    10966321
smart_177_raw    10966321

smart_195_raw¶

This column only has values in a single manufacturer's drives, and even then only 77% of them. There appears to be virtually no difference in the column's distribution by failure status. Filling in NaNs with this information would only result in collinearity between the column and the manufacturer column, so it will be dropped from the dataframe.

sns.distplot(df.loc[df['smart_195_raw'].notnull()]['smart_195_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1721850e388>

sns.distplot(df.loc[df['failure'] == 0]['smart_195_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_195_raw'])
plt.grid(True)
plt.title("smart_195_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x17218898f08>

df.loc[df['smart_195_raw'].notnull()]['manufacturer'].value_counts()

Seagate    6168809
Name: manufacturer, dtype: int64

len(df.loc[df['smart_195_raw'].isnull()])

4806302

df['manufacturer'].value_counts()

Seagate            7966555
HGST               2660533
Toshiba             322722
Western Digital      25301
Name: manufacturer, dtype: int64

df[['smart_195_raw', 'failure']].corr()

df.drop(['smart_195_raw'], axis=1, inplace=True)
df.head()

smart_191_raw¶

This column is not split along manufacturer lines like many others, but still has a large percentage of missing values. A categorical column smart_191_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_191_raw column will then be dropped.

sns.distplot(df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x17218a02dc8>

sns.distplot(df.loc[df['failure'] == 0]['smart_191_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x17218b18908>

sns.distplot(df.loc[(df['failure'] == 0) & \
                    (df['smart_191_raw'] != 0.0)]['smart_191_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & \
                    (df['smart_191_raw'] != 0.0)]['smart_191_raw'])
plt.grid(True)
plt.title("smart_191_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x2638a69f108>

df.loc[df['smart_191_raw'].notnull()]['manufacturer'].value_counts()

Seagate            4239295
Toshiba             322722
Western Digital      10249
Name: manufacturer, dtype: int64

len(df.loc[df['smart_191_raw'].isnull()])

6402845

smart_191_mean = df.loc[df['smart_191_raw'].notnull()]['smart_191_raw'].mean()
smart_191_mean

14090.159100979688

df['smart_191_cat'] = 0

df.loc[(df['smart_191_raw'] < smart_191_mean), 'smart_191_cat'] = 1
df.loc[(df['smart_191_raw'] > smart_191_mean), 'smart_191_cat'] = 2

df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_191_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_191_cat'].value_counts()

0    6402845
1    3563568
2    1008698
Name: smart_191_cat, dtype: int64

df['smart_191_cat'].isnull().sum()

0

df.drop(['smart_191_raw'], axis=1, inplace=True)
df.head()

smart_184_raw¶

This column very rarely has any value other than 0 when it is available. However, whenever it is available and not 0, it has a disproportionate ratio of failures to nonfailures, making it a very useful measure for predicting failure. A categorical column smart_184_cat will be created with the following categories and values.

Value	Representation
0	Value is 0 or NaN
1	Value is above 0

The original smart_184_raw column will then be dropped.

sns.distplot(df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x16c1079f4c8>

sns.distplot(df.loc[df['failure'] == 0]['smart_184_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_184_raw'], kde = False)
plt.grid(True)
plt.title("smart_184_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x16c10895808>

df.loc[df['smart_184_raw'].notnull()]['smart_184_raw'].value_counts()

0.0    4194090
1.0          5
5.0          3
9.0          1
8.0          1
4.0          1
2.0          1
Name: smart_184_raw, dtype: int64

df.loc[(df['smart_184_raw'] != 0) & \
       (df['smart_184_raw'].notnull())][['smart_184_raw', 'failure']]

df.loc[df['smart_184_raw'].notnull()]['manufacturer'].value_counts()

Seagate            4194102
Western Digital          0
Toshiba                  0
HGST                     0
Name: manufacturer, dtype: int64

len(df.loc[df['smart_184_raw'].isnull()])

6781009

df['smart_184_cat'] = 0

df.loc[(df['smart_184_raw'] > 0), 'smart_184_cat'] = 1

df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_184_cat'].dtype

CategoricalDtype(categories=[0, 1], ordered=False)

df['smart_184_cat'].value_counts()

0    10975099
1          12
Name: smart_184_cat, dtype: int64

df['smart_184_cat'].isnull().sum()

0

df.drop(['smart_184_raw'], axis=1, inplace=True)
df.head()

smart_189_raw¶

This column only has values in a single manufacturer's drives, and even then only 38% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.

sns.distplot(df.loc[df['smart_189_raw'].notnull()]['smart_189_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x17242fb9808>

sns.distplot(df.loc[df['failure'] == 0]['smart_189_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_189_raw'], kde = False)
plt.grid(True)
plt.title("smart_189_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x17218d76d88>

df['smart_189_raw'].value_counts()

0.0       4032998
3.0          8714
6.0          8320
5.0          7783
7.0          7039
           ...   
6726.0          1
6721.0          1
6715.0          1
6711.0          1
2334.0          1
Name: smart_189_raw, Length: 818, dtype: int64

df.loc[df['smart_189_raw'].notnull()]['manufacturer'].value_counts()

Seagate    4194102
Name: manufacturer, dtype: int64

len(df.loc[df['smart_189_raw'].isnull()])

6781009

df[['smart_189_raw', 'failure']].corr()

df.drop(['smart_189_raw'], axis=1, inplace=True)
df.head()

smart_200_raw¶

This column is not entirely split along manufacturer lines like many others, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_200_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_200_raw column will then be dropped.

sns.distplot(df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1722987a148>

sns.distplot(df.loc[df['failure'] == 0]['smart_200_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_200_raw'], kde = False)
plt.grid(True)
plt.title("smart_200_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x17229970308>

df['smart_200_raw'].value_counts()

0.0         3852942
11894.0          93
292551.0         92
230037.0         92
128448.0         92
             ...   
402402.0          1
402415.0          1
402423.0          1
402469.0          1
295843.0          1
Name: smart_200_raw, Length: 30531, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_200_raw'] != 0.0)]['smart_200_raw'])
plt.grid(True)
plt.title("smart_200_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x17240d80e08>

df.loc[df['smart_200_raw'].notnull()]['manufacturer'].value_counts()

Seagate            3872834
Western Digital      25301
Name: manufacturer, dtype: int64

len(df.loc[df['smart_200_raw'].isnull()])

7076976

df[['smart_200_raw', 'failure']].corr()

smart_200_mean = df.loc[df['smart_200_raw'].notnull()]['smart_200_raw'].mean()
smart_200_mean

3494.200142375777

df.loc[(df['failure'] == 0) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()

3493.295720311555

df.loc[(df['failure'] == 1) & (df['smart_200_raw'].notnull())]['smart_200_raw'].mean()

12351.484924623115

df['smart_200_cat'] = 0

df.loc[(df['smart_200_raw'] < smart_200_mean), 'smart_200_cat'] = 1
df.loc[(df['smart_200_raw'] > smart_200_mean), 'smart_200_cat'] = 2

df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_200_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_200_cat'].value_counts()

0    7076976
1    3853796
2      44339
Name: smart_200_cat, dtype: int64

df['smart_200_cat'].isnull().sum()

0

df.drop(['smart_200_raw'], axis=1, inplace=True)
df.head()

smart_196_raw¶

This column is not split along manufacturer lines whatsoever, but still has a large percentage of missing values. Given the reasonably large correlation between a higher value and a higher failure rate, a categorical column smart_196_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_196_raw column will then be dropped.

sns.distplot(df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x172424a0608>

sns.distplot(df.loc[df['failure'] == 0]['smart_196_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_196_raw'], kde = False)
plt.grid(True)
plt.title("smart_196_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x172431350c8>

df['smart_196_raw'].value_counts()

0.0       3042208
1.0          2466
2.0           472
6.0           353
7.0           317
           ...   
1054.0          1
1056.0          1
1057.0          1
375.0           1
5085.0          1
Name: smart_196_raw, Length: 505, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_196_raw'] != 0.0)]['smart_196_raw'])
plt.grid(True)
plt.title("smart_196_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1724324c2c8>

df.loc[df['smart_196_raw'].notnull()]['manufacturer'].value_counts()

HGST               2660533
Toshiba             322722
Seagate              45193
Western Digital      25301
Name: manufacturer, dtype: int64

len(df.loc[df['smart_196_raw'].isnull()])

7921362

df[['smart_196_raw', 'failure']].corr()

smart_196_mean = df.loc[df['smart_196_raw'].notnull()]['smart_196_raw'].mean()
smart_196_mean

0.6639966153079379

df.loc[(df['failure'] == 0) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()

0.662450747691953

df.loc[(df['failure'] == 1) & (df['smart_196_raw'].notnull())]['smart_196_raw'].mean()

56.2

df['smart_196_cat'] = 0

df.loc[(df['smart_196_raw'] < smart_196_mean), 'smart_196_cat'] = 1
df.loc[(df['smart_196_raw'] > smart_196_mean), 'smart_196_cat'] = 2

df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_196_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_196_cat'].value_counts()

0    7921362
1    3042208
2      11541
Name: smart_196_cat, dtype: int64

df['smart_196_cat'].isnull().sum()

0

df.drop(['smart_196_raw'], axis=1, inplace=True)
df.head()

smart_8_raw¶

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_8_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_8_raw column will then be dropped.

sns.distplot(df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x17243340c88>

sns.distplot(df.loc[df['failure'] == 0]['smart_8_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_8_raw'])
plt.grid(True)
plt.title("smart_8_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1724344c088>

df['smart_8_raw'].value_counts()

42.0    1188731
0.0     1016768
18.0     511993
43.0     139862
41.0      70099
15.0      40426
17.0      19545
16.0      16678
44.0      13437
40.0       9087
45.0       1822
Name: smart_8_raw, dtype: int64

df.loc[df['smart_8_raw'].notnull()]['manufacturer'].value_counts()

HGST       2660533
Toshiba     322722
Seagate      45193
Name: manufacturer, dtype: int64

len(df.loc[df['smart_8_raw'].isnull()])

7946663

df[['smart_8_raw', 'failure']].corr()

smart_8_mean = df.loc[df['smart_8_raw'].notnull()]['smart_8_raw'].mean()
smart_8_mean

23.204262381259312

df.loc[(df['failure'] == 0) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()

23.20460452474583

df.loc[(df['failure'] == 1) & (df['smart_8_raw'].notnull())]['smart_8_raw'].mean()

10.08860759493671

df['smart_8_cat'] = 0

df.loc[(df['smart_8_raw'] < smart_8_mean), 'smart_8_cat'] = 1
df.loc[(df['smart_8_raw'] > smart_8_mean), 'smart_8_cat'] = 2

df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_8_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_8_cat'].value_counts()

0    7946663
1    1605410
2    1423038
Name: smart_8_cat, dtype: int64

df['smart_8_cat'].isnull().sum()

0

df.drop(['smart_8_raw'], axis=1, inplace=True)
df.head()

smart_2_raw¶

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_2_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_2_raw column will then be dropped.

sns.distplot(df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x17255978f48>

sns.distplot(df.loc[df['failure'] == 0]['smart_2_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_2_raw'])
plt.grid(True)
plt.title("smart_2_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1731879a0c8>

df['smart_2_raw'].value_counts()

0.0      1016766
100.0     391933
96.0      348559
104.0     310452
103.0     213528
          ...   
70.0           3
161.0          1
67.0           1
64.0           1
62.0           1
Name: smart_2_raw, Length: 72, dtype: int64

df.loc[df['smart_2_raw'].notnull()]['manufacturer'].value_counts()

HGST       2660533
Toshiba     322722
Seagate      45193
Name: manufacturer, dtype: int64

len(df.loc[df['smart_2_raw'].isnull()])

7946663

df[['smart_2_raw', 'failure']].corr()

smart_2_mean = df.loc[df['smart_2_raw'].notnull()]['smart_2_raw'].mean()
smart_2_mean

67.00107943078434

df.loc[(df['failure'] == 0) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()

100.86548962821233

df.loc[(df['failure'] == 1) & (df['smart_2_raw'].notnull())]['smart_2_raw'].mean()

100.65217391304348

df['smart_2_cat'] = 0

df.loc[(df['smart_2_raw'] < smart_2_mean), 'smart_2_cat'] = 1
df.loc[(df['smart_2_raw'] > smart_2_mean), 'smart_2_cat'] = 2

df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_2_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_2_cat'].value_counts()

0    7946663
2    2010939
1    1017509
Name: smart_2_cat, dtype: int64

df['smart_2_cat'].isnull().sum()

0

df.drop(['smart_2_raw'], axis=1, inplace=True)
df.head()

smart_183_raw¶

This column only has values in a single manufacturer's drives, and even then only 23% of them. There is also little correlation between this column and the failure rate. Filling in NaNs with this information could result in collinearity between the column and the manufacturer column as well, so it will be dropped from the dataframe without a category column.

sns.distplot(df.loc[df['smart_183_raw'].notnull()]['smart_183_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1723729f608>

sns.distplot(df.loc[df['failure'] == 0]['smart_183_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_183_raw'], kde = False)
plt.grid(True)
plt.title("smart_183_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x173188c32c8>

df['smart_183_raw'].value_counts()

0.0     1526248
2.0      160993
1.0       66360
4.0       17906
3.0       16077
         ...   
86.0         75
43.0         57
87.0         52
68.0         39
58.0          2
Name: smart_183_raw, Length: 116, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_183_raw'] != 0.0)]['smart_183_raw'])
plt.grid(True)
plt.title("smart_183_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x172291088c8>

df.loc[df['smart_183_raw'].notnull()]['manufacturer'].value_counts()

Seagate    1842819
Name: manufacturer, dtype: int64

len(df.loc[df['smart_183_raw'].isnull()])

9132292

df[['smart_183_raw', 'failure']].corr()

df.drop(['smart_183_raw'], axis=1, inplace=True)
df.head()

Memory Management and Reloading Checkpoint¶

if not os.path.isfile('pre_22_df.csv'):
    df.to_csv('pre_22_df.csv', index=False)

df = pd.read_csv('pre_22_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')

smart_22_raw¶

This column only has values in a single manufacturer's drives, as it is an indication of helium levels encased in certain HGST drives (Klein, 2015). Given this, it would make no sense to fill this column's NaN values in rows of drives from other manufacturers. Beyond that, the dataset does not have any failures with abnormal levels, making this column potentially a negative impact to the real-world effectiveness of a predictive model. Given this risk, the risk of collinearity with the manufacturer column, and the low correlation of this column and the failure rate, this column will be dropped from the dataframe without a category column for the simplification of the models.

sns.distplot(df.loc[df['smart_22_raw'].notnull()]['smart_22_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x17218e47a88>

sns.distplot(df.loc[df['failure'] == 0]['smart_22_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x26243e3ea08>

df['smart_22_raw'].value_counts()

100.0    1232992
98.0          23
97.0          23
94.0          13
99.0          12
96.0          11
95.0           8
92.0           6
91.0           5
93.0           5
88.0           4
71.0           3
81.0           2
73.0           2
74.0           2
75.0           2
76.0           2
79.0           2
80.0           2
84.0           2
83.0           2
87.0           2
89.0           2
90.0           2
82.0           1
77.0           1
85.0           1
86.0           1
70.0           1
69.0           1
66.0           1
61.0           1
58.0           1
Name: smart_22_raw, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_22_raw'] != 100.0)]['smart_22_raw'], kde = False)
plt.grid(True)
plt.title("smart_22_raw Non100 Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:200: RuntimeWarning: Mean of empty slice.
  line, = ax.plot(a.mean(), 0)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\numpy\core\_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

<matplotlib.legend.Legend at 0x2638c946508>

df.loc[df['failure'] == 1]['smart_22_raw'].value_counts()

100.0    10
Name: smart_22_raw, dtype: int64

df.loc[df['smart_22_raw'].notnull()]['manufacturer'].value_counts()

HGST    1233138
Name: manufacturer, dtype: int64

len(df.loc[df['smart_22_raw'].isnull()])

9741973

df[['smart_22_raw', 'failure']].corr()

df.drop(['smart_22_raw'], axis=1, inplace=True)
df.head()

smart_223_raw¶

This column is not strongly split along manufacturer lines, but still has a large percentage of missing values. Given the reasonably large negative correlation between a higher value and failure rate, a categorical column smart_223_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_223_raw column will then be dropped.

sns.distplot(df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x2627455f688>

sns.distplot(df.loc[df['failure'] == 0]['smart_223_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_223_raw'], kde = False)
plt.grid(True)
plt.title("smart_223_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x262746c9d08>

df['smart_223_raw'].value_counts()

0.0       466047
164.0        208
196.0        202
654.0        191
484.0        191
           ...  
2587.0         1
2588.0         1
2594.0         1
2606.0         1
1852.0         1
Name: smart_223_raw, Length: 3560, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_223_raw'] != 0.0)]['smart_223_raw'])
plt.grid(True)
plt.title("smart_223_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x262747d2208>

df.loc[df['failure'] == 1]['smart_223_raw'].value_counts()

0.0        41
911.0       1
331.0       1
4590.0      1
266.0       1
1160.0      1
1215.0      1
5115.0      1
2609.0      1
2102.0      1
841.0       1
13377.0     1
221.0       1
184.0       1
Name: smart_223_raw, dtype: int64

df.loc[df['smart_223_raw'].notnull()]['manufacturer'].value_counts()

Toshiba    322722
HGST       143520
Seagate     45193
Name: manufacturer, dtype: int64

len(df.loc[df['smart_223_raw'].isnull()])

10463676

df[['smart_223_raw', 'failure']].corr()

smart_223_mean = df.loc[df['smart_223_raw'].notnull()]['smart_223_raw'].mean()
smart_223_mean

113.52614701770509

df.loc[(df['failure'] == 0) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()

1278.8611129476585

df.loc[(df['failure'] == 1) & (df['smart_223_raw'].notnull())]['smart_223_raw'].mean()

2532.4615384615386

df['smart_223_cat'] = 0

df.loc[(df['smart_223_raw'] < smart_223_mean), 'smart_223_cat'] = 1
df.loc[(df['smart_223_raw'] > smart_223_mean), 'smart_223_cat'] = 2

df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_223_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_223_cat'].value_counts()

0    10463676
1      469343
2       42092
Name: smart_223_cat, dtype: int64

df['smart_223_cat'].isnull().sum()

0

df.drop(['smart_223_raw'], axis=1, inplace=True)
df.head()

Memory Management and Reloading Checkpoint¶

if not os.path.isfile('pre_18_df.csv'):
    df.to_csv('pre_18_df.csv', index=False)

df = pd.read_csv('pre_18_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')

smart_18_raw¶

This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.

sns.distplot(df.loc[df['smart_18_raw'].notnull()]['smart_18_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1dd4e11b108>

df['smart_18_raw'].value_counts()

0.0    323114
Name: smart_18_raw, dtype: int64

df.loc[df['failure'] == 1]['smart_18_raw'].value_counts()

0.0    10
Name: smart_18_raw, dtype: int64

df.loc[df['smart_18_raw'].notnull()]['manufacturer'].value_counts()

Seagate            323114
Western Digital         0
Toshiba                 0
HGST                    0
Name: manufacturer, dtype: int64

df.drop(['smart_18_raw'], axis=1, inplace=True)
df.head()

smart_224_raw¶

This column is not only missing 97% of its values, it also has no variance whatsoever, making it useless for analysis.

sns.distplot(df.loc[df['smart_224_raw'].notnull()]['smart_224_raw'], kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x1de874b18c8>

df['smart_224_raw'].value_counts()

0.0    322722
Name: smart_224_raw, dtype: int64

df.loc[df['failure'] == 1]['smart_224_raw'].value_counts()

0.0    40
Name: smart_224_raw, dtype: int64

df.loc[df['smart_224_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df.drop(['smart_224_raw'], axis=1, inplace=True)
df.head()

smart_220_raw¶

This column is entirely split along manufacturer lines and has a large percentage of missing values, but it seems to be one of the few predictors available for Toshiba drives. Given the relatively large negative correlation between a higher value and failure rate, a categorical column smart_220_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Median
2	Above Median

The original smart_220_raw column will then be dropped.

sns.distplot(df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce586d9e48>

sns.distplot(df.loc[df['failure'] == 0]['smart_220_raw'], kde = False)
sns.distplot(df.loc[df['failure'] == 1]['smart_220_raw'], kde = False)
plt.grid(True)
plt.title("smart_220_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1cd48449348>

df['smart_220_raw'].value_counts()

0.0            91799
2097152.0       3664
2097153.0       3215
1835008.0       2602
2228224.0       2455
               ...  
393222.0           1
393221.0           1
52035592.0         1
219545607.0        1
286130185.0        1
Name: smart_220_raw, Length: 2581, dtype: int64

sns.distplot(df.loc[(df['failure'] == 0) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
sns.distplot(df.loc[(df['failure'] == 1) & (df['smart_220_raw'] != 0.0)]['smart_220_raw'])
plt.grid(True)
plt.title("smart_220_raw Nonzero Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1cde0ecf1c8>

df.loc[df['failure'] == 1]['smart_220_raw'].value_counts()

0.0            33
235536386.0     1
218234882.0     1
1048579.0       1
2097155.0       1
18612226.0      1
134217729.0     1
1835009.0       1
Name: smart_220_raw, dtype: int64

len(df.loc[df['smart_220_raw'].isnull()])

10652389

df.loc[df['smart_220_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df[['smart_220_raw', 'failure']].corr()

smart_220_median = df.loc[df['smart_220_raw'].notnull()]['smart_220_raw'].median()
smart_220_median

17563650.0

df.loc[(df['failure'] == 0) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()

60495145.75764994

df.loc[(df['failure'] == 1) & (df['smart_220_raw'].notnull())]['smart_220_raw'].mean()

15289549.15

df['smart_220_cat'] = 0

df.loc[(df['smart_220_raw'] < smart_220_median), 'smart_220_cat'] = 1
df.loc[(df['smart_220_raw'] > smart_220_median), 'smart_220_cat'] = 2

df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_220_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_220_cat'].value_counts()

0    10652643
1      161341
2      161127
Name: smart_220_cat, dtype: int64

df['smart_220_cat'].isnull().sum()

0

df.drop(['smart_220_raw'], axis=1, inplace=True)
df.head()

smart_222_raw¶

Although only available on the Toshiba drives, this is the highest correlation to failure rates yet. A categorical column smart_222_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_222_raw column will then be dropped.

sns.distplot(df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce3f5c4548>

sns.distplot(df.loc[df['failure'] == 0]['smart_222_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_222_raw'])
plt.grid(True)
plt.title("smart_222_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1cd0f4e9908>

df['smart_222_raw'].value_counts()

77.0       525
89.0       375
90.0       333
94.0       304
58.0       301
          ... 
30889.0      1
28043.0      1
26603.0      1
26609.0      1
26583.0      1
Name: smart_222_raw, Length: 27393, dtype: int64

df.loc[df['failure'] == 1]['smart_222_raw'].value_counts()

25099.0    2
23812.0    1
24778.0    1
17444.0    1
16776.0    1
25250.0    1
15603.0    1
25711.0    1
22518.0    1
17055.0    1
24416.0    1
20610.0    1
23714.0    1
9565.0     1
25217.0    1
336.0      1
218.0      1
99.0       1
23203.0    1
15095.0    1
9199.0     1
15775.0    1
22333.0    1
17532.0    1
17755.0    1
14413.0    1
6384.0     1
16092.0    1
27990.0    1
19802.0    1
20436.0    1
22484.0    1
25076.0    1
28690.0    1
25807.0    1
806.0      1
1045.0     1
18607.0    1
21792.0    1
Name: smart_222_raw, dtype: int64

len(df.loc[df['smart_222_raw'].isnull()])

10652389

df.loc[df['smart_222_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df[['smart_222_raw', 'failure']].corr()

smart_222_mean = df.loc[df['smart_222_raw'].notnull()]['smart_222_raw'].mean()
smart_222_mean

9188.18992197619

df.loc[(df['failure'] == 0) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()

9187.117322937132

df.loc[(df['failure'] == 1) & (df['smart_222_raw'].notnull())]['smart_222_raw'].mean()

17840.9

df['smart_222_cat'] = 0

df.loc[(df['smart_222_raw'] < smart_222_mean), 'smart_222_cat'] = 1
df.loc[(df['smart_222_raw'] > smart_222_mean), 'smart_222_cat'] = 2

df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_222_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_222_cat'].value_counts()

0    10652389
2      168101
1      154621
Name: smart_222_cat, dtype: int64

df['smart_222_cat'].isnull().sum()

0

df.drop(['smart_222_raw'], axis=1, inplace=True)
df.head()

smart_226_raw¶

Although only available on the Toshiba drives, this is the highest negative correlation to failure rates yet. A categorical column smart_226_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_226_raw column will then be dropped.

sns.distplot(df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1cd99227848>

sns.distplot(df.loc[df['failure'] == 0]['smart_226_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_226_raw'])
plt.grid(True)
plt.title("smart_226_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1cdcb246648>

df['smart_226_raw'].value_counts()

535.0    27872
536.0    24618
534.0    24428
533.0    21772
537.0    18981
         ...  
629.0        2
300.0        2
295.0        2
242.0        1
634.0        1
Name: smart_226_raw, Length: 232, dtype: int64

df.loc[df['failure'] == 1]['smart_226_raw'].value_counts()

533.0    3
180.0    2
258.0    2
168.0    2
173.0    2
182.0    2
243.0    1
179.0    1
248.0    1
249.0    1
592.0    1
277.0    1
272.0    1
183.0    1
540.0    1
250.0    1
176.0    1
184.0    1
537.0    1
177.0    1
586.0    1
261.0    1
257.0    1
251.0    1
265.0    1
187.0    1
181.0    1
269.0    1
262.0    1
167.0    1
270.0    1
263.0    1
186.0    1
Name: smart_226_raw, dtype: int64

len(df.loc[df['smart_226_raw'].isnull()])

10652389

df.loc[df['smart_226_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            322722
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df[['smart_226_raw', 'failure']].corr()

smart_226_mean = df.loc[df['smart_226_raw'].notnull()]['smart_226_raw'].mean()
smart_226_mean

458.5273021362039

df.loc[(df['failure'] == 0) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()

458.54995010567677

df.loc[(df['failure'] == 1) & (df['smart_226_raw'].notnull())]['smart_226_raw'].mean()

275.825

df['smart_226_cat'] = 0

df.loc[(df['smart_226_raw'] < smart_226_mean), 'smart_226_cat'] = 1
df.loc[(df['smart_226_raw'] > smart_226_mean), 'smart_226_cat'] = 2

df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_226_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_226_cat'].value_counts()

0    10652389
2      234337
1       88385
Name: smart_226_cat, dtype: int64

df['smart_226_cat'].isnull().sum()

0

df.drop(['smart_226_raw'], axis=1, inplace=True)
df.head()

smart_23_raw¶

This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.

sns.distplot(df.loc[df['smart_23_raw'].notnull()]['smart_23_raw'])

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:288: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x1cd9a403e88>

df['smart_23_raw'].value_counts()

0.0    232122
Name: smart_23_raw, dtype: int64

df.loc[df['smart_23_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            232122
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df.drop(['smart_23_raw'], axis=1, inplace=True)
df.head()

smart_24_raw¶

This column is not only missing 98% of its values, it also has no variance whatsoever, making it useless for analysis.

sns.distplot(df.loc[df['smart_24_raw'].notnull()]['smart_24_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce2c004cc8>

df['smart_24_raw'].value_counts()

0.0    232122
Name: smart_24_raw, dtype: int64

df.loc[df['smart_24_raw'].notnull()]['manufacturer'].value_counts()

Toshiba            232122
Western Digital         0
Seagate                 0
HGST                    0
Name: manufacturer, dtype: int64

df.drop(['smart_24_raw'], axis=1, inplace=True)
df.head()

smart_11_raw¶

Although only available on 0.64% of drives, this is the highest correlation to failure rates yet. A categorical column smart_11_cat will be created with the following categories and values.

Value	Representation
0	Value NaN
1	Below Average
2	Above Average

The original smart_11_raw column will then be dropped.

sns.distplot(df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce2ae293c8>

sns.distplot(df.loc[df['failure'] == 0]['smart_11_raw'])
sns.distplot(df.loc[df['failure'] == 1]['smart_11_raw'])
plt.grid(True)
plt.title("smart_11_raw Distribution by Failure")
plt.legend(["Not Failed", "Failed"])

<matplotlib.legend.Legend at 0x1ce86d97f48>

df['smart_11_raw'].value_counts()

0.0       25301
164.0       208
196.0       202
484.0       191
654.0       191
          ...  
894.0         1
2786.0        1
2787.0        1
2790.0        1
2763.0        1
Name: smart_11_raw, Length: 3558, dtype: int64

df.loc[df['failure'] == 1]['smart_11_raw'].value_counts()

0.0        6
911.0      1
331.0      1
4590.0     1
266.0      1
1160.0     1
1215.0     1
5115.0     1
2609.0     1
2102.0     1
841.0      1
13377.0    1
221.0      1
184.0      1
Name: smart_11_raw, dtype: int64

df.loc[df['smart_11_raw'].notnull()]['manufacturer'].value_counts()

Seagate            45193
Western Digital    25301
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64

len(df.loc[df['smart_11_raw'].isnull()])

10904617

df[['smart_11_raw', 'failure']].corr()

smart_11_mean = df.loc[df['smart_11_raw'].notnull()]['smart_11_raw'].mean()
smart_11_mean

676.7458223394899

df.loc[(df['failure'] == 0) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()

676.4611280595956

df.loc[(df['failure'] == 1) & (df['smart_11_raw'].notnull())]['smart_11_raw'].mean()

1732.7368421052631

df['smart_11_cat'] = 0

df.loc[(df['smart_11_raw'] < smart_11_mean), 'smart_11_cat'] = 1
df.loc[(df['smart_11_raw'] > smart_11_mean), 'smart_11_cat'] = 2

df['smart_11_cat'] = df['smart_11_cat'].astype('category')
df['smart_11_cat'].dtype

CategoricalDtype(categories=[0, 1, 2], ordered=False)

df['smart_11_cat'].value_counts()

0    10904617
1       45469
2       25025
Name: smart_11_cat, dtype: int64

df['smart_11_cat'].isnull().sum()

0

df.drop(['smart_11_raw'], axis=1, inplace=True)
df.head()

smart_254_raw¶

This column is not only missing 99.75% of its values, it also has no variance whatsoever, making it useless for analysis.

sns.distplot(df.loc[df['smart_254_raw'].notnull()]['smart_254_raw'])

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\seaborn\distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x20238c92f48>

df['smart_254_raw'].value_counts()

0.0    26973
Name: smart_254_raw, dtype: int64

df.loc[df['smart_254_raw'].notnull()]['manufacturer'].value_counts()

Seagate            26061
Western Digital      912
Toshiba                0
HGST                   0
Name: manufacturer, dtype: int64

df.drop(['smart_254_raw'], axis=1, inplace=True)
df.head()

smart_235_raw¶

This column represents an interesting report, in that the first 3 bytes of it is the drive's good block count, while the last 2 bytes is the drive's bad block count, but this column is missing 99.92% of its values, making it useless for this type of predictive analysis.

sns.distplot(df.loc[df['smart_235_raw'].notnull()]['smart_235_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f55648>

df['smart_235_raw'].value_counts()

1.722229e+09    2
2.630864e+09    2
1.656523e+09    2
1.666588e+09    2
2.332312e+10    2
               ..
1.247958e+09    1
1.249292e+09    1
1.250461e+09    1
1.251153e+09    1
8.591649e+09    1
Name: smart_235_raw, Length: 8744, dtype: int64

df.loc[df['smart_235_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_235_raw'], axis=1, inplace=True)
df.head()

smart_233_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_233_raw'].notnull()]['smart_233_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f91f88>

df.loc[(df['smart_233_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_233_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_233_raw'], axis=1, inplace=True)
df.head()

smart_232_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_232_raw'].notnull()]['smart_232_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce86f85b88>

df.loc[(df['smart_232_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_232_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_232_raw'], axis=1, inplace=True)
df.head()

Memory Management and Reloading Checkpoint¶

if not os.path.isfile('pre_168_df.csv'):
    df.to_csv('pre_168_df.csv', index=False)

df = pd.read_csv('pre_168_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')

smart_168_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_168_raw'].notnull()]['smart_168_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce8710c808>

df.loc[(df['smart_168_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_168_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_168_raw'], axis=1, inplace=True)
df.head()

smart_170_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_170_raw'].notnull()]['smart_170_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1ce8729a608>

df.loc[(df['smart_170_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_170_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_170_raw'], axis=1, inplace=True)
df.head()

smart_218_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_218_raw'].notnull()]['smart_218_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1d05c6ffcc8>

df.loc[(df['smart_218_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_218_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_218_raw'], axis=1, inplace=True)
df.head()

smart_174_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_174_raw'].notnull()]['smart_174_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1d05c735e88>

df.loc[(df['smart_174_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_174_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_174_raw'], axis=1, inplace=True)
df.head()

smart_16_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_16_raw'].notnull()]['smart_16_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1d05c7b3408>

df.loc[(df['smart_16_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_16_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_16_raw'], axis=1, inplace=True)
df.head()

smart_17_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_17_raw'].notnull()]['smart_17_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1d05c8a8dc8>

df.loc[(df['smart_17_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_17_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_17_raw'], axis=1, inplace=True)
df.head()

#### Memory Management and Reloading Checkpoint
if not os.path.isfile('pre_173_df.csv'):
    df.to_csv('pre_173_df.csv', index=False)

df = pd.read_csv('pre_173_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')

smart_173_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_173_raw'].notnull()]['smart_173_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1b0ccd78e48>

df.loc[(df['smart_173_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_173_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_173_raw'], axis=1, inplace=True)
df.head()

smart_231_raw¶

No failures exist in the drives that have any value for this column, and it is also missing 99.92% of its values, making it useless for analysis.

sns.distplot(df.loc[df['smart_231_raw'].notnull()]['smart_231_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1b0feb75588>

df.loc[(df['smart_231_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_231_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_231_raw'], axis=1, inplace=True)
df.head()

smart_177_raw¶

sns.distplot(df.loc[df['smart_177_raw'].notnull()]['smart_177_raw'])

<matplotlib.axes._subplots.AxesSubplot at 0x1b11152b188>

df.loc[(df['smart_177_raw'].notnull()) & (df['failure'] == 1)]

df.loc[df['smart_177_raw'].notnull()]['manufacturer'].value_counts()

Seagate            8792
Western Digital       0
Toshiba               0
HGST                  0
Name: manufacturer, dtype: int64

df.drop(['smart_177_raw'], axis=1, inplace=True)
df.head()

if not os.path.isfile('explorative_df.csv'):
    df.to_csv('explorative_df.csv', index = False)

Analysis of Potential Predictors ¶

Memory Management and Reloading Checkpoint¶

df = pd.read_csv('explorative_df.csv')
n_rows = len(df)
df['date'] = df['date'].astype('category')
df['model'] = df['model'].astype('category')
df['failure'] = df['failure'].astype('bool')
df['manufacturer'] = df['manufacturer'].astype('category')
df['smart_191_cat'] = df['smart_191_cat'].astype('category')
df['smart_184_cat'] = df['smart_184_cat'].astype('category')
df['smart_200_cat'] = df['smart_200_cat'].astype('category')
df['smart_196_cat'] = df['smart_196_cat'].astype('category')
df['smart_8_cat'] = df['smart_8_cat'].astype('category')
df['smart_2_cat'] = df['smart_2_cat'].astype('category')
df['smart_223_cat'] = df['smart_223_cat'].astype('category')
df['smart_220_cat'] = df['smart_220_cat'].astype('category')
df['smart_222_cat'] = df['smart_222_cat'].astype('category')
df['smart_226_cat'] = df['smart_226_cat'].astype('category')
df['smart_11_cat'] = df['smart_11_cat'].astype('category')

fig, axes = plt.subplots(6, 6, figsize = (30, 25))

row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
    
    if col == 6:
        row += 1
        col = 0
        
    # Histograms
    if df[df_col].dtype.name == 'float64':
        if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
                      'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
                      'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
                      'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
                      'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
            ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
            ax.set_yscale('log')
            
        else:
            ax = sns.distplot(df[df_col], ax = axes[row, col], kde = False)
        
    # Countplots
    elif df[df_col].dtype.name == 'category' or \
                df[df_col].dtype.name == 'bool':
        if df_col == "date":
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            
        elif df_col == "model":
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            ax.set_yscale('log')
            
        elif df_col in ['smart_184_cat', 'smart_11_cat']:
            ax = sns.countplot(df[df_col], ax = axes[row, col])
            ax.set_yscale('log')
            
        else:
            sns.countplot(df[df_col], ax = axes[row, col])
    
    else:
        print("Unknown column dtype")
    
    col += 1
    

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Columns", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Distributions.svg")
fig.savefig("Charts/Dataframe Distributions.png")

fig, axes = plt.subplots(6, 6, figsize = (30, 25))

row = 0
col = 0
for df_col in ['date', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat']:
    
    if col == 6:
        row += 1
        col = 0
        
    # Histograms
    if df[df_col].dtype.name == 'float64':
        if df_col in ['smart_1_raw', 'smart_3_raw', 'smart_4_raw',
                      'smart_5_raw', 'smart_7_raw', 'smart_10_raw',
                      'smart_12_raw', 'smart_187_raw', 'smart_188_raw',
                      'smart_192_raw', 'smart_197_raw', 'smart_198_raw',
                      'smart_199_raw', 'smart_242_raw', 'smart_193_225']:
            ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
            ax.set_yscale('log')
            
        else:
            ax = sns.distplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col], kde = False)
        
    # Countplots
    elif df[df_col].dtype.name == 'category' or df[df_col].dtype.name == 'bool':
        if df_col == "date":
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            
        elif df_col == "model":
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            ax.set(xticklabels = [])
            ax.set_yscale('log')
            
        elif df_col in ['smart_184_cat', 'smart_11_cat']:
            ax = sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
            
        else:
            sns.countplot(df.loc[df['failure'] == 1][df_col], ax = axes[row, col])
    
    else:
        print("Unknown column dtype")
    
    col += 1
    

plt.subplots_adjust(top = 0.90)
fig.suptitle("Distribution of Dataframe Failure", fontsize = 54, y = 0.95)
fig.savefig("Charts/Dataframe Failure Distributions.svg")
fig.savefig("Charts/Dataframe Failure Distributions.png")

With all NaN values interpolated or their columns removed, correlations can be determined between the columns.

corr_df = df.corr(method = 'pearson')
corr_df

A few of the columns' relations will need to be examined well based on these correlation coefficients.

A prominent feature is smart_9_raw as the column with the most extreme correlations with other columns, which is understandable given that SMART attribute 9 represents the total count of hours the drive has been in a power-on state (Acronis, Knowledge Base 9109). Most other issues worth measuring are likely correlated with the drive age and amount of operation. This column may also be a powerful predictor within predictive models as an older drive is more likely to wear down to failure suddenly than a newer drive in general even if other values are not present. Even if other predictors of failure are present in an instance, a drive with an average or lower smart_9_raw value may represent a drive that will fail far sooner than the average length of time to failure.

smart_240_raw also has quite high correlations with other independent variables.

smart_197_raw and smart_198_raw have a nearly perfect degree of collinearity with each other and little in comparison with any other column. smart_198_raw will be dropped as it has a lower correlation with the dependent variable failure.

Finally, smart_190_raw and smart_194_raw have a very high degree of collinearity with each other and little in comparison with any other column. One likely needs removed.

The dataset may be large enough to not need to worry about the multicollinearity affecting the predictive power of the models, but the redundancy of information may skew the results.

For potential predictors for failure, smart_5_raw and smart_197_raw have the highest positive correlations with failure, at 4.4% and 2.7%. SMART attribute 5 is the reallocated sectors count of drives, which triggers when a read, write, or verification error occurs (Acronis, Knowledge Base 9105). SMART attribute 197 is the current pending sector count, which is the count of unstable sectors that are awaiting remapping (Acronis, Knowledge Base 9133). This value decreases as sectors are remapped, but the value would remain consistently high if these sectors are unable to be remapped. Both columns make complete sense as the highest correlation with failure and will likely be the most important predictor variables for HDD failure.

df[['smart_197_raw', 'smart_198_raw', 'failure']].corr()

df.drop('smart_198_raw', axis = 1, inplace = True)

corr_df = df.corr(method = 'pearson')

fig, ax = plt.subplots(figsize = (30, 23))

sns.heatmap(
    corr_df,
    ax = ax,
    annot = True,
    fmt = ".1%",
    vmin = -1, vmax = 1, center = 0,
    linewidths = 3,
    linecolor = "white",
    xticklabels = corr_df.columns,
    yticklabels = corr_df.columns,
    square = True,
    cbar = True
)

plt.title("Dataframe Correlation Heatmap", fontsize = 54)
fig.savefig("Charts/Corr Heatmap.svg")
fig.savefig("Charts/Corr Heatmap.png")

df.columns

Index(['date', 'serial_number', 'model', 'failure', 'smart_1_raw',
       'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw',
       'smart_9_raw', 'smart_10_raw', 'smart_12_raw', 'smart_187_raw',
       'smart_188_raw', 'smart_190_raw', 'smart_192_raw', 'smart_194_raw',
       'smart_197_raw', 'smart_199_raw', 'smart_240_raw', 'smart_241_raw',
       'smart_242_raw', 'manufacturer', 'capacity_TB', 'smart_193_225',
       'smart_191_cat', 'smart_184_cat', 'smart_200_cat', 'smart_196_cat',
       'smart_8_cat', 'smart_2_cat', 'smart_223_cat', 'smart_220_cat',
       'smart_222_cat', 'smart_226_cat', 'smart_11_cat'],
      dtype='object')

from sklearn.feature_selection import chi2

# Display the results of a Chi Squared on a contingency table
# in a tabular format
def chi2_output(contingency: pd.core.frame.DataFrame):
    chi2, p, dof, expected = scs.chi2_contingency(contingency)
    print("χ2-Coefficient: \t" + str(chi2))
    print("P-Value: \t\t" + str(p))
    print("Degrees of Freedom: \t" + str(dof))
    
    # Access the index names of the contingency table dataframe
    ax_1 = str(contingency.axes[1][0])
    ax_2 = str(contingency.axes[1][1])
    ax_title = ax_1 + ":\t\t" + ax_2 + ":\t\t" + ax_1 + ":\t\t" + ax_2 + ":"
            
    print("Expected Values:\n\t\t\tExpected:\t\t\tActual:")
    print("\t" + contingency.axes[1].name + ":\t" + ax_title)
    print(contingency.axes[0].name + ":")
    
    # Map the indexes to string values to ensure numeric indexes
    # don't cause type errors
    contingency.index = contingency.index.map(str)
    
    for i, j in enumerate(contingency.index):
        expected_false = "{:.3f}".format(expected[i][0])
        actual_false = str(contingency[0][i])
        expected_true = "{:.3f}".format(expected[i][1])
        actual_true = str(contingency[1][i])
        
        # Tabular spacing adjustments on the assumption that 1 tab = 8 spaces
        index_text = "    " + j + ": \t"
        
        if len(j) < 3:
            index_text += "\t\t"
        elif len(j) < 9:
            index_text += "\t"
        
        if len(expected_false) < 7:
            expected_false += "\t"
        if len(actual_false) < 7:
            actual_false += "\t"
        if len(expected_true) < 7:
            expected_true += "]\t"
        else:
            expected_true += "]"
        
        expected_text = "[" + expected_false + "\t" + expected_true
        if len(expected_text) < 16:
            expected_text = expected_text + "\t"
        actual_text = "[" + actual_false + "\t" + actual_true + "]"

        line = expected_text + "\t" + actual_text
                   
        print(index_text + line)

import os
import sys

from cffi import FFI

FFI_ = FFI()
FFI_.cdef('extern void* CreateSubarea(char * modelId, double areaKm2);')
FFI_.cdef('extern char** GetSubareaNames(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeIdentifiers(void* simulation, int* size);')
FFI_.cdef('extern char** GetNodeNames(void* simulation, int* size);')

def prepend_path_env (added_paths, to_env='PATH'):
    path_sep = ';'
    prior_path_env = os.environ.get(to_env)
    prior_paths = prior_path_env.split(path_sep)
    added_paths = [x for x in added_paths if os.path.exists(x)]
    new_paths = prior_paths + added_paths
    new_env_val = path_sep.join(new_paths)
    return new_env_val

libs_path = r"C:\Users\aedri\Anaconda3\envs\tf1\lib\R\bin\x64"
dll_path = os.path.join("R.dll")
libs_path2 = r"C:\Users\aedri\Anaconda3\envs\tf1\Lib\R\library\stats\libs\x64"
dll_path2 = os.path.join("stats.dll")

to_env = 'PATH'
if(sys.platform == 'win32'):
    os.environ[to_env] = prepend_path_env([libs_path], to_env)
    os.environ[to_env] = prepend_path_env([libs_path2], to_env)

LIB = FFI_.dlopen(dll_path, 1) # 1 for Lazy loading
dir(LIB)
LIB2 = FFI_.dlopen(dll_path2, 1) # 1 for Lazy loading
dir(LIB2)

['CreateSubarea', 'GetNodeIdentifiers', 'GetNodeNames', 'GetSubareaNames']

import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
rpy2.robjects.numpy2ri.activate()
rstats = importr('stats')

# Display the formatted results of the R stats Fisher_Test, 
# using Monte Carlo Simulation
def r_fisher_output(dataframe):
    results = rstats.fisher_test(dataframe.to_numpy(), \
                                 simulate_p_value = True)
    
    # Convert the listvector object returned from R stats to
    # a list of string values
    d = [key + "_" + str(results.rx2(key)[0]) for key in results.names]
    d2 = []
    for i in d:
        d2.append("".join(i.replace("\t", "").splitlines()))
    
    # Replicate the tabluar data formatting
    for line in d2:
        if len(line.split("_")[0]) < 8:
            print(line.replace("_", "\t\t"))
        else:
            print(line.replace("_", "\t"))

manufacturer¶

manufacturer_contingency = pd.crosstab(df['manufacturer'], df['failure'])
manufacturer_contingency

pd.crosstab(df['manufacturer'], df['failure'], normalize = "index")

chi2_output(manufacturer_contingency)

χ2-Coefficient: 	175.608940580931
P-Value: 		7.828322902603545e-38
Degrees of Freedom: 	3
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
manufacturer:
    HGST: 		[2660368.643	164.357]	[2660507	26]
    Seagate: 		[7966062.857	492.143]	[7965949	606]
    Toshiba: 		[322702.063	19.937]		[322682		40]
    Western Digital: 	[25299.437	1.563]		[25295		6]

r_fisher_output(manufacturer_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(2660507L, 7965949L, 322682L, 25295L, 26L, 606L, 40L, 
data.name2	6L), .Dim = c(4L, 2L))

model¶

The model column will ultimately be dropped even after all of the work that went into cleaning its data. The large amount of categories in it substantially adds complexity to the model while not improving enough. The manufacturer column, while less specific, contains all of the same variation with only four categories. Additionally, many of the models do not have a single failure, and even more have only thousands of hard drive days that they represent. Leaving the column in for predictive modeling and analysis will only hurt the overall results, and as such, is removed.

model_contingency = pd.crosstab(df['model'], df['failure'])
model_contingency

pd.crosstab(df['model'], df['failure'], normalize = "index")

chi2_output(model_contingency)

χ2-Coefficient: 	519.3493099524433
P-Value: 		3.025361166876539e-85
Degrees of Freedom: 	39
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
model:
    HDS5C4040ALE630: 	[2483.847	0.153]		[2484		0]
    HDWE160: 		[367.977	0.023]		[368		0]
    HDWF180: 		[1839.886	0.114]		[1840		0]
    HMS5C4040ALE640: 	[253742.324	15.676]		[253754		4]
    HMS5C4040BLE640: 	[1168419.815	72.185]		[1168480	12]
    HMS5C4040BLE641: 	[90.994		0.006]		[91		0]
    HUH721212ALE600: 	[143511.134	8.866]		[143519		1]
    HUH721212ALN604: 	[995663.488	61.512]		[995718		7]
    HUH728080ALE600: 	[92067.312	5.688]		[92071		2]
    HUS726040ALE610: 	[2569.841	0.159]		[2570		0]
    MD04ABA400V: 	[9008.443	0.557]		[9009		0]
    MG07ACA14TA: 	[232107.660	14.340]		[232115		7]
    MQ01ABF050: 	[42562.370	2.630]		[42542		23]
    MQ01ABF050M: 	[36815.726	2.274]		[36808		10]
    ST10000NM0086: 	[109166.256	6.744]		[109168		5]
    ST1000LM024 HN': 	[90.994		0.006]		[91		0]
    ST12000NM0007: 	[3394682.277	209.723]	[3394528	364]
    ST12000NM0008: 	[321254.153	19.847]		[321264		10]
    ST12000NM0117: 	[461.971	0.029]		[462		0]
    ST16000NM001G: 	[1839.886	0.114]		[1840		0]
    ST4000DM000: 	[1757389.429	108.571]	[1757379	119]
    ST4000DM005: 	[3554.780	0.220]		[3555		0]
    ST500LM012 HN: 	[45099.214	2.786]		[45089		13]
    ST500LM021: 	[3035.812	0.188]		[3036		0]
    ST500LM030: 	[23023.578	1.422]		[23019		6]
    ST6000DM001: 	[367.977	0.023]		[368		0]
    ST6000DM004: 	[91.994		0.006]		[92		0]
    ST6000DX000: 	[81487.966	5.034]		[81492		1]
    ST8000DM002: 	[896890.590	55.410]		[896911		35]
    ST8000DM004: 	[272.983	0.017]		[273		0]
    ST8000DM005: 	[2256.861	0.139]		[2257		0]
    ST8000NM0055: 	[1316304.679	81.321]		[1316333	53]
    Seagate SSD: 	[1819.888	0.112]		[1820		0]
    WD5000BPKT: 	[911.944	0.056]		[912		0]
    WD5000LPCX: 	[4927.696	0.304]		[4928		0]
    WD5000LPVX: 	[19185.815	1.185]		[19181		6]
    WD60EFRX: 		[273.983	0.017]		[274		0]
    ZA2000CM10002: 	[354.978	0.022]		[355		0]
    ZA250CM10002: 	[6843.577	0.423]		[6844		0]
    ZA500CM10002: 	[1592.902	0.098]		[1593		0]

r_fisher_output(model_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(2484L, 368L, 1840L, 253754L, 1168480L, 91L, 143519L, 
data.name2	995718L, 92071L, 2570L, 9009L, 232115L, 42542L, 36808L, 109168L, 
data.name3	91L, 3394528L, 321264L, 462L, 1840L, 1757379L, 3555L, 45089L, 
data.name4	3036L, 23019L, 368L, 92L, 81492L, 896911L, 273L, 2257L, 1316333L, 
data.name5	1820L, 912L, 4928L, 19181L, 274L, 355L, 6844L, 1593L, 0L, 0L, 
data.name6	0L, 4L, 12L, 0L, 1L, 7L, 2L, 0L, 0L, 7L, 23L, 10L, 5L, 0L, 364L, 
data.name7	10L, 0L, 0L, 119L, 0L, 13L, 0L, 6L, 0L, 0L, 1L, 35L, 0L, 0L, 
data.name8	53L, 0L, 0L, 0L, 6L, 0L, 0L, 0L, 0L), .Dim = c(40L, 2L))

df.drop('model', axis = 1, inplace = True)

capacity_TB¶

capacity_contingency = pd.crosstab(df['capacity_TB'], df['failure'])
capacity_contingency

pd.crosstab(df['capacity_TB'], df['failure'], normalize = "index")

chi2_output(capacity_contingency)

χ2-distribution: 	272.1246384142239
P-Value: 		1.192577692516227e-52
Degrees of Freedom: 	10
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
capacity_TB:
    0.25: 		[6843.577	0.423]		[6844		0]
    0.5: 		[177155.055	10.945]		[177108		58]
    1.0: 		[90.994		0.006]		[91		0]
    2.0: 		[354.978	0.022]		[355		0]
    4.0: 		[3197259.473	197.527]	[3197322	135]
    6.0: 		[82589.898	5.102]		[82594		1]
    8.0: 		[2309632.311	142.689]	[2309685	90]
    10.0: 		[110986.143	6.857]		[110988		5]
    12.0: 		[4855573.023	299.977]	[4855491	382]
    14.0: 		[232107.660	14.340]		[232115		7]
    16.0: 		[1839.886	0.114]		[1840		0]

r_fisher_output(capacity_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name1	structure(c(6844L, 177108L, 91L, 355L, 3197322L, 82594L, 2309685L, 
data.name2	110988L, 4855491L, 232115L, 1840L, 0L, 58L, 0L, 0L, 135L, 1L, 
data.name3	90L, 5L, 382L, 7L, 0L), .Dim = c(11L, 2L))

smart_191_cat¶

This column has the highest p-value out of all of the category columns. While still statistically significant, this is likely from the size of the dataset and not out of pure correlation. smart_191_cat is not likely to be a good predictor variable.

smart_191_contingency = pd.crosstab(df['smart_191_cat'], df['failure'])
smart_191_contingency

pd.crosstab(df['smart_191_cat'], df['failure'], normalize = "index")

chi2_output(smart_191_contingency)

χ2-distribution: 	11.872750463233979
P-Value: 		0.002641587460833803
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_191_cat:
    0: 			[6402449.457	395.543]	[6402442	403]
    1: 			[3563347.857	220.143]	[3563330	238]
    2: 			[1008635.687	62.313]		[1008661	37]

r_fisher_output(smart_191_contingency)

p.value		0.0014992503748125937
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(6402442L, 3563330L, 1008661L, 403L, 238L, 37L), .Dim = 3:2)

smart_184_cat¶

With a p-value of 0.0, this is likely the strongest relation to failure in the dataset.

smart_184_contingency = pd.crosstab(df['smart_184_cat'], df['failure'])
smart_184_contingency

pd.crosstab(df['smart_184_cat'], df['failure'], normalize = "index")

chi2_output(smart_184_contingency)

χ2-distribution: 	40797.503188875104
P-Value: 		0.0
Degrees of Freedom: 	1
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_184_cat:
    0: 			[10974421.001	677.999]	[10974427	672]
    1: 			[11.999		0.001]		[6		6]

r_fisher_output(smart_184_contingency)

p.value		5.0214144599400225e-23
conf.int	4201.0256410217235
estimate	16382.987859030574
null.value	1.0
alternative	two.sided
method		Fisher's Exact Test for Count Data
data.name	structure(c(10974427L, 6L, 672L, 6L), .Dim = c(2L, 2L))

smart_200_cat¶

smart_200_contingency = pd.crosstab(df['smart_200_cat'], df['failure'])
smart_200_contingency

pd.crosstab(df['smart_200_cat'], df['failure'], normalize = "index")

chi2_output(smart_200_contingency)

χ2-distribution: 	185.64260160941936
P-Value: 		4.877769316310266e-41
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_200_cat:
    0: 			[7076538.812	437.188]	[7076696	280]
    1: 			[3853557.927	238.073]	[3853411	385]
    2: 			[44336.261	2.739]		[44326		13]

r_fisher_output(smart_200_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7076696L, 3853411L, 44326L, 280L, 385L, 13L), .Dim = 3:2)

smart_196_cat¶

smart_196_contingency = pd.crosstab(df['smart_196_cat'], df['failure'])
smart_196_contingency

pd.crosstab(df['smart_196_cat'], df['failure'], normalize = "index")

chi2_output(smart_196_contingency)

χ2-distribution: 	184.9589741070181
P-Value: 		6.865451182810317e-41
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_196_cat:
    0: 			[7920872.649	489.351]	[7920769	593]
    1: 			[3042020.064	187.936]	[3042132	76]
    2: 			[11540.287	0.713]		[11532		9]

r_fisher_output(smart_196_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7920769L, 3042132L, 11532L, 593L, 76L, 9L), .Dim = 3:2)

smart_8_cat¶

smart_8_contingency = pd.crosstab(df['smart_8_cat'], df['failure'])
smart_8_contingency

pd.crosstab(df['smart_8_cat'], df['failure'], normalize = "index")

chi2_output(smart_8_contingency)

χ2-distribution: 	95.8211076517963
P-Value: 		1.5585145046397941e-21
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_8_cat:
    0: 			[7946172.086	490.914]	[7946064	599]
    1: 			[1605310.824	99.176]		[1605347	63]
    2: 			[1422950.090	87.910]		[1423022	16]

r_fisher_output(smart_8_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7946064L, 1605347L, 1423022L, 599L, 63L, 16L), .Dim = 3:2)

smart_2_cat¶

smart_2_contingency = pd.crosstab(df['smart_2_cat'], df['failure'])
smart_2_contingency

pd.crosstab(df['smart_2_cat'], df['failure'], normalize = "index")

chi2_output(smart_2_contingency)

χ2-distribution: 	107.03867719554249
P-Value: 		5.712767796083653e-24
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_2_cat:
    0: 			[7946172.086	490.914]	[7946064	599]
    1: 			[1017446.142	62.858]		[1017453	56]
    2: 			[2010814.772	124.228]	[2010916	23]

r_fisher_output(smart_2_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(7946064L, 1017453L, 2010916L, 599L, 56L, 23L), .Dim = 3:2)

smart_223_cat¶

smart_223_contingency = pd.crosstab(df['smart_223_cat'], df['failure'])
smart_223_contingency

pd.crosstab(df['smart_223_cat'], df['failure'], normalize = "index")

chi2_output(smart_223_contingency)

χ2-distribution: 	47.3441011733582
P-Value: 		5.240335045424587e-11
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_223_cat:
    0: 			[10463029.594	646.406]	[10463052	624]
    1: 			[469314.006	28.994]		[469302		41]
    2: 			[42089.400	2.600]		[42079		13]

r_fisher_output(smart_223_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10463052L, 469302L, 42079L, 624L, 41L, 13L), .Dim = 3:2)

smart_220_cat¶

smart_220_contingency = pd.crosstab(df['smart_220_cat'], df['failure'])
smart_220_contingency

pd.crosstab(df['smart_220_cat'], df['failure'], normalize = "index")

chi2_output(smart_220_contingency)

χ2-distribution: 	72.17414321356232
P-Value: 		2.1261012025679754e-16
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_220_cat:
    0: 			[10651984.921	658.079]	[10652005	638]
    1: 			[161331.033	9.967]		[161305		36]
    2: 			[161117.046	9.954]		[161123		4]

r_fisher_output(smart_220_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10652005L, 161305L, 161123L, 638L, 36L, 4L), .Dim = 3:2)

smart_222_cat¶

smart_222_contingency = pd.crosstab(df['smart_222_cat'], df['failure'])
smart_222_contingency

pd.crosstab(df['smart_222_cat'], df['failure'], normalize = "index")

chi2_output(smart_222_contingency)

χ2-Coefficient: 	55.638904842943205
P-Value: 		8.28257398176623e-13
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_222_cat:
    0: 			[10651730.937	658.063]	[10651751	638]
    1: 			[154611.448	9.552]		[154615		6]
    2: 			[168090.615	10.385]		[168067		34]

r_fisher_output(smart_222_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10651751L, 154615L, 168067L, 638L, 6L, 34L), .Dim = 3:2)

smart_226_cat¶

smart_226_contingency = pd.crosstab(df['smart_226_cat'], df['failure'])
smart_226_contingency

pd.crosstab(df['smart_226_cat'], df['failure'], normalize = "index")

chi2_output(smart_226_contingency)

χ2-Coefficient: 	143.3893719250644
P-Value: 		7.301187551981327e-32
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_226_cat:
    0: 			[10651730.937	658.063]	[10651751	638]
    1: 			[88379.540	5.460]		[88352		33]
    2: 			[234322.524	14.476]		[234330		7]

r_fisher_output(smart_226_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10651751L, 88352L, 234330L, 638L, 33L, 7L), .Dim = 3:2)

smart_11_cat¶

smart_11_contingency = pd.crosstab(df['smart_11_cat'], df['failure'])
smart_11_contingency

pd.crosstab(df['smart_11_cat'], df['failure'], normalize = "index")

chi2_output(smart_11_contingency)

χ2-Coefficient: 	54.67278414829015
P-Value: 		1.3426282073300716e-12
Degrees of Freedom: 	2
Expected Values:
			Expected:			Actual:
	failure:	False:		True:		False:		True:
smart_11_cat:
    0: 			[10903943.355	673.645]	[10903958	659]
    1: 			[45466.191	2.809]		[45459		10]
    2: 			[25023.454	1.546]		[25016		9]

r_fisher_output(smart_11_contingency)

p.value		0.0004997501249375312
alternative	two.sided
method		Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data.name	structure(c(10903958L, 45459L, 25016L, 659L, 10L, 9L), .Dim = 3:2)

Dataset Preparation for Model Creation ¶

To begin performing factor analysis, the dataset will need to be prepared through standardization and normalization, as well as the test, train, and validation splits. Doing these before the PCA ensures that no data is contaminated with the influence of the testing and validation data.

Dataset Splitting: Train, Test, and Validation¶

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

y_df = df['failure']
x_df = df.drop('failure', axis = 1)

x_df.drop(['date', 'serial_number'], axis = 1, inplace = True)

del df

The first split is 80% Train and 20% Test, stratified on the y_df / failure series.

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, \
                        test_size = 0.2, random_state = 13, stratify = y_df)

Verify the stratified splitting.

y_train.value_counts()

False    8779546
True         542
Name: failure, dtype: int64

y_train.value_counts()[1] / y_train.value_counts()[0]

6.173439947805957e-05

y_test.value_counts()

False    2194887
True         136
Name: failure, dtype: int64

Note that while the ratio is not exact, it is the closest possible.

y_test.value_counts()[1] / y_test.value_counts()[0]

6.196218757503233e-05

(y_test.value_counts()[1] - 1) / y_test.value_counts()[0]

6.150658325462769e-05

(y_test.value_counts()[1] + 1) / y_test.value_counts()[0]

6.241779189543698e-05

The second split is 87.5% Train and 12.5% Validation, stratified on the y_df / failure series, to result in 70% Train and 10% Validation overall.

x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, \
                    test_size = 0.125, random_state = 13, stratify = y_train)

y_train.value_counts()

False    7682103
True         474
Name: failure, dtype: int64

y_train.value_counts()[1] / y_train.value_counts()[0]

6.170185429692885e-05

y_valid.value_counts()

False    1097443
True          68
Name: failure, dtype: int64

y_valid.value_counts()[1] / y_valid.value_counts()[0]

6.196221580528556e-05

Continuous Variable Standardization¶

A scaler fit to the training data is created to standardize the continuous columns for model training. This avoids any contamination of the training data by ensuring that the test and validation datasets do not influence the training data at all.

scaler = preprocessing.StandardScaler()

x_train.columns

Index(['smart_1_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw',
       'smart_7_raw', 'smart_9_raw', 'smart_10_raw', 'smart_12_raw',
       'smart_187_raw', 'smart_188_raw', 'smart_190_raw', 'smart_192_raw',
       'smart_194_raw', 'smart_197_raw', 'smart_199_raw', 'smart_240_raw',
       'smart_241_raw', 'smart_242_raw', 'manufacturer', 'capacity_TB',
       'smart_193_225', 'smart_191_cat', 'smart_184_cat', 'smart_200_cat',
       'smart_196_cat', 'smart_8_cat', 'smart_2_cat', 'smart_223_cat',
       'smart_220_cat', 'smart_222_cat', 'smart_226_cat', 'smart_11_cat'],
      dtype='object')

cont_cols = [
    'smart_1_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw',
    'smart_7_raw', 'smart_9_raw', 'smart_10_raw', 'smart_12_raw',
    'smart_187_raw', 'smart_188_raw', 'smart_190_raw', 'smart_192_raw',
    'smart_194_raw', 'smart_197_raw', 'smart_199_raw', 'smart_240_raw',
    'smart_241_raw', 'smart_242_raw', 'smart_193_225', 'capacity_TB'
]

This fits the scaler to the continuous columns of the training data. The fit scaler will then be used to scale the testing and validation datasets.

x_train[cont_cols] = scaler.fit_transform(x_train[cont_cols])

A mean as close to zero as possible given the dataset and a standard deviation of 1 is a successful standardization.

x_train[cont_cols].describe()

x_test[cont_cols] = scaler.transform(x_test[cont_cols])

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s

x_valid[cont_cols] = scaler.transform(x_valid[cont_cols])

PCA and MCA ¶

PCA¶

import prince

pca = prince.PCA(
    n_components = len(cont_cols),
    n_iter = 3 ,
    copy = True,
    check_input = True,
    random_state = 13
)

pca = pca.fit(x_train[cont_cols])

ax = pca.plot_row_coordinates(
    x_train[cont_cols],
    ax = None,
    figsize = (6, 6),
    x_component = 0,
    y_component = 1
)

# No .svg file will be saved for this plot as it takes up
# 1.07 GB (1,158,481,389 bytes).
#plt.savefig("Charts/PCA.svg")
plt.savefig("Charts/PCA.png")

No handles with labels found to put in legend.

pca_results_df = pca.column_correlations(x_train[cont_cols])
pca_results_df

fig, ax = plt.subplots(figsize = (30, 23))

sns.heatmap(
    pca_results_df,
    ax = ax,
    annot = True,
    fmt = ".1%",
    vmin = -1, vmax = 1, center = 0,
    linewidths = 3,
    linecolor = "white",
    xticklabels = pca_results_df.columns,
    yticklabels = pca_results_df.index,
    square = True,
    cbar = True
)

plt.title("PCA Results Heatmap", fontsize = 54)
plt.savefig("Charts/PCA Heatmap.svg")
plt.savefig("Charts/PCA Heatmap.png")

pca_eigenvalues = pca.eigenvalues_
pca_eigenvalues

[24843306.51,
 13129034.19,
 9758841.017,
 8826913.796,
 8626783.65,
 8304944.847,
 7926688.784,
 7744995.316,
 7689628.569,
 7679972.369,
 7671021.412,
 7440170.605,
 6917334.239,
 6736046.078,
 6395419.9,
 6110364.178,
 4819698.087,
 1720733.771,
 797311.7685,
 512330.9103]

pca.explained_inertia_

[0.16168602354268868,
 0.08544681158976027,
 0.06351280968257833,
 0.05744761032724548,
 0.05614511673603315,
 0.05405051486588958,
 0.051588736334488725,
 0.05040623293370864,
 0.0500458932548495,
 0.049983048454989215,
 0.04992479354032559,
 0.04842236273611431,
 0.04501962192457163,
 0.04383975635902524,
 0.041622881877629705,
 0.039767672864061826,
 0.03136771741602234,
 0.011198936056775211,
 0.005189090643217338,
 0.003334368860027106]

plt.plot(np.arange(len(cont_cols)), pca_eigenvalues, 'ro-')
plt.title("PCA Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.xticks(range(0, len(cont_cols)))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Scree Plot.svg")
plt.savefig("Charts/PCA Scree Plot.png")

cum_inertia = [0]

for i, e in enumerate(pca_eigenvalues):
    cum_inertia.append(sum(pca_eigenvalues[0:i+1]) / sum(pca_eigenvalues))
    
cum_inertia

[0,
 0.1616860235,
 0.2471328351,
 0.3106456448,
 0.3680932551,
 0.4242383719,
 0.4782888867,
 0.5298776231,
 0.580283856,
 0.6303297493,
 0.6803127977,
 0.7302375913,
 0.778659954,
 0.8236795759,
 0.8675193323,
 0.9091422142,
 0.948909887,
 0.9802776044,
 0.9914765405,
 0.9966656311,
 1]

sum(pca_eigenvalues[0:13]) / sum(pca_eigenvalues)

0.8236795759

plt.plot(range(0, len(cum_inertia)), cum_inertia)
plt.title("Inertia by Principal Components Kept")
plt.xlabel("Number of Principal Components")
plt.ylabel("Inertia")
plt.xticks(range(0, len(cum_inertia)))
plt.grid(b=True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b=True, which = 'minor', color = 'w', linewidth = 0.5)
plt.savefig("Charts/PCA Inertia Plot.svg")
plt.savefig("Charts/PCA Inertia Plot.png")

The eigenvalues and explained inertia were used to create a scree plot, and this plot is then used alongside the cumulative inertia to determine that 13 principal components an appropriate amount of dimensionality reduction to use as these components made up 82.37% of the inertia of the dataset in only 13 out of the 20, or 65%, of the total components.

PCA as a form of dimensionality reduction ensures that as little information, in the form of inertia, is lost as possible for the given number of dimensions reduced. As this dataset is quite large, any amount of dimensionality reduction will greatly affect the speed and chance of proper convergence in the predictive models to come. The result is reducing of the data by 35% while only losing 17.63% of the information, a 2-for-1 trade.

pca = prince.PCA(
    n_components = 13,
    n_iter = 3,
    copy = True,
    check_input = True,
    random_state = 13
)

pca = pca.fit(x_train[cont_cols])

pca.explained_inertia_

[0.16168602354268732,
 0.08544681158975947,
 0.0635128096825781,
 0.0574476103272455,
 0.056145116736033195,
 0.05405051486588986,
 0.051588736334489044,
 0.05040623293370882,
 0.05004589325484982,
 0.04998304845498967,
 0.04992479354032566,
 0.04842236273611464,
 0.045019621924571526]

Training set PCA transformation¶

pca_df = pca.transform(x_train[cont_cols])

pca_df = pca_df.add_prefix('pca_component_')
pca_df

pca_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 13 columns):
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: float64(13)
memory usage: 820.6 MB

# Replace the columns that factored in the PCA with
# the reduced-dimension PCA results.
x_train.drop(cont_cols, axis = 1, inplace = True)

x_train = x_train.join(pca_df)
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 25 columns):
manufacturer        category
smart_191_cat       category
smart_184_cat       category
smart_200_cat       category
smart_196_cat       category
smart_8_cat         category
smart_2_cat         category
smart_223_cat       category
smart_220_cat       category
smart_222_cat       category
smart_226_cat       category
smart_11_cat        category
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: category(12), float64(13)
memory usage: 1.2 GB

x_train.head()

Test set PCA transformation¶

pca_df = pca.transform(x_test[cont_cols])

pca_df = pca_df.add_prefix('pca_component_')
pca_df

# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_test.drop(cont_cols, axis = 1, inplace = True)

C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:4102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

x_test = x_test.join(pca_df)
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2195023 entries, 3013725 to 8124260
Data columns (total 25 columns):
manufacturer        category
smart_191_cat       category
smart_184_cat       category
smart_200_cat       category
smart_196_cat       category
smart_8_cat         category
smart_2_cat         category
smart_223_cat       category
smart_220_cat       category
smart_222_cat       category
smart_226_cat       category
smart_11_cat        category
pca_component_0     float64
pca_component_1     float64
pca_component_2     float64
pca_component_3     float64
pca_component_4     float64
pca_component_5     float64
pca_component_6     float64
pca_component_7     float64
pca_component_8     float64
pca_component_9     float64
pca_component_10    float64
pca_component_11    float64
pca_component_12    float64
dtypes: category(12), float64(13)
memory usage: 339.6 MB

Validation set PCA transformation¶

pca_df = pca.transform(x_valid[cont_cols])
pca_df = pca_df.add_prefix('pca_component_')

# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)
x_valid = x_valid.join(pca_df)

pca_df = pca.transform(x_valid[cont_cols])

pca_df = pca_df.add_prefix('pca_component_')
pca_df

# Replace the columns that factored in the PCA with 
# the reduced-dimension PCA results.
x_valid.drop(cont_cols, axis = 1, inplace = True)

x_valid = x_valid.join(pca_df)
x_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1097511 entries, 3193456 to 773013
Data columns (total 25 columns):
manufacturer        1097511 non-null category
smart_191_cat       1097511 non-null category
smart_184_cat       1097511 non-null category
smart_200_cat       1097511 non-null category
smart_196_cat       1097511 non-null category
smart_8_cat         1097511 non-null category
smart_2_cat         1097511 non-null category
smart_223_cat       1097511 non-null category
smart_220_cat       1097511 non-null category
smart_222_cat       1097511 non-null category
smart_226_cat       1097511 non-null category
smart_11_cat        1097511 non-null category
pca_component_0     1097511 non-null float64
pca_component_1     1097511 non-null float64
pca_component_2     1097511 non-null float64
pca_component_3     1097511 non-null float64
pca_component_4     1097511 non-null float64
pca_component_5     1097511 non-null float64
pca_component_6     1097511 non-null float64
pca_component_7     1097511 non-null float64
pca_component_8     1097511 non-null float64
pca_component_9     1097511 non-null float64
pca_component_10    1097511 non-null float64
pca_component_11    1097511 non-null float64
pca_component_12    1097511 non-null float64
dtypes: category(12), float64(13)
memory usage: 169.8 MB

if not os.path.isfile('pca_x_train.csv'):
    x_train.to_csv('pca_x_train.csv', index = False)
if not os.path.isfile('y_train.csv'):
    y_train.to_csv('y_train.csv', index = False, header = True)
    
if not os.path.isfile('pca_x_test.csv'):
    x_test.to_csv('pca_x_test.csv', index = False)
if not os.path.isfile('y_test.csv'):
    y_test.to_csv('y_test.csv', index = False, header = True)
    
if not os.path.isfile('pca_x_valid.csv'):
    x_valid.to_csv('pca_x_valid.csv', index = False)
if not os.path.isfile('y_valid.csv'):
    y_valid.to_csv('y_valid.csv', index = False, header = True)

reload_pca = False
if reload_pca:
    x_train = pd.read_csv('pca_x_train.csv')
    y_train = pd.read_csv('pca_y_train.csv')
    x_test = pd.read_csv('pca_x_test.csv')
    y_test = pd.read_csv('pca_y_test.csv')
    x_valid = pd.read_csv('pca_x_valid.csv')
    y_valid = pd.read_csv('pca_y_valid.csv')
    
    n_rows = len(df)

    
    #df['manufacturer'] = df['manufacturer'].astype('category')
    #df['smart_191_cat'] = df['smart_191_cat'].astype('category')
    #df['smart_184_cat'] = df['smart_184_cat'].astype('category')
    #df['smart_200_cat'] = df['smart_200_cat'].astype('category')
    #df['smart_196_cat'] = df['smart_196_cat'].astype('category')
    #df['smart_8_cat'] = df['smart_8_cat'].astype('category')
    #df['smart_2_cat'] = df['smart_2_cat'].astype('category')
    #df['smart_223_cat'] = df['smart_223_cat'].astype('category')
    #df['smart_220_cat'] = df['smart_220_cat'].astype('category')
    #df['smart_222_cat'] = df['smart_222_cat'].astype('category')
    #df['smart_226_cat'] = df['smart_226_cat'].astype('category')
    #df['smart_11_cat'] = df['smart_11_cat'].astype('category')

MCA¶

To begin doing MCA, the categorical columns need converted to boolean encoding columns.

cat_cols = [
    'manufacturer',  'smart_191_cat', 'smart_184_cat',
    'smart_200_cat', 'smart_196_cat', 'smart_8_cat',
    'smart_2_cat', 'smart_223_cat', 'smart_220_cat',
    'smart_222_cat', 'smart_226_cat', 'smart_11_cat'
]

x_train_cat = pd.get_dummies(x_train[cat_cols], \
                             columns = cat_cols, dtype = bool)
x_test_cat = pd.get_dummies(x_test[cat_cols], \
                            columns = cat_cols, dtype = bool)
x_valid_cat = pd.get_dummies(x_valid[cat_cols], \
                             columns = cat_cols, dtype = bool)

x_train_cat.columns

Index(['manufacturer_HGST', 'manufacturer_Seagate', 'manufacturer_Toshiba',
       'manufacturer_Western Digital', 'smart_191_cat_0', 'smart_191_cat_1',
       'smart_191_cat_2', 'smart_184_cat_0', 'smart_184_cat_1',
       'smart_200_cat_0', 'smart_200_cat_1', 'smart_200_cat_2',
       'smart_196_cat_0', 'smart_196_cat_1', 'smart_196_cat_2',
       'smart_8_cat_0', 'smart_8_cat_1', 'smart_8_cat_2', 'smart_2_cat_0',
       'smart_2_cat_1', 'smart_2_cat_2', 'smart_223_cat_0', 'smart_223_cat_1',
       'smart_223_cat_2', 'smart_220_cat_0', 'smart_220_cat_1',
       'smart_220_cat_2', 'smart_222_cat_0', 'smart_222_cat_1',
       'smart_222_cat_2', 'smart_226_cat_0', 'smart_226_cat_1',
       'smart_226_cat_2', 'smart_11_cat_0', 'smart_11_cat_1',
       'smart_11_cat_2'],
      dtype='object')

cat_df.dtypes

manufacturer_HGST               bool
manufacturer_Seagate            bool
manufacturer_Toshiba            bool
manufacturer_Western Digital    bool
smart_191_cat_0                 bool
smart_191_cat_1                 bool
smart_191_cat_2                 bool
smart_184_cat_0                 bool
smart_184_cat_1                 bool
smart_200_cat_0                 bool
smart_200_cat_1                 bool
smart_200_cat_2                 bool
smart_196_cat_0                 bool
smart_196_cat_1                 bool
smart_196_cat_2                 bool
smart_8_cat_0                   bool
smart_8_cat_1                   bool
smart_8_cat_2                   bool
smart_2_cat_0                   bool
smart_2_cat_1                   bool
smart_2_cat_2                   bool
smart_223_cat_0                 bool
smart_223_cat_1                 bool
smart_223_cat_2                 bool
smart_220_cat_0                 bool
smart_220_cat_1                 bool
smart_220_cat_2                 bool
smart_222_cat_0                 bool
smart_222_cat_1                 bool
smart_222_cat_2                 bool
smart_226_cat_0                 bool
smart_226_cat_1                 bool
smart_226_cat_2                 bool
smart_11_cat_0                  bool
smart_11_cat_1                  bool
smart_11_cat_2                  bool
dtype: object

cat_df.memory_usage().sum()

395104124

mca = prince.MCA(
    n_components = 13,
    n_iter = 3,
    copy = True,
    random_state = 13
)

mca = mca.fit(cat_df)

C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\ops\__init__.py:1641: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_values, index=self.index, name=self.name)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:339: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=self.default_fill_value,
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\generic.py:6289: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\generic.py:5884: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return self._constructor(new_data).__finalize__(self)

ax = mca.plot_coordinates(
    X = cat_df,
    ax = None,
    figsize=(20, 20),
    show_row_points = True,
    row_points_size = 10,
    show_row_labels = False,
    show_column_points = True,
    column_points_size = 30,
    show_column_labels = False,
    legend_n_cols = 1
)

plt.savefig('Charts/MCA With Rows.png')

C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)

mca_eigenvalues = mca.eigenvalues_
mca_eigenvalues

[0.278968505,
 0.1813786161,
 0.1017688289,
 0.07509976733,
 0.05613721872,
 0.05382373409,
 0.04912671174,
 0.04781589597,
 0.03590523611,
 0.0300406049,
 0.0262677525,
 0.01661171385,
 0.0161819015]

plt.plot(np.arange(len(mca_eigenvalues)), mca_eigenvalues, 'ro-')
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Eigenvalue")
plt.show()

ax = mca.plot_coordinates(
    X = cat_df,
    ax = None,
    figsize = (20, 20),
    show_row_points = False,
    show_row_labels = False,
    show_column_points = True,
    column_points_size = 30,
    show_column_labels = True,
    legend_n_cols = 3
)

plt.savefig('Charts/MCA.svg')
plt.savefig('Charts/MCA.png')

C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\prince\one_hot.py:35: FutureWarning: SparseDataFrame is deprecated and will be removed in a future version.
Use a regular DataFrame whose columns are SparseArrays instead.

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  default_fill_value=0
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\sparse\frame.py:257: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  sparse_index=BlockIndex(N, blocs, blens),
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\frame.py:3456: FutureWarning: SparseSeries is deprecated and will be removed in a future version.
Use a Series with sparse values instead.

    >>> series = pd.Series(pd.SparseArray(...))

See http://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating for more.

  return klass(values, index=self.index, name=items, fastpath=True)

# Replace the categorical columns with their encoded representation columns.
x_train.drop(cat_cols, axis = 1, inplace = True)
x_train = x_train.join(x_train_cat)
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7682577 entries, 3450686 to 9385528
Data columns (total 49 columns):
pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               bool
manufacturer_Seagate            bool
manufacturer_Toshiba            bool
manufacturer_Western Digital    bool
smart_191_cat_0                 bool
smart_191_cat_1                 bool
smart_191_cat_2                 bool
smart_184_cat_0                 bool
smart_184_cat_1                 bool
smart_200_cat_0                 bool
smart_200_cat_1                 bool
smart_200_cat_2                 bool
smart_196_cat_0                 bool
smart_196_cat_1                 bool
smart_196_cat_2                 bool
smart_8_cat_0                   bool
smart_8_cat_1                   bool
smart_8_cat_2                   bool
smart_2_cat_0                   bool
smart_2_cat_1                   bool
smart_2_cat_2                   bool
smart_223_cat_0                 bool
smart_223_cat_1                 bool
smart_223_cat_2                 bool
smart_220_cat_0                 bool
smart_220_cat_1                 bool
smart_220_cat_2                 bool
smart_222_cat_0                 bool
smart_222_cat_1                 bool
smart_222_cat_2                 bool
smart_226_cat_0                 bool
smart_226_cat_1                 bool
smart_226_cat_2                 bool
smart_11_cat_0                  bool
smart_11_cat_1                  bool
smart_11_cat_2                  bool
dtypes: bool(36), float64(13)
memory usage: 1.4 GB

# Replace the categorical columns with their encoded representation columns.
x_test.drop(cat_cols, axis = 1, inplace = True)
x_test = x_test.join(x_test_cat)

# Replace the categorical columns with their encoded representation columns.
x_valid.drop(cat_cols, axis = 1, inplace = True)
x_valid = x_valid.join(x_valid_cat)

if not os.path.isfile('cat_x_train.csv'):
    x_train.to_csv('cat_x_train.csv', index = False)
    
if not os.path.isfile('cat_x_test.csv'):
    x_test.to_csv('cat_x_test.csv', index = False)
    
if not os.path.isfile('cat_x_valid.csv'):
    x_valid.to_csv('cat_x_valid.csv', index = False)

reload_cat = True
if reload_cat:
    x_train = pd.read_csv('cat_x_train.csv')
    y_train = pd.read_csv('y_train.csv')
    x_test = pd.read_csv('cat_x_test.csv')
    y_test = pd.read_csv('y_test.csv')
    x_valid = pd.read_csv('cat_x_valid.csv')
    y_valid = pd.read_csv('y_valid.csv')
    n_rows = len(x_train)

While Factor Analysis of Mixed Data (FAMD) would have been ideal for dimensionality reduction, the current hardware requirements and software availability do not allow for it with such a large dataset.

Why SMOTE (Synthetic Minority Oversampling Technique) is Needed¶

Traditional training fails as hard drive failure is an extremely rare occurence. The model learns to only predict non-failure, making it useless for actually predicting failure. This is why a combination of undersampling the non-failures and oversampling the failures will improve the training and production of the predictive models.

from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, \
                            classification_report, roc_curve, auc

regression = LogisticRegression(solver = 'sag', n_jobs = -1)
regression.fit(x_train, y_train.values.ravel())

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

LogisticRegression(n_jobs=-1, solver='sag')

regression.intercept_

array([-1.09223837])

regression.coef_

array([[ 1.18621790e-01,  1.92844770e-01, -4.87481695e-01,
        -2.14837069e-02,  2.54997364e-03, -1.07725167e-01,
         2.24330184e-01, -1.15211496e-01,  3.71445683e-02,
        -1.62009855e-02,  2.17645436e-02,  1.05120126e-04,
         1.57435397e-02, -4.60059511e-01, -4.92139451e-01,
        -1.40594384e-01,  3.81862959e-03, -5.73064084e-01,
        -3.69431658e-01, -1.46478974e-01, -1.09739378e+00,
         8.41906555e-03, -9.57929442e-01, -1.32230176e-01,
         1.18490140e-03, -4.92988130e-01, -6.07767236e-01,
         1.17806493e-02, -4.89169500e-01, -3.63043862e-01,
        -2.36761353e-01, -4.89169500e-01, -2.62213597e-01,
        -3.37591619e-01, -9.17877465e-01, -1.73561541e-01,
         2.46428982e-03, -9.48460309e-01, -7.29937015e-02,
        -6.75207055e-02, -9.48380332e-01, -6.88172542e-02,
        -7.17771298e-02, -9.48380332e-01, -4.38120894e-02,
        -9.67822946e-02, -1.09364202e+00, -9.17299916e-04,
         5.58460842e-03]])

coefs = pd.concat([pd.DataFrame(x_train.columns),
                   pd.DataFrame(np.transpose(regression.coef_))], axis = 1)

coefs.columns = ["Column", "Coefficient"]
coefs

coefs.where(coefs['Coefficient'] > 0).sort_values(['Coefficient'], \
                                                  ascending = False).dropna()

coefs.where(coefs['Coefficient'] < 0).sort_values(['Coefficient']).dropna()

accuracy = regression.score(x_test, y_test)
accuracy

0.9999375860754078

predictions = regression.predict(x_test)
actual = y_test

confusion = confusion_matrix(actual, predictions)
confusion

array([[2194886,       1],
       [    136,       0]], dtype=int64)

precision = precision_score(actual, predictions)
precision

0.0

print(classification_report(actual, predictions))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00   2194887
        True       0.00      0.00      0.00       136

    accuracy                           1.00   2195023
   macro avg       0.50      0.50      0.50   2195023
weighted avg       1.00      1.00      1.00   2195023

SMOTE¶

sm = SMOTE(random_state = 13)

x_train, y_train = sm.fit_resample(x_train, y_train)

y_train['failure'].value_counts()

True     7682103
False    7682103
Name: failure, dtype: int64

if not os.path.isfile('smote_x_df.pkl'):
    x_train.to_pickle('smote_x_df.pkl')

if not os.path.isfile('smote_y_df.pkl'):
    y_train.to_pickle('smote_y_df.pkl')

reload_smote = True
if reload_smote:
    x_train = pd.read_pickle('smote_x_df.pkl')
    y_train = pd.read_pickle('smote_y_df.pkl')
    x_test = pd.read_csv('cat_x_test.csv')
    y_test = pd.read_csv('y_test.csv')
    x_valid = pd.read_csv('cat_x_valid.csv')
    y_valid = pd.read_csv('y_valid.csv')
    n_rows = len(x_train)

Model Creation ¶

Logistic Regression With SMOTE¶

regression = LogisticRegression(solver = 'liblinear')

regression.fit(x_train, y_train)

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\svm\_base.py:975: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

LogisticRegression(solver='liblinear')

regression_accuracy = regression.score(x_test, y_test)
regression_accuracy

0.6419859837459562

regression_predictions = regression.predict(x_test)
actual = y_test

regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion

array([[1409080,  785807],
       [     42,      94]], dtype=int64)

regression_precision = precision_score(actual, regression_predictions)
regression_precision

0.00011960794044033536

print(classification_report(actual, regression_predictions))

              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.69      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.67      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]

regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc

0.7732849586066055

plt.title('Liblinear Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Liblinear Logistic ROC AUC.svg')
plt.show()

# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression Liblinear.sav', 'wb'))

regression = LogisticRegression(solver = 'sag', n_jobs = -1)

regression.fit(x_train, y_train)

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

LogisticRegression(solver='sag')

regression_accuracy = regression.score(x_test, y_test)
regression_accuracy

0.6437262844170654

regression_predictions = regression.predict(x_test)
actual = y_test

regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion

array([[1412901,  781986],
       [     43,      93]], dtype=int64)

regression_precision = precision_score(actual, regression_predictions)
regression_precision

0.00011891381816926424

print(classification_report(actual, regression_predictions))

              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.68      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.66      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]

regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc

0.7734260619446601

plt.title('SAG Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAG ROC AUC.svg')
plt.show()

# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAG.sav', 'wb'))

regression = LogisticRegression(solver = 'saga', n_jobs = -1)

regression.fit(x_train, y_train)

C:\Users\aedri\Anaconda3\envs\pytorch2\lib\site-packages\sklearn\linear_model\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

LogisticRegression(n_jobs=-1, solver='saga')

regression_accuracy = regression.score(x_test, y_test)
regression_accuracy

0.6401891916394498

regression_predictions = regression.predict(x_test)
actual = y_test

regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion

array([[1405136,  789751],
       [     42,      94]], dtype=int64)

regression_precision = precision_score(actual, regression_predictions)
regression_precision

0.00011901069197120954

print(classification_report(actual, regression_predictions))

              precision    recall  f1-score   support

       False       1.00      0.64      0.78   2194887
        True       0.00      0.69      0.00       136

    accuracy                           0.64   2195023
   macro avg       0.50      0.67      0.39   2195023
weighted avg       1.00      0.64      0.78   2195023

regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]

regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

regression_roc_auc = auc(regression_false_positive_rate, regression_true_positive_rate)
regression_roc_auc

0.7773458805155158

plt.title('SAGA Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Logistic SAGA ROC AUC.svg')
plt.show()

# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression SAGA.sav', 'wb'))

LBFGS Solver Logistic Regression¶

regression = LogisticRegression(solver = 'lbfgs', \
                                max_iter = 10000, n_jobs = 1)

regression.fit(x_train, y_train.values.ravel())

LogisticRegression(max_iter=10000, n_jobs=1)

regression_accuracy = regression.score(x_test, y_test)
regression_accuracy

0.9731984585127355

regression_predictions = regression.predict(x_test)
actual = y_test

regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion

array([[2136106,   58781],
       [     49,      87]], dtype=int64)

regression_precision = precision_score(actual, regression_predictions)
regression_precision

0.0014778827206631787

print(classification_report(actual, regression_predictions))

              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99   2194887
         1.0       0.00      0.64      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.81      0.49   2195023
weighted avg       1.00      0.97      0.99   2195023

regression_probabilities = regression.predict_proba(x_test)
predictions = regression_probabilities[:,1]

regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

regression_roc_auc = auc(regression_false_positive_rate, \
                         regression_true_positive_rate)
regression_roc_auc

0.8729115935460594

plt.title('LBFGS Logistic Regression Receiver Operating Characteristic Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic ROC AUC.png')
plt.show()

# Save the model to disk
pickle.dump(regression, open('Models/Logistic Regression lbfgs.sav', 'wb'))

coefs = pd.concat([pd.DataFrame(x_train.columns),
                   pd.DataFrame(np.transpose(regression.coef_))], axis = 1)

coefs.columns = ["Column", "Coefficient"]
coefs

Decision Tree¶

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

tree = DecisionTreeClassifier(max_depth = 20, splitter = 'best', \
                              random_state = 13)

tree.fit(x_train, y_train)

DecisionTreeClassifier(max_depth=20, random_state=13)

tree_accuracy = tree.score(x_test, y_test)
tree_accuracy

0.9690331263043713

tree_predictions = tree.predict(x_test)
actual = y_test

tree_confusion = confusion_matrix(actual, tree_predictions)
tree_confusion

array([[2126990,   67897],
       [     76,      60]], dtype=int64)

tree_precision = precision_score(actual, tree_predictions)
tree_precision

0.0008829112527039157

print(classification_report(actual, tree_predictions))

              precision    recall  f1-score   support

         0.0       1.00      0.97      0.98   2194887
         1.0       0.00      0.44      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.71      0.49   2195023
weighted avg       1.00      0.97      0.98   2195023

tree_probabilities = tree.predict_proba(x_test)
predictions = tree_probabilities[:,1]

tree_false_positive_rate, tree_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

tree_roc_auc = auc(tree_false_positive_rate, tree_true_positive_rate)
tree_roc_auc

0.6900202690992079

plt.title('Decision Tree Receiver Operating Characteristic Curve')
plt.plot(tree_false_positive_rate, tree_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % tree_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Tree ROC AUC.svg')
plt.savefig('Charts/Tree ROC AUC.png')
plt.show()

fig, ax = plt.subplots(figsize=(40, 20))
plot_tree(tree, fontsize = 6, max_depth = 3, class_names = True, \
          feature_names = x_train.columns)
plt.savefig('Charts/Decision Tree.svg', dpi=100)
plt.savefig('Charts/Decision Tree.png', dpi=100)

# Save the model to disk
pickle.dump(tree, open('Models/Decision Tree.sav', 'wb'))

tree = pickle.load(open('Models/Decision Tree.sav', 'rb'))

Random Forest Ensemble¶

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
                                random_state = 13, n_jobs = -1)

forest.fit(x_train, y_train.values.ravel())

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed: 16.5min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 45.8min finished

RandomForestClassifier(max_depth=20, n_jobs=-1, random_state=13, verbose=1)

forest_accuracy = forest.score(x_test, y_test)
forest_accuracy

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.9s finished

0.9902342708937446

forest_predictions = forest.predict(x_test)
actual = y_test

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.8s finished

forest_confusion = confusion_matrix(actual, forest_predictions)
forest_confusion

array([[2173538,   21349],
       [     87,      49]], dtype=int64)

forest_precision = precision_score(actual, forest_predictions)
forest_precision

0.0022899336386578185

print(classification_report(actual, forest_predictions))

              precision    recall  f1-score   support

       False       1.00      0.99      1.00   2194887
        True       0.00      0.36      0.00       136

    accuracy                           0.99   2195023
   macro avg       0.50      0.68      0.50   2195023
weighted avg       1.00      0.99      1.00   2195023

forest_probabilities = forest.predict_proba(x_test)
predictions = forest_probabilities[:,1]

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.2s finished

forest_false_positive_rate, forest_true_positive_rate, threshold =\
    roc_curve(y_test, predictions)

forest_roc_auc = auc(forest_false_positive_rate, forest_true_positive_rate)
forest_roc_auc

0.7974132039599305

plt.title('Random Forest Receiver Operating Characteristic Curve')
plt.plot(forest_false_positive_rate, forest_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Forest ROC AUC.png')
plt.savefig('Charts/Forest ROC AUC.svg')
plt.show()

# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble.sav', 'wb'))

reload_forest = False
if reload_forest:
    forest = pickle.load(open('Models/Random Forest Ensemble.sav', 'rb'))

Random Forest Ensemble With Class Weights¶

In an attempt to train the ensemble in a way that prioritizes the true negative, or actual failure cases, this version of the random forest weights failures as twice as important as non-failures.

weighted_forest = RandomForestClassifier(max_depth = 20, verbose = 1, \
        random_state = 13, n_jobs = -1, class_weight = {0: 1, 1: 2})

weighted_forest.fit(x_train, y_train.values.ravel())

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed: 15.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 44.5min finished

RandomForestClassifier(class_weight={0: 1, 1: 2}, max_depth=20, n_jobs=-1,
                       random_state=13, verbose=1)

weighted_forest_accuracy = weighted_forest.score(x_test, y_test)
weighted_forest_accuracy

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    5.9s finished

0.9717009798986161

weighted_forest_predictions = weighted_forest.predict(x_test)
actual = y_test

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    2.3s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.5s finished

Compared to the unweighted Random Forest ensemble, this weighted ensemble gains another 6 true negative classifications for a true negative rate of 40% rather than 36%, but also gains 40,687 false positive classifications, for 0.028%, instead of 0.0097% false positives.

weighted_forest_confusion = confusion_matrix(actual, \
                                             weighted_forest_predictions)
weighted_forest_confusion

array([[2132851,   62036],
       [     81,      55]], dtype=int64)

weighted_forest_precision = precision_score(actual, \
                                            weighted_forest_predictions)
weighted_forest_precision

0.0008857966532991899

print(classification_report(actual, weighted_forest_predictions))

              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99   2194887
         1.0       0.00      0.40      0.00       136

    accuracy                           0.97   2195023
   macro avg       0.50      0.69      0.49   2195023
weighted avg       1.00      0.97      0.99   2195023

weighted_forest_probabilities = weighted_forest.predict_proba(x_test)
weighted_predictions = weighted_forest_probabilities[:,1]

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    6.2s finished

weighted_forest_false_positive_rate, weighted_forest_true_positive_rate, \
                    threshold = roc_curve(y_test, weighted_predictions)

weighted_forest_roc_auc = auc(weighted_forest_false_positive_rate, \
                              weighted_forest_true_positive_rate)
weighted_forest_roc_auc

0.799827248576833

plt.title('Weighted Random Forest Receiver Operating Characteristic Curve')
plt.plot(weighted_forest_false_positive_rate, \
         weighted_forest_true_positive_rate, \
         'blue', label = 'AUC = %0.2f' % weighted_forest_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/Weighted Forest ROC AUC.png')
plt.savefig('Charts/Weighted Forest ROC AUC.svg')
plt.show()

# Save the model to disk
pickle.dump(forest, open('Models/Random Forest Ensemble Weighted.sav', 'wb'))

reload_weighted_forest = False
if reload_weighted_forest:
    forest = pickle.load(open('Models/Random Forest Ensemble Weighted.sav', 'rb'))

Neural Networks¶

import torch
from torch import nn, optim
import torch.utils.data as data_utils
torch.manual_seed(13)

<torch._C.Generator at 0x1ed8cb527d0>

PyTorch requires the boolean values to be converted to floating point, so these dtypes will be changed before the neural network is defined.

x_train.dtypes

pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object

for col in x_train:
    if x_train[col].dtype == "bool":
        x_train[col] = x_train[col].astype(float)
        x_test[col] = x_test[col].astype(float)

x_train.dtypes

pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object

x_train.isna().sum().sum()

0

x_test.dtypes

pca_component_0                 float64
pca_component_1                 float64
pca_component_2                 float64
pca_component_3                 float64
pca_component_4                 float64
pca_component_5                 float64
pca_component_6                 float64
pca_component_7                 float64
pca_component_8                 float64
pca_component_9                 float64
pca_component_10                float64
pca_component_11                float64
pca_component_12                float64
manufacturer_HGST               float64
manufacturer_Seagate            float64
manufacturer_Toshiba            float64
manufacturer_Western Digital    float64
smart_191_cat_0                 float64
smart_191_cat_1                 float64
smart_191_cat_2                 float64
smart_184_cat_0                 float64
smart_184_cat_1                 float64
smart_200_cat_0                 float64
smart_200_cat_1                 float64
smart_200_cat_2                 float64
smart_196_cat_0                 float64
smart_196_cat_1                 float64
smart_196_cat_2                 float64
smart_8_cat_0                   float64
smart_8_cat_1                   float64
smart_8_cat_2                   float64
smart_2_cat_0                   float64
smart_2_cat_1                   float64
smart_2_cat_2                   float64
smart_223_cat_0                 float64
smart_223_cat_1                 float64
smart_223_cat_2                 float64
smart_220_cat_0                 float64
smart_220_cat_1                 float64
smart_220_cat_2                 float64
smart_222_cat_0                 float64
smart_222_cat_1                 float64
smart_222_cat_2                 float64
smart_226_cat_0                 float64
smart_226_cat_1                 float64
smart_226_cat_2                 float64
smart_11_cat_0                  float64
smart_11_cat_1                  float64
smart_11_cat_2                  float64
dtype: object

x_test.isna().sum().sum()

0

y_train = y_train.astype(float)
y_test = y_test.astype(float)

train_label = torch.tensor(y_train.values)
trainset = torch.tensor(x_train.values)
train_tensor = data_utils.TensorDataset(trainset, train_label) 
trainloader = data_utils.DataLoader(dataset = train_tensor, \
                                    batch_size = 512, shuffle = True)


test_label = torch.tensor(y_test.values)
testset = torch.tensor(x_test.values)
test_tensor = data_utils.TensorDataset(testset, test_label) 
testloader = data_utils.DataLoader(dataset = test_tensor, \
                                   batch_size = 512, shuffle = True)

torch.backends.cudnn.enabled

True

torch.cuda.is_available()

True

print(torch.version.cuda)

10.2

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

Neural Network 1¶

class nn_Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(49, 24)
        self.act1 = nn.LeakyReLU()
        self.fc2 = nn.Linear(24, 12)
        self.act2 = nn.LeakyReLU()
        self.fc3 = nn.Linear(12, 1)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(-1, 49)
        
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        
        return x

neural_network = nn_Classifier()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network.parameters(), lr = 1e-7, \
                       weight_decay = 1e-5)

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)
            
neural_network.apply(init_weights)

nn_Classifier(
  (fc1): Linear(in_features=49, out_features=24, bias=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (fc2): Linear(in_features=24, out_features=12, bias=True)
  (act2): LeakyReLU(negative_slope=0.01)
  (fc3): Linear(in_features=12, out_features=1, bias=True)
)

n_train = len(x_train)
epochs = 10
neural_network.to(device);

train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf 

for e in range(epochs):
    neural_network.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)
        
        optimizer.zero_grad()
        
        output = neural_network(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network.eval()
        test_loss = 0
        accuracy = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                
                output = neural_network(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = test_loss/len(testloader)
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
         
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network.state_dict(), 'Models/Neural Network 1.pt')
            test_loss_min = valid_loss

Epoch: 1/10..  Training Loss: 0.66..  Test Loss: 0.90.. 
Test loss decreased (inf --> 0.8998).  Saving model ...
Epoch: 2/10..  Training Loss: 0.64..  Test Loss: 0.83.. 
Test loss decreased (0.8998 --> 0.8279).  Saving model ...
Epoch: 3/10..  Training Loss: 0.62..  Test Loss: 0.77.. 
Test loss decreased (0.8279 --> 0.7658).  Saving model ...
Epoch: 4/10..  Training Loss: 0.61..  Test Loss: 0.71.. 
Test loss decreased (0.7658 --> 0.7128).  Saving model ...
Epoch: 5/10..  Training Loss: 0.59..  Test Loss: 0.67.. 
Test loss decreased (0.7128 --> 0.6690).  Saving model ...
Epoch: 6/10..  Training Loss: 0.58..  Test Loss: 0.63.. 
Test loss decreased (0.6690 --> 0.6340).  Saving model ...
Epoch: 7/10..  Training Loss: 0.57..  Test Loss: 0.61.. 
Test loss decreased (0.6340 --> 0.6058).  Saving model ...
Epoch: 8/10..  Training Loss: 0.56..  Test Loss: 0.58.. 
Test loss decreased (0.6058 --> 0.5828).  Saving model ...
Epoch: 9/10..  Training Loss: 0.55..  Test Loss: 0.56.. 
Test loss decreased (0.5828 --> 0.5647).  Saving model ...
Epoch: 10/10..  Training Loss: 0.54..  Test Loss: 0.55.. 
Test loss decreased (0.5647 --> 0.5507).  Saving model ...

While it may eventually improve with enough training, it's most likely that this architecture of neural network is too simple for the problem at hand. A more complex one will be built next.

for param in neural_network.parameters():
    print(param.data)

tensor([[-0.1473,  0.2276,  0.0955,  ..., -0.0943, -0.1436,  0.2180],
        [ 0.1537,  0.1776,  0.0052,  ..., -0.2855,  0.2067,  0.2886],
        [ 0.0298,  0.0174, -0.2388,  ...,  0.2323, -0.2218,  0.1123],
        ...,
        [-0.0063,  0.2199,  0.0515,  ...,  0.2911, -0.0368, -0.2214],
        [ 0.1956,  0.0806, -0.0785,  ...,  0.1534,  0.2400,  0.1957],
        [-0.2404, -0.0611, -0.2725,  ..., -0.1479,  0.0216,  0.1225]],
       device='cuda:0')
tensor([ 0.0136, -0.0186,  0.0186,  0.0352, -0.0144, -0.0008,  0.0370,  0.0393,
         0.0043,  0.0392, -0.0165,  0.0240, -0.0170,  0.0299, -0.0164,  0.0381,
         0.0371,  0.0386,  0.0289, -0.0026,  0.0215,  0.0374,  0.0129,  0.0231],
       device='cuda:0')
tensor([[ 1.7849e-01, -2.0251e-01,  6.2617e-02,  3.1074e-01, -3.6079e-01,
          2.9729e-01, -2.8113e-01,  2.7099e-01,  2.2215e-01, -2.2240e-02,
         -4.0342e-01,  1.1523e-01, -3.1116e-01,  2.6588e-01, -3.7920e-01,
          3.0629e-01,  1.4652e-02,  1.8915e-01,  3.9967e-01, -1.0067e-01,
          4.2945e-01,  3.6560e-01,  3.9827e-01, -1.6999e-01],
        [ 1.9745e-01,  1.8876e-01, -2.7809e-01,  5.0733e-02,  6.1933e-02,
          3.1665e-01,  4.1407e-01, -2.2936e-01, -2.5261e-01,  3.3287e-01,
          1.2620e-01,  2.0362e-02, -1.3069e-01,  2.2032e-02, -4.3166e-01,
          2.4362e-01,  3.8399e-01, -1.9928e-01,  3.2188e-02,  3.2998e-02,
          3.4668e-01,  2.7267e-01, -2.1278e-01,  4.0523e-01],
        [-3.4832e-01,  1.9394e-01,  3.4774e-02,  2.4953e-01,  3.8886e-01,
          1.6397e-01, -2.1415e-01, -1.7721e-01, -1.7167e-01,  1.0021e-01,
         -4.3153e-02, -2.4533e-01, -1.9537e-01,  3.3032e-01,  3.9929e-01,
          2.0657e-01,  2.8361e-01,  2.9780e-01,  2.2265e-01,  3.3895e-01,
          6.8382e-02, -3.8065e-01, -1.2103e-02,  1.7160e-01],
        [ 1.8496e-01, -2.6005e-01, -1.9058e-01,  1.0866e-01,  1.9734e-01,
         -2.8778e-01,  2.5133e-01,  6.2480e-02,  4.2115e-02,  8.9481e-02,
          5.0526e-03,  9.6357e-03,  3.3971e-01, -4.2915e-02, -4.1379e-01,
         -2.7652e-01,  2.2339e-01, -2.3685e-01, -3.5120e-01, -6.4610e-02,
         -1.0527e-02, -3.9578e-01,  3.9148e-01,  3.3773e-01],
        [ 2.0560e-01,  1.0526e-01, -1.4828e-01,  7.8619e-02, -1.1924e-01,
         -2.5331e-01,  2.6876e-01,  6.1315e-02,  1.0085e-01,  2.3695e-01,
         -1.8098e-01, -2.2123e-01, -3.6014e-01,  1.5567e-01, -2.7817e-01,
          2.1415e-01,  2.9940e-01, -4.1336e-01,  1.4994e-01,  6.3024e-03,
         -3.0537e-01,  3.2188e-01,  9.7505e-02,  2.3926e-01],
        [ 2.0996e-01,  2.7758e-01,  3.3542e-01, -2.8499e-01,  2.1536e-01,
          2.7024e-01, -1.7165e-01, -1.5995e-01,  5.9415e-02, -1.0555e-02,
          3.5253e-01, -1.3720e-01,  8.6542e-02,  1.8572e-01,  3.2024e-01,
          2.2338e-01, -3.0169e-01,  3.4114e-01, -5.0227e-02, -2.0665e-01,
         -3.6495e-01,  7.1415e-02,  2.7955e-01, -2.1537e-02],
        [ 3.8182e-01,  6.1844e-02,  9.8586e-04, -5.1438e-02,  8.6961e-02,
          1.2606e-01,  3.0335e-01, -4.0351e-01,  3.9948e-01, -2.6709e-01,
         -3.5960e-01,  3.3448e-01,  2.4755e-01, -3.9389e-01, -3.6011e-01,
          1.8035e-01,  3.8077e-01, -1.5780e-01, -3.8327e-01,  1.1362e-01,
         -3.3365e-02, -4.5110e-02,  3.3626e-02, -2.1293e-01],
        [ 3.8949e-01, -3.8495e-01, -3.0542e-01, -1.8660e-01, -3.7620e-02,
         -6.8434e-02,  1.7219e-02,  2.8387e-01, -3.5285e-01, -2.6009e-01,
         -2.8474e-01, -1.1658e-01,  1.4181e-01, -4.0640e-01,  1.0785e-01,
          2.0529e-02, -5.4648e-02,  8.1995e-02,  3.9354e-01,  7.1802e-02,
          3.3340e-01,  2.8701e-01, -7.4615e-02,  4.0314e-01],
        [-3.6525e-01,  5.6254e-02,  3.6376e-01,  2.2438e-01,  1.9816e-01,
         -7.9381e-02, -2.4098e-01, -3.7098e-02, -5.5936e-02, -2.4190e-01,
         -2.6380e-01,  3.9556e-01,  2.1115e-01,  3.9093e-01,  8.4963e-02,
         -2.2577e-01,  3.3309e-03,  5.9278e-02, -5.3482e-02, -1.1502e-01,
          1.0826e-01,  6.7052e-02, -2.8193e-02,  4.0294e-01],
        [-2.0106e-01,  1.8851e-01, -6.0427e-02,  3.1058e-01,  2.5597e-01,
          3.3932e-01,  2.1778e-01, -1.2057e-01,  3.8046e-01, -8.1065e-02,
          9.2627e-02, -2.2181e-01, -3.0132e-01,  2.6635e-01,  1.2915e-01,
         -1.2134e-01, -3.2905e-01, -1.2862e-01,  3.8297e-01,  1.0554e-01,
          3.5215e-01, -2.7842e-01,  6.2954e-02,  3.0246e-01],
        [ 8.5760e-02,  2.1523e-01, -3.6957e-01, -2.9393e-01,  3.2447e-01,
          1.4559e-03, -1.8984e-01,  5.3440e-02, -1.6062e-01,  3.8645e-01,
         -2.7754e-01, -1.2209e-01, -3.5887e-01,  7.9270e-03, -3.2843e-01,
          7.7787e-03, -8.3935e-02, -3.9126e-01, -1.0404e-01,  3.5060e-01,
          3.9034e-01, -2.0370e-01, -1.8387e-01,  1.1450e-01],
        [ 3.8823e-01, -8.8690e-02, -1.4096e-01, -3.3738e-01, -2.4424e-01,
          1.5122e-01, -8.6990e-02,  2.6272e-01, -2.7245e-01, -6.6767e-05,
         -1.1266e-01, -1.3454e-01,  1.8355e-01, -1.9562e-01, -1.4209e-01,
         -3.4258e-02,  8.6167e-02,  9.8019e-02, -4.0399e-01, -6.7994e-03,
         -6.4918e-02,  5.0701e-02, -1.5243e-01,  1.3844e-01]], device='cuda:0')
tensor([ 0.0377, -0.0176,  0.0356,  0.0121,  0.0375, -0.0067, -0.0167,  0.0026,
         0.0377, -0.0158,  0.0088,  0.0044], device='cuda:0')
tensor([[-0.4812,  0.0197, -0.3723, -0.1464,  0.3376,  0.3456,  0.4956,  0.6614,
         -0.1128,  0.5379,  0.3026,  0.3946]], device='cuda:0')
tensor([-0.0167], device='cuda:0')

plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', \
                     label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', \
                    label = "Test Loss")
plt.title("Neural Network 1 Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN1 Loss Plot.svg")
plt.savefig("Charts/NN1 Loss Plot.png")

neural_network.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network(rows.float())
        pred_targets += targets

output[:30]

[tensor([0.4071], device='cuda:0'),
 tensor([0.4367], device='cuda:0'),
 tensor([0.4916], device='cuda:0'),
 tensor([0.4602], device='cuda:0'),
 tensor([0.4200], device='cuda:0'),
 tensor([0.5220], device='cuda:0'),
 tensor([0.3182], device='cuda:0'),
 tensor([0.4554], device='cuda:0'),
 tensor([0.4201], device='cuda:0'),
 tensor([0.4309], device='cuda:0'),
 tensor([0.3266], device='cuda:0'),
 tensor([0.3411], device='cuda:0'),
 tensor([0.4875], device='cuda:0'),
 tensor([0.3779], device='cuda:0'),
 tensor([0.3846], device='cuda:0'),
 tensor([0.4094], device='cuda:0'),
 tensor([0.3893], device='cuda:0'),
 tensor([0.3415], device='cuda:0'),
 tensor([0.4405], device='cuda:0'),
 tensor([0.3425], device='cuda:0'),
 tensor([0.3623], device='cuda:0'),
 tensor([0.4208], device='cuda:0'),
 tensor([0.3198], device='cuda:0'),
 tensor([0.3298], device='cuda:0'),
 tensor([0.3357], device='cuda:0'),
 tensor([0.6797], device='cuda:0'),
 tensor([0.4066], device='cuda:0'),
 tensor([0.4710], device='cuda:0'),
 tensor([0.3952], device='cuda:0'),
 tensor([0.6402], device='cuda:0')]

nn1_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn1_predictions.append(0)
    else:
        nn1_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)

nn1_confusion = confusion_matrix(actual, nn1_predictions)
nn1_confusion

array([[2016112,  178775],
       [     52,      84]], dtype=int64)

nn1_precision = precision_score(actual, nn1_predictions)
nn1_precision

0.0004696436858083742

print(classification_report(actual, nn1_predictions))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96   2194887
           1       0.00      0.62      0.00       136

    accuracy                           0.92   2195023
   macro avg       0.50      0.77      0.48   2195023
weighted avg       1.00      0.92      0.96   2195023

nn1_false_positive_rate, nn1_true_positive_rate, threshold =\
    roc_curve(actual, nn1_predictions)

nn1_roc_auc = auc(nn1_false_positive_rate, nn1_true_positive_rate)
nn1_roc_auc

0.7680981982215941

plt.title('Neural Network 1 Receiver Operating Characteristic Curve')
plt.plot(nn1_false_positive_rate, nn1_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn1_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN1 ROC AUC.png')
plt.savefig('Charts/NN1 ROC AUC.svg')
plt.show()

Neural Network 2¶

class nn_Classifier2(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(49, 98)
        self.act1 = nn.LeakyReLU()
        self.fc2 = nn.Linear(98, 72)
        self.act2 = nn.LeakyReLU()
        self.fc3 = nn.Linear(72, 36)
        self.act3 = nn.LeakyReLU()
        self.fc4 = nn.Linear(36, 9)
        self.act4 = nn.LeakyReLU()
        self.fc5 = nn.Linear(9, 1)
        
    def forward(self, x):
        # make sure input tensor is flattened
        x = x.view(-1, 49)
        
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        x = self.act3(self.fc3(x))
        x = self.act4(self.fc4(x))
        x = torch.sigmoid(self.fc5(x))
        
        return x

neural_network2 = nn_Classifier2()
criterion = nn.BCELoss()
optimizer = optim.Adam(neural_network2.parameters(), lr = 1e-7, weight_decay = 1e-5)

def init_weights2(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)
            
neural_network2.apply(init_weights2)

nn_Classifier2(
  (fc1): Linear(in_features=49, out_features=98, bias=True)
  (act1): LeakyReLU(negative_slope=0.01)
  (fc2): Linear(in_features=98, out_features=72, bias=True)
  (act2): LeakyReLU(negative_slope=0.01)
  (fc3): Linear(in_features=72, out_features=36, bias=True)
  (act3): LeakyReLU(negative_slope=0.01)
  (fc4): Linear(in_features=36, out_features=9, bias=True)
  (act4): LeakyReLU(negative_slope=0.01)
  (fc5): Linear(in_features=9, out_features=1, bias=True)
)

n_train = len(x_train)
epochs = 50
neural_network2.to(device);

train_losses = []
test_losses = []
current = 0
test_loss_min = np.Inf

for e in range(epochs):
    neural_network2.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)

        optimizer.zero_grad()
        
        output = neural_network2(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network2.eval()
        test_loss = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                output = neural_network2(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = (test_loss/len(testloader))
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
            test_loss_min = valid_loss

Epoch: 1/50..  Training Loss: 0.64..  Test Loss: 0.70.. 
Test loss decreased (inf --> 0.7035).  Saving model ...
Epoch: 2/50..  Training Loss: 0.60..  Test Loss: 0.63.. 
Test loss decreased (0.7035 --> 0.6301).  Saving model ...
Epoch: 3/50..  Training Loss: 0.56..  Test Loss: 0.55.. 
Test loss decreased (0.6301 --> 0.5467).  Saving model ...
Epoch: 4/50..  Training Loss: 0.53..  Test Loss: 0.51.. 
Test loss decreased (0.5467 --> 0.5089).  Saving model ...
Epoch: 5/50..  Training Loss: 0.51..  Test Loss: 0.48.. 
Test loss decreased (0.5089 --> 0.4839).  Saving model ...
Epoch: 6/50..  Training Loss: 0.50..  Test Loss: 0.47.. 
Test loss decreased (0.4839 --> 0.4651).  Saving model ...
Epoch: 7/50..  Training Loss: 0.48..  Test Loss: 0.45.. 
Test loss decreased (0.4651 --> 0.4544).  Saving model ...
Epoch: 8/50..  Training Loss: 0.47..  Test Loss: 0.44.. 
Test loss decreased (0.4544 --> 0.4439).  Saving model ...
Epoch: 9/50..  Training Loss: 0.46..  Test Loss: 0.43.. 
Test loss decreased (0.4439 --> 0.4348).  Saving model ...
Epoch: 10/50..  Training Loss: 0.46..  Test Loss: 0.43.. 
Test loss decreased (0.4348 --> 0.4289).  Saving model ...
Epoch: 11/50..  Training Loss: 0.45..  Test Loss: 0.42.. 
Test loss decreased (0.4289 --> 0.4247).  Saving model ...
Epoch: 12/50..  Training Loss: 0.44..  Test Loss: 0.42.. 
Test loss decreased (0.4247 --> 0.4200).  Saving model ...
Epoch: 13/50..  Training Loss: 0.44..  Test Loss: 0.42.. 
Test loss decreased (0.4200 --> 0.4164).  Saving model ...
Epoch: 14/50..  Training Loss: 0.43..  Test Loss: 0.41.. 
Test loss decreased (0.4164 --> 0.4122).  Saving model ...
Epoch: 15/50..  Training Loss: 0.43..  Test Loss: 0.41.. 
Test loss decreased (0.4122 --> 0.4094).  Saving model ...
Epoch: 16/50..  Training Loss: 0.42..  Test Loss: 0.41.. 
Test loss decreased (0.4094 --> 0.4061).  Saving model ...
Epoch: 17/50..  Training Loss: 0.42..  Test Loss: 0.40.. 
Test loss decreased (0.4061 --> 0.4030).  Saving model ...
Epoch: 18/50..  Training Loss: 0.41..  Test Loss: 0.40.. 
Test loss decreased (0.4030 --> 0.4004).  Saving model ...
Epoch: 19/50..  Training Loss: 0.41..  Test Loss: 0.40.. 
Test loss decreased (0.4004 --> 0.3973).  Saving model ...
Epoch: 20/50..  Training Loss: 0.41..  Test Loss: 0.39.. 
Test loss decreased (0.3973 --> 0.3937).  Saving model ...
Epoch: 21/50..  Training Loss: 0.40..  Test Loss: 0.39.. 
Test loss decreased (0.3937 --> 0.3901).  Saving model ...
Epoch: 22/50..  Training Loss: 0.40..  Test Loss: 0.39.. 
Test loss decreased (0.3901 --> 0.3871).  Saving model ...
Epoch: 23/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3871 --> 0.3839).  Saving model ...
Epoch: 24/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3839 --> 0.3812).  Saving model ...
Epoch: 25/50..  Training Loss: 0.39..  Test Loss: 0.38.. 
Test loss decreased (0.3812 --> 0.3775).  Saving model ...
Epoch: 26/50..  Training Loss: 0.38..  Test Loss: 0.38.. 
Test loss decreased (0.3775 --> 0.3758).  Saving model ...
Epoch: 27/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3758 --> 0.3731).  Saving model ...
Epoch: 28/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3731 --> 0.3706).  Saving model ...
Epoch: 29/50..  Training Loss: 0.38..  Test Loss: 0.37.. 
Test loss decreased (0.3706 --> 0.3670).  Saving model ...
Epoch: 30/50..  Training Loss: 0.37..  Test Loss: 0.37.. 
Test loss decreased (0.3670 --> 0.3650).  Saving model ...
Epoch: 31/50..  Training Loss: 0.37..  Test Loss: 0.36.. 
Test loss decreased (0.3650 --> 0.3623).  Saving model ...
Epoch: 32/50..  Training Loss: 0.37..  Test Loss: 0.36.. 
Test loss decreased (0.3623 --> 0.3590).  Saving model ...
Epoch: 33/50..  Training Loss: 0.36..  Test Loss: 0.36.. 
Test loss decreased (0.3590 --> 0.3568).  Saving model ...
Epoch: 34/50..  Training Loss: 0.36..  Test Loss: 0.35.. 
Test loss decreased (0.3568 --> 0.3535).  Saving model ...
Epoch: 35/50..  Training Loss: 0.36..  Test Loss: 0.35.. 
Test loss decreased (0.3535 --> 0.3511).  Saving model ...
Epoch: 36/50..  Training Loss: 0.35..  Test Loss: 0.35.. 
Test loss decreased (0.3511 --> 0.3467).  Saving model ...
Epoch: 37/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3467 --> 0.3447).  Saving model ...
Epoch: 38/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3447 --> 0.3428).  Saving model ...
Epoch: 39/50..  Training Loss: 0.35..  Test Loss: 0.34.. 
Test loss decreased (0.3428 --> 0.3398).  Saving model ...
Epoch: 40/50..  Training Loss: 0.34..  Test Loss: 0.34.. 
Test loss decreased (0.3398 --> 0.3369).  Saving model ...
Epoch: 41/50..  Training Loss: 0.34..  Test Loss: 0.34.. 
Test loss decreased (0.3369 --> 0.3360).  Saving model ...
Epoch: 42/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3360 --> 0.3328).  Saving model ...
Epoch: 43/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3328 --> 0.3298).  Saving model ...
Epoch: 44/50..  Training Loss: 0.34..  Test Loss: 0.33.. 
Test loss decreased (0.3298 --> 0.3279).  Saving model ...
Epoch: 45/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3279 --> 0.3248).  Saving model ...
Epoch: 46/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3248 --> 0.3236).  Saving model ...
Epoch: 47/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3236 --> 0.3208).  Saving model ...
Epoch: 48/50..  Training Loss: 0.33..  Test Loss: 0.32.. 
Test loss decreased (0.3208 --> 0.3191).  Saving model ...
Epoch: 49/50..  Training Loss: 0.32..  Test Loss: 0.32.. 
Test loss decreased (0.3191 --> 0.3178).  Saving model ...
Epoch: 50/50..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3178 --> 0.3129).  Saving model ...

It may seem quite odd that the testing loss is consistently lower than the training loss. In this case, it's quite likely the the sheer size of the training set causes this to occur. The model is constantly improving every training batch and the training loss is calculated from the entire epoch. The testing loss is calculated after the entire epoch of batches have all affected the model for the better.

plt.figure(figsize = (12, 5))
train_ax, = plt.plot(np.arange(epochs), train_losses, 'r--', label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs), test_losses, 'b--', label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot.svg")
plt.savefig("Charts/NN2 Loss Plot.png")

train_losses

[0.6394088806,
 0.5974827975,
 0.5615484586,
 0.5345383341,
 0.5140510356,
 0.4971041327,
 0.4837051073,
 0.4724557599,
 0.4632148664,
 0.4550733971,
 0.4483568522,
 0.4421497967,
 0.4368479946,
 0.4318463335,
 0.4273023369,
 0.4228789921,
 0.4183978202,
 0.4141231419,
 0.4100503742,
 0.4062411901,
 0.4021704196,
 0.3983474331,
 0.3946568989,
 0.3911679414,
 0.3877969839,
 0.3843898797,
 0.3811379338,
 0.3780967995,
 0.3750898181,
 0.3719848349,
 0.368968146,
 0.3658273222,
 0.3629895356,
 0.3601837461,
 0.3575650346,
 0.3548797017,
 0.3518992016,
 0.3494093663,
 0.347011814,
 0.3447187431,
 0.3424452845,
 0.3400888381,
 0.3374523245,
 0.3351643949,
 0.3328306837,
 0.3308196616,
 0.3288264187,
 0.3263858052,
 0.3239979536,
 0.3216325002]

with open('Models/NN2_train_losses.txt', 'w') as loss:
    for epoch in train_losses:
        loss.write(str(epoch) + "\n")

for epoch in test_losses:
    print(epoch.item())

0.7034912109375
0.6301103234291077
0.5466606616973877
0.5088833570480347
0.4839249551296234
0.46507981419563293
0.4544140696525574
0.4438667297363281
0.4348237216472626
0.42894241213798523
0.4246806800365448
0.4199894666671753
0.4163917005062103
0.41220182180404663
0.40937915444374084
0.4061104357242584
0.403046190738678
0.4004068672657013
0.3973144590854645
0.39373621344566345
0.3901030123233795
0.3871174156665802
0.38390490412712097
0.3812478482723236
0.3775373697280884
0.3758487105369568
0.37314432859420776
0.37061068415641785
0.3670412600040436
0.36504238843917847
0.3622877299785614
0.3589939475059509
0.35675162076950073
0.3535061180591583
0.35114216804504395
0.34674540162086487
0.34468403458595276
0.34278616309165955
0.3398434817790985
0.3369307816028595
0.33598217368125916
0.3327692747116089
0.3297843337059021
0.3278554379940033
0.3247697651386261
0.3235817551612854
0.3208043873310089
0.3191089332103729
0.31777939200401306
0.31286540627479553

with open('Models/NN2_test_losses.txt', 'w') as loss:
    for epoch in test_losses:
        loss.write(str(epoch.item()) + "\n")

neural_network2.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network2(rows.float())
        pred_targets += targets

output[:30]

[tensor([0.2725], device='cuda:0'),
 tensor([0.2781], device='cuda:0'),
 tensor([0.1641], device='cuda:0'),
 tensor([0.2510], device='cuda:0'),
 tensor([0.1950], device='cuda:0'),
 tensor([0.5347], device='cuda:0'),
 tensor([0.4186], device='cuda:0'),
 tensor([0.2010], device='cuda:0'),
 tensor([0.0081], device='cuda:0'),
 tensor([0.3655], device='cuda:0'),
 tensor([0.0787], device='cuda:0'),
 tensor([0.0925], device='cuda:0'),
 tensor([0.1309], device='cuda:0'),
 tensor([0.3426], device='cuda:0'),
 tensor([0.2741], device='cuda:0'),
 tensor([0.4018], device='cuda:0'),
 tensor([0.1861], device='cuda:0'),
 tensor([0.1662], device='cuda:0'),
 tensor([0.1504], device='cuda:0'),
 tensor([0.0069], device='cuda:0'),
 tensor([0.0103], device='cuda:0'),
 tensor([0.0778], device='cuda:0'),
 tensor([0.0693], device='cuda:0'),
 tensor([0.2810], device='cuda:0'),
 tensor([0.0530], device='cuda:0'),
 tensor([0.4706], device='cuda:0'),
 tensor([0.3761], device='cuda:0'),
 tensor([0.1868], device='cuda:0'),
 tensor([0.0669], device='cuda:0'),
 tensor([0.0670], device='cuda:0')]

nn2_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn2_predictions.append(0)
    else:
        nn2_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)

nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion

array([[2060059,  134828],
       [     40,      96]], dtype=int64)

nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision

0.0007115116658266876

print(classification_report(actual, nn2_predictions))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97   2194887
           1       0.00      0.71      0.00       136

    accuracy                           0.94   2195023
   macro avg       0.50      0.82      0.48   2195023
weighted avg       1.00      0.94      0.97   2195023

nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
    roc_curve(actual, nn2_predictions)

nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc

0.8222270668148293

plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 50 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 50 Epochs.svg')
plt.show()

While 50 epochs were originally planned, the test loss consistently decreases even at the 50th epoch. Additionally, when compared to the other models, this neural network has a very high amount of true negative predictions and a moderately low amount of false positive predictions. Additional training may result in a model that outperforms even the LBFGS solved logistic regression model for this task.

for e in range(20):
    neural_network2.train()
    running_loss = 0
    for row, target in trainloader:
        row = row.to(device)
        target = target.to(device)

        optimizer.zero_grad()
        
        output = neural_network2(row.float())
        loss = criterion(output, target.float())
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Reporting
        print(str(current) + " / " + str(n_train) + "({:.3f}%)".format((current / n_train) * 100), end = "\r", flush = True)
        current += len(row)
        
    else:
        neural_network2.eval()
        test_loss = 0
        
        # Turn off gradients for validation, saves memory and computations
        with torch.no_grad():
            for row, target in testloader:
                row = row.to(device)
                target = target.to(device)
                
                output = neural_network2(row.float())
                test_loss += criterion(output, target.float())

                current = 0
        
        # Calculate average losses
        train_losses.append(running_loss/len(trainloader))
        valid_loss = (test_loss/len(testloader))
        test_losses.append(valid_loss)
        
        # Print validation statistics
        print("Epoch: {}/{}.. ".format(e + 51, epochs + 20),
              "Training Loss: {:.2f}.. ".format(running_loss/len(trainloader)),
              "Test Loss: {:.2f}.. ".format(valid_loss))
        
        # Save the model if test loss has decreased
        if test_loss/len(testloader) <= test_loss_min:
            print('Test loss decreased ({:.4f} --> {:.4f}).  Saving model ...'.format(
                test_loss_min, valid_loss))
            torch.save(neural_network2.state_dict(), 'Models/Neural Network 2.pt')
            test_loss_min = valid_loss

Epoch: 51/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3129 --> 0.3113).  Saving model ...
Epoch: 52/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3113 --> 0.3111).  Saving model ...
Epoch: 53/60..  Training Loss: 0.32..  Test Loss: 0.31.. 
Test loss decreased (0.3111 --> 0.3089).  Saving model ...
Epoch: 54/60..  Training Loss: 0.31..  Test Loss: 0.31.. 
Test loss decreased (0.3089 --> 0.3066).  Saving model ...
Epoch: 55/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3066 --> 0.3050).  Saving model ...
Epoch: 56/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3050 --> 0.3037).  Saving model ...
Epoch: 57/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3037 --> 0.3018).  Saving model ...
Epoch: 58/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3018 --> 0.3009).  Saving model ...
Epoch: 59/60..  Training Loss: 0.31..  Test Loss: 0.30.. 
Test loss decreased (0.3009 --> 0.2989).  Saving model ...
Epoch: 60/60..  Training Loss: 0.30..  Test Loss: 0.30.. 
Test loss decreased (0.2989 --> 0.2973).  Saving model ...
Epoch: 61/60..  Training Loss: 0.30..  Test Loss: 0.30.. 
Test loss decreased (0.2973 --> 0.2953).  Saving model ...
Epoch: 62/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2953 --> 0.2938).  Saving model ...
Epoch: 63/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2938 --> 0.2935).  Saving model ...
Epoch: 64/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2935 --> 0.2918).  Saving model ...
Epoch: 65/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2918 --> 0.2894).  Saving model ...
Epoch: 66/60..  Training Loss: 0.30..  Test Loss: 0.29.. 
Test loss decreased (0.2894 --> 0.2878).  Saving model ...
Epoch: 67/60..  Training Loss: 0.29..  Test Loss: 0.29.. 
Test loss decreased (0.2878 --> 0.2865).  Saving model ...
Epoch: 68/60..  Training Loss: 0.29..  Test Loss: 0.29.. 
Test loss decreased (0.2865 --> 0.2864).  Saving model ...
Epoch: 69/60..  Training Loss: 0.29..  Test Loss: 0.28.. 
Test loss decreased (0.2864 --> 0.2844).  Saving model ...
Epoch: 70/60..  Training Loss: 0.29..  Test Loss: 0.28.. 
Test loss decreased (0.2844 --> 0.2834).  Saving model ...

plt.figure(figsize = (18, 5))
train_ax, = plt.plot(np.arange(epochs + 20), train_losses, 'r--', \
                     label = "Training Loss")
test_ax, = plt.plot(np.arange(epochs + 20), test_losses, 'b--',\
                    label = "Test Loss")
plt.title("Neural Network 2 Training and Test Losses")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.xticks(range(0, epochs + 20))
plt.grid(b = True, which = 'major', color = 'w', linewidth = 1.0)
plt.grid(b = True, which = 'minor', color = 'w', linewidth = 0.5)
plt.legend(handles = [train_ax, test_ax])
plt.savefig("Charts/NN2 Loss Plot 2.svg")
plt.savefig("Charts/NN2 Loss Plot 2.png")

train_losses

[0.6394088806,
 0.5974827975,
 0.5615484586,
 0.5345383341,
 0.5140510356,
 0.4971041327,
 0.4837051073,
 0.4724557599,
 0.4632148664,
 0.4550733971,
 0.4483568522,
 0.4421497967,
 0.4368479946,
 0.4318463335,
 0.4273023369,
 0.4228789921,
 0.4183978202,
 0.4141231419,
 0.4100503742,
 0.4062411901,
 0.4021704196,
 0.3983474331,
 0.3946568989,
 0.3911679414,
 0.3877969839,
 0.3843898797,
 0.3811379338,
 0.3780967995,
 0.3750898181,
 0.3719848349,
 0.368968146,
 0.3658273222,
 0.3629895356,
 0.3601837461,
 0.3575650346,
 0.3548797017,
 0.3518992016,
 0.3494093663,
 0.347011814,
 0.3447187431,
 0.3424452845,
 0.3400888381,
 0.3374523245,
 0.3351643949,
 0.3328306837,
 0.3308196616,
 0.3288264187,
 0.3263858052,
 0.3239979536,
 0.3216325002,
 0.3197079353,
 0.3179992844,
 0.3160571078,
 0.3142515663,
 0.3126136929,
 0.3110215347,
 0.3094471151,
 0.3077344343,
 0.3061700857,
 0.3046334704,
 0.3029804776,
 0.3013822753,
 0.3000134525,
 0.2985723903,
 0.2972142988,
 0.2958292321,
 0.294525181,
 0.2932756564,
 0.2920163189,
 0.2907511399]

with open('Models/NN2_train_losses.txt', 'w') as loss:
    for epoch in train_losses:
        loss.write(str(epoch) + "\n")

for epoch in test_losses:
    print(epoch.item())

0.7034912109375
0.6301103234291077
0.5466606616973877
0.5088833570480347
0.4839249551296234
0.46507981419563293
0.4544140696525574
0.4438667297363281
0.4348237216472626
0.42894241213798523
0.4246806800365448
0.4199894666671753
0.4163917005062103
0.41220182180404663
0.40937915444374084
0.4061104357242584
0.403046190738678
0.4004068672657013
0.3973144590854645
0.39373621344566345
0.3901030123233795
0.3871174156665802
0.38390490412712097
0.3812478482723236
0.3775373697280884
0.3758487105369568
0.37314432859420776
0.37061068415641785
0.3670412600040436
0.36504238843917847
0.3622877299785614
0.3589939475059509
0.35675162076950073
0.3535061180591583
0.35114216804504395
0.34674540162086487
0.34468403458595276
0.34278616309165955
0.3398434817790985
0.3369307816028595
0.33598217368125916
0.3327692747116089
0.3297843337059021
0.3278554379940033
0.3247697651386261
0.3235817551612854
0.3208043873310089
0.3191089332103729
0.31777939200401306
0.31286540627479553
0.3113062083721161
0.31107133626937866
0.30892065167427063
0.3065762221813202
0.3049827516078949
0.303663969039917
0.3017641305923462
0.3008725643157959
0.29889076948165894
0.29732397198677063
0.2953140437602997
0.29375478625297546
0.2934665083885193
0.2917536795139313
0.2893546223640442
0.28776565194129944
0.2865000367164612
0.2864183485507965
0.2844005227088928
0.28341546654701233

with open('Models/NN2_test_losses.txt', 'w') as loss:
    for epoch in test_losses:
        loss.write(str(epoch.item()) + "\n")

neural_network2.eval()
output = []
pred_targets = []

with torch.no_grad():
    for rows, targets in testloader:
        rows = rows.to(device)
                
        output += neural_network2(rows.float())
        pred_targets += targets

nn2_predictions = []
actual = []

for i, x in enumerate(output):
    if output[i].item() <= 0.5:
        nn2_predictions.append(0)
    else:
        nn2_predictions.append(1)
        
    if pred_targets[i].item() == 0.0:
        actual.append(0)
    elif pred_targets[i].item() == 1.0:
        actual.append(1)

nn2_confusion = confusion_matrix(actual, nn2_predictions)
nn2_confusion

array([[2055329,  139558],
       [     39,      97]], dtype=int64)

nn2_precision = precision_score(actual, nn2_predictions)
nn2_precision

0.0006945687587268626

print(classification_report(actual, nn2_predictions))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97   2194887
           1       0.00      0.71      0.00       136

    accuracy                           0.94   2195023
   macro avg       0.50      0.82      0.48   2195023
weighted avg       1.00      0.94      0.97   2195023

nn2_false_positive_rate, nn2_true_positive_rate, threshold =\
    roc_curve(actual, nn2_predictions)

nn2_roc_auc = auc(nn2_false_positive_rate, nn2_true_positive_rate)
nn2_roc_auc

0.8248260331853076

plt.title('Neural Network 2 Receiver Operating Characteristic Curve at 70 Epochs')
plt.plot(nn2_false_positive_rate, nn2_true_positive_rate, 'blue',
         label = 'AUC = %0.2f' % nn2_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.png')
plt.savefig('Charts/NN2 ROC AUC at 70 Epochs.svg')
plt.show()

Validation Results¶

regression = pickle.load(open('Models/Logistic Regression lbfgs.sav', 'rb'))

regression_accuracy = regression.score(x_valid, y_valid)
regression_accuracy

0.9732312477961497

regression_predictions = regression.predict(x_valid)
actual = y_valid

regression_confusion = confusion_matrix(actual, regression_predictions)
regression_confusion

array([[1068088,   29355],
       [     24,      44]], dtype=int64)

regression_precision = precision_score(actual, regression_predictions)
regression_precision

0.001496649545902922

print(classification_report(actual, regression_predictions))

              precision    recall  f1-score   support

       False       1.00      0.97      0.99   1097443
        True       0.00      0.65      0.00        68

    accuracy                           0.97   1097511
   macro avg       0.50      0.81      0.49   1097511
weighted avg       1.00      0.97      0.99   1097511

regression_probabilities = regression.predict_proba(x_valid)
predictions = regression_probabilities[:,1]

regression_false_positive_rate, regression_true_positive_rate, threshold =\
    roc_curve(y_valid, predictions)

regression_roc_auc = auc(regression_false_positive_rate, \
                         regression_true_positive_rate)
regression_roc_auc

0.8627243456996373

plt.title('LBFGS Logistic Regression Validation ROC Curve')
plt.plot(regression_false_positive_rate, regression_true_positive_rate, \
         'blue', label = 'AUC = %0.2f' % regression_roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.svg')
plt.savefig('Charts/LBFGS Logistic Validation ROC AUC.png')
plt.show()

Conclusions ¶

Table 1
HDD Failure Predictive Model Testing Results

Model	Sensitivity	Specificity	Precision	Error Rate	ROC AUC
Logistic Regression	0.6397	0.9732	1.1478e-3	2.68%	0.8729
Decision Tree	0.4412	0.9690	0.8829e-3	3.10%	0.6900
Random Forest	0.3603	0.9903	2.2900e-3	0.98%	0.7974
Class-Weighted Random Forest	0.4044	0.9717	0.8858e-3	2.83%	0.7998
Simple DNN	0.6176	0.9185	0.4696e-3	8.15%	0.7681
Complex DNN	0.7132	0.9364	0.6946e-3	6.36%	0.8248

The Project Limitations¶

A few limitations of this project exist. First, a very large amount of the dataset was made up of missing values. A second limitation that deserves caution is that the ratios of drives made by each manufacturer in the dataset is very imbalanced. No assumptions about value or reliability of the four manufacturers included in the dataset should be made from this data. A third limitation is that the dataset was extremely imbalanced in terms of the minority (failure) and majority (non-failure) classes. Though SMOTE succeeded exceptionally well at allowing predictive models to learn from the imbalanced data, it does introduce bias as the synthetically created instances of the minority classes overrepresent their information in the analysis. Finally, working computer memory was a great limitation throughout the project as the dataset is so large. This limitation prevented factor analysis of mixed data from being performed and PCA had to be selected as the alternative.

Actions Proposed and Expected Benefits¶

It is highly recommended that either the logistic regression model or the more complex DNN model is added to the daily HDD diagnostics checks and backup procedure pipeline. The complex DNN will successfully flag 71.3% of drives expected to fail that day and the logistic regression 64%, allowing for total backup and retirement of the drive before the failure occurs. Do note that while more sensitive to detecting failure, the DNN does have a higher false positive rate, at 6.36%, than the more conservative logistic regression at 2.68%. Until this can be completed, special care should be given to drives with higher values of SMART attributes 5, 197, and 9 to reduce data loss and complications arising from the events of HDD failure.

Once implemented, an ensemble approach between the two should be tested to further reduce the false positive rate. Furthermore, additional research is warranted beyond the scope and limitations of the project. Taking an RNN approach to the data tidying and predictive modeling will almost certainly improve the results quite significantly, as they are specifically designed for time-series data such as this.

References¶

Acronis. Knowledge Base 9105. S.M.A.R.T. Attribute: Reallocated Sectors Count | Knowledge Base. https://kb.acronis.com/content/9105.

Acronis. Knowledge Base 9109. S.M.A.R.T. Attribute: Power-On Hours (POH) | Knowledge Base. https://kb.acronis.com/content/9109.

Acronis. Knowledge Base 9128. S.M.A.R.T. Attribute: Load Cycle Count; Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9128.

Acronis. Knowledge Base 9133. S.M.A.R.T. Attribute: Current Pending Sector Count | Knowledge Base. https://kb.acronis.com/content/9133.

Acronis. Knowledge Base 9152. S.M.A.R.T. Attribute: Load/Unload Cycle Count | Knowledge Base. https://kb.acronis.com/content/9152.

Backblaze. (2020). data_Q4_2019. San Mateo, CA; Backblaze. Klein, A. (2015, April 16). SMART Hard Drive Attributes: SMART 22 is a Gas Gas Gas. Backblaze Blog | Cloud Storage & Cloud Backup. https://www.backblaze.com/blog/smart-22-is-a-gas-gas-gas/.

Painchaud, A. (2018, October 31). 8 Reasons on How Data Loss Can Negatively Impact Your Bussiness. https://www.sherweb.com/blog/security/statistics-on-data-loss/.

Sanders, J. (2018, November 13). Western Digital spins down HGST and Tegile brands in hard disk market shuffle. TechRepublic. https://www.techrepublic.com/article/western-digital-spins-down-hgst-and-tegile-brands-in-hard-disk-market-shuffle/.

Weiss, G. M. (2013). Foundations of Imbalanced Learning. Imbalanced Learning, 13–41. https://doi.org/10.1002/9781118646106.ch2

	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	smart_8_raw	smart_9_raw	smart_10_raw	smart_11_raw	smart_12_raw	smart_13_raw	smart_15_raw	smart_16_raw	smart_17_raw	smart_18_raw	smart_22_raw	smart_23_raw	smart_24_raw	smart_168_raw	smart_170_raw	smart_173_raw	smart_174_raw	smart_177_raw	smart_179_raw	smart_181_raw	smart_182_raw	smart_183_raw	smart_184_raw	smart_187_raw	smart_188_raw	smart_189_raw	smart_190_raw	smart_191_raw	smart_192_raw	smart_193_raw	smart_194_raw	smart_195_raw	smart_196_raw	smart_197_raw	smart_198_raw	smart_199_raw	smart_200_raw	smart_201_raw	smart_218_raw	smart_220_raw	smart_222_raw	smart_223_raw	smart_224_raw	smart_225_raw	smart_226_raw	smart_231_raw	smart_232_raw	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_250_raw	smart_251_raw	smart_252_raw	smart_254_raw	smart_255_raw
count	1.097511e+07	3.028448e+06	1.096632e+07	1.096632e+07	1.096632e+07	1.096632e+07	3.028448e+06	1.097511e+07	1.096632e+07	70494.000000	1.097511e+07	0.0	0.0	8792.000000	8792.000000	323114.0	1.233138e+06	232122.0	232122.0	8792.0	8792.000000	8.792000e+03	8792.000000	8792.000000	0.0	0.0	0.0	1.842819e+06	4.194102e+06	7.912570e+06	7.912570e+06	4.194102e+06	7.912570e+06	4.572266e+06	1.097511e+07	1.092113e+07	1.097511e+07	6.168809e+06	3.053749e+06	1.096632e+07	1.096632e+07	1.096632e+07	3.898135e+06	0.0	8792.0	3.227220e+05	322722.000000	511435.000000	322722.0	4.519300e+04	322722.000000	8.792000e+03	8.792000e+03	8792.000000	8.792000e+03	8.241859e+06	8.065794e+06	8.065794e+06	0.0	0.0	0.0	26973.0	0.0
mean	8.802956e+07	6.700108e+01	2.421066e+02	9.593508e+00	2.241630e+01	1.387124e+09	2.320426e+01	1.956301e+04	3.230188e+01	676.745822	6.697014e+00	NaN	NaN	93.155823	93.155823	0.0	9.999898e+01	0.0	0.0	0.0	243.410714	5.897619e+09	2.550728	4.982712	NaN	NaN	NaN	1.467631e+00	1.025249e-05	1.110177e+00	1.804494e+08	5.889907e+00	2.822723e+01	1.409016e+04	2.194653e+02	1.122817e+04	2.834573e+01	1.212009e+08	6.639966e-01	1.255862e-01	1.079844e-01	4.217150e-01	3.494200e+03	NaN	0.0	6.048954e+07	9188.189922	113.526147	0.0	3.241115e+05	458.527302	1.099512e+14	3.975942e+11	6209.844177	1.302403e+10	1.947989e+04	5.326082e+10	1.400945e+11	NaN	NaN	NaN	0.0	NaN
std	8.118987e+07	4.816420e+01	1.136119e+03	1.325780e+02	6.353874e+02	6.481567e+10	1.881317e+01	1.144213e+04	1.956962e+03	1010.676314	1.120958e+01	NaN	NaN	19.106895	19.106895	0.0	1.346234e-01	0.0	0.0	0.0	312.487277	5.567485e+09	2.129612	3.346507	NaN	NaN	NaN	2.245870e+01	7.642989e-03	1.966771e+02	4.255038e+09	4.787608e+02	6.096319e+00	4.832565e+04	9.425638e+02	2.625259e+04	5.733469e+00	7.112637e+07	3.029599e+01	9.629527e+00	9.419638e+00	3.114347e+01	4.109417e+04	NaN	0.0	8.105917e+07	9011.095797	1232.030152	0.0	4.091719e+05	143.387204	1.522909e+00	3.333572e+10	5098.537468	1.069241e+10	1.154593e+04	1.592336e+10	1.668076e+11	NaN	NaN	NaN	0.0	NaN
min	0.000000e+00	0.000000e+00	0.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000	0.000000e+00	NaN	NaN	44.000000	44.000000	0.0	5.800000e+01	0.0	0.0	0.0	50.000000	4.294967e+09	0.000000	0.000000	NaN	NaN	NaN	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.400000e+01	0.000000e+00	0.000000e+00	1.000000e+00	1.200000e+01	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	NaN	0.0	0.000000e+00	0.000000	0.000000	0.0	3.480200e+04	160.000000	1.099512e+14	2.705829e+11	242.000000	5.086516e+08	0.000000e+00	0.000000e+00	1.000000e+00	NaN	NaN	NaN	0.0	NaN
25%	0.000000e+00	0.000000e+00	0.000000e+00	3.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	9.310000e+03	0.000000e+00	0.000000	3.000000e+00	NaN	NaN	83.000000	83.000000	0.0	1.000000e+02	0.0	0.0	0.0	101.000000	4.295426e+09	1.000000	2.000000	NaN	NaN	NaN	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	2.300000e+01	0.000000e+00	1.000000e+00	3.950000e+02	2.400000e+01	5.952610e+07	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	NaN	0.0	0.000000e+00	826.000000	0.000000	0.0	9.649900e+04	370.000000	1.099512e+14	3.865471e+11	1716.000000	3.600462e+09	1.129700e+04	4.930868e+10	1.131444e+11	NaN	NaN	NaN	0.0	NaN
50%	7.481462e+07	9.800000e+01	0.000000e+00	6.000000e+00	0.000000e+00	3.903726e+08	1.800000e+01	1.901100e+04	0.000000e+00	336.000000	5.000000e+00	NaN	NaN	93.000000	93.000000	0.0	1.000000e+02	0.0	0.0	0.0	154.000000	4.296475e+09	2.000000	5.000000	NaN	NaN	NaN	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	2.800000e+01	2.300000e+01	8.000000e+01	1.079000e+03	2.800000e+01	1.214947e+08	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	NaN	0.0	1.756365e+07	9324.000000	0.000000	0.0	1.515510e+05	534.000000	1.099512e+14	4.080219e+11	5171.000000	1.084498e+10	1.847300e+04	5.513324e+10	1.545293e+11	NaN	NaN	NaN	0.0	NaN
75%	1.599356e+08	1.020000e+02	0.000000e+00	9.000000e+00	0.000000e+00	8.594137e+08	4.200000e+01	2.920900e+04	0.000000e+00	1089.000000	9.000000e+00	NaN	NaN	102.000000	102.000000	0.0	1.000000e+02	0.0	0.0	0.0	287.000000	4.297917e+09	3.000000	8.000000	NaN	NaN	NaN	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	3.200000e+01	1.241300e+04	2.470000e+02	1.124700e+04	3.200000e+01	1.831201e+08	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	NaN	0.0	1.019740e+08	11160.000000	0.000000	0.0	3.400660e+05	537.000000	1.099512e+14	4.209068e+11	9380.250000	1.967301e+10	2.938600e+04	6.324645e+10	1.742499e+11	NaN	NaN	NaN	0.0	NaN
max	1.126895e+09	1.061000e+03	1.114200e+04	2.514100e+04	6.552800e+04	1.457377e+13	4.500000e+01	5.939300e+04	3.276800e+05	13377.000000	8.690000e+02	NaN	NaN	171.000000	171.000000	0.0	1.000000e+02	0.0	0.0	0.0	1979.000000	8.590406e+10	18.000000	11.000000	NaN	NaN	NaN	2.053000e+03	9.000000e+00	6.553500e+04	6.013057e+11	6.553500e+04	6.700000e+01	4.881618e+06	2.638800e+04	1.123031e+06	7.500000e+01	2.441406e+08	5.248000e+03	3.576000e+03	3.576000e+03	7.670000e+03	1.383907e+06	NaN	0.0	2.874409e+08	40503.000000	65536.000000	0.0	2.400769e+06	647.000000	1.099512e+14	4.380867e+11	23751.000000	4.981004e+10	5.897000e+04	2.053460e+11	2.811011e+13	NaN	NaN	NaN	0.0	NaN

	count	perc_not_nan	bar
smart_1_raw	1.09751e+07	100	100
smart_2_raw	3.02845e+06	27.5938	27.5938
smart_3_raw	1.09663e+07	99.9199	99.9199
smart_4_raw	1.09663e+07	99.9199	99.9199
smart_5_raw	1.09663e+07	99.9199	99.9199
smart_7_raw	1.09663e+07	99.9199	99.9199
smart_8_raw	3.02845e+06	27.5938	27.5938
smart_9_raw	1.09751e+07	100	100
smart_10_raw	1.09663e+07	99.9199	99.9199
smart_11_raw	70494	0.642308	0.642308
smart_12_raw	1.09751e+07	100	100
smart_13_raw	0	0	0
smart_15_raw	0	0	0
smart_16_raw	8792	0.0801085	0.0801085
smart_17_raw	8792	0.0801085	0.0801085
smart_18_raw	323114	2.94406	2.94406
smart_22_raw	1.23314e+06	11.2358	11.2358
smart_23_raw	232122	2.11499	2.11499
smart_24_raw	232122	2.11499	2.11499
smart_168_raw	8792	0.0801085	0.0801085
smart_170_raw	8792	0.0801085	0.0801085
smart_173_raw	8792	0.0801085	0.0801085
smart_174_raw	8792	0.0801085	0.0801085
smart_177_raw	8792	0.0801085	0.0801085
smart_179_raw	0	0	0
smart_181_raw	0	0	0
smart_182_raw	0	0	0
smart_183_raw	1.84282e+06	16.7909	16.7909
smart_184_raw	4.1941e+06	38.2147	38.2147
smart_187_raw	7.91257e+06	72.0956	72.0956
smart_188_raw	7.91257e+06	72.0956	72.0956
smart_189_raw	4.1941e+06	38.2147	38.2147
smart_190_raw	7.91257e+06	72.0956	72.0956
smart_191_raw	4.57227e+06	41.6603	41.6603
smart_192_raw	1.09751e+07	100	100
smart_193_raw	1.09211e+07	99.5081	99.5081
smart_194_raw	1.09751e+07	100	100
smart_195_raw	6.16881e+06	56.2072	56.2072
smart_196_raw	3.05375e+06	27.8243	27.8243
smart_197_raw	1.09663e+07	99.9199	99.9199
smart_198_raw	1.09663e+07	99.9199	99.9199
smart_199_raw	1.09663e+07	99.9199	99.9199
smart_200_raw	3.89814e+06	35.5179	35.5179
smart_201_raw	0	0	0
smart_218_raw	8792	0.0801085	0.0801085
smart_220_raw	322722	2.94049	2.94049
smart_222_raw	322722	2.94049	2.94049
smart_223_raw	511435	4.65995	4.65995
smart_224_raw	322722	2.94049	2.94049
smart_225_raw	45193	0.411777	0.411777
smart_226_raw	322722	2.94049	2.94049
smart_231_raw	8792	0.0801085	0.0801085
smart_232_raw	8792	0.0801085	0.0801085
smart_233_raw	8792	0.0801085	0.0801085
smart_235_raw	8792	0.0801085	0.0801085
smart_240_raw	8.24186e+06	75.0959	75.0959
smart_241_raw	8.06579e+06	73.4917	73.4917
smart_242_raw	8.06579e+06	73.4917	73.4917
smart_250_raw	0	0	0
smart_251_raw	0	0	0
smart_252_raw	0	0	0
smart_254_raw	26973	0.245765	0.245765
smart_255_raw	0	0	0

	date	serial_number	model	capacity_bytes	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	...	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_250_raw	smart_251_raw	smart_252_raw	smart_254_raw	smart_255_raw
0	2019-10-01	Z305B2QN	ST4000DM000	4000787030016	97236416.0	NaN	0.0	13.0	...	NaN	NaN	33009.0	5.063798e+10	1.623458e+11	NaN	NaN	NaN	NaN	NaN
1	2019-10-01	ZJV0XJQ4	ST12000NM0007	12000138625024	4665536.0	NaN	0.0	3.0	...	NaN	NaN	9533.0	5.084775e+10	1.271356e+11	NaN	NaN	NaN	NaN	NaN
2	2019-10-01	ZJV0XJQ3	ST12000NM0007	12000138625024	92892872.0	NaN	0.0	1.0	...	NaN	NaN	6977.0	4.920827e+10	4.658787e+10	NaN	NaN	NaN	NaN	NaN
3	2019-10-01	ZJV0XJQ0	ST12000NM0007	12000138625024	231702544.0	NaN	0.0	6.0	...	NaN	NaN	10669.0	5.341374e+10	9.427903e+10	NaN	NaN	NaN	NaN	NaN
4	2019-10-01	PL1331LAHG1S4H	HGST HMS5C4040ALE640	4000787030016	0.0	103.0	436.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	2019-10-01	ZA16NQJR	ST8000NM0055	8001563222016	117053872.0	NaN	0.0	7.0	...	NaN	NaN	21190.0	5.861349e+10	1.380783e+11	NaN	NaN	NaN	NaN	NaN
6	2019-10-01	ZJV02XWG	ST12000NM0007	12000138625024	194975656.0	NaN	0.0	8.0	...	NaN	NaN	12038.0	5.206555e+10	9.974091e+10	NaN	NaN	NaN	NaN	NaN
7	2019-10-01	ZJV1CSVX	ST12000NM0007	12000138625024	121918904.0	NaN	0.0	19.0	...	NaN	NaN	10444.0	5.417592e+10	1.400380e+11	NaN	NaN	NaN	NaN	NaN
8	2019-10-01	ZJV02XWA	ST12000NM0007	12000138625024	22209920.0	NaN	0.0	7.0	...	NaN	NaN	12130.0	6.002246e+10	1.372655e+11	NaN	NaN	NaN	NaN	NaN
9	2019-10-01	ZA18CEBS	ST8000NM0055	8001563222016	119880096.0	NaN	0.0	2.0	...	NaN	NaN	18159.0	5.162341e+10	1.326167e+11	NaN	NaN	NaN	NaN	NaN
10	2019-10-01	Z305DEMG	ST4000DM000	4000787030016	161164360.0	NaN	0.0	4.0	...	NaN	NaN	31207.0	4.454928e+10	1.502931e+11	NaN	NaN	NaN	NaN	NaN
11	2019-10-01	ZA130TTW	ST8000DM002	8001563222016	40241952.0	NaN	0.0	2.0	...	NaN	NaN	26265.0	6.771851e+10	1.653885e+11	NaN	NaN	NaN	NaN	NaN
12	2019-10-01	ZJV5HJQF	ST12000NM0007	12000138625024	41766200.0	NaN	0.0	2.0	...	NaN	NaN	93.0	6.804080e+08	3.379383e+08	NaN	NaN	NaN	NaN	NaN
13	2019-10-01	ZJV1CSVV	ST12000NM0007	12000138625024	90869464.0	NaN	0.0	3.0	...	NaN	NaN	7121.0	4.144846e+10	6.582024e+10	NaN	NaN	NaN	NaN	NaN
14	2019-10-01	ZA18CEBF	ST8000NM0055	8001563222016	206980416.0	NaN	0.0	5.0	...	NaN	NaN	18174.0	5.177096e+10	1.458424e+11	NaN	NaN	NaN	NaN	NaN
15	2019-10-01	ZJV02XWV	ST12000NM0007	12000138625024	122003344.0	NaN	0.0	3.0	...	NaN	NaN	12123.0	5.917276e+10	1.325998e+11	NaN	NaN	NaN	NaN	NaN
16	2019-10-01	PL2331LAG9TEEJ	HGST HMS5C4040ALE640	4000787030016	0.0	98.0	449.0	13.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	2019-10-01	PL2331LAH3WYAJ	HGST HMS5C4040BLE640	4000787030016	0.0	106.0	539.0	5.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	2019-10-01	2AGN81UY	HGST HUH721212ALN604	12000138625024	0.0	96.0	0.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	2019-10-01	PL1331LAHG53YH	HGST HMS5C4040BLE640	4000787030016	0.0	104.0	440.0	7.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	2019-10-01	88Q0A0LGF97G	TOSHIBA MG07ACA14TA	14000519643136	0.0	0.0	7795.0	2.0	...	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	2019-10-01	PL2331LAHDUVVJ	HGST HMS5C4040BLE640	4000787030016	0.0	100.0	0.0	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	2019-10-01	ZA10JDYK	ST8000DM002	8001563222016	144780968.0	NaN	0.0	5.0	...	NaN	NaN	29378.0	5.207358e+10	1.730103e+11	NaN	NaN	NaN	NaN	NaN
23	2019-10-01	2AGN03VY	HGST HUH721212ALN604	12000138625024	0.0	96.0	0.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	2019-10-01	2AGNBDDY	HGST HUH721212ALN604	12000138625024	0.0	96.0	0.0	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	2019-10-01	ZA18CEBT	ST8000NM0055	8001563222016	44530656.0	NaN	0.0	5.0	...	NaN	NaN	18163.0	5.212663e+10	1.302842e+11	NaN	NaN	NaN	NaN	NaN
26	2019-10-01	PL1331LAHD252H	HGST HMS5C4040BLE640	4000787030016	0.0	103.0	432.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	2019-10-01	PL1331LAHD1HTH	HGST HMS5C4040BLE640	4000787030016	0.0	103.0	432.0	6.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	2019-10-01	8CGDP8AH	HGST HUH721212ALE600	12000138625024	0.0	96.0	384.0	14.0	...	NaN	NaN	NaN	3.766782e+10	4.292351e+10	NaN	NaN	NaN	NaN	NaN
29	2019-10-01	ZCH0EBLP	ST12000NM0007	12000138625024	119494112.0	NaN	0.0	9.0	...	NaN	NaN	12020.0	5.211059e+10	9.888794e+10	NaN	NaN	NaN	NaN	NaN

	date	serial_number	model	capacity_bytes	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	...	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_250_raw	smart_251_raw	smart_252_raw	smart_254_raw	smart_255_raw
162	2019-10-01	1747287481d20010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1169	2019-10-01	a79d077c55d30010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1666	2019-10-01	8583f658cd680010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3224	2019-10-01	22ecf3ea21150010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3922	2019-10-01	9ac75f2107cc0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4150	2019-10-01	3b8f38bf6bc90010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4337	2019-10-01	29bae1bef9ad0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4567	2019-10-01	c3bea4912a060010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8696	2019-10-01	5bd1f7cc48910010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12851	2019-10-01	c1858f02677a0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13299	2019-10-01	b160b38dd1370010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15676	2019-10-01	ef29e2d545380010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16477	2019-10-01	7b7ec52d10240010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19564	2019-10-01	eef069c94dfb0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20934	2019-10-01	a79beabda2020010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23008	2019-10-01	10ca0ecb78690010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26018	2019-10-01	5c2f968553650010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31033	2019-10-01	350901195c7b0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
35634	2019-10-01	6866178485f00010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38044	2019-10-01	2d30418626330010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38758	2019-10-01	17eddeea3c620010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38973	2019-10-01	128cfa8eabec0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
40819	2019-10-01	e4d24bb6b3290010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
41068	2019-10-01	a23b0568a5f60010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
41224	2019-10-01	a941d1eaf0160010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
41966	2019-10-01	13a2651c44500010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
42935	2019-10-01	b86976e2b7490010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
43813	2019-10-01	45f3334ff98c0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
48684	2019-10-01	37e5a52d44600010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
54392	2019-10-01	49a73bc7c27d0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
57720	2019-10-01	f2907e144db40010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
58040	2019-10-01	312feea327f30010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
59544	2019-10-01	dc85a3f97d6f0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
60703	2019-10-01	f7acc8a7d9220010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
62021	2019-10-01	cce2cfe98b7f0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
62201	2019-10-01	c295df982e020010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
62319	2019-10-01	7818d2d7bc260010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
65014	2019-10-01	b9f8a9fe5d910010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
65271	2019-10-01	af70ef0319310010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
66849	2019-10-01	56cc876a649c0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
73313	2019-10-01	22d96dd0f90c0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
74417	2019-10-01	4d03b7d534ea0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
77924	2019-10-01	c3b6042ce1d70010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
83725	2019-10-01	5de287ae7c050010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
85238	2019-10-01	507b941884d90010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
85320	2019-10-01	1f157071f4590010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
85872	2019-10-01	421ceb5dd0720010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
91051	2019-10-01	9dd00e2a06080010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
91959	2019-10-01	eeb700c6e4960010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
92330	2019-10-01	76db3b83c3b30010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
92831	2019-10-01	bff106d793020010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
93365	2019-10-01	d83f152970950010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
96150	2019-10-01	ccddfe2489d50010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
96190	2019-10-01	e2cfef5b9de50010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
96893	2019-10-01	ad6def546aea0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98881	2019-10-01	3c8f79f4ce9b0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
101456	2019-10-01	d2830942e1ca0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
101692	2019-10-01	98a871fbf9de0010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
102576	2019-10-01	826fc283ec560010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
111088	2019-10-01	2e591a197fd00010	DELLBOSS VD	480036847616	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	date	serial_number	model	capacity_bytes	failure	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	...	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_250_raw	smart_251_raw	smart_252_raw	smart_254_raw	smart_255_raw
1113	2019-10-01	NB1206GH	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	869.0	1.823282e+09	NaN	1399.0	307.0	NaN	NaN	NaN	NaN	NaN
1482	2019-10-01	NB120KH2	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	15439.0	3.237942e+10	NaN	7769.0	4309.0	NaN	NaN	NaN	NaN	NaN
1507	2019-10-01	NB120KHJ	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	14427.0	3.025683e+10	NaN	7588.0	4182.0	NaN	NaN	NaN	NaN	NaN
4724	2019-10-01	NB120H6H	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	2848.0	5.972913e+09	NaN	1562.0	911.0	NaN	NaN	NaN	NaN	NaN
4749	2019-10-01	NB120H66	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	2549.0	5.346440e+09	NaN	1686.0	695.0	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
109521	2019-10-01	NB120G0J	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	7765.0	1.628449e+10	NaN	4844.0	2382.0	NaN	NaN	NaN	NaN	NaN
109891	2019-10-01	NB120AKM	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	17371.0	3.643071e+10	NaN	7951.0	4602.0	NaN	NaN	NaN	NaN	NaN
109901	2019-10-01	NB120AKR	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	12588.0	2.639919e+10	NaN	7887.0	4337.0	NaN	NaN	NaN	NaN	NaN
113784	2019-10-01	NB120KY9	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	12661.0	2.655319e+10	NaN	8001.0	4103.0	NaN	NaN	NaN	NaN	NaN
114644	2019-10-01	NB120HRB	Seagate SSD	250059350016	0	0.0	NaN	NaN	NaN	NaN	...	752.0	1.578609e+09	NaN	1189.0	18.0	NaN	NaN	NaN	NaN	NaN

	date	serial_number	model	capacity_bytes	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	...	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_250_raw	smart_251_raw	smart_252_raw	smart_254_raw	smart_255_raw	manufacturer
0	10-01	Z305B2QN	ST4000DM000	4000787030016	97236416.0	NaN	0.0	13.0	...	NaN	33009.0	5.063798e+10	1.623458e+11	NaN	NaN	NaN	NaN	NaN	Seagate
1	10-01	ZJV0XJQ4	ST12000NM0007	12000138625024	4665536.0	NaN	0.0	3.0	...	NaN	9533.0	5.084775e+10	1.271356e+11	NaN	NaN	NaN	NaN	NaN	Seagate
2	10-01	ZJV0XJQ3	ST12000NM0007	12000138625024	92892872.0	NaN	0.0	1.0	...	NaN	6977.0	4.920827e+10	4.658787e+10	NaN	NaN	NaN	NaN	NaN	Seagate
3	10-01	ZJV0XJQ0	ST12000NM0007	12000138625024	231702544.0	NaN	0.0	6.0	...	NaN	10669.0	5.341374e+10	9.427903e+10	NaN	NaN	NaN	NaN	NaN	Seagate
4	10-01	PL1331LAHG1S4H	HMS5C4040ALE640	4000787030016	0.0	103.0	436.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	HGST

failure	0	1
manufacturer
HGST	2660507	26
Seagate	7965951	606
Toshiba	322682	40
Western Digital	25295	6

	Seagate	HGST	Toshiba	Western Digital
date	0	0	0	0
serial_number	0	0	0	0
model	0	0	0	0
failure	0	0	0	0
smart_1_raw	2	0	0	0
smart_2_raw	7921364	0	0	25301
smart_3_raw	8794	0	0	0
smart_4_raw	8794	0	0	0
smart_5_raw	8794	0	0	0
smart_7_raw	8794	0	0	0
smart_8_raw	7921364	0	0	25301
smart_9_raw	2	0	0	0
smart_10_raw	8794	0	0	0
smart_11_raw	7921364	2660533	322722	0
smart_12_raw	2	0	0	0
smart_13_raw	7966557	2660533	322722	25301
smart_15_raw	7966557	2660533	322722	25301
smart_16_raw	7957765	2660533	322722	25301
smart_17_raw	7957765	2660533	322722	25301
smart_18_raw	7643443	2660533	322722	25301
smart_22_raw	7966557	1427395	322722	25301
smart_23_raw	7966557	2660533	90600	25301
smart_24_raw	7966557	2660533	90600	25301
smart_168_raw	7957765	2660533	322722	25301
smart_170_raw	7957765	2660533	322722	25301
smart_173_raw	7957765	2660533	322722	25301
smart_174_raw	7957765	2660533	322722	25301
smart_177_raw	7957765	2660533	322722	25301
smart_179_raw	7966557	2660533	322722	25301
smart_181_raw	7966557	2660533	322722	25301
smart_182_raw	7966557	2660533	322722	25301
smart_183_raw	6123738	2660533	322722	25301
smart_184_raw	3772455	2660533	322722	25301
smart_187_raw	53987	2660533	322722	25301
smart_188_raw	53987	2660533	322722	25301
smart_189_raw	3772455	2660533	322722	25301
smart_190_raw	53987	2660533	322722	25301
smart_191_raw	3727262	2660533	0	15052
smart_192_raw	2	0	0	0
smart_193_raw	53987	0	0	0
smart_194_raw	2	0	0	0
smart_195_raw	1797748	2660533	322722	25301
smart_196_raw	7921364	0	0	0
smart_197_raw	8794	0	0	0
smart_198_raw	8794	0	0	0
smart_199_raw	8794	0	0	0
smart_200_raw	4093723	2660533	322722	0
smart_201_raw	7966557	2660533	322722	25301
smart_218_raw	7957765	2660533	322722	25301
smart_220_raw	7966557	2660533	0	25301
smart_222_raw	7966557	2660533	0	25301
smart_223_raw	7921364	2517013	0	25301
smart_224_raw	7966557	2660533	0	25301
smart_225_raw	7921364	2660533	322722	25301
smart_226_raw	7966557	2660533	0	25301
smart_231_raw	7957765	2660533	322722	25301
smart_232_raw	7957765	2660533	322722	25301
smart_233_raw	7957765	2660533	322722	25301
smart_235_raw	7957765	2660533	322722	25301
smart_240_raw	53987	2660533	0	18734
smart_241_raw	45195	2517013	322722	24389
smart_242_raw	45195	2517013	322722	24389
smart_250_raw	7966557	2660533	322722	25301
smart_251_raw	7966557	2660533	322722	25301
smart_252_raw	7966557	2660533	322722	25301
smart_254_raw	7940496	2660533	322722	24389
smart_255_raw	7966557	2660533	322722	25301
manufacturer	0	0	0	0
capacity_TB	0	0	0	0

	count
smart_1_raw	10975111.0
smart_2_raw	3028448.0
smart_3_raw	10966319.0
smart_4_raw	10966319.0
smart_5_raw	10966319.0
smart_7_raw	10966319.0
smart_8_raw	3028448.0
smart_9_raw	10975111.0
smart_10_raw	10966319.0
smart_11_raw	70494.0
smart_12_raw	10975111.0
smart_13_raw	0.0
smart_15_raw	0.0
smart_16_raw	8792.0
smart_17_raw	8792.0
smart_18_raw	323114.0
smart_22_raw	1233138.0
smart_23_raw	232122.0
smart_24_raw	232122.0
smart_168_raw	8792.0
smart_170_raw	8792.0
smart_173_raw	8792.0
smart_174_raw	8792.0
smart_177_raw	8792.0
smart_179_raw	0.0
smart_181_raw	0.0
smart_182_raw	0.0
smart_183_raw	1842819.0
smart_184_raw	4194102.0
smart_187_raw	7912570.0
smart_188_raw	7912570.0
smart_189_raw	4194102.0
smart_190_raw	7912570.0
smart_191_raw	4572266.0
smart_192_raw	10975111.0
smart_193_raw	10921126.0
smart_194_raw	10975111.0
smart_195_raw	6168809.0
smart_196_raw	3053749.0
smart_197_raw	10966319.0
smart_198_raw	10966319.0
smart_199_raw	10966319.0
smart_200_raw	3898135.0
smart_201_raw	0.0
smart_218_raw	8792.0
smart_220_raw	322722.0
smart_222_raw	322722.0
smart_223_raw	511435.0
smart_224_raw	322722.0
smart_225_raw	45193.0
smart_226_raw	322722.0
smart_231_raw	8792.0
smart_232_raw	8792.0
smart_233_raw	8792.0
smart_235_raw	8792.0
smart_240_raw	8241859.0
smart_241_raw	8065794.0
smart_242_raw	8065794.0
smart_250_raw	0.0
smart_251_raw	0.0
smart_252_raw	0.0
smart_254_raw	26973.0
smart_255_raw	0.0

	count	perc_not_nan
smart_1_raw	10975111.0	99.999982
smart_2_raw	3028448.0	27.593775
smart_3_raw	10966319.0	99.919873
smart_4_raw	10966319.0	99.919873
smart_5_raw	10966319.0	99.919873
smart_7_raw	10966319.0	99.919873
smart_8_raw	3028448.0	27.593775
smart_9_raw	10975111.0	99.999982
smart_10_raw	10966319.0	99.919873
smart_11_raw	70494.0	0.642308
smart_12_raw	10975111.0	99.999982
smart_13_raw	0.0	0.000000
smart_15_raw	0.0	0.000000
smart_16_raw	8792.0	0.080109
smart_17_raw	8792.0	0.080109
smart_18_raw	323114.0	2.944061
smart_22_raw	1233138.0	11.235766
smart_23_raw	232122.0	2.114985
smart_24_raw	232122.0	2.114985
smart_168_raw	8792.0	0.080109
smart_170_raw	8792.0	0.080109
smart_173_raw	8792.0	0.080109
smart_174_raw	8792.0	0.080109
smart_177_raw	8792.0	0.080109
smart_179_raw	0.0	0.000000
smart_181_raw	0.0	0.000000
smart_182_raw	0.0	0.000000
smart_183_raw	1842819.0	16.790889
smart_184_raw	4194102.0	38.214659
smart_187_raw	7912570.0	72.095567
smart_188_raw	7912570.0	72.095567
smart_189_raw	4194102.0	38.214659
smart_190_raw	7912570.0	72.095567
smart_191_raw	4572266.0	41.660309
smart_192_raw	10975111.0	99.999982
smart_193_raw	10921126.0	99.508096
smart_194_raw	10975111.0	99.999982
smart_195_raw	6168809.0	56.207248
smart_196_raw	3053749.0	27.824306
smart_197_raw	10966319.0	99.919873
smart_198_raw	10966319.0	99.919873
smart_199_raw	10966319.0	99.919873
smart_200_raw	3898135.0	35.517949
smart_201_raw	0.0	0.000000
smart_218_raw	8792.0	0.080109
smart_220_raw	322722.0	2.940489
smart_222_raw	322722.0	2.940489
smart_223_raw	511435.0	4.659952
smart_224_raw	322722.0	2.940489
smart_225_raw	45193.0	0.411777
smart_226_raw	322722.0	2.940489
smart_231_raw	8792.0	0.080109
smart_232_raw	8792.0	0.080109
smart_233_raw	8792.0	0.080109
smart_235_raw	8792.0	0.080109
smart_240_raw	8241859.0	75.095892
smart_241_raw	8065794.0	73.491672
smart_242_raw	8065794.0	73.491672
smart_250_raw	0.0	0.000000
smart_251_raw	0.0	0.000000
smart_252_raw	0.0	0.000000
smart_254_raw	26973.0	0.245765
smart_255_raw	0.0	0.000000

	date	serial_number	model	failure	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	...	smart_231_raw	smart_232_raw	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_254_raw	manufacturer	capacity_TB
4632946	11-10	ZJV00DR4	ST12000NM0007	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	12.0
4797700	11-11	ZHZ3M097	ST12000NM0008	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	12.0

	date	serial_number	model	failure	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	...	smart_231_raw	smart_232_raw	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_254_raw	manufacturer	capacity_TB
4797702	11-11	ZCH0CCK9	ST12000NM0007	0	151625632.0	NaN	0.0	14.0	0.0	362901516.0	...	NaN	NaN	NaN	NaN	15873.0	6.539592e+10	2.142969e+11	NaN	Seagate	12.0
4632947	11-10	Z304JWB3	ST4000DM000	0	146071592.0	NaN	0.0	9.0	0.0	992849367.0	...	NaN	NaN	NaN	NaN	35761.0	5.668042e+10	1.810822e+11	NaN	Seagate	4.0

	ST4000DM000	ST12000NM0007	HMS5C4040ALE640	ST8000NM0055	ST8000DM002	HMS5C4040BLE640	HUH721212ALN604	MG07ACA14TA	HUH721212ALE600	MQ01ABF050	ST500LM030	ST6000DX000	ST10000NM0086	MQ01ABF050M	WD5000LPVX	ST500LM012 HN	HUH728080ALE600	MD04ABA400V	HDWF180	ST8000DM005	Seagate SSD	ST4000DM005	WD5000LPCX	HDS5C4040ALE630	ST500LM021	HUS726040ALE610	ZA500CM10002	ST12000NM0117	ZA2000CM10002	ZA250CM10002	HDWE160	WD5000BPKT	ST6000DM001	WD60EFRX	ST8000DM004	HMS5C4040BLE641	ST1000LM024 HN'	ST6000DM004	ST12000NM0008	ST16000NM001G
date	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
serial_number	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
model	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
failure	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
smart_1_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
smart_2_raw	1757498	3394893	0	1316386	896946	0	0	0	0	0	23025	81493	109173	0	19187	0	0	0	0	2257	0	3555	4928	0	3036	0	1593	462	355	6844	0	912	368	274	273	0	0	92	321275	1840
smart_3_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	0	0	1	0
smart_4_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	0	0	1	0
smart_5_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	0	0	1	0
smart_7_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	0	0	1	0
smart_8_raw	1757498	3394893	0	1316386	896946	0	0	0	0	0	23025	81493	109173	0	19187	0	0	0	0	2257	0	3555	4928	0	3036	0	1593	462	355	6844	0	912	368	274	273	0	0	92	321275	1840
smart_9_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
smart_10_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	0	0	1	0
smart_11_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	0	0	92073	9009	1840	2257	1820	3555	0	2484	3036	2570	1593	462	355	6844	368	0	368	0	273	91	0	92	321275	1840
smart_12_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
smart_13_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_15_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_16_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_17_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_18_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	1	0
smart_22_raw	1757498	3394893	253758	1316386	896946	1168492	0	232122	0	42565	23025	81493	109173	36818	19187	45102	0	9009	1840	2257	0	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_23_raw	1757498	3394893	253758	1316386	896946	1168492	995725	0	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_24_raw	1757498	3394893	253758	1316386	896946	1168492	995725	0	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_168_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_170_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_173_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_174_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_177_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	0	462	0	0	368	912	368	274	273	91	91	92	321275	1840
smart_179_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_181_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_182_raw	1757498	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	81493	109173	36818	19187	45102	92073	9009	1840	2257	1820	3555	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	273	91	91	92	321275	1840
smart_183_raw	0	3394893	253758	1316386	896946	1168492	995725	232122	143520	42565	23025	0	109173	36818	19187	45102	92073	9009	1840	2257	1820	0	4928	2484	3036	2570	1593	462	355	6844	368	912	368	274	0	91	91	92	321275	1840
smart_184_raw	0	3394893	253758	0	0	1168492	995725	232122	143520	42565	0	0	0	36818	19187	45102	92073	9009	1840	0	1820	0	4928	2484	0	2570	1593	462	355	6844	368	912	0	274	0	91	91	0	321275	1840
smart_187_raw	0	1	253758	0	0	1168492	995725	232122	143520	42565	0	0	0	36818	19187	45102	92073	9009	1840	0	1820	0	4928	2484	0	2570	1593	0	355	6844	368	912	0	274	0	91	91	0	1	0
smart_188_raw	0	1	253758	0	0	1168492	995725	232122	143520	42565	0	0	0	36818	19187	45102	92073	9009	1840	0	1820	0	4928	2484	0	2570	1593	0	355	6844	368	912	0	274	0	91	91	0	1	0
smart_189_raw	0	3394893	253758	0	0	1168492	995725	232122	143520	42565	0	0	0	36818	19187	45102	92073	9009	1840	0	1820	0	4928	2484	0	2570	1593	462	355	6844	368	912	0	274	0	91	91	0	321275	1840
smart_190_raw	0	1	253758	0	0	1168492	995725	232122	143520	42565	0	0	0	36818	19187	45102	92073	9009	1840	0	1820	0	4928	2484	0	2570	1593	0	355	6844	368	912	0	274	0	91	91	0	1	0
smart_191_raw	0	3394893	253758	0	0	1168492	995725	0	143520	0	0	0	0	0	9850	0	92073	0	0	0	1820	0	4928	2484	0	2570	1593	462	355	6844	0	0	0	274	0	91	0	0	321275	1840
smart_192_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
smart_193_raw	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	45102	0	0	0	0	0	0	0	0	0	0	1593	0	355	6844	0	0	0	0	0	0	91	0	1	0

	date	serial_number	model	failure	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	...	smart_231_raw	smart_232_raw	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_254_raw	manufacturer	capacity_TB
8416	10-01	7M00027W	ZA500CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.080219e+11	6433.0	1.349139e+10	NaN	6647.0	1885.0	NaN	Seagate	0.50
13012	10-01	7M200214	ZA2000CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.123169e+11	4340.0	9.102103e+09	NaN	14440.0	1928.0	NaN	Seagate	2.00
13676	10-01	7LZ01GHG	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.123169e+11	5944.0	1.246611e+10	NaN	4555.0	2327.0	NaN	Seagate	0.25
13682	10-01	7LZ01GH2	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.123169e+11	5782.0	1.212574e+10	NaN	4557.0	2347.0	NaN	Seagate	0.25
13683	10-01	7LZ01GH1	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.037269e+11	3382.0	7.092588e+09	NaN	2166.0	1279.0	NaN	Seagate	0.25
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10970045	12-31	7LZ01ENF	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.123169e+11	5952.0	1.248291e+10	NaN	4570.0	2192.0	NaN	Seagate	0.25
10971251	12-31	7LZ01K94	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.123169e+11	5324.0	1.116620e+10	NaN	4170.0	2154.0	NaN	Seagate	0.25
10972074	12-31	7LZ0232K	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	3.693672e+11	801.0	1.680505e+09	NaN	869.0	193.0	NaN	Seagate	0.25
10973819	12-31	7LZ01GH7	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.166118e+11	8910.0	1.868588e+10	NaN	6795.0	2979.0	NaN	Seagate	0.25
10975048	12-31	7LZ01N9E	ZA250CM10002	0	0.0	NaN	NaN	NaN	NaN	NaN	...	1.099512e+14	4.252018e+11	12680.0	2.659358e+10	NaN	7605.0	4363.0	NaN	Seagate	0.25

	date	serial_number	model	failure	smart_1_raw	smart_2_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	...	smart_231_raw	smart_232_raw	smart_233_raw	smart_235_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_254_raw	manufacturer	capacity_TB
246	10-01	S2ZYJ9FG405092	ST500LM012 HN	False	7.0	0.0	2044.0	14.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
1006	10-01	S2ZYJ9HF707975	ST500LM012 HN	False	4.0	0.0	1989.0	13.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
1502	10-01	S2ZYJ9DG700888	ST500LM012 HN	False	320.0	0.0	1801.0	22.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
2090	10-01	S2ZYJ9KG303900	ST500LM012 HN	False	37055.0	0.0	2129.0	14.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
2365	10-01	S2ZYJ9DG701035	ST500LM012 HN	False	13.0	0.0	1986.0	6.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10974017	12-31	S2ZYJ9KG303897	ST500LM012 HN	False	46.0	0.0	1786.0	13.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
10974019	12-31	S2ZYJ9KG303892	ST500LM012 HN	False	2.0	0.0	2149.0	10.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
10974020	12-31	S2ZYJ9KG303893	ST500LM012 HN	False	2.0	0.0	1828.0	13.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
10974375	12-31	S2ZYJ9GGB01626	ST500LM012 HN	False	110.0	0.0	1814.0	14.000000	0.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Seagate	0.50
10975048	12-31	7LZ01N9E	ZA250CM10002	False	0.0	NaN	0.0	8.636075	0.0	696801289.0	...	1.099512e+14	4.252018e+11	12680.0	2.659358e+10	NaN	7605.0	4363.0	NaN	Seagate	0.25

	smart_193_raw	smart_225_raw
246	NaN	310513.0
1006	NaN	72805.0
1502	NaN	72944.0
2090	NaN	159087.0
2365	NaN	59938.0
...	...	...
10974017	NaN	128760.0
10974019	NaN	412141.0
10974020	NaN	132761.0
10974375	NaN	692196.0
10975048	NaN	NaN

	smart_184_raw	failure
99758	8.0	True
2613651	9.0	True
4849813	1.0	True
5931498	2.0	True
8943626	4.0	False
9066037	5.0	False
9189214	5.0	True
9836420	1.0	False
9961273	1.0	False
10086127	1.0	True
10771703	1.0	False
10896354	5.0	False

	failure	smart_1_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	smart_9_raw	smart_10_raw	smart_12_raw	smart_187_raw	...	smart_192_raw	smart_194_raw	smart_197_raw	smart_198_raw	smart_199_raw	smart_240_raw	smart_241_raw	smart_242_raw	capacity_TB	smart_193_225
failure	1.000000	0.002200	-1.664119e-04	0.001083	0.044413	0.000082	-0.000190	-1.296866e-04	0.001259	0.006324	...	-0.000087	0.000137	0.027410	0.020870	0.000506	-1.360845e-03	1.255774e-03	2.058487e-04	0.000602	0.000636
smart_1_raw	0.002200	1.000000	-2.311701e-01	-0.008473	0.014507	0.008715	0.082018	-1.788949e-02	0.000917	0.001963	...	-0.154805	-0.023544	0.002662	0.004631	-0.002304	7.928208e-02	4.301937e-02	1.381072e-02	0.110230	0.057691
smart_3_raw	-0.000166	-0.231170	1.000000e+00	0.007011	-0.007325	-0.004563	-0.140497	4.258443e-03	-0.020797	-0.001022	...	-0.032340	-0.024982	-0.001506	-0.002443	-0.000193	-3.036262e-01	-1.130124e-02	-3.669415e-03	0.109833	0.002603
smart_4_raw	0.001083	-0.008473	7.010971e-03	1.000000	-0.000145	0.000296	0.024426	1.154425e-03	0.110635	0.000031	...	-0.005423	0.010887	-0.000235	-0.000232	0.000167	4.539439e-03	1.314098e-02	6.288057e-03	-0.036918	0.088046
smart_5_raw	0.044413	0.014507	-7.325152e-03	-0.000145	1.000000	-0.000127	-0.006007	-5.818652e-04	0.003318	0.022669	...	-0.000367	-0.004287	0.005990	0.006118	-0.000259	-7.569510e-03	2.646947e-02	4.199105e-03	0.024436	-0.006208
smart_7_raw	0.000082	0.008715	-4.562669e-03	0.000296	-0.000127	1.000000	0.018988	-3.530821e-04	0.004242	0.000151	...	-0.003937	-0.007435	-0.000100	-0.000063	-0.000170	2.125667e-02	1.115787e-02	4.814785e-03	-0.006759	0.022182
smart_9_raw	-0.000190	0.082018	-1.404970e-01	0.024426	-0.006007	0.018988	1.000000	1.448785e-02	0.151401	-0.001826	...	-0.048660	-0.255554	-0.000854	-0.000627	0.003222	8.263827e-01	4.055059e-01	2.403070e-01	-0.821823	0.298264
smart_10_raw	-0.000130	-0.017889	4.258443e-03	0.001154	-0.000582	-0.000353	0.014488	1.000000e+00	0.016557	-0.000079	...	0.003794	-0.001671	-0.000215	-0.000189	-0.000205	2.373800e-15	-5.687530e-16	-1.244273e-14	-0.020359	-0.004730
smart_12_raw	0.001259	0.000917	-2.079667e-02	0.110635	0.003318	0.004242	0.151401	1.655722e-02	1.000000	0.001363	...	-0.016338	-0.008522	-0.000003	0.000001	0.004982	1.093496e-01	5.200587e-02	5.729262e-02	-0.125083	0.075448
smart_187_raw	0.006324	0.001963	-1.021893e-03	0.000031	0.022669	0.000151	-0.001826	-7.907920e-05	0.001363	1.000000	...	-0.000212	-0.002019	0.000312	0.000328	-0.000060	-2.124657e-03	2.980454e-03	1.679313e-05	0.003999	-0.001180
smart_188_raw	0.004512	0.015110	-7.675596e-03	0.002522	0.023388	0.000958	-0.014084	-5.939758e-04	0.014031	0.004907	...	-0.004477	0.009208	-0.000030	0.000036	0.119021	-1.636407e-02	5.057671e-03	-6.395375e-03	-0.010661	-0.004182
smart_190_raw	-0.000015	-0.001256	1.052630e-14	0.004690	-0.004044	-0.007757	-0.239176	-1.730456e-16	-0.027309	-0.002127	...	0.002582	0.902827	-0.000895	-0.000915	-0.000305	-2.727695e-01	-1.667955e-03	-5.554090e-02	0.193054	-0.091489
smart_192_raw	-0.000087	-0.154805	-3.234025e-02	-0.005423	-0.000367	-0.003937	-0.048660	3.793984e-03	-0.016338	-0.000212	...	1.000000	0.019913	0.001547	-0.001284	0.000290	-3.025350e-02	2.727831e-02	-3.325593e-03	0.020102	-0.042788
smart_194_raw	0.000137	-0.023544	-2.498233e-02	0.010887	-0.004287	-0.007435	-0.255554	-1.671127e-03	-0.008522	-0.002019	...	0.019913	1.000000	-0.000758	-0.001036	-0.001245	-2.456049e-01	-3.040466e-02	-5.864715e-02	0.169691	-0.083569
smart_197_raw	0.027410	0.002662	-1.506073e-03	-0.000235	0.005990	-0.000100	-0.000854	-2.150969e-04	-0.000003	0.000312	...	0.001547	-0.000758	1.000000	0.978249	0.000040	-2.082457e-03	-2.641448e-03	-9.483053e-04	-0.000096	-0.000288
smart_198_raw	0.020870	0.004631	-2.443247e-03	-0.000232	0.006118	-0.000063	-0.000627	-1.890706e-04	0.000001	0.000328	...	-0.001284	-0.001036	0.978249	1.000000	0.000066	-8.091369e-04	-2.572496e-03	-9.330611e-04	0.001548	0.000012
smart_199_raw	0.000506	-0.002304	-1.928865e-04	0.000167	-0.000259	-0.000170	0.003222	-2.052890e-04	0.004982	-0.000060	...	0.000290	-0.001245	0.000040	0.000066	1.000000	7.865792e-04	3.307712e-03	1.030953e-03	-0.004438	-0.001160
smart_240_raw	-0.001361	0.079282	-3.036262e-01	0.004539	-0.007570	0.021257	0.826383	2.373800e-15	0.109350	-0.002125	...	-0.030253	-0.245605	-0.002082	-0.000809	0.000787	1.000000e+00	4.086996e-01	2.579818e-01	-0.641924	0.284585
smart_241_raw	0.001256	0.043019	-1.130124e-02	0.013141	0.026469	0.011158	0.405506	-5.687530e-16	0.052006	0.002980	...	0.027278	-0.030405	-0.002641	-0.002572	0.003308	4.086996e-01	1.000000e+00	2.355348e-01	-0.072924	0.054042
smart_242_raw	0.000206	0.013811	-3.669415e-03	0.006288	0.004199	0.004815	0.240307	-1.244273e-14	0.057293	0.000017	...	-0.003326	-0.058647	-0.000948	-0.000933	0.001031	2.579818e-01	2.355348e-01	1.000000e+00	-0.146782	0.075975
capacity_TB	0.000602	0.110230	1.098329e-01	-0.036918	0.024436	-0.006759	-0.821823	-2.035881e-02	-0.125083	0.003999	...	0.020102	0.169691	-0.000096	0.001548	-0.004438	-6.419243e-01	-7.292402e-02	-1.467825e-01	1.000000	-0.265300
smart_193_225	0.000636	0.057691	2.603227e-03	0.088046	-0.006208	0.022182	0.298264	-4.730102e-03	0.075448	-0.001180	...	-0.042788	-0.083569	-0.000288	0.000012	-0.001160	2.845854e-01	5.404226e-02	7.597500e-02	-0.265300	1.000000

failure	False	True
manufacturer
HGST	0.999990	0.000010
Seagate	0.999924	0.000076
Toshiba	0.999876	0.000124
Western Digital	0.999763	0.000237

failure	False	True
smart_191_cat
0	0.999937	0.000063
1	0.999933	0.000067
2	0.999963	0.000037

failure	False	True
smart_200_cat
0	0.999960	0.000040
1	0.999900	0.000100
2	0.999707	0.000293

failure	False	True
smart_196_cat
0	0.999925	0.000075
1	0.999975	0.000025
2	0.999220	0.000780

failure	False	True
smart_8_cat
0	0.999925	0.000075
1	0.999961	0.000039
2	0.999989	0.000011

failure	False	True
smart_2_cat
0	0.999925	0.000075
1	0.999945	0.000055
2	0.999989	0.000011

failure	False	True
smart_223_cat
0	0.999940	0.000060
1	0.999913	0.000087
2	0.999691	0.000309

failure	False	True
smart_220_cat
0	0.999940	0.000060
1	0.999777	0.000223
2	0.999975	0.000025

failure	False	True
smart_222_cat
0	0.999940	0.000060
1	0.999961	0.000039
2	0.999798	0.000202

failure	False	True
smart_226_cat
0	0.999940	0.000060
1	0.999627	0.000373
2	0.999970	0.000030

	smart_1_raw	smart_3_raw	smart_4_raw	smart_5_raw	smart_7_raw	smart_9_raw	smart_10_raw	smart_12_raw	smart_187_raw	smart_188_raw	smart_190_raw	smart_192_raw	smart_194_raw	smart_197_raw	smart_199_raw	smart_240_raw	smart_241_raw	smart_242_raw	smart_193_225	capacity_TB
count	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06	7.682577e+06
mean	6.559957e-17	1.272166e-17	-3.058378e-17	-6.428810e-18	-2.783875e-18	-3.501949e-17	7.288019e-19	2.408561e-17	1.202338e-18	-5.216298e-18	-1.113003e-14	-9.474425e-18	2.056072e-16	8.952795e-19	2.092068e-18	-9.409711e-16	1.715198e-15	-2.116430e-16	-1.771507e-17	-1.632683e-16
std	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00
min	-1.084328e+00	-2.132400e-01	-6.419432e-02	-3.532305e-02	-2.138032e-02	-1.709926e+00	-1.664074e-02	-6.012170e-01	-4.769771e-03	-3.586239e-02	-2.748210e+00	-2.328147e-01	-2.850415e+00	-1.324463e-02	-1.351191e-02	-1.947329e+00	-3.902050e+00	-9.574821e-01	-2.973543e-01	-2.305760e+00
25%	-1.084328e+00	-2.132400e-01	-4.929733e-02	-3.532305e-02	-2.138032e-02	-8.959860e-01	-1.664074e-02	-3.318594e-01	-4.769771e-03	-3.586239e-02	-6.234342e-01	-2.317537e-01	-7.578256e-01	-1.324463e-02	-1.351191e-02	-4.645074e-01	-4.604450e-02	-1.503821e-02	-2.879344e-01	-1.271726e+00
50%	-1.624969e-01	-2.132400e-01	-2.695186e-02	-3.532305e-02	-1.536102e-02	-4.831020e-02	-1.664074e-02	-1.522877e-01	-4.769771e-03	-3.586239e-02	-5.786530e-05	-1.479354e-01	-6.029569e-02	-1.324463e-02	-1.351191e-02	-9.956174e-05	-6.576903e-05	-2.941769e-04	-2.715625e-01	-1.687557e-01
75%	8.856612e-01	-2.132400e-01	-4.606390e-03	-3.532305e-02	-8.171134e-03	8.429774e-01	-1.664074e-02	2.068557e-01	-4.769771e-03	-3.586239e-02	3.423730e-01	2.925006e-02	6.372342e-01	-1.324463e-02	-1.351191e-02	4.459378e-01	5.464831e-01	1.809309e-01	-2.633865e-02	9.342143e-01
max	7.025662e+00	9.594751e+00	1.871909e+02	1.028342e+02	2.235101e+02	3.480919e+00	1.662685e+02	7.742269e+01	3.789315e+02	1.666108e+02	7.103023e+00	2.776461e+01	8.135681e+00	3.798481e+02	2.459751e+02	3.947373e+00	1.114197e+01	1.911032e+02	5.666689e+01	2.037184e+00

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
smart_1_raw	0.078377	0.099782	-0.801482	0.078892	-0.026641	0.190968	-0.014460	-0.105466	-0.034428	0.011094	0.037297	0.073143	0.109396	-0.024232	-0.212629	-0.134207	-0.448838	0.018779	0.026033	-0.021886
smart_3_raw	-0.221081	-0.226726	0.591213	0.054313	0.047426	0.393056	-0.342183	0.205728	-0.046383	0.054827	-0.027654	-0.045865	0.022211	-0.031782	-0.071179	-0.228396	-0.391735	-0.102950	0.011255	-0.011746
smart_4_raw	0.048660	0.072129	0.154954	0.551078	-0.057755	0.284603	0.296320	-0.329400	0.110936	-0.040330	-0.032100	0.092808	0.427198	-0.057928	0.403422	0.094201	-0.053534	-0.012757	-0.002279	0.000573
smart_5_raw	-0.006278	-0.001843	-0.081180	-0.078453	0.205831	0.287805	0.477535	0.287489	-0.008891	-0.001993	-0.021501	-0.728418	0.012169	0.126458	0.006132	0.060036	-0.019917	-0.002544	0.000348	-0.001090
smart_7_raw	0.029846	0.008393	-0.028766	0.033013	-0.005494	0.061966	-0.092231	0.189596	0.555230	-0.395619	0.690174	0.015728	-0.057464	0.016059	0.053744	0.028973	-0.010901	0.002984	0.000545	-0.000641
smart_9_raw	0.914468	0.216561	0.056618	-0.011368	-0.016741	-0.073360	-0.047572	0.082071	-0.022034	0.012941	-0.017792	-0.037410	-0.028725	0.009031	0.098398	-0.079393	-0.175235	0.047671	-0.125063	0.175988
smart_10_raw	0.013026	0.005765	0.095699	0.041742	-0.003344	-0.084262	0.137335	0.030474	-0.704358	0.094477	0.654839	0.039760	0.157354	-0.024904	-0.032661	-0.005514	0.006891	-0.005280	0.000779	-0.000430
smart_12_raw	0.194149	0.131716	0.141257	0.432546	0.017109	0.215855	0.269566	-0.270411	-0.037091	-0.005874	0.062788	0.090480	-0.662569	0.138942	-0.244017	-0.112547	0.033331	-0.005667	-0.000968	-0.006064
smart_187_raw	-0.001460	-0.005599	-0.021650	-0.016887	0.080140	0.128587	0.422809	0.626073	-0.075976	-0.191259	-0.187202	0.572773	0.006948	-0.001307	0.001157	0.000681	-0.001755	-0.000492	-0.000030	-0.000112
smart_188_raw	-0.010913	0.025422	-0.060009	0.188750	0.710788	-0.112109	-0.053136	0.009293	0.001901	-0.001210	0.000090	-0.033541	-0.101379	-0.653558	0.013578	0.016314	0.012988	-0.011107	-0.002864	0.003416
smart_190_raw	-0.497725	0.827636	0.062881	0.012931	-0.026880	-0.041007	-0.055579	0.077399	-0.007290	0.007396	-0.012731	-0.028185	0.007987	0.004576	0.004503	-0.005350	-0.052993	0.007085	-0.198862	-0.089894
smart_192_raw	-0.050711	-0.012418	0.359853	-0.342269	0.086477	-0.352633	0.485804	-0.321156	0.213810	-0.107886	0.022458	0.059658	0.192683	-0.060343	-0.335626	-0.100792	-0.233609	-0.008731	-0.004174	-0.000321
smart_194_raw	-0.490057	0.829796	0.074658	0.030969	-0.035558	-0.076954	-0.029637	0.070888	0.001589	0.002541	-0.015770	-0.030898	-0.005057	0.009443	-0.003068	0.023080	-0.032537	-0.057000	0.190803	0.095273
smart_197_raw	-0.001079	-0.002520	-0.011483	0.002369	0.006971	-0.000050	0.139321	0.155128	0.344853	0.882787	0.200694	0.132922	-0.001450	-0.006854	0.007565	-0.004703	0.004413	-0.000767	-0.000051	-0.000029
smart_199_raw	0.003781	0.006615	-0.012864	0.153402	0.698261	-0.157618	-0.162203	-0.040339	0.010049	0.012121	-0.002605	0.109544	0.162326	0.630061	-0.016127	-0.005706	-0.004494	0.000095	0.000475	-0.000275
smart_240_raw	0.881871	0.205289	-0.068262	-0.098361	-0.008913	-0.094303	0.014801	0.007841	0.007242	-0.005509	-0.010820	-0.006907	0.007021	0.008348	0.083749	-0.036939	0.044719	-0.375409	0.020113	-0.067021
smart_241_raw	0.410374	0.306128	0.049650	-0.419610	0.172929	0.409955	-0.016359	-0.160246	-0.001752	0.002760	0.044892	0.084822	0.086429	-0.037152	0.132310	-0.479829	0.215578	0.141245	0.044802	-0.034395
smart_242_raw	0.332081	0.172154	0.113607	-0.312483	0.128006	0.448299	-0.107209	-0.154212	-0.028217	0.033677	0.028729	0.142897	0.020378	-0.020922	-0.216429	0.651921	-0.025424	0.019587	0.000311	0.000631
smart_193_225	0.397954	0.103282	0.066931	0.396076	-0.119475	0.039030	-0.123916	0.181213	0.084530	-0.012458	-0.058806	-0.099335	0.386078	-0.075974	-0.587553	-0.088894	0.275922	0.030176	0.000386	-0.000895
capacity_TB	-0.771975	-0.165620	-0.216108	-0.147941	0.065255	0.312972	0.041428	-0.180356	0.019113	-0.013801	0.052656	0.093126	0.080721	-0.003217	-0.128277	-0.131585	0.230846	-0.211757	-0.094550	0.110517

	pca_component_0	pca_component_1	pca_component_2	pca_component_3	pca_component_4	pca_component_5	pca_component_6	pca_component_7	pca_component_8	pca_component_9	pca_component_10	pca_component_11	pca_component_12
3450686	2.588802	-0.256104	0.363015	-0.194650	-0.027685	-0.374874	-0.079665	0.163281	-0.092533	0.037941	-0.097018	-0.063204	-0.364961
7599474	4.400617	2.545297	3.722155	9.092851	-1.929462	2.098228	0.360051	0.414306	1.651123	-0.335092	-1.139049	-0.840110	6.994877
3323950	1.016396	1.945011	-0.376139	0.035172	0.042050	0.652551	0.011721	-0.268730	-0.061039	0.005665	0.088317	0.159699	-0.360805
4727599	2.440267	1.156597	-0.146680	0.295595	-0.118076	-0.055855	-0.023809	0.021113	-0.125542	0.048756	-0.049074	0.016417	-0.580271
8582613	-0.869126	2.660505	-0.760537	-0.089982	-0.177652	-0.033002	-0.394537	0.278956	-0.085833	0.050159	-0.045416	-0.020975	0.339985
...	...	...	...	...	...	...	...	...	...	...	...	...	...
3241991	-1.011622	2.492260	0.803247	0.086082	-0.155075	-0.444901	-0.163175	0.383642	-0.010255	0.005194	-0.093564	-0.167842	-0.274331
9697415	1.091494	1.392319	-1.121297	-0.437001	0.016222	0.545738	-0.344713	-0.012948	-0.044460	0.022871	0.043195	0.104584	0.589097
1525039	3.254400	-0.734202	-1.408422	0.272469	-0.208944	-0.032483	-0.251709	0.138698	-0.075531	0.041390	-0.069626	-0.010925	0.452372
8962312	-1.078743	-0.454942	0.457822	-0.630959	0.110766	-0.152421	0.232372	-0.239497	0.143571	-0.083654	0.008128	0.086702	0.205345
9385528	-2.571192	0.481604	-0.905770	0.892347	-0.368219	-0.440418	0.128963	0.053498	-0.015385	-0.031840	-0.004730	-0.035442	-0.150307

	manufacturer	smart_191_cat	smart_200_cat	smart_196_cat	...	pca_component_3	pca_component_4	pca_component_5	pca_component_6	pca_component_7	pca_component_8	pca_component_9	pca_component_10	pca_component_11	pca_component_12
3450686	Seagate	1	0	0	...	3.866352e-11	4.344533e-10	1.634889e-08	2.984379e-09	3.583231e-11	5.123224e-09	1.308675e-08	5.061791e-08	2.452409e-08	5.690374e-08
7599474	Western Digital	1	1	1	...	1.364831e-09	4.317408e-08	5.819838e-07	1.480315e-07	1.752998e-15	3.629931e-07	6.217520e-08	7.128673e-08	6.204078e-07	1.484787e-06
3323950	Seagate	2	0	0	...	2.010200e-15	2.823538e-10	3.050107e-08	2.883427e-09	2.322707e-10	1.022783e-09	4.633041e-09	6.725898e-08	6.051673e-09	2.219027e-08
4727599	Seagate	1	0	0	...	1.077629e-10	6.118397e-10	1.040523e-08	6.354446e-09	6.307566e-13	2.605017e-09	1.071095e-08	1.458014e-09	2.410012e-08	3.769567e-08
8582613	Seagate	1	0	0	...	1.435735e-10	5.794786e-10	2.621771e-08	4.247093e-09	8.370350e-10	3.828430e-08	5.115769e-08	3.508412e-07	6.707833e-10	2.986888e-08

	pca_component_0	pca_component_1	pca_component_2	pca_component_3	pca_component_4	pca_component_5	pca_component_6	pca_component_7	pca_component_8	pca_component_9	pca_component_10	pca_component_11	pca_component_12
3013725	0.435791	-0.684580	-0.283102	0.075296	-0.039540	0.113908	-0.040595	-0.065597	-0.034287	0.004561	-0.002883	0.057713	-0.023175
9172422	0.263702	-0.277320	0.387045	-0.015873	-0.038166	-0.178170	-0.018477	0.074526	-0.003965	-0.009490	-0.038771	-0.033708	-0.182556
743251	2.445029	-0.802264	-1.245554	0.046790	-0.151859	-0.262728	-0.136119	0.086480	-0.128149	0.041046	-0.042140	0.006269	0.096354
3621925	3.280192	-0.808463	0.640664	-0.465401	0.037590	-0.279781	-0.217491	0.253497	-0.032683	0.032800	-0.130896	-0.098988	0.032410
111857	2.787889	-2.080358	-1.217044	0.045168	-0.112905	-0.224616	-0.064388	-0.035409	-0.133254	0.040308	-0.025953	0.069686	-0.006195
...	...	...	...	...	...	...	...	...	...	...	...	...	...
843336	0.181057	0.657489	1.166916	-0.040047	-0.080168	-0.652888	-0.104184	0.389101	-0.012324	0.013322	-0.140614	-0.198347	-0.312721
31386	-2.001845	-0.630380	-0.412659	0.206235	-0.157499	-0.283062	0.032125	0.039428	0.024065	-0.039794	0.000278	-0.018526	0.129898
3968575	-0.748217	-0.424364	-0.575875	-0.404148	0.161604	0.676065	0.116033	-0.505856	-0.013727	-0.020468	0.122741	0.284857	0.033014
9172105	0.832329	2.323726	0.039687	-0.420184	0.030308	0.301889	-0.282565	0.119185	-0.044276	0.035200	-0.046526	-0.004071	0.193693
8124260	-1.378419	-0.486651	-1.231650	0.368690	-0.085083	0.380618	0.188240	-0.423705	-0.057614	-0.016030	0.108468	0.229276	-0.155963