Statistical Analysis of HDD Failure

Matthew Unrue, Spring 2020

Western Govenors University MSDA Capstone


Website Version Note:

This notebook is so large and works with so much that data that it was run in multiple settings with the kernel reset for memory management each time. As such, the code cell blocks have execution numbers that are not perfectly in order. Though these do not match up perfectly in this version, the code was and should only be excecuted from top to bottom.

Additional Resources:

The 5-page project Executive Summary can be found here.
The reveal.js based multimedia presentation notes can be found here.
The 87-page report write-up of this project can be found here.


Introduction

What factors indicate impending hard disk drive failure?

H0: Study factors do not significantly indicate impending hard disk failure.
H1: Study factors do significantly indicate impending hard disk failure.

Context

Data helps businesses solve problems, make better decisions, and understand consumers, but a lot of data needs to be stored and available to enable these benefits. Hard drive failure is the most common form of data loss, which is one of the most impactful problems that businesses can experience today as simple drive recovery can cost up to $7,500 per drive (Painchaud, 2018). For cloud-based data centers, keeping multitudes of businesses’ data intact for their own operations is crucial. Being able to predict which hard drives are at the highest risk of failure based on understanding of the combinations of routine diagnostics test results is an ideal solution to backup and replace failing drives before the data is lost.

Data

The dataset used is Backblaze’s 4th quarter data from 2019 (Backblaze, 2020). All of the needed data is contained within the .zip file that Backblaze provides to the public as .csv files split by day.

The dataset contains .csv files for each day of its corresponding quarter, from 2019-10-01 to 2019-12-31. As an example, the subsection of the dataset for 2019-10-01 contains 115,259 rows of data. However, as this data contains recorded readings from a live data center, the number of hard drives and thus rows, changed daily as failed drives were taken out and new drives were installed. The 129 column attributes are Date, Serial Number, Model Number, Capacity, Failure, 62 Self-Monitoring, Analysis and Reporting Technology (SMART) test results, and 62 normalized values of the SMART test values. The Failure attribute is the dependent variable of this study and is a qualitative binary categorical variable. The Date, Serial Number, and Model are nominal qualitative independent variables. Finally, and the SMART value columns are continuous quantitative independent variables.

As stated in Backblaze’s Hard Drive Data and Stats page (Backblaze, n.d.), this dataset is free for any use as long as Backblaze is cited as the data source, that users accept that they are solely responsible for how the data is used, and that the data cannot be sold to anybody as it publicly available.

Data Analytics Tools and Techniques

Python, pandas, and the scikit-learn stack are extensively used for the loading, tidying, manipulation, and analysis of the datasets. PyTorch is used for all neural network related tasks of the analysis and model production. Matplotlib and seaborn are used to create charts and graphics for analysis and presentation of project findings. A needed algorithm, namely Fisher's Exact test for contingency tables greater than 2x2 dimensions, is unavailable in the scikit-learn ecosystem, and R.stats is used for this by using rpy2 to run the R code by embedding it in the Python process. Prince is used for factor analysis, and imbalanced-learn is used for the implementation of SMOTE.

Like R and unlike SAS, all of these packages are easily available, free, and open-source with Python. These methods have been chosen over R for ease of explanation, as Python code is often understood more readily than R, and because of the potential of integrating this project directly into a program or software for future use. While R is highly specialized for statistics and mathematics, Python is a general-purpose programming language with specialized libraries for the needed tools, and this facilitates project expansion in the future.

Synthetic Minority Over-Sampling Technique (SMOTE) is used specifically to handle the imbalanced classes for training and testing splits. PCA is used for dimensionality reduction. Predictor variables are examined through correlation coefficients and Fisher's exact test, as well as graphed univariate and bivariate distributions. A logistic regression model and a decision tree model are examined along with the results of the PCA to find predictor variables as well. For building a predictive model for future use, the logistic regression model, a random forest ensemble model, and neural networks are compared to determine which can produce the most useful model.

As HDD failure is an extremely rare event, the dependent variable class is extremely imbalanced and failing to control for the imbalance through techniques like boosting or oversampling would lead to ineffective models. As the dependent variable is a Boolean value, this task is a binary classification task. Logistic regression is an ideal predictive model for binary classification tasks that gives a probability for classification while also having a simplistic interpretation of coefficients that can be used for feature selection. Decision trees are also simple to understand and work well for classification tasks. Given the complexity of the various fields in the dataset, a more complicated model may work better for predictive power. Random forests and neural networks work very well for classification tasks under these circumstances.

Project Outcomes

The key project outcomes are a deep understanding of the risk of hard drive failure based on the results of SMART test values regardless of manufacturer and predictive models that will be able to flag hard drives that are at high risk of failing. The understanding of the risk of failure based on test values will empower better business decisions by optimizing the choice of storage used based on projected lifetime. The predictive models will allow the business to proactively backup data from storage onto new storage devices before failure while also allowing hard drives to continue working closer to their end of life, minimizing waste from constantly replacing hard drives before it is needed. The combination of these two products will also enable the future creation of a more automated system that protects data from hard drive failure.

Dataset Preparation

The dataset provided by Backblaze is made up of 92 .csv files, 1 for each day in the 2019 4th quarter, totaling 3.13GB of text data. As hard drive failure is an extremely rare event, all of these days will need to be considered together in order to have enough failures to draw conclusions. The project begins by combining all parts of the dataset from their .csv files into a single file.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import csv
import scipy.stats as scs
import gc
import pickle
In [2]:
# Jupyter magic commands for displaying plot objects in the notebook and
# setting float display limits.
%matplotlib inline
%precision %.10g
sns.set_style("dark")
In [3]:
if not os.path.isfile('q4_combined.csv'):
    # Create a generator of dataset files in the current working directory.
    files = glob.glob(os.path.join(os.getcwd(), "2019-*.csv"))

    # Combine the fields into a single file, writing the column index from
    # only the first .csv file.
    index = False
    with open('q4_combined.csv', 'w') as combined:
        for file in files:
            with open(file, 'r') as part:

                if not index:
                    for row in part:
                        combined.write(row)
                    index = True

                else:
                    next(part)
                    for row in part:
                        combined.write(row)
                
In [4]:
with open('q4_combined.csv') as file:
    for (count, _) in enumerate(file, 0):
        pass
    
row_count = count
print("Rows: " + str(row_count))
Rows: 10991209
In [5]:
df = pd.read_csv('q4_combined.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Columns: 131 entries, date to smart_255_raw
dtypes: float64(126), int64(2), object(3)
memory usage: 10.7+ GB

Out of 10,991,209 hard drive days, there were only 678 failures, which gives a failure rate of 0.006169%.

In [6]:
df['failure'].value_counts()
Out[6]:
0    10990531
1         678
Name: failure, dtype: int64
In [7]:
nonfailed, failed = df['failure'].value_counts()
failure_rate = failed / nonfailed
print("Failure Rate: " + str("{:.6f}".format(failure_rate * 100)) + "%")
Failure Rate: 0.006169%

Weiss (2013) defined the imbalance ratio as the ratio between majority and minority classes with a modestly imbalanced dataset having an imbalance ratio of 10:1, and extremely imbalanced datasets as having an imbalance ratio of 1000:1 or greater (pg. 15). This dataset has an imbalance ratio of approximately 16,210:1 and as such will require very careful cultivation in order for any predictive model to successfully learn from. The rarity of the positive failure cases is also the reason that the entire 4th quarter dataset is required.

Unfortunately, this combined file requires too much memory to load all at once for current hardware restraints. It needs 13.5GB for just the data, not including the memory needed for the OS and other software, nor memory for calculations.

In [8]:
# Return the summed memory usage of each column in bytes.
memory_usage = sum(df.memory_usage(deep=True))
memory_usage
Out[8]:
13499129713
In [9]:
print(str(memory_usage / 1000) + "KB")
print(str("{:.2f}".format(memory_usage / 1000000)) + "MB")
print(str("{:.2f}".format(memory_usage / 1000000000)) + "GB")
13499129.713KB
13499.13MB
13.50GB

As this dataset contains both raw and normalized values for all of the SMART values, a simple way to deal with the memory issues is to divide the dataset into a raw form and a normalized form.

In [10]:
list(df.columns.values)
Out[10]:
['date',
 'serial_number',
 'model',
 'capacity_bytes',
 'failure',
 'smart_1_normalized',
 'smart_1_raw',
 'smart_2_normalized',
 'smart_2_raw',
 'smart_3_normalized',
 'smart_3_raw',
 'smart_4_normalized',
 'smart_4_raw',
 'smart_5_normalized',
 'smart_5_raw',
 'smart_7_normalized',
 'smart_7_raw',
 'smart_8_normalized',
 'smart_8_raw',
 'smart_9_normalized',
 'smart_9_raw',
 'smart_10_normalized',
 'smart_10_raw',
 'smart_11_normalized',
 'smart_11_raw',
 'smart_12_normalized',
 'smart_12_raw',
 'smart_13_normalized',
 'smart_13_raw',
 'smart_15_normalized',
 'smart_15_raw',
 'smart_16_normalized',
 'smart_16_raw',
 'smart_17_normalized',
 'smart_17_raw',
 'smart_18_normalized',
 'smart_18_raw',
 'smart_22_normalized',
 'smart_22_raw',
 'smart_23_normalized',
 'smart_23_raw',
 'smart_24_normalized',
 'smart_24_raw',
 'smart_168_normalized',
 'smart_168_raw',
 'smart_170_normalized',
 'smart_170_raw',
 'smart_173_normalized',
 'smart_173_raw',
 'smart_174_normalized',
 'smart_174_raw',
 'smart_177_normalized',
 'smart_177_raw',
 'smart_179_normalized',
 'smart_179_raw',
 'smart_181_normalized',
 'smart_181_raw',
 'smart_182_normalized',
 'smart_182_raw',
 'smart_183_normalized',
 'smart_183_raw',
 'smart_184_normalized',
 'smart_184_raw',
 'smart_187_normalized',
 'smart_187_raw',
 'smart_188_normalized',
 'smart_188_raw',
 'smart_189_normalized',
 'smart_189_raw',
 'smart_190_normalized',
 'smart_190_raw',
 'smart_191_normalized',
 'smart_191_raw',
 'smart_192_normalized',
 'smart_192_raw',
 'smart_193_normalized',
 'smart_193_raw',
 'smart_194_normalized',
 'smart_194_raw',
 'smart_195_normalized',
 'smart_195_raw',
 'smart_196_normalized',
 'smart_196_raw',
 'smart_197_normalized',
 'smart_197_raw',
 'smart_198_normalized',
 'smart_198_raw',
 'smart_199_normalized',
 'smart_199_raw',
 'smart_200_normalized',
 'smart_200_raw',
 'smart_201_normalized',
 'smart_201_raw',
 'smart_218_normalized',
 'smart_218_raw',
 'smart_220_normalized',
 'smart_220_raw',
 'smart_222_normalized',
 'smart_222_raw',
 'smart_223_normalized',
 'smart_223_raw',
 'smart_224_normalized',
 'smart_224_raw',
 'smart_225_normalized',
 'smart_225_raw',
 'smart_226_normalized',
 'smart_226_raw',
 'smart_231_normalized',
 'smart_231_raw',
 'smart_232_normalized',
 'smart_232_raw',
 'smart_233_normalized',
 'smart_233_raw',
 'smart_235_normalized',
 'smart_235_raw',
 'smart_240_normalized',
 'smart_240_raw',
 'smart_241_normalized',
 'smart_241_raw',
 'smart_242_normalized',
 'smart_242_raw',
 'smart_250_normalized',
 'smart_250_raw',
 'smart_251_normalized',
 'smart_251_raw',
 'smart_252_normalized',
 'smart_252_raw',
 'smart_254_normalized',
 'smart_254_raw',
 'smart_255_normalized',
 'smart_255_raw']
In [11]:
raw_cols = []
for col in df.columns.values:
    if "normalized" not in col:
        raw_cols.append(col)

print(raw_cols)
['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_raw', 'smart_2_raw', 'smart_3_raw', 'smart_4_raw', 'smart_5_raw', 'smart_7_raw', 'smart_8_raw', 'smart_9_raw', 'smart_10_raw', 'smart_11_raw', 'smart_12_raw', 'smart_13_raw', 'smart_15_raw', 'smart_16_raw', 'smart_17_raw', 'smart_18_raw', 'smart_22_raw', 'smart_23_raw', 'smart_24_raw', 'smart_168_raw', 'smart_170_raw', 'smart_173_raw', 'smart_174_raw', 'smart_177_raw', 'smart_179_raw', 'smart_181_raw', 'smart_182_raw', 'smart_183_raw', 'smart_184_raw', 'smart_187_raw', 'smart_188_raw', 'smart_189_raw', 'smart_190_raw', 'smart_191_raw', 'smart_192_raw', 'smart_193_raw', 'smart_194_raw', 'smart_195_raw', 'smart_196_raw', 'smart_197_raw', 'smart_198_raw', 'smart_199_raw', 'smart_200_raw', 'smart_201_raw', 'smart_218_raw', 'smart_220_raw', 'smart_222_raw', 'smart_223_raw', 'smart_224_raw', 'smart_225_raw', 'smart_226_raw', 'smart_231_raw', 'smart_232_raw', 'smart_233_raw', 'smart_235_raw', 'smart_240_raw', 'smart_241_raw', 'smart_242_raw', 'smart_250_raw', 'smart_251_raw', 'smart_252_raw', 'smart_254_raw', 'smart_255_raw']
In [12]:
norm_cols = []
for col in df.columns.values:
    if "raw" not in col:
        norm_cols.append(col)

print(norm_cols)
['date', 'serial_number', 'model', 'capacity_bytes', 'failure', 'smart_1_normalized', 'smart_2_normalized', 'smart_3_normalized', 'smart_4_normalized', 'smart_5_normalized', 'smart_7_normalized', 'smart_8_normalized', 'smart_9_normalized', 'smart_10_normalized', 'smart_11_normalized', 'smart_12_normalized', 'smart_13_normalized', 'smart_15_normalized', 'smart_16_normalized', 'smart_17_normalized', 'smart_18_normalized', 'smart_22_normalized', 'smart_23_normalized', 'smart_24_normalized', 'smart_168_normalized', 'smart_170_normalized', 'smart_173_normalized', 'smart_174_normalized', 'smart_177_normalized', 'smart_179_normalized', 'smart_181_normalized', 'smart_182_normalized', 'smart_183_normalized', 'smart_184_normalized', 'smart_187_normalized', 'smart_188_normalized', 'smart_189_normalized', 'smart_190_normalized', 'smart_191_normalized', 'smart_192_normalized', 'smart_193_normalized', 'smart_194_normalized', 'smart_195_normalized', 'smart_196_normalized', 'smart_197_normalized', 'smart_198_normalized', 'smart_199_normalized', 'smart_200_normalized', 'smart_201_normalized', 'smart_218_normalized', 'smart_220_normalized', 'smart_222_normalized', 'smart_223_normalized', 'smart_224_normalized', 'smart_225_normalized', 'smart_226_normalized', 'smart_231_normalized', 'smart_232_normalized', 'smart_233_normalized', 'smart_235_normalized', 'smart_240_normalized', 'smart_241_normalized', 'smart_242_normalized', 'smart_250_normalized', 'smart_251_normalized', 'smart_252_normalized', 'smart_254_normalized', 'smart_255_normalized']
In [13]:
if not os.path.isfile('q4_raw.csv'):
    df.to_csv('q4_raw.csv', columns = raw_cols, index=False)
In [14]:
if not os.path.isfile('q4_normalized.csv'):
    df.to_csv('q4_normalized.csv', columns = norm_cols, index=False)
In [15]:
try:
    del [df, nonfailed, failed, failure_rate, memory_usage, raw_cols, norm_cols]
    print("Memory cleared successfully.")
except:
    pass
Memory cleared successfully.

The considerably smaller raw value subset of data is the main dataset of this project. As with nearly all real-world datasets, this one needs considerable cleaning and tidying in order to use for analysis.

In [3]:
df = pd.read_csv('q4_raw.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10991209 entries, 0 to 10991208
Data columns (total 68 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   serial_number   object 
 2   model           object 
 3   capacity_bytes  int64  
 4   failure         int64  
 5   smart_1_raw     float64
 6   smart_2_raw     float64
 7   smart_3_raw     float64
 8   smart_4_raw     float64
 9   smart_5_raw     float64
 10  smart_7_raw     float64
 11  smart_8_raw     float64
 12  smart_9_raw     float64
 13  smart_10_raw    float64
 14  smart_11_raw    float64
 15  smart_12_raw    float64
 16  smart_13_raw    float64
 17  smart_15_raw    float64
 18  smart_16_raw    float64
 19  smart_17_raw    float64
 20  smart_18_raw    float64
 21  smart_22_raw    float64
 22  smart_23_raw    float64
 23  smart_24_raw    float64
 24  smart_168_raw   float64
 25  smart_170_raw   float64
 26  smart_173_raw   float64
 27  smart_174_raw   float64
 28  smart_177_raw   float64
 29  smart_179_raw   float64
 30  smart_181_raw   float64
 31  smart_182_raw   float64
 32  smart_183_raw   float64
 33  smart_184_raw   float64
 34  smart_187_raw   float64
 35  smart_188_raw   float64
 36  smart_189_raw   float64
 37  smart_190_raw   float64
 38  smart_191_raw   float64
 39  smart_192_raw   float64
 40  smart_193_raw   float64
 41  smart_194_raw   float64
 42  smart_195_raw   float64
 43  smart_196_raw   float64
 44  smart_197_raw   float64
 45  smart_198_raw   float64
 46  smart_199_raw   float64
 47  smart_200_raw   float64
 48  smart_201_raw   float64
 49  smart_218_raw   float64
 50  smart_220_raw   float64
 51  smart_222_raw   float64
 52  smart_223_raw   float64
 53  smart_224_raw   float64
 54  smart_225_raw   float64
 55  smart_226_raw   float64
 56  smart_231_raw   float64
 57  smart_232_raw   float64
 58  smart_233_raw   float64
 59  smart_235_raw   float64
 60  smart_240_raw   float64
 61  smart_241_raw   float64
 62  smart_242_raw   float64
 63  smart_250_raw   float64
 64  smart_251_raw   float64
 65  smart_252_raw   float64
 66  smart_254_raw   float64
 67  smart_255_raw   float64
dtypes: float64(63), int64(2), object(3)
memory usage: 5.6+ GB
In [4]:
null_values = df.isna().sum().sum()
null_values
Out[4]:
452576024
In [5]:
len(df.columns)
Out[5]:
68
In [6]:
n_rows = len(df)
n_rows
Out[6]:
10991209
In [7]:
n_values = n_rows * len(df.columns)
n_values
Out[7]:
747402212
In [8]:
null_values / n_values
Out[8]:
0.6055320906649926
In [9]:
# Calculate the number of values in the total dataset
n_rows * 131
Out[9]:
1439848379
In [10]:
df.head(30)
Out[10]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN
4 2019-10-01 PL1331LAHG1S4H HGST HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 2019-10-01 ZA16NQJR ST8000NM0055 8001563222016 0 117053872.0 NaN 0.0 7.0 0.0 ... NaN NaN 21190.0 5.861349e+10 1.380783e+11 NaN NaN NaN NaN NaN
6 2019-10-01 ZJV02XWG ST12000NM0007 12000138625024 0 194975656.0 NaN 0.0 8.0 0.0 ... NaN NaN 12038.0 5.206555e+10 9.974091e+10 NaN NaN NaN NaN NaN
7 2019-10-01 ZJV1CSVX ST12000NM0007 12000138625024 0 121918904.0 NaN 0.0 19.0 0.0 ... NaN NaN 10444.0 5.417592e+10 1.400380e+11 NaN NaN NaN NaN NaN
8 2019-10-01 ZJV02XWA ST12000NM0007 12000138625024 0 22209920.0 NaN 0.0 7.0 0.0 ... NaN NaN 12130.0 6.002246e+10 1.372655e+11 NaN NaN NaN NaN NaN
9 2019-10-01 ZA18CEBS ST8000NM0055 8001563222016 0 119880096.0 NaN 0.0 2.0 0.0 ... NaN NaN 18159.0 5.162341e+10 1.326167e+11 NaN NaN NaN NaN NaN
10 2019-10-01 Z305DEMG ST4000DM000 4000787030016 0 161164360.0 NaN 0.0 4.0 0.0 ... NaN NaN 31207.0 4.454928e+10 1.502931e+11 NaN NaN NaN NaN NaN
11 2019-10-01 ZA130TTW ST8000DM002 8001563222016 0 40241952.0 NaN 0.0 2.0 0.0 ... NaN NaN 26265.0 6.771851e+10 1.653885e+11 NaN NaN NaN NaN NaN
12 2019-10-01 ZJV5HJQF ST12000NM0007 12000138625024 0 41766200.0 NaN 0.0 2.0 0.0 ... NaN NaN 93.0 6.804080e+08 3.379383e+08 NaN NaN NaN NaN NaN
13 2019-10-01 ZJV1CSVV ST12000NM0007 12000138625024 0 90869464.0 NaN 0.0 3.0 0.0 ... NaN NaN 7121.0 4.144846e+10 6.582024e+10 NaN NaN NaN NaN NaN
14 2019-10-01 ZA18CEBF ST8000NM0055 8001563222016 0 206980416.0 NaN 0.0 5.0 0.0 ... NaN NaN 18174.0 5.177096e+10 1.458424e+11 NaN NaN NaN NaN NaN
15 2019-10-01 ZJV02XWV ST12000NM0007 12000138625024 0 122003344.0 NaN 0.0 3.0 0.0 ... NaN NaN 12123.0 5.917276e+10 1.325998e+11 NaN NaN NaN NaN NaN
16 2019-10-01 PL2331LAG9TEEJ HGST HMS5C4040ALE640 4000787030016 0 0.0 98.0 449.0 13.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 2019-10-01 PL2331LAH3WYAJ HGST HMS5C4040BLE640 4000787030016 0 0.0 106.0 539.0 5.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2019-10-01 2AGN81UY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 1.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 2019-10-01 PL1331LAHG53YH HGST HMS5C4040BLE640 4000787030016 0 0.0 104.0 440.0 7.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 2019-10-01 88Q0A0LGF97G TOSHIBA MG07ACA14TA 14000519643136 0 0.0 0.0 7795.0 2.0 0.0 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
21 2019-10-01 PL2331LAHDUVVJ HGST HMS5C4040BLE640 4000787030016 0 0.0 100.0 0.0 4.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 2019-10-01 ZA10JDYK ST8000DM002 8001563222016 0 144780968.0 NaN 0.0 5.0 0.0 ... NaN NaN 29378.0 5.207358e+10 1.730103e+11 NaN NaN NaN NaN NaN
23 2019-10-01 2AGN03VY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 1.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 2019-10-01 2AGNBDDY HGST HUH721212ALN604 12000138625024 0 0.0 96.0 0.0 2.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 2019-10-01 ZA18CEBT ST8000NM0055 8001563222016 0 44530656.0 NaN 0.0 5.0 0.0 ... NaN NaN 18163.0 5.212663e+10 1.302842e+11 NaN NaN NaN NaN NaN
26 2019-10-01 PL1331LAHD252H HGST HMS5C4040BLE640 4000787030016 0 0.0 103.0 432.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 2019-10-01 PL1331LAHD1HTH HGST HMS5C4040BLE640 4000787030016 0 0.0 103.0 432.0 6.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 2019-10-01 8CGDP8AH HGST HUH721212ALE600 12000138625024 0 0.0 96.0 384.0 14.0 0.0 ... NaN NaN NaN 3.766782e+10 4.292351e+10 NaN NaN NaN NaN NaN
29 2019-10-01 ZCH0EBLP ST12000NM0007 12000138625024 0 119494112.0 NaN 0.0 9.0 0.0 ... NaN NaN 12020.0 5.211059e+10 9.888794e+10 NaN NaN NaN NaN NaN

30 rows × 68 columns

In [11]:
# Return the memory usage of each column in bytes.
print(df.memory_usage(deep=True))
Index                   128
date              736411003
serial_number     724524693
model             783195873
capacity_bytes     87929672
                    ...    
smart_250_raw      87929672
smart_251_raw      87929672
smart_252_raw      87929672
smart_254_raw      87929672
smart_255_raw      87929672
Length: 69, dtype: int64
In [12]:
# Total number of failures
df.failure.sum()
Out[12]:
678
In [13]:
# Average number of failures per day
df.failure.sum() / len(df.date.unique())
Out[13]:
7.369565217391305

All SMART test columns have null values in some rows. The dataset notes state that this comes from differing manufacturer's standards despite the standardized nature of SMART tests.

In [14]:
for col in df.columns.values:
    print(col + ": " + str(df[col].isnull().values.any()))
date: False
serial_number: False
model: False
capacity_bytes: False
failure: False
smart_1_raw: True
smart_2_raw: True
smart_3_raw: True
smart_4_raw: True
smart_5_raw: True
smart_7_raw: True
smart_8_raw: True
smart_9_raw: True
smart_10_raw: True
smart_11_raw: True
smart_12_raw: True
smart_13_raw: True
smart_15_raw: True
smart_16_raw: True
smart_17_raw: True
smart_18_raw: True
smart_22_raw: True
smart_23_raw: True
smart_24_raw: True
smart_168_raw: True
smart_170_raw: True
smart_173_raw: True
smart_174_raw: True
smart_177_raw: True
smart_179_raw: True
smart_181_raw: True
smart_182_raw: True
smart_183_raw: True
smart_184_raw: True
smart_187_raw: True
smart_188_raw: True
smart_189_raw: True
smart_190_raw: True
smart_191_raw: True
smart_192_raw: True
smart_193_raw: True
smart_194_raw: True
smart_195_raw: True
smart_196_raw: True
smart_197_raw: True
smart_198_raw: True
smart_199_raw: True
smart_200_raw: True
smart_201_raw: True
smart_218_raw: True
smart_220_raw: True
smart_222_raw: True
smart_223_raw: True
smart_224_raw: True
smart_225_raw: True
smart_226_raw: True
smart_231_raw: True
smart_232_raw: True
smart_233_raw: True
smart_235_raw: True
smart_240_raw: True
smart_241_raw: True
smart_242_raw: True
smart_250_raw: True
smart_251_raw: True
smart_252_raw: True
smart_254_raw: True
smart_255_raw: True

Deriving the manufacturer from the model column will allow the dataset to be easily divided by manufacturer.

In [15]:
df.model.unique()
Out[15]:
array(['ST4000DM000', 'ST12000NM0007', 'HGST HMS5C4040ALE640',
       'ST8000NM0055', 'ST8000DM002', 'HGST HMS5C4040BLE640',
       'HGST HUH721212ALN604', 'TOSHIBA MG07ACA14TA',
       'HGST HUH721212ALE600', 'TOSHIBA MQ01ABF050', 'ST500LM030',
       'ST6000DX000', 'ST10000NM0086', 'DELLBOSS VD',
       'TOSHIBA MQ01ABF050M', 'WDC WD5000LPVX', 'ST500LM012 HN',
       'HGST HUH728080ALE600', 'TOSHIBA MD04ABA400V', 'TOSHIBA HDWF180',
       'ST8000DM005', 'Seagate SSD', 'HGST HUH721010ALE600',
       'ST4000DM005', 'WDC WD5000LPCX', 'HGST HDS5C4040ALE630',
       'ST500LM021', 'Hitachi HDS5C4040ALE630', 'HGST HUS726040ALE610',
       'Seagate BarraCuda SSD ZA500CM10002', 'ST12000NM0117',
       'Seagate BarraCuda SSD ZA2000CM10002',
       'Seagate BarraCuda SSD ZA250CM10002', 'TOSHIBA HDWE160',
       'WDC WD5000BPKT', 'ST6000DM001', 'WDC WD60EFRX', 'ST8000DM004',
       'HGST HMS5C4040BLE641', 'ST1000LM024 HN', 'ST6000DM004',
       'ST12000NM0008', 'ST16000NM001G'], dtype=object)

The "DELLBOSS VD" model value seems the be the only value potentially out of place.

In [16]:
df.loc[(df['model'] == "DELLBOSS VD") &
       (df['date'] == "2019-10-01")]
Out[16]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
162 2019-10-01 1747287481d20010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1169 2019-10-01 a79d077c55d30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1666 2019-10-01 8583f658cd680010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3224 2019-10-01 22ecf3ea21150010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3922 2019-10-01 9ac75f2107cc0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4150 2019-10-01 3b8f38bf6bc90010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4337 2019-10-01 29bae1bef9ad0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4567 2019-10-01 c3bea4912a060010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8696 2019-10-01 5bd1f7cc48910010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12851 2019-10-01 c1858f02677a0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13299 2019-10-01 b160b38dd1370010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15676 2019-10-01 ef29e2d545380010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16477 2019-10-01 7b7ec52d10240010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19564 2019-10-01 eef069c94dfb0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20934 2019-10-01 a79beabda2020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23008 2019-10-01 10ca0ecb78690010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26018 2019-10-01 5c2f968553650010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31033 2019-10-01 350901195c7b0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35634 2019-10-01 6866178485f00010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38044 2019-10-01 2d30418626330010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38758 2019-10-01 17eddeea3c620010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38973 2019-10-01 128cfa8eabec0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
40819 2019-10-01 e4d24bb6b3290010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41068 2019-10-01 a23b0568a5f60010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41224 2019-10-01 a941d1eaf0160010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41966 2019-10-01 13a2651c44500010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
42935 2019-10-01 b86976e2b7490010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
43813 2019-10-01 45f3334ff98c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48684 2019-10-01 37e5a52d44600010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54392 2019-10-01 49a73bc7c27d0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
57720 2019-10-01 f2907e144db40010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
58040 2019-10-01 312feea327f30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
59544 2019-10-01 dc85a3f97d6f0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
60703 2019-10-01 f7acc8a7d9220010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62021 2019-10-01 cce2cfe98b7f0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62201 2019-10-01 c295df982e020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
62319 2019-10-01 7818d2d7bc260010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65014 2019-10-01 b9f8a9fe5d910010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
65271 2019-10-01 af70ef0319310010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
66849 2019-10-01 56cc876a649c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
73313 2019-10-01 22d96dd0f90c0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
74417 2019-10-01 4d03b7d534ea0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
77924 2019-10-01 c3b6042ce1d70010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83725 2019-10-01 5de287ae7c050010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85238 2019-10-01 507b941884d90010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85320 2019-10-01 1f157071f4590010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85872 2019-10-01 421ceb5dd0720010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91051 2019-10-01 9dd00e2a06080010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91959 2019-10-01 eeb700c6e4960010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
92330 2019-10-01 76db3b83c3b30010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
92831 2019-10-01 bff106d793020010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
93365 2019-10-01 d83f152970950010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96150 2019-10-01 ccddfe2489d50010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96190 2019-10-01 e2cfef5b9de50010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
96893 2019-10-01 ad6def546aea0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
98881 2019-10-01 3c8f79f4ce9b0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
101456 2019-10-01 d2830942e1ca0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
101692 2019-10-01 98a871fbf9de0010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
102576 2019-10-01 826fc283ec560010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
111088 2019-10-01 2e591a197fd00010 DELLBOSS VD 480036847616 0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

60 rows × 68 columns

None of the SMART values exist for this hard drive model, but 60 of the drives have this model value. Additionally, no failures for this model exist in the dataset. Any row with this model value should be removed from the training data before any predictive analysis. Some searching online leads to the belief that it may be a RAID controller. (https://www.dell.com/support/manuals/au/en/aubsd1/boss-s-1/boss_s1_ug_publication/overview?guid=guid-b20ef25b-b7e3-40f2-b7cd-e497358cd10a&lang=en-us)

In [17]:
df.loc[(df['model'] == "DELLBOSS VD") &
       (df['failure'] == 1)]
Out[17]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw

0 rows × 68 columns

Additionally the "Seagate SSD" model seems to be missing information. Like the "DELLBOSS VD" model rows, this one also does not have any failures and will need to be removed before predictive analysis is performed.

In [18]:
df.loc[(df['model'] == "Seagate SSD") &
       (df['date'] == "2019-10-01")]
Out[18]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw
1113 2019-10-01 NB1206GH Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 869.0 1.823282e+09 NaN 1399.0 307.0 NaN NaN NaN NaN NaN
1482 2019-10-01 NB120KH2 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 15439.0 3.237942e+10 NaN 7769.0 4309.0 NaN NaN NaN NaN NaN
1507 2019-10-01 NB120KHJ Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 14427.0 3.025683e+10 NaN 7588.0 4182.0 NaN NaN NaN NaN NaN
4724 2019-10-01 NB120H6H Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 2848.0 5.972913e+09 NaN 1562.0 911.0 NaN NaN NaN NaN NaN
4749 2019-10-01 NB120H66 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 2549.0 5.346440e+09 NaN 1686.0 695.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
109521 2019-10-01 NB120G0J Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 7765.0 1.628449e+10 NaN 4844.0 2382.0 NaN NaN NaN NaN NaN
109891 2019-10-01 NB120AKM Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 17371.0 3.643071e+10 NaN 7951.0 4602.0 NaN NaN NaN NaN NaN
109901 2019-10-01 NB120AKR Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 12588.0 2.639919e+10 NaN 7887.0 4337.0 NaN NaN NaN NaN NaN
113784 2019-10-01 NB120KY9 Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 12661.0 2.655319e+10 NaN 8001.0 4103.0 NaN NaN NaN NaN NaN
114644 2019-10-01 NB120HRB Seagate SSD 250059350016 0 0.0 NaN NaN NaN NaN ... 752.0 1.578609e+09 NaN 1189.0 18.0 NaN NaN NaN NaN NaN

96 rows × 68 columns

In [19]:
df.loc[(df['model'] == "Seagate SSD") &
       (df['failure'] == 1)]
Out[19]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_233_raw smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw

0 rows × 68 columns

The rows not appropriate for analysis are deleted.

In [4]:
df.drop(df[(df['model'] == "DELLBOSS VD") | \
           (df['model'] == "Seagate SSD")].index, axis = 0, inplace = True)
In [5]:
n_rows = len(df)
n_rows
Out[5]:
10976221
In [6]:
# model: ["Manufacturer", "New Model"]
manufacturer_dict = {
    'ST4000DM000': ["Seagate", "ST4000DM000"],
    'ST12000NM0007': ["Seagate", "ST12000NM0007"],
    'HGST HMS5C4040ALE640': ["HGST", "HMS5C4040ALE640"],
    'ST8000NM0055': ["Seagate", "ST8000NM0055"],
    'ST8000DM002': ["Seagate", "ST8000DM002"],
    'HGST HMS5C4040BLE640': ["HGST", "HMS5C4040BLE640"],
    'HGST HUH721212ALN604': ["HGST", "HUH721212ALN604"],
    'TOSHIBA MG07ACA14TA': ["Toshiba", "MG07ACA14TA"],
    'HGST HUH721212ALE600': ["HGST", "HUH721212ALE600"],
    'TOSHIBA MQ01ABF050': ["Toshiba", "MQ01ABF050"],
    'ST500LM030': ["Seagate", "ST500LM030"],
    'ST6000DX000': ["Seagate", "ST6000DX000"],
    'ST10000NM0086': ["Seagate", "ST10000NM0086"],
    'DELLBOSS VD': ["Dell", "DELLBOSS VD"],
    'TOSHIBA MQ01ABF050M': ["Toshiba", "MQ01ABF050M"],
    'WDC WD5000LPVX': ["Western Digital", "WD5000LPVX"],
    'ST500LM012 HN': ["Seagate", "ST500LM012 HN"],
    'HGST HUH728080ALE600': ["HGST", "HUH728080ALE600"],
    'TOSHIBA MD04ABA400V': ["Toshiba", "MD04ABA400V"],
    'TOSHIBA HDWF180': ["Toshiba", "HDWF180"],
    'ST8000DM005': ["Seagate", "ST8000DM005"],
    'Seagate SSD': ["Seagate", "Seagate SSD"],
    'HGST HUH721010ALE600': ["HGST", "Seagate SSD"],
    'ST4000DM005': ["Seagate", "ST4000DM005"],
    'WDC WD5000LPCX': ["Western Digital", "WD5000LPCX"],
    'HGST HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'ST500LM021': ["Seagate", "ST500LM021"],
    'Hitachi HDS5C4040ALE630': ["HGST", "HDS5C4040ALE630"],
    'HGST HUS726040ALE610': ["HGST", "HUS726040ALE610"],
    'Seagate BarraCuda SSD ZA500CM10002': ["Seagate", "ZA500CM10002"],
    'ST12000NM0117': ["Seagate", "ST12000NM0117"],
    'Seagate BarraCuda SSD ZA2000CM10002': ["Seagate", "ZA2000CM10002"],
    'Seagate BarraCuda SSD ZA250CM10002': ["Seagate", "ZA250CM10002"],
    'TOSHIBA HDWE160': ["Toshiba", "HDWE160"],
    'WDC WD5000BPKT': ["Western Digital", "WD5000BPKT"],
    'ST6000DM001': ["Seagate", "ST6000DM001"],
    'WDC WD60EFRX': ["Western Digital", "WD60EFRX"],
    'ST8000DM004': ["Seagate", "ST8000DM004"],
    'HGST HMS5C4040BLE641': ["HGST", "HMS5C4040BLE641"],
    'ST1000LM024 HN': ["Seagate", "ST1000LM024 HN'"],
    'ST6000DM004': ["Seagate", "ST6000DM004"],
    'ST12000NM0008': ["Seagate", "ST12000NM0008"],
    'ST16000NM001G': ["Seagate", "ST16000NM001G"]
}
In [7]:
# Change the model column into Manufacturer and Model columns.
df['model_temp'] = df['model']
df['manufacturer'] = ''

df['manufacturer'] = df['model_temp'].map(lambda x: manufacturer_dict[x][0])
df['model'] = df['model_temp'].map(lambda x: manufacturer_dict[x][1])

df.drop(['model_temp'], axis=1, inplace=True)
In [8]:
df.head()
Out[8]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN HGST

5 rows × 69 columns

Given the size of the dataset, a few minor changes to the columns may free up a considerable amount of memory. The date and capacity_bytes columns are two easy places to improve.

In [25]:
# date
df['date'].value_counts()
Out[25]:
2019-12-23    124853
2019-12-24    124853
2019-12-25    124853
2019-12-22    124851
2019-12-26    124850
               ...  
2019-10-09    115102
2019-10-04    115101
2019-10-07    115100
2019-10-03    115099
2019-11-05     55837
Name: date, Length: 92, dtype: int64
In [26]:
df['date'][0:5]
Out[26]:
0    2019-10-01
1    2019-10-01
2    2019-10-01
3    2019-10-01
4    2019-10-01
Name: date, dtype: object
In [27]:
before_mem = df['date'].memory_usage()
before_mem
Out[27]:
175619536
In [28]:
df['date'] = df['date'].str[-5:]
df.head()
Out[28]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_235_raw smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer
0 10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... NaN 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate
1 10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... NaN 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate
2 10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... NaN 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate
3 10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... NaN 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate
4 10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN HGST

5 rows × 69 columns

In [29]:
df['date'] = df['date'].astype('category')
df['date'][0:5]
Out[29]:
0    10-01
1    10-01
2    10-01
3    10-01
4    10-01
Name: date, dtype: category
Categories (92, object): [10-01, 10-02, 10-03, 10-04, ..., 12-28, 12-29, 12-30, 12-31]
In [30]:
after_mem = df['date'].memory_usage()
after_mem
Out[30]:
98789285
In [31]:
memory_saved = before_mem - after_mem
print("Memory saved on the date column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the date column: 73.27MB
In [32]:
# model
before_mem = df['model'].memory_usage()
df['model'] = df['model'].astype('category')
after_mem = df['model'].memory_usage()
memory_saved = before_mem - after_mem
print("Memory saved on the model column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the model column: 73.27MB
In [33]:
# failure
before_mem = df['failure'].memory_usage(deep = True)
df['failure'] = df['failure'].astype('bool')
after_mem = df['failure'].memory_usage(deep = True)
memory_saved = before_mem - after_mem
print("Memory saved on the failure column: " + str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the failure column: 73.27MB
In [9]:
# capacity_bytes
before_memory = df['capacity_bytes'].memory_usage(deep = True)
before_memory
Out[9]:
175619536

Here we can see that 1108 drive days have an error value rather than their actual capacity. These rows may need to be removed, but may also be an excellent signal for a failing drive.

In [10]:
df.loc[df["capacity_bytes"] == -1]["manufacturer"].value_counts()
Out[10]:
Seagate            759
HGST               299
Toshiba             48
Western Digital      2
Name: manufacturer, dtype: int64
In [11]:
sns.countplot(x = df.loc[df["capacity_bytes"] == -1]["capacity_bytes"], \
              hue = df["failure"])
Out[11]:
<AxesSubplot:xlabel='capacity_bytes', ylabel='count'>

Unfortunately, all drives experiencing this error do not fail and this can introduce problems in the final model. As it only affects 0.01% of the dataset, removing the affected rows seems best.

In [12]:
# Calculate the percentage of the dataset that is affected by this error.
str(np.around(((1008/n_rows) * 100), 2)) + "%"
Out[12]:
'0.01%'
In [13]:
df.drop(df[(df['capacity_bytes'] == -1)].index, axis = 0, inplace = True)
In [14]:
n_rows = len(df)
n_rows
Out[14]:
10975113
In [15]:
df['capacity_bytes'].value_counts()
Out[15]:
12000138625024    4855875
4000787030016     3197457
8001563222016     2309775
14000519643136     232122
500107862016       177166
10000831348736     110993
6001175126016       82595
250059350016         6844
16000900661248       1840
2000398934016         355
1000204886016          91
Name: capacity_bytes, dtype: int64

The capacity_bytes column is converted from bytes to terabytes to condense the information on disk.

In [16]:
df['capacity_TB'] = np.around((df['capacity_bytes']/(1000*1000*1000*1000)), \
                              decimals = 2)
df.head()
Out[16]:
date serial_number model capacity_bytes failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw ... smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer capacity_TB
0 2019-10-01 Z305B2QN ST4000DM000 4000787030016 0 97236416.0 NaN 0.0 13.0 0.0 ... 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate 4.0
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 12000138625024 0 4665536.0 NaN 0.0 3.0 0.0 ... 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate 12.0
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 12000138625024 0 92892872.0 NaN 0.0 1.0 0.0 ... 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate 12.0
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 12000138625024 0 231702544.0 NaN 0.0 6.0 0.0 ... 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate 12.0
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 4000787030016 0 0.0 103.0 436.0 9.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN HGST 4.0

5 rows × 70 columns

In [17]:
df['capacity_TB'].value_counts()
Out[17]:
12.00    4855875
4.00     3197457
8.00     2309775
14.00     232122
0.50      177166
10.00     110993
6.00       82595
0.25        6844
16.00       1840
2.00         355
1.00          91
Name: capacity_TB, dtype: int64
In [19]:
df['capacity_TB'] = df['capacity_TB'].astype('category')
after_mem = df['capacity_TB'].memory_usage()
memory_saved = before_memory - after_mem
print("Memory saved on the capacity column: " + \
      str(np.around((memory_saved / 1024 ** 2), 2)) + "MB")
Memory saved on the capacity column: 73.28MB
In [20]:
df.drop(['capacity_bytes'], axis=1, inplace=True)
df.head()
Out[20]:
date serial_number model failure smart_1_raw smart_2_raw smart_3_raw smart_4_raw smart_5_raw smart_7_raw ... smart_240_raw smart_241_raw smart_242_raw smart_250_raw smart_251_raw smart_252_raw smart_254_raw smart_255_raw manufacturer capacity_TB
0 2019-10-01 Z305B2QN ST4000DM000 0 97236416.0 NaN 0.0 13.0 0.0 704304346.0 ... 33009.0 5.063798e+10 1.623458e+11 NaN NaN NaN NaN NaN Seagate 4.0
1 2019-10-01 ZJV0XJQ4 ST12000NM0007 0 4665536.0 NaN 0.0 3.0 0.0 422822971.0 ... 9533.0 5.084775e+10 1.271356e+11 NaN NaN NaN NaN NaN Seagate 12.0
2 2019-10-01 ZJV0XJQ3 ST12000NM0007 0 92892872.0 NaN 0.0 1.0 0.0 936518450.0 ... 6977.0 4.920827e+10 4.658787e+10 NaN NaN NaN NaN NaN Seagate 12.0
3 2019-10-01 ZJV0XJQ0 ST12000NM0007 0 231702544.0 NaN 0.0 6.0 0.0 416687782.0 ... 10669.0 5.341374e+10 9.427903e+10 NaN NaN NaN NaN NaN Seagate 12.0
4 2019-10-01 PL1331LAHG1S4H HMS5C4040ALE640 0 0.0 103.0 436.0 9.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN HGST 4.0

5 rows × 69 columns

In [21]:
fail_df = pd.crosstab(df["manufacturer"], df["failure"])
fail_df
Out[21]:
failure 0 1
manufacturer
HGST 2660507 26
Seagate 7965951 606
Toshiba 322682 40
Western Digital 25295 6
In [22]:
fail_df['Rate'] = fail_df[1] / (fail_df[0] + fail_df[1])
fail_df
Out[22]:
failure 0 1 Rate
manufacturer
HGST 2660507 26 0.000010
Seagate 7965951 606 0.000076
Toshiba 322682 40 0.000124
Western Digital 25295 6 0.000237
In [23]:
corr_df = df.corr()
In [24]:
corr_df['failure']
Out[24]:
failure          1.000000
smart_1_raw      0.002183
smart_2_raw     -0.003998
smart_3_raw     -0.000161
smart_4_raw      0.001086
                   ...   
smart_250_raw         NaN
smart_251_raw         NaN
smart_252_raw         NaN
smart_254_raw         NaN
smart_255_raw         NaN
Name: failure, Length: 64, dtype: float64

With these things finished, the univariate distributions can be examined to gain a better sense of the data.

The first column, date shows some sort of testing or operational failure on November 5th.

In [25]:
plt.figure(figsize = (20, 10))
plt.title('Number of Drives in Operation per Day (Q4 2019)')
g = sns.countplot(df['date'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
g.figure.savefig("Charts/Date Distribution.png")
g.figure.savefig("Charts/Date Distribution.svg")

Drive capacities are mostly 4, 8, and 12 TB, likely coinciding with large investments in new drives for the datacenter and possibly alongside the price lowering of specific models.

In [26]:
plt.figure(figsize = (5, 5))
plt.title('Capacity of Drives')
g = sns.countplot(df['capacity_TB'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Capacity Distribution.svg")
g.figure.savefig("Charts/Capacity Distribution.svg")

The manufacturer of the most drives in this dataset is Seagate at 72.59%. HGST is the second highest at 24.24%. Western Digital is the least represented manufacturer in the dataset with only 0.23%, but as HGST was acquired by Western Digital in 2012 (Sanders, 2018), the drives in this dataset will likely be quite similar between the two manufacturers given the seven-year timespan between then and the time of dataset recording and creation. Finally, Toshiba is the other manufacturer, with 2.94% of the dataset. This amount is quite low and may make it difficult to accurately predict their drives in comparison.

In [27]:
plt.figure(figsize = (5, 5))
plt.title('Manufacturers of Drives')
g = sns.countplot(df['manufacturer'], data = df)
g.set_xticklabels(g.get_xticklabels(), rotation = 90)
for p in g.patches:
    percentage = "{0:.2f}".format((p.get_height() / n_rows) * 100) + "%"
    g.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center', va = 'center', xytext = (0, 7), textcoords = 'offset points')

g.figure.savefig("Charts/Manufacturer Distribution.svg")
g.figure.savefig("Charts/Manufacturer Distribution.png")

The SMART values vary greatly from the number of different types of drives that exist in this dataset. Before the columns can be graphed appropriately, the NaN/null values need to be examined. It's most likely that the missing data is most related to the hard drive's manufacturer or model.

In [61]:
sns.distplot(df['smart_1_raw'])
plt.grid(True)
plt.show()
In [62]:
# Pandas styling function
def highlight_nans(val):
    color = 'red' if val == True or val > 0 else 'black'
    return 'color: %s' % color

Every single SMART figure column has null values.

In [63]:
pd.set_option('display.max_rows', 70)
pd.set_option('display.max_columns', 75)
df.isna().any()
Out[63]:
date             False
serial_number    False
model            False
failure          False
smart_1_raw       True
smart_2_raw       True
smart_3_raw       True
smart_4_raw       True
smart_5_raw       True
smart_7_raw       True
smart_8_raw       True
smart_9_raw       True
smart_10_raw      True
smart_11_raw      True
smart_12_raw      True
smart_13_raw      True
smart_15_raw      True
smart_16_raw      True
smart_17_raw      True
smart_18_raw      True
smart_22_raw      True
smart_23_raw      True
smart_24_raw      True
smart_168_raw     True
smart_170_raw     True
smart_173_raw     True
smart_174_raw     True
smart_177_raw     True
smart_179_raw     True
smart_181_raw     True
smart_182_raw     True
smart_183_raw     True
smart_184_raw     True
smart_187_raw     True
smart_188_raw     True
smart_189_raw     True
smart_190_raw     True
smart_191_raw     True
smart_192_raw     True
smart_193_raw     True
smart_194_raw     True
smart_195_raw     True
smart_196_raw     True
smart_197_raw     True
smart_198_raw     True
smart_199_raw     True
smart_200_raw     True
smart_201_raw     True
smart_218_raw     True
smart_220_raw     True
smart_222_raw     True
smart_223_raw     True
smart_224_raw     True
smart_225_raw     True
smart_226_raw     True
smart_231_raw     True
smart_232_raw     True
smart_233_raw     True
smart_235_raw     True
smart_240_raw     True
smart_241_raw     True
smart_242_raw     True
smart_250_raw     True
smart_251_raw     True
smart_252_raw     True
smart_254_raw     True
smart_255_raw     True
manufacturer     False
capacity_TB      False
dtype: bool
In [64]:
manu_nan_df = pd.DataFrame()
for manu in df['manufacturer'].unique():
    manu_nan_df[manu] = df.loc[df['manufacturer'] == manu].isna().sum()
In [65]:
manu_nan_df.style.applymap(highlight_nans)
Out[65]:
Seagate HGST Toshiba Western Digital
date 0 0 0 0
serial_number 0 0 0 0
model 0 0 0 0
failure 0 0 0 0
smart_1_raw 2 0 0 0
smart_2_raw 7921364 0 0 25301
smart_3_raw 8794 0 0 0
smart_4_raw 8794 0 0 0
smart_5_raw 8794 0 0 0
smart_7_raw 8794 0 0 0
smart_8_raw 7921364 0 0 25301
smart_9_raw 2 0 0 0
smart_10_raw 8794 0 0 0
smart_11_raw 7921364 2660533 322722 0
smart_12_raw 2 0 0 0
smart_13_raw 7966557 2660533 322722 25301
smart_15_raw 7966557 2660533 322722 25301
smart_16_raw 7957765 2660533 322722 25301
smart_17_raw 7957765 2660533 322722 25301
smart_18_raw 7643443 2660533 322722 25301
smart_22_raw 7966557 1427395 322722 25301
smart_23_raw 7966557 2660533 90600 25301
smart_24_raw 7966557 2660533 90600 25301
smart_168_raw 7957765 2660533 322722 25301
smart_170_raw 7957765 2660533 322722 25301
smart_173_raw 7957765 2660533 322722 25301
smart_174_raw 7957765 2660533 322722 25301
smart_177_raw 7957765 2660533 322722 25301
smart_179_raw 7966557 2660533 322722 25301
smart_181_raw 7966557 2660533 322722 25301
smart_182_raw 7966557 2660533 322722 25301
smart_183_raw 6123738 2660533 322722 25301
smart_184_raw 3772455 2660533 322722 25301
smart_187_raw 53987 2660533 322722 25301
smart_188_raw 53987 2660533 322722 25301
smart_189_raw 3772455 2660533 322722 25301
smart_190_raw 53987 2660533 322722 25301
smart_191_raw 3727262 2660533 0 15052
smart_192_raw 2 0 0 0
smart_193_raw 53987 0 0 0
smart_194_raw 2 0 0 0
smart_195_raw 1797748 2660533 322722 25301
smart_196_raw 7921364 0 0 0
smart_197_raw 8794 0 0 0
smart_198_raw 8794 0 0 0
smart_199_raw 8794 0 0 0
smart_200_raw 4093723 2660533 322722 0
smart_201_raw 7966557 2660533 322722 25301
smart_218_raw 7957765 2660533 322722 25301
smart_220_raw 7966557 2660533 0 25301
smart_222_raw 7966557 2660533 0 25301
smart_223_raw 7921364 2517013 0 25301
smart_224_raw 7966557 2660533 0 25301
smart_225_raw 7921364 2660533 322722 25301
smart_226_raw 7966557 2660533 0 25301
smart_231_raw 7957765 2660533 322722 25301
smart_232_raw 7957765 2660533 322722 25301
smart_233_raw 7957765 2660533 322722 25301
smart_235_raw 7957765 2660533 322722 25301
smart_240_raw 53987 2660533 0 18734
smart_241_raw 45195 2517013 322722 24389
smart_242_raw 45195 2517013 322722 24389
smart_250_raw 7966557 2660533 322722 25301
smart_251_raw 7966557 2660533 322722 25301
smart_252_raw 7966557 2660533 322722 25301
smart_254_raw 7940496 2660533 322722 24389
smart_255_raw 7966557 2660533 322722 25301
manufacturer 0 0 0 0
capacity_TB 0 0 0 0
In [66]:
model_nan_df = pd.DataFrame()
for model in df['model'].unique():
    model_nan_df[model] = df.loc[df['model'] == model].isna().sum()
In [67]:
model_nan_df.style.applymap(highlight_nans)
Out[67]:
ST4000DM000 ST12000NM0007 HMS5C4040ALE640 ST8000NM0055 ST8000DM002 HMS5C4040BLE640 HUH721212ALN604 MG07ACA14TA HUH721212ALE600 MQ01ABF050 ST500LM030 ST6000DX000 ST10000NM0086 MQ01ABF050M WD5000LPVX ST500LM012 HN HUH728080ALE600 MD04ABA400V HDWF180 ST8000DM005 Seagate SSD ST4000DM005 WD5000LPCX HDS5C4040ALE630 ST500LM021 HUS726040ALE610 ZA500CM10002 ST12000NM0117 ZA2000CM10002 ZA250CM10002 HDWE160 WD5000BPKT ST6000DM001 WD60EFRX ST8000DM004 HMS5C4040BLE641 ST1000LM024 HN' ST6000DM004 ST12000NM0008 ST16000NM001G
date 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
serial_number 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
model 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
failure 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
smart_1_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_2_raw 1757498 3394893 0 1316386 896946 0 0 0 0 0 23025 81493 109173 0 19187 0 0 0 0 2257 0 3555 4928 0 3036 0 1593 462 355 6844 0 912 368 274 273 0 0 92 321275 1840
smart_3_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_4_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_5_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_7_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_8_raw 1757498 3394893 0 1316386 896946 0 0 0 0 0 23025 81493 109173 0 19187 0 0 0 0 2257 0 3555 4928 0 3036 0 1593 462 355 6844 0 912 368 274 273 0 0 92 321275 1840
smart_9_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_10_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1593 0 355 6844 0 0 0 0 0 0 0 0 1 0
smart_11_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 0 0 92073 9009 1840 2257 1820 3555 0 2484 3036 2570 1593 462 355 6844 368 0 368 0 273 91 0 92 321275 1840
smart_12_raw 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
smart_13_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_15_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_16_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_17_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_18_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 1 0
smart_22_raw 1757498 3394893 253758 1316386 896946 1168492 0 232122 0 42565 23025 81493 109173 36818 19187 45102 0 9009 1840 2257 0 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_23_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_24_raw 1757498 3394893 253758 1316386 896946 1168492 995725 0 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92 321275 1840
smart_168_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_170_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_173_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_174_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_177_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 0 462 0 0 368 912 368 274 273 91 91 92 321275 1840
smart_179_raw 1757498 3394893 253758 1316386 896946 1168492 995725 232122 143520 42565 23025 81493 109173 36818 19187 45102 92073 9009 1840 2257 1820 3555 4928 2484 3036 2570 1593 462 355 6844 368 912 368 274 273 91 91 92