Udacity Data Analyst Nanodegree Project 4
WeRateDogs (https://twitter.com/dog_rates) is a Twitter account that is well known for its pictures and commentary of dogs. Its popularity is built on the internet's obsession with dogs, a rating system that reflects how good all dogs are with scores consistently above the maximum, and commentary that has spawned a language of its own through observations about the pictures and both snarky and amusing remarks to other user accounts. I've combed through thousands of this account's tweets to find some of the patterns in their postings.
Not all tweets had ratings, and not all ratings were of dogs. Of all ratings, the average numerator is 12.13, and out of all dog ratings, the average numerator is 11.4. While the denominators in the ratings were most often 10, this wasn't always the case. Of all ratings, the average ratio of numerator over denominator is 1.16, while the same for the ratings of dogs was 1.08. The highest numerator overall is 1776, while the lowest is -5.
Dog Breed | Total Number of Ratings |
---|---|
Golden Retriever | 144 |
Labrador Retriever | 99 |
Pembroke | 88 |
Chihuahua | 83 |
Pug | 57 |
Chow | 44 |
Samoyed | 43 |
Toy Poodle | 39 |
Pomeranian | 38 |
Cocker Spaniel | 30 |
Other | 854 |
Some breeds were definitely more popular than others. Golden Retrievers, Labrador Retrievers, and Pembroke / Welsh Corgis were the most common. Meanwhile, Japanese Spaniels, Clumber Spaniels, Groenendaels, Silky Terriers, Entlebuchers, Scotch Terriers, and Standard Schnauzers were the least common, with only a single rating each. It seems that more popular breeds of dogs are rated more often than less popular breeds.
Dog Breed | Average Numerator Rating |
---|---|
Soft-Coated Wheaten Terrier | 25.45 |
West Highland White Terrier | 15.64 |
Great Pyrenees | 14.93 |
Borzoi | 14.44 |
Labrador Retriever | 13.49 |
Siberian Husky | 13.25 |
Golden Retriever | 13.00 |
Saluki | 12.50 |
Tibetan Mastiff | 12.40 |
Briard | 12.33 |
Out of all breeds, the Soft-Coated Wheaten Terrier, the West Highland White Terrier, and the Great Pyrenees had the highest average numerator rating. Meanwhile, the Japanese Spaniel had the lowest average numerator rating of 5, followed by the Weimaraner at 8.53 and the Basenji at an average of 8.73.
The data for this project was gathered from three different sources. The first dataframe was loaded from the downloaded file 'twitter-archive-enhanced.csv'. The second dataframe was programmatically requested from 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'. The third dataframe was gathered from the .json data accessed from the Twitter API through Tweepy.
The first step in wrangling these datasets was to fix the tidyness issue of the data for each observation, or tweet, being split across multiple dataframes when it would be best for it to all be in a single dataframe. Copies of the dataframes were made and merged based on the 'tweet_id' columns into the 'master_df' dataframe before any other wrangling or cleaning was performed in order to avoid duplicating any fixes.
To avoid unnecessary work, one quality issue was fixed before the remaining tidyness issues. Fourteen outliers were removed because they had multiple values in the development stage categories when it appeared that each should only have one value. These development stage columns were then combined into a single 'development_stage' column and the values from the four stages were recorded here. These four columns, along with four other columns were dropped from the 'master_df' because of their redundancy.
The quality issues were mostly minor changes. All id columns were changed to the object dtype to reflect their non-numeric nature. The name column had ninety-eight values that were not names, and these were changed to 'None' like all others that did not have a name. Finally, the names of dog breeds in the 'p1', 'p2', and 'p3' columns were cleaned to make them more readable by changing the underscores to spaces and consistently capitalizing the words with title casing.
The largest issue was that the numerator and denominator values of the ratings were incorrectly pulled in several cases. It appears that they were algorithmically pulled from the 'text' values, but that this algorithm did not account for the possibility for multiple ratings or fractions in the text field. Additionally, it did not account for the possibility of decimal places, float values, or negative numbers existing as a part of the rating. Each row was iterated over and the 'rating_numerator' and 'rating_denominator' values were recalculated and stored with these points in mind. All rows with text values that contained multiple forward slashes ('/') were checked manually to ensure that the rating values were pulled correctly.
Some minor wrangling was required for the analysis portion of the project as well. A score column was created to reflect the ratio between the 'rating_numerator' and 'rating_denominator' values since the denominator only tended to be ten. Another dataframe, the 'is_dog' dataframe was created to easily do analysis on specifically the rows that pertained to images that were determined to actually contain a dog by the machine learning algorithm applied to the original second dataset. Lastly, the 'master_df' dataframe was stored in the file 'twitter_archive_master.csv', available here.
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
Load the data from the downloaded .csv file into a dataframe.
df_1 = pd.read_csv('twitter-archive-enhanced.csv')
df_1.info()
Request the .tsv file and load it into a separate dataframe.
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
# Display a limited amount of the requested data.
print(r.text[:5000])
file = open('image_predictions.tsv', 'w')
file.write(r.text)
file.close()
df_2 = pd.read_csv('image_predictions.tsv', delim_whitespace = True)
df_2.info()
Load the .json data for each tweet gathered with Tweepy into a separate dataframe.
'''
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer()
print(end - start)
print(fails_dict)
'''
df_3 = pd.read_json('tweet-json.txt', lines = True)
df_3.info()
df_1['tweet_id', 'in_reply_to_status_id' 'in_reply_to_user_id' 'retweeted_status_id' 'retweeted_status_user_id']
df_2['id']
df_3['id', 'id_str', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'quoted_status_id', 'quoted_status_id_str']
df_1[['doggo', 'floofer', 'pupper', 'puppo']][:20]
See if the 'doggo', 'floofer', 'pupper', and 'puppo' variables are set correctly based on whether or not they appear in the 'text' and then go over the times they do not appear to determine if the variables are incorrect.
check_categories = []
for index, row in df_1.iterrows():
appending = [index]
needs_appended = False
if row['doggo'] == 'doggo':
if ('doggo') not in row['text'].lower():
print(str(index) + '\tdoggo: Is not in Text.')
appending.append('doggo')
needs_appended = True
if row['floofer'] == 'floofer':
if ('floofer') not in row['text'].lower():
print(str(index) + '\tfloofer: Is not in Text.')
appending.append('floofer')
needs_appended = True
if row['pupper'] == 'pupper':
if ('pupper') not in row['text'].lower():
print(str(index) + '\tpupper: Is not in Text.')
appending.append('pupper')
needs_appended = True
if row['puppo'] == 'puppo':
if ('puppo') not in row['text'].lower():
print(str(index) + '\tpuppo: Is not in Text.')
appending.append('puppo')
needs_appended = True
if needs_appended:
check_categories.append(appending)
check_categories
All of these values appear to be correctly set, so these will be left as they are.
Find if multiple dog categories appear in any row.
def check_multiple_categories(df):
category_check_list = []
for index, row in df.iterrows():
count = 0
if row['doggo'] == 'doggo':
count += 1
if row['floofer'] == 'floofer':
count += 1
if row['pupper'] == 'pupper':
count += 1
if row['puppo'] == 'puppo':
count += 1
if count > 1:
category_check_list.append(index)
return category_check_list
check_multiple_categories(df_1)
Check to see if the numerator values are properly calculated.
# Get a list of tuples of the value count keys and values.
num_value_counts = list(zip(df_1['rating_numerator'].value_counts().keys(), df_1['rating_numerator'].value_counts()))
print(num_value_counts)
Create a list of the rating numerators that are uncommon enough to need to be checked.
numerators_to_check = []
for count in num_value_counts:
if count[1] < 10:
numerators_to_check.append(count[0])
numerators_to_check.sort()
print(numerators_to_check)
num_indexs_to_check = df_1.loc[df_1['rating_numerator'].isin(numerators_to_check)][['text', 'rating_numerator']].index.values
num_indexs_to_check
for index in num_indexs_to_check:
print(df_1.iloc[index]['text'])
print(df_1.iloc[index]['rating_numerator'])
Check to see if the denominator values are properly calculated in a similar fashion.
# Get a list of tuples of the value count keys and values.
denom_value_counts = list(zip(df_1['rating_denominator'].value_counts().keys(), df_1['rating_denominator'].value_counts()))
denom_value_counts
denominators_to_check = []
for count in denom_value_counts:
if count[1] < 4:
denominators_to_check.append(count[0])
denominators_to_check.sort()
denominators_to_check
denom_indexs_to_check = df_1.loc[df_1['rating_denominator'].isin(denominators_to_check)][['text', 'rating_denominator']].index.values
denom_indexs_to_check
for index in denom_indexs_to_check:
print(df_1.iloc[index]['text'])
print(df_1.iloc[index]['rating_denominator'])
Check for proper name values based on the text values.
df_1['name'].value_counts()[:20]
# Get a list of tuples of the value count keys and values.
name_value_counts = list(zip(df_1['name'].value_counts().keys(), df_1['name'].value_counts()))
count = 0
item = ""
for i, name in enumerate(name_value_counts):
item = item + str(name)
if count < 4:
count += 1
if len(str(name)) < 16:
item = item + "\t\t"
else:
item = item + "\t"
elif count == 4:
print(item)
count = 0
item = ""
print(item)
Nearly all of the name values are words from the text that were false positives in whatever name detection algorithm was used. These all begin with lowercase letters, so we can create a list of row indexes to check based on whether or not the name value begins with an lowercase letter or not.
names_to_check = []
for index, row in df_1.iterrows():
if row['name'][0].islower():
names_to_check.append(index)
count = 0
item = ""
for i, name in enumerate(names_to_check):
item = item + str(name)
if count < 4:
count += 1
if len(str(name)) < 16:
item = item + "\t\t"
else:
item = item + "\t"
elif count == 4:
print(item)
count = 0
item = ""
print(item)
count = 0
item = ""
for i, name in enumerate(names_to_check):
item = item + str(df_1.iloc[name]['name'])
if count < 4:
count += 1
if len(str(df_1.iloc[name]['name'])) < 8:
item = item + "\t\t"
else:
item = item + "\t"
elif count == 4:
print(item)
count = 0
item = ""
print(item)
text_names_to_check = []
for index, row in df_1.iterrows():
if row['name'] not in(row['text']) and row['name'] != 'None':
text_names_to_check.append(index)
text_names_to_check
All of the name values that are not 'None' do appear to be present in their respective row's text column.
df_1[:20]
df_3.info()
df_2['p1'].sample(10)
Tidyness Issue 3 and Quality Issue 7
df_3_copy = df_3.rename(index=str, columns={"id": "tweet_id"});
df_3_copy['tweet_id']
master_df = df_1.merge(df_2, on = 'tweet_id')
master_df = master_df.merge(df_3_copy, on = 'tweet_id')
master_df.info()
Quality Issue 3
to_del = check_multiple_categories(master_df)
to_del
master_df = master_df.drop(to_del)
check_multiple_categories(master_df)
Tidyness Issue 1 and Quality Issue 2
master_df['development_stage'] = np.NaN
master_df['development_stage'] = master_df['development_stage'].astype('object')
master_df['development_stage']
for index, row in master_df.iterrows():
if row['doggo'] == 'doggo':
master_df.at[index, 'development_stage'] = 'doggo'
elif row['floofer'] == 'floofer':
master_df.at[index, 'development_stage'] = 'floofer'
elif row['pupper'] == 'pupper':
master_df.at[index, 'development_stage'] = 'pupper'
elif row['puppo'] == 'puppo':
master_df.at[index, 'development_stage'] = 'puppo'
master_df['development_stage'] = master_df['development_stage'].astype('category')
master_df.info()
master_df = master_df.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis = 1)
master_df.info()
Tidyness Issue 2
master_df = master_df.drop(['id_str', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'quoted_status_id_str'], axis = 1)
master_df.info()
Quality Issue 1
master_df['tweet_id'] = master_df['tweet_id'].astype('object')
master_df['in_reply_to_status_id_x'] = master_df['in_reply_to_status_id_x'].astype('object')
master_df['in_reply_to_user_id_x'] = master_df['in_reply_to_user_id_x'].astype('object')
master_df['retweeted_status_id'] = master_df['retweeted_status_id'].astype('object')
master_df['retweeted_status_user_id'] = master_df['retweeted_status_user_id'].astype('object')
master_df['in_reply_to_status_id_y'] = master_df['in_reply_to_status_id_y'].astype('object')
master_df['in_reply_to_user_id_y'] = master_df['in_reply_to_user_id_y'].astype('object')
master_df['quoted_status_id'] = master_df['quoted_status_id'].astype('object')
master_df.info()
Quality Issue 8
names_to_change = []
for index, row in master_df.iterrows():
if row['name'][0].islower():
names_to_change.append(index)
len(names_to_change)
for index in names_to_change:
master_df.at[index, 'name'] = 'None'
names_to_change = []
for index, row in master_df.iterrows():
if row['name'][0].islower():
names_to_change.append(index)
len(names_to_change)
Quality Issue 4, Quality Issue 5, and Quality Issue 6
master_df['rating_numerator'] = master_df['rating_numerator'].astype('float')
for index, row in master_df.iterrows():
numerator = 0.0
numerator_end = 0
numerator_start = 0
# Find the position of the / in the first fraction in the text.
for letter_index, letter in enumerate(row['text']):
if letter == '/' and row['text'][letter_index + 1].isdigit():
numerator_end = letter_index
break
# Find the position of the beginning of the numerator of the first fraction in the text.
for letter_index2, letter in enumerate(row['text'][numerator_end::-1]):
if letter.isspace():
numerator_start = numerator_end - letter_index2
break
# Some text values do not have a space before the rating.
# In these cases, shorten the string until it is just the rating numbers.
getting_numerator = True
while getting_numerator:
try:
numerator = float(row['text'][numerator_start:numerator_end])
getting_numerator = False
break
except:
pass
if getting_numerator:
numerator_start += 1
master_df.at[index, 'rating_numerator'] = numerator
master_df['rating_numerator'].value_counts()
fractions_to_check = []
for index, row in master_df.iterrows():
num_forward_slashs = 0
num_url_slashs = 0
# Check for each / and ignore 3 of them for each URL.
num_forward_slashs = row['text'].count('/')
num_url_slashes = row['text'].count('https://t.co/') * 3
if num_forward_slashs - num_url_slashes != 1:
fractions_to_check.append(index)
len(fractions_to_check)
for index in fractions_to_check:
print(master_df.iloc[index]['text'])
print(str(master_df.iloc[index]['rating_numerator']) + '/' + str(master_df.iloc[index]['rating_denominator']) + '\n\n')
Quality Issue 9
master_df[['p1', 'p2', 'p3']] = master_df[['p1', 'p2', 'p3']].replace("_", " ", regex = True)
master_df['p1'] = master_df['p1'].str.title()
master_df['p2'] = master_df['p2'].str.title()
master_df['p3'] = master_df['p3'].str.title()
master_df.p1
Of all ratings
master_df['rating_numerator'].mean()
Of all dog ratings
master_df.loc[master_df['p1_dog'] == True]['rating_numerator'].mean()
#Create a row that is rating_numerator / rating_denominator and then get the mean.
master_df['score'] = master_df['rating_numerator'] / master_df['rating_denominator']
Of the scores of all ratings
master_df['score'].mean()
Of the scores of all dog ratings
master_df.loc[master_df['p1_dog'] == True]['score'].mean()
Only consider the ratings that correspond to an actual dog.
is_dog = master_df.loc[master_df['p1_dog'] == True]
is_dog.info()
is_dog['p1'].value_counts()
# Split the breed counts into a top 10 list and all others.
totals = is_dog['p1'].value_counts()
top_10_totals = totals[0:10]
others = totals[10:]
# Also create a bottom 10 list.
bottom_10_totals = totals[-10:]
bottom_10_totals
# Combine the other breeds into a single value for plotting.
top_10_totals.at['Other'] = others.sum()
top_10_totals
plt.pie(top_10_totals, labels = top_10_totals.index)
plt.title('The Most Common Breeds of Dogs on We Rate Dogs')
plt.savefig('Dog Breed Proportions Pieplot.png', dpi=300, bbox_inches = "tight")
plt.show()
breeds = is_dog['p1'].value_counts().keys()
breed_mean_rating = []
breeds
for breed in breeds:
avg = is_dog.loc[is_dog['p1'] == breed]['rating_numerator'].mean()
breed_mean_rating.append([breed, avg])
breed_mean_rating.sort(key=lambda x: x[1], reverse=True)
breed_mean_rating
top_10_breeds = breed_mean_rating[0:10]
top_10_breeds
bottom_10_breeds = breed_mean_rating[-10:]
bottom_10_breeds
# Limit the averages to two decimal places for plotting.
for breed in top_10_breeds:
breed[1] = float("{0:.2f}".format(breed[1]))
top_10_breeds
breed_names = list(zip(*top_10_breeds))[0]
breed_ratings = list(zip(*top_10_breeds))[1]
plt.bar(breed_names, breed_ratings)
plt.xticks(rotation=90)
plt.xlabel('Dog Breed')
plt.ylabel('Average Rating')
plt.title('Top 10 Average Ratings of Dog Breeds')
plt.savefig('Top 10 Ratings Barplot.png', dpi=300, bbox_inches = "tight")
plt.show()
master_df.to_csv('twitter_archive_master.csv')