The Statistics of WeRateDogs

Matthew Unrue, 2018

Udacity Data Analyst Nanodegree Project 4


act_report addition:

WeRateDogs (https://twitter.com/dog_rates) is a Twitter account that is well known for its pictures and commentary of dogs. Its popularity is built on the internet's obsession with dogs, a rating system that reflects how good all dogs are with scores consistently above the maximum, and commentary that has spawned a language of its own through observations about the pictures and both snarky and amusing remarks to other user accounts. I've combed through thousands of this account's tweets to find some of the patterns in their postings.

Not all tweets had ratings, and not all ratings were of dogs. Of all ratings, the average numerator is 12.13, and out of all dog ratings, the average numerator is 11.4. While the denominators in the ratings were most often 10, this wasn't always the case. Of all ratings, the average ratio of numerator over denominator is 1.16, while the same for the ratings of dogs was 1.08. The highest numerator overall is 1776, while the lowest is -5.

Dog%20Breed%20Proportions%20Pieplot%20Resized.png

Dog Breed Total Number of Ratings
Golden Retriever 144
Labrador Retriever 99
Pembroke 88
Chihuahua 83
Pug 57
Chow 44
Samoyed 43
Toy Poodle 39
Pomeranian 38
Cocker Spaniel 30
Other 854

Some breeds were definitely more popular than others. Golden Retrievers, Labrador Retrievers, and Pembroke / Welsh Corgis were the most common. Meanwhile, Japanese Spaniels, Clumber Spaniels, Groenendaels, Silky Terriers, Entlebuchers, Scotch Terriers, and Standard Schnauzers were the least common, with only a single rating each. It seems that more popular breeds of dogs are rated more often than less popular breeds.

Top%2010%20Ratings%20Barplot%20Resized.png

Dog Breed Average Numerator Rating
Soft-Coated Wheaten Terrier 25.45
West Highland White Terrier 15.64
Great Pyrenees 14.93
Borzoi 14.44
Labrador Retriever 13.49
Siberian Husky 13.25
Golden Retriever 13.00
Saluki 12.50
Tibetan Mastiff 12.40
Briard 12.33

Out of all breeds, the Soft-Coated Wheaten Terrier, the West Highland White Terrier, and the Great Pyrenees had the highest average numerator rating. Meanwhile, the Japanese Spaniel had the lowest average numerator rating of 5, followed by the Weimaraner at 8.53 and the Basenji at an average of 8.73.



wrangle_report addition

The data for this project was gathered from three different sources. The first dataframe was loaded from the downloaded file 'twitter-archive-enhanced.csv'. The second dataframe was programmatically requested from 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'. The third dataframe was gathered from the .json data accessed from the Twitter API through Tweepy.

The first step in wrangling these datasets was to fix the tidyness issue of the data for each observation, or tweet, being split across multiple dataframes when it would be best for it to all be in a single dataframe. Copies of the dataframes were made and merged based on the 'tweet_id' columns into the 'master_df' dataframe before any other wrangling or cleaning was performed in order to avoid duplicating any fixes.

To avoid unnecessary work, one quality issue was fixed before the remaining tidyness issues. Fourteen outliers were removed because they had multiple values in the development stage categories when it appeared that each should only have one value. These development stage columns were then combined into a single 'development_stage' column and the values from the four stages were recorded here. These four columns, along with four other columns were dropped from the 'master_df' because of their redundancy.

The quality issues were mostly minor changes. All id columns were changed to the object dtype to reflect their non-numeric nature. The name column had ninety-eight values that were not names, and these were changed to 'None' like all others that did not have a name. Finally, the names of dog breeds in the 'p1', 'p2', and 'p3' columns were cleaned to make them more readable by changing the underscores to spaces and consistently capitalizing the words with title casing.

The largest issue was that the numerator and denominator values of the ratings were incorrectly pulled in several cases. It appears that they were algorithmically pulled from the 'text' values, but that this algorithm did not account for the possibility for multiple ratings or fractions in the text field. Additionally, it did not account for the possibility of decimal places, float values, or negative numbers existing as a part of the rating. Each row was iterated over and the 'rating_numerator' and 'rating_denominator' values were recalculated and stored with these points in mind. All rows with text values that contained multiple forward slashes ('/') were checked manually to ensure that the rating values were pulled correctly.

Some minor wrangling was required for the analysis portion of the project as well. A score column was created to reflect the ratio between the 'rating_numerator' and 'rating_denominator' values since the denominator only tended to be ten. Another dataframe, the 'is_dog' dataframe was created to easily do analysis on specifically the rows that pertained to images that were determined to actually contain a dog by the machine learning algorithm applied to the original second dataset. Lastly, the 'master_df' dataframe was stored in the file 'twitter_archive_master.csv', available here.


Original Notebook:

In [1]:
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt

Gathering

Load the data from the downloaded .csv file into a dataframe.

In [2]:
df_1 = pd.read_csv('twitter-archive-enhanced.csv')
In [3]:
df_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

Request the .tsv file and load it into a separate dataframe.

In [4]:
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
In [5]:
# Display a limited amount of the requested data.
print(r.text[:5000])
tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.0614285	True
666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.07419169999999999	True	Rhodesian_ridgeback	0.07201	True
666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.13858399999999998	True	bloodhound	0.11619700000000001	True
666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True
666050758794694657	https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg	1	Bernese_mountain_dog	0.651137	True	English_springer	0.263788	True	Greater_Swiss_Mountain_dog	0.0161992	True
666051853826850816	https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg	1	box_turtle	0.9330120000000001	False	mud_turtle	0.04588540000000001	False	terrapin	0.017885299999999996	False
666055525042405380	https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg	1	chow	0.692517	True	Tibetan_mastiff	0.058279399999999995	True	fur_coat	0.0544486	False
666057090499244032	https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg	1	shopping_cart	0.962465	False	shopping_basket	0.014593799999999999	False	golden_retriever	0.00795896	True
666058600524156928	https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg	1	miniature_poodle	0.201493	True	komondor	0.192305	True	soft-coated_wheaten_terrier	0.08208610000000001	True
666063827256086533	https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg	1	golden_retriever	0.77593	True	Tibetan_mastiff	0.0937178	True	Labrador_retriever	0.07242660000000001	True
666071193221509120	https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg	1	Gordon_setter	0.503672	True	Yorkshire_terrier	0.174201	True	Pekinese	0.109454	True
666073100786774016	https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg	1	Walker_hound	0.260857	True	English_foxhound	0.17538199999999998	True	Ibizan_hound	0.0974705	True
666082916733198337	https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg	1	pug	0.489814	True	bull_mastiff	0.40472199999999997	True	French_bulldog	0.0489595	True
666094000022159362	https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg	1	bloodhound	0.195217	True	German_shepherd	0.0782598	True	malinois	0.07562780000000001	True
666099513787052032	https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg	1	Lhasa	0.58233	True	Shih-Tzu	0.166192	True	Dandie_Dinmont	0.0896883	True
666102155909144576	https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg	1	English_setter	0.298617	True	Newfoundland	0.149842	True	borzoi	0.133649	True
666104133288665088	https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg	1	hen	0.965932	False	cock	0.0339194	False	partridge	5.20658e-05	False
666268910803644416	https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg	1	desktop_computer	0.086502	False	desk	0.0855474	False	bookcase	0.0794797	False
666273097616637952	https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg	1	Italian_greyhound	0.176053	True	toy_terrier	0.111884	True	basenji	0.111152	True
666287406224695296	https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg	1	Maltese_dog	0.8575309999999999	True	toy_poodle	0.0630638	True	miniature_poodle	0.0255806	True
666293911632134144	https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg	1	three-toed_sloth	0.9146709999999999	False	otter	0.01525	False	great_grey_owl	0.0132072	False
666337882303524864	https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg	1	ox	0.41666899999999996	False	Newfoundland	0.278407	True	groenendael	0.10264300000000001	True
666345417576210432	https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg	1	golden_retriever	0.8587440000000001	True	Chesapeake_Bay_retriever	0.054786800000000004	True	Labrador_retriever	0.014240899999999999	True
666353288456101888	https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg	1	malamute	0.33687399999999995	True	Siberian_husky	0.147655	True	Eskimo_dog	0.09341239999999999	True
666362758909284353	https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg	1	guinea_pig	0.9964959999999999	False	skunk	0.00240245	False	hamster	0.00046086300000000005	False
666373753744588802	https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg	1	soft-coated_wheaten_terrier	0.326467	True	Afghan_hound	0.25955100000000003	True	briard	0.20680300000000001	True
666396247373291520	https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg	1	Chihuahua	0.978108	True	toy_terrier	0.00939697	True	papillon	0.00457681	True
666407126856765440	https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg	1	black-and-tan_coonhound	0.529139	True	bloodhound	0.24422	True	flat-coated_retriever	0.17381	True
666411507551481857	https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg	1	coho	0.40464	False	barracouta	0.271485	False	gar	0.189945	False
666418789513326592	https://pbs.twimg.com/media/CT-YWb7U8AA7QnN.jpg	1	toy_terri
In [6]:
file = open('image_predictions.tsv', 'w')
file.write(r.text)
file.close()
In [7]:
df_2 = pd.read_csv('image_predictions.tsv', delim_whitespace = True)
In [8]:
df_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Load the .json data for each tweet gathered with Tweepy into a separate dataframe.

In [9]:
'''
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
'''
Out[9]:
'\nimport tweepy\nfrom tweepy import OAuthHandler\nimport json\nfrom timeit import default_timer as timer\n\n# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file\n# These are hidden to comply with Twitter\'s API terms and conditions\nconsumer_key = \'HIDDEN\'\nconsumer_secret = \'HIDDEN\'\naccess_token = \'HIDDEN\'\naccess_secret = \'HIDDEN\'\n\nauth = OAuthHandler(consumer_key, consumer_secret)\nauth.set_access_token(access_token, access_secret)\n\napi = tweepy.API(auth, wait_on_rate_limit=True)\n\n# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:\n# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to\n# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv\n# NOTE TO REVIEWER: this student had mobile verification issues so the following\n# Twitter API code was sent to this student from a Udacity instructor\n# Tweet IDs for which to gather additional data via Twitter\'s API\ntweet_ids = df_1.tweet_id.values\nlen(tweet_ids)\n\n# Query Twitter\'s API for JSON data for each tweet ID in the Twitter archive\ncount = 0\nfails_dict = {}\nstart = timer()\n# Save each tweet\'s returned JSON as a new line in a .txt file\nwith open(\'tweet_json.txt\', \'w\') as outfile:\n    # This loop will likely take 20-30 minutes to run because of Twitter\'s rate limit\n    for tweet_id in tweet_ids:\n        count += 1\n        print(str(count) + ": " + str(tweet_id))\n        try:\n            tweet = api.get_status(tweet_id, tweet_mode=\'extended\')\n            print("Success")\n            json.dump(tweet._json, outfile)\n            outfile.write(\'\n\')\n        except tweepy.TweepError as e:\n            print("Fail")\n            fails_dict[tweet_id] = e\n            pass\nend = timer()\nprint(end - start)\nprint(fails_dict)\n'
In [10]:
df_3 = pd.read_json('tweet-json.txt', lines = True)
In [11]:
df_3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
created_at                       2354 non-null datetime64[ns, UTC]
id                               2354 non-null int64
id_str                           2354 non-null int64
full_text                        2354 non-null object
truncated                        2354 non-null bool
display_text_range               2354 non-null object
entities                         2354 non-null object
extended_entities                2073 non-null object
source                           2354 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null float64
in_reply_to_screen_name          78 non-null object
user                             2354 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2354 non-null bool
retweet_count                    2354 non-null int64
favorite_count                   2354 non-null int64
favorited                        2354 non-null bool
retweeted                        2354 non-null bool
possibly_sensitive               2211 non-null float64
possibly_sensitive_appealable    2211 non-null float64
lang                             2354 non-null object
retweeted_status                 179 non-null object
quoted_status_id                 29 non-null float64
quoted_status_id_str             29 non-null float64
quoted_status                    28 non-null object
dtypes: bool(4), datetime64[ns, UTC](1), float64(11), int64(4), object(11)
memory usage: 505.9+ KB
In [ ]:
 

Assessing

Quality 1: Consistency
The id columns in all three dataframes have the incorrect dtype. They are all floats when they should be strings, as ids should never be used in mathematical calculations.

df_1['tweet_id', 'in_reply_to_status_id' 'in_reply_to_user_id' 'retweeted_status_id' 'retweeted_status_user_id']

df_2['id']

df_3['id', 'id_str', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'quoted_status_id', 'quoted_status_id_str']

In [12]:
df_1[['doggo', 'floofer', 'pupper', 'puppo']][:20]
Out[12]:
doggo floofer pupper puppo
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None
5 None None None None
6 None None None None
7 None None None None
8 None None None None
9 doggo None None None
10 None None None None
11 None None None None
12 None None None puppo
13 None None None None
14 None None None puppo
15 None None None None
16 None None None None
17 None None None None
18 None None None None
19 None None None None

Quality 2: Consistency
These dog development category columns have 'None' when NaN should be used.

See if the 'doggo', 'floofer', 'pupper', and 'puppo' variables are set correctly based on whether or not they appear in the 'text' and then go over the times they do not appear to determine if the variables are incorrect.

In [13]:
check_categories = []

for index, row in df_1.iterrows():
    appending = [index]
    needs_appended = False
    
    if row['doggo'] == 'doggo':
        if ('doggo') not in row['text'].lower():
            print(str(index) + '\tdoggo: Is not in Text.')
            appending.append('doggo')
            needs_appended = True
    if row['floofer'] == 'floofer':
        if ('floofer') not in row['text'].lower():
            print(str(index) + '\tfloofer: Is not in Text.')
            appending.append('floofer')
            needs_appended = True
    if row['pupper'] == 'pupper':
        if ('pupper') not in row['text'].lower():
            print(str(index) + '\tpupper: Is not in Text.')
            appending.append('pupper')
            needs_appended = True
    if row['puppo'] == 'puppo':
        if ('puppo') not in row['text'].lower():
            print(str(index) + '\tpuppo: Is not in Text.')
            appending.append('puppo')
            needs_appended = True
    
    if needs_appended:
        check_categories.append(appending)
        
check_categories
Out[13]:
[]

All of these values appear to be correctly set, so these will be left as they are.

Find if multiple dog categories appear in any row.

In [14]:
def check_multiple_categories(df):
    category_check_list = []
    for index, row in df.iterrows():
        count = 0

        if row['doggo'] == 'doggo':
            count += 1
        if row['floofer'] == 'floofer':
            count += 1
        if row['pupper'] == 'pupper':
            count += 1
        if row['puppo'] == 'puppo':
            count += 1
        if count > 1:
            category_check_list.append(index)

    return category_check_list
In [15]:
check_multiple_categories(df_1)
Out[15]:
[191, 200, 460, 531, 565, 575, 705, 733, 778, 822, 889, 956, 1063, 1113]

Quality 3: Validity
It appears that only one of the dog development categories should exist per row. These 14 rows have multiple values, and appear to be outliers or incorrect. As they are a very small fraction of the rows and we cannot know which category is correct, the simpelst fix is to drop these rows.

Tidyness 1: Each variable forms a column.
The 4 dog development categories should be combined to make a single categorical column.

Check to see if the numerator values are properly calculated.

In [16]:
# Get a list of tuples of the value count keys and values.
num_value_counts = list(zip(df_1['rating_numerator'].value_counts().keys(), df_1['rating_numerator'].value_counts()))
print(num_value_counts)
[(12, 558), (11, 464), (10, 461), (13, 351), (9, 158), (8, 102), (7, 55), (14, 54), (5, 37), (6, 32), (3, 19), (4, 17), (1, 9), (2, 9), (420, 2), (0, 2), (15, 2), (75, 2), (80, 1), (20, 1), (24, 1), (26, 1), (44, 1), (50, 1), (60, 1), (165, 1), (84, 1), (88, 1), (144, 1), (182, 1), (143, 1), (666, 1), (960, 1), (1776, 1), (17, 1), (27, 1), (45, 1), (99, 1), (121, 1), (204, 1)]

Create a list of the rating numerators that are uncommon enough to need to be checked.

In [17]:
numerators_to_check = []
for count in num_value_counts:
    if count[1] < 10:
        numerators_to_check.append(count[0])
        
numerators_to_check.sort()
print(numerators_to_check)
[0, 1, 2, 15, 17, 20, 24, 26, 27, 44, 45, 50, 60, 75, 80, 84, 88, 99, 121, 143, 144, 165, 182, 204, 420, 666, 960, 1776]
In [18]:
num_indexs_to_check = df_1.loc[df_1['rating_numerator'].isin(numerators_to_check)][['text', 'rating_numerator']].index.values
num_indexs_to_check
Out[18]:
array([  55,  188,  189,  285,  290,  291,  313,  315,  340,  433,  516,
        605,  695,  763,  902,  979, 1016, 1120, 1202, 1228, 1254, 1274,
       1351, 1433, 1446, 1634, 1635, 1663, 1712, 1761, 1764, 1779, 1843,
       1869, 1920, 1940, 2038, 2074, 2079, 2091, 2237, 2246, 2261, 2310,
       2326, 2335, 2338, 2349], dtype=int64)
In [19]:
for index in num_indexs_to_check:
    print(df_1.iloc[index]['text'])
    print(df_1.iloc[index]['rating_numerator'])
@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s
17
@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
420
@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
666
RT @KibaDva: I collected all the good dogs!! 15/10 @dog_rates #GoodDogs https://t.co/6UCGFczlOI
15
@markhoppus 182/10
182
@bragg6of8 @Andy_Pace_ we are still looking for the first 15/10
15
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
960
When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag
0
RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…
75
The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
84
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
24
RT @dog_rates: Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/…
1
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
75
This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq
27
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
165
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
1776
PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment https://t.co/c7WXaB2nqX
0
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
204
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
50
Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
99
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
80
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
45
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
60
Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
44
After reading the comments I may have overestimated this pup. Downgraded to a 1/10. Please forgive me
1
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
143
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
121
I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
20
Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD
26
Exotic pup here. Tail long af. Throat looks swollen. Might breathe fire. Exceptionally unfluffy 2/10 would still pet https://t.co/a8SqCaSo2r
2
This is Crystal. She's a shitty fireman. No sense of urgency. People could be dying Crystal. 2/10 just irresponsible https://t.co/rtMtjSl9pz
2
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
144
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw
88
What kind of person sends in a picture without a dog in it? 1/10 just because that's a nice table https://t.co/RDXCfk8hK0
1
This is Henry. He's a shit dog. Short pointy ears. Leaves trail of pee. Not fluffy. Doesn't come when called. 2/10 https://t.co/Pu9RhfHDEQ
2
The millennials have spoken and we've decided to immediately demote to a 1/10. Thank you
1
After 22 minutes of careful deliberation this dog is being demoted to a 1/10. The longer you look at him the more terrifying he becomes
1
After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY
420
Scary dog here. Too many legs. Extra tail. Not soft, let alone fluffy. Won't bark. Moves sideways. Has weapon. 2/10 https://t.co/XOPXCSXiUT
2
Flamboyant pup here. Probably poisonous. Won't eat kibble. Doesn't bark. Slow af. Petting doesn't look fun. 1/10 https://t.co/jxukeh2BeO
1
This lil pup is Oliver. Hops around. Has wings but doesn't fly (lame). Annoying chirp. Won't catch tennis balls 2/10 https://t.co/DnhUw0aBM2
2
This is Tedrick. He lives on the edge. Needs someone to hit the gas tho. Other than that he's a baller. 10&amp;2/10 https://t.co/LvP1TTYSCN
2
Never seen dog like this. Breathes heavy. Tilts head in a pattern. No bark. Shitty at fetch. Not even cordless. 1/10 https://t.co/i9iSGNn3fx
1
Unfamiliar with this breed. Ears pointy af. Won't let go of seashell. Won't eat kibble. Not very fast. Bad dog 2/10 https://t.co/EIn5kElY1S
2
This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t
2
This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
1
Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/Asgdc6kuLX
1
This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc
2

Quality 4: Accuracy
Some of these rating_numerator values are set to the numerator of the first fraction found in the text and not the actual rating fraction. These will be most easily dealt with by looking over and changing them manually.

Quality 5: Accuracy
Additionally, some of these ratings contain float values, like 9.75 specifically, so the rating numerator dtype should be updated to allow this, and the numerators should be updated from the text.

Check to see if the denominator values are properly calculated in a similar fashion.

In [20]:
# Get a list of tuples of the value count keys and values.
denom_value_counts = list(zip(df_1['rating_denominator'].value_counts().keys(), df_1['rating_denominator'].value_counts()))
denom_value_counts
Out[20]:
[(10, 2333),
 (11, 3),
 (50, 3),
 (80, 2),
 (20, 2),
 (2, 1),
 (16, 1),
 (40, 1),
 (70, 1),
 (15, 1),
 (90, 1),
 (110, 1),
 (120, 1),
 (130, 1),
 (150, 1),
 (170, 1),
 (7, 1),
 (0, 1)]
In [21]:
denominators_to_check = []
for count in denom_value_counts:
    if count[1] < 4:
        denominators_to_check.append(count[0])
        
denominators_to_check.sort()
denominators_to_check
Out[21]:
[0, 2, 7, 11, 15, 16, 20, 40, 50, 70, 80, 90, 110, 120, 130, 150, 170]
In [22]:
denom_indexs_to_check = df_1.loc[df_1['rating_denominator'].isin(denominators_to_check)][['text', 'rating_denominator']].index.values
denom_indexs_to_check
Out[22]:
array([ 313,  342,  433,  516,  784,  902, 1068, 1120, 1165, 1202, 1228,
       1254, 1274, 1351, 1433, 1598, 1634, 1635, 1662, 1663, 1779, 1843,
       2335], dtype=int64)
In [23]:
for index in denom_indexs_to_check:
    print(df_1.iloc[index]['text'])
    print(df_1.iloc[index]['rating_denominator'])
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
0
@docmisterio account started on 11/15/15
15
The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd
70
Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
7
RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…
11
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
150
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
11
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
170
Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
20
This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
50
Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1
90
Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12
80
From left to right:
Cletus, Jerome, Alejandro, Burp, &amp; Titson
None know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK
50
Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa
50
Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ
40
Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating
20
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
130
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
110
This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
11
I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible
16
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
120
Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw
80
This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
2

Quality 6: Accuracy
Some of these rating_denominator values are also set to the denominator of the first fraction found in the text and not the actual rating fraction. These will be most easily dealt with by looking over and changing them manually.

Quality 7: Validity/Accuracy
After looking over many of these tweets with odd or at least uncommon rating values, it's become clear that some of these tweets are replies or retweets that do not actually have a real dog, or a real rating. These all need to be removed to ensure that they do not impact any analysis.

Check for proper name values based on the text values.

In [24]:
df_1['name'].value_counts()[:20]
Out[24]:
None       745
a           55
Charlie     12
Cooper      11
Lucy        11
Oliver      11
Lola        10
Penny       10
Tucker      10
Winston      9
Bo           9
the          8
Sadie        8
Bailey       7
an           7
Toby         7
Daisy        7
Buddy        7
Bella        6
Jax          6
Name: name, dtype: int64
In [25]:
# Get a list of tuples of the value count keys and values.
name_value_counts = list(zip(df_1['name'].value_counts().keys(), df_1['name'].value_counts()))

count = 0
item = ""
for i, name in enumerate(name_value_counts):
    item = item + str(name)
    if count < 4:
        count += 1
        if len(str(name)) < 16:
            item = item + "\t\t"
        else:
            item = item + "\t"
    elif count == 4:
        print(item)
        count = 0
        item = ""
print(item)
('None', 745)		('a', 55)		('Charlie', 12)		('Cooper', 11)		('Lucy', 11)
('Oliver', 11)		('Lola', 10)		('Penny', 10)		('Tucker', 10)		('Winston', 9)
('Bo', 9)		('the', 8)		('Sadie', 8)		('Bailey', 7)		('an', 7)
('Toby', 7)		('Daisy', 7)		('Buddy', 7)		('Bella', 6)		('Jax', 6)
('Koda', 6)		('Dave', 6)		('Milo', 6)		('Stanley', 6)		('Leo', 6)
('Oscar', 6)		('Scout', 6)		('Rusty', 6)		('Jack', 6)		('Louis', 5)
('Sammy', 5)		('Larry', 5)		('Phil', 5)		('Oakley', 5)		('very', 5)
('Finn', 5)		('Chester', 5)		('Sunny', 5)		('Bentley', 5)		('George', 5)
('Alfie', 5)		('Gus', 5)		('Loki', 4)		('Dexter', 4)		('Boomer', 4)
('Brody', 4)		('Bruce', 4)		('Reginald', 4)		('Jerry', 4)		('Shadow', 4)
('Carl', 4)		('Cassie', 4)		('Moose', 4)		('Scooter', 4)		('Beau', 4)
('Luna', 4)		('Maggie', 4)		('Maximus', 4)		('Jeffrey', 4)		('Chip', 4)
('Bear', 4)		('Clarence', 4)		('Gerald', 4)		('Ruby', 4)		('Riley', 4)
('Maddie', 4)		('Gary', 4)		('just', 4)		('Walter', 4)		('Reggie', 4)
('Derek', 4)		('Duke', 4)		('quite', 4)		('Sophie', 4)		('Clark', 4)
('Hank', 4)		('one', 4)		('Sampson', 4)		('Archie', 4)		('Winnie', 4)
('Klevin', 3)		('Coco', 3)		('Ted', 3)		('Lorenzo', 3)		('Calvin', 3)
('Peaches', 3)		('Gizmo', 3)		('Max', 3)		('Sebastian', 3)	('Zoey', 3)
('Rory', 3)		('Rosie', 3)		('Paisley', 3)		('Nala', 3)		('Earl', 3)
('Vincent', 3)		('Olive', 3)		('Frankie', 3)		('Louie', 3)		('Kyle', 3)
('Mia', 3)		('Arnie', 3)		('Lily', 3)		('Wallace', 3)		('Samson', 3)
('Steven', 3)		('Waffles', 3)		('Otis', 3)		('Ellie', 3)		('Zeke', 3)
('Malcolm', 3)		('Jimothy', 3)		('Colby', 3)		('Wilson', 3)		('Doug', 3)
('Reese', 3)		('Wyatt', 3)		('Thumas', 2)		('Titan', 2)		('Misty', 2)
('Alice', 2)		('Kilo', 2)		('Remington', 2)	('Olivia', 2)		('Pipsy', 2)
('Jeph', 2)		('Astrid', 2)		('Kreg', 2)		('Pippa', 2)		('Opal', 2)
('Bernie', 2)		('Finley', 2)		('Ollie', 2)		('Oliviér', 2)		('Aspen', 2)
('Patrick', 2)		('Roosevelt', 2)	('Paull', 2)		('Chompsky', 2)		('Juno', 2)
('Phred', 2)		('Watson', 2)		('Wally', 2)		('Quinn', 2)		('Kenny', 2)
('Calbert', 2)		('Ken', 2)		('Sansa', 2)		('Happy', 2)		('Theodore', 2)
('Lincoln', 2)		('Piper', 2)		('Elliot', 2)		('Cupcake', 2)		('Anakin', 2)
('Gromit', 2)		('Benedict', 2)		('Chelsea', 2)		('Rizzy', 2)		('Fred', 2)
('Baxter', 2)		('Lennon', 2)		('Albus', 2)		('Dakota', 2)		('Terry', 2)
('actually', 2)		('Charles', 2)		('Herald', 2)		('Ozzy', 2)		('Canela', 2)
('Hunter', 2)		('Klein', 2)		('Cali', 2)		('Timison', 2)		('Trooper', 2)
('Harold', 2)		('Sandy', 2)		('Lilly', 2)		('Jackson', 2)		('Kenneth', 2)
('Axel', 2)		('Atlas', 2)		('Carly', 2)		('Keurig', 2)		('Linda', 2)
('Atticus', 2)		('mad', 2)		('Nelly', 2)		('Meyer', 2)		('Gabe', 2)
('Layla', 2)		('Lou', 2)		('Rocky', 2)		('Raymond', 2)		('Kyro', 2)
('Solomon', 2)		('Rubio', 2)		('Jamesy', 2)		('Albert', 2)		('Leela', 2)
('Cash', 2)		('not', 2)		('Curtis', 2)		('Chet', 2)		('Neptune', 2)
('Ash', 2)		('Hobbes', 2)		('Bubbles', 2)		('Jesse', 2)		('Seamus', 2)
('Herm', 2)		('Jiminy', 2)		('Hammond', 2)		('Brad', 2)		('Dash', 2)
('Sarge', 2)		('Indie', 2)		('Benji', 2)		('Cody', 2)		('Herschel', 2)
('Fizz', 2)		('Dawn', 2)		('Flávio', 2)		('CeCe', 2)		('Django', 2)
('Betty', 2)		('Levi', 2)		('Rocco', 2)		('Blitz', 2)		('Oshie', 2)
('Bob', 2)		('Percy', 2)		('Chipson', 2)		('Phineas', 2)		('Philbert', 2)
('Belle', 2)		('getting', 2)		('Butter', 2)		('Emmy', 2)		('Stephan', 2)
('Penelope', 2)		('Marley', 2)		('Romeo', 2)		('Bungalo', 2)		('Eve', 2)
('Churlie', 2)		('Mister', 2)		('Kevin', 2)		('Sierra', 2)		('Stubert', 2)
('Sam', 2)		('Bell', 2)		('Hurley', 2)		('Tyrone', 2)		('Doc', 2)
('Tyr', 2)		('Mattie', 2)		('Maxaroni', 2)		('Davey', 2)		('Kreggory', 2)
('Baloo', 2)		('Griffin', 2)		('Logan', 2)		('Balto', 2)		('Coops', 2)
('Gidget', 2)		('Moreton', 2)		('Yogi', 2)		('Harper', 2)		('Bisquick', 2)
('Odie', 2)		('Sugar', 2)		('Crystal', 2)		('Franklin', 2)		('Fiona', 2)
('Panda', 2)		('Frank', 2)		('Chuckles', 2)		('Shaggy', 2)		('Keith', 2)
('Pickles', 2)		('Smokey', 2)		('Luca', 2)		('Lenny', 2)		('Abby', 2)
('Eli', 2)		('Ava', 2)		('Kirby', 2)		('Nollie', 2)		('Jimison', 2)
('Hercules', 2)		('Terrance', 2)		('Rufus', 2)		('Moe', 2)		('Pablo', 2)
('Walker', 1)		('Julius', 1)		('Rodney', 1)		('Benny', 1)		('Sojourner', 1)
('Jessiga', 1)		('Griswold', 1)		('Ron', 1)		('Antony', 1)		('Comet', 1)
('old', 1)		('Cheesy', 1)		('Scott', 1)		('Brownie', 1)		('Chadrick', 1)
('Mabel', 1)		('Mutt', 1)		('Nida', 1)		('Ulysses', 1)		('Cal', 1)
('Wesley', 1)		('Lance', 1)		('Willy', 1)		('Brian', 1)		('Cedrick', 1)
('Olaf', 1)		('Brockly', 1)		('Huck', 1)		('officially', 1)	('Grizzwald', 1)
('Poppy', 1)		('Shawwn', 1)		('Evy', 1)		('my', 1)		('Karma', 1)
('Thor', 1)		('Aja', 1)		('Fynn', 1)		('General', 1)		('Sky', 1)
('Brutus', 1)		('Hall', 1)		('Jersey', 1)		('Danny', 1)		('Raphael', 1)
('Edmund', 1)		('Ralphson', 1)		('Petrick', 1)		('Dug', 1)		('Trevith', 1)
('Creg', 1)		('Joshwa', 1)		('Kathmandu', 1)	('Genevieve', 1)	('Steve', 1)
('Tedrick', 1)		('Aqua', 1)		('Pubert', 1)		('Crawford', 1)		('Jockson', 1)
('Chesney', 1)		('Sprinkles', 1)	('Dixie', 1)		('Durg', 1)		('Timber', 1)
('Dobby', 1)		('Ivar', 1)		('Jim', 1)		('Dot', 1)		('Buckley', 1)
('Mojo', 1)		('Jazzy', 1)		('Enchilada', 1)	('Timmy', 1)		('Wishes', 1)
('Livvie', 1)		('Rambo', 1)		('Jeremy', 1)		('Jett', 1)		('Flash', 1)
('Fabio', 1)		('Bode', 1)		('Pluto', 1)		('Ralpher', 1)		('Kollin', 1)
('Rupert', 1)		('Dwight', 1)		('Tove', 1)		('Gustaf', 1)		('Pumpkin', 1)
('Lilli', 1)		('Charleson', 1)	('Ziva', 1)		('Dido', 1)		('Odin', 1)
('Amélie', 1)		('Geoff', 1)		('Dutch', 1)		('Jeb', 1)		('Gerbald', 1)
('Bobb', 1)		('Randall', 1)		('Sephie', 1)		('Vinnie', 1)		('Clyde', 1)
('Sage', 1)		('Jordy', 1)		('this', 1)		('Covach', 1)		('Marty', 1)
('Banjo', 1)		('Swagger', 1)		('Willem', 1)		('Mason', 1)		('Orion', 1)
('Tater', 1)		('River', 1)		('Gabby', 1)		('Chaz', 1)		('infuriating', 1)
('Horace', 1)		('light', 1)		('Tedders', 1)		('Chubbs', 1)		('Stella', 1)
('Pippin', 1)		('Crumpet', 1)		('Shikha', 1)		('Longfellow', 1)	('Kallie', 1)
('Grizz', 1)		('DonDon', 1)		('Dewey', 1)		('Ralphus', 1)		('Shiloh', 1)
('Fwed', 1)		('Amy', 1)		('Major', 1)		('Chuck', 1)		('Anna', 1)
('Bodie', 1)		('Marlee', 1)		('Newt', 1)		('Pawnd', 1)		('Einstein', 1)
('Miley', 1)		('Bonaparte', 1)	('Andy', 1)		('Arya', 1)		('Theo', 1)
('Storkson', 1)		('Florence', 1)		('Bradlay', 1)		('Ole', 1)		('Devón', 1)
('Venti', 1)		('Glenn', 1)		('Zooey', 1)		('life', 1)		('Kane', 1)
('Divine', 1)		('Bertson', 1)		('Lizzie', 1)		('Oreo', 1)		('Eevee', 1)
('Godzilla', 1)		('Chesterson', 1)	('Spark', 1)		('Mookie', 1)		('Spencer', 1)
('Kulet', 1)		('Tito', 1)		('Rufio', 1)		('Pinot', 1)		('Wiggles', 1)
('William', 1)		('Winifred', 1)		('Siba', 1)		('Mingus', 1)		('Dale', 1)
('by', 1)		('Samsom', 1)		('Mona', 1)		('Halo', 1)		('Stephanus', 1)
('Ember', 1)		('Diogi', 1)		('Pherb', 1)		('Bilbo', 1)		('Strudel', 1)
('Ferg', 1)		('Snicku', 1)		('Corey', 1)		('Clifford', 1)		('Laika', 1)
('Eleanor', 1)		('Lucky', 1)		('Autumn', 1)		('Shadoe', 1)		('Pete', 1)
('Grizzie', 1)		('Reptar', 1)		('Linus', 1)		('Sweet', 1)		('Beebop', 1)
('Sully', 1)		('Alejandro', 1)	('Jarvis', 1)		('Tino', 1)		('Berkeley', 1)
('Superpup', 1)		('Lenox', 1)		('Kramer', 1)		('Brat', 1)		('Koko', 1)
('Rilo', 1)		('Blipson', 1)		('Harry', 1)		('Gòrdón', 1)		('Ace', 1)
('Link', 1)		('Tiger', 1)		('Fillup', 1)		('Bauer', 1)		('Alexander', 1)
('Jarod', 1)		('Colin', 1)		('Sailor', 1)		('Adele', 1)		('Farfle', 1)
('Lipton', 1)		('Ito', 1)		('Derby', 1)		('Ester', 1)		('Lacy', 1)
('Ralphie', 1)		('Jebberson', 1)	('Harnold', 1)		('Amber', 1)		('Philippe', 1)
('Cupid', 1)		('Torque', 1)		('Vince', 1)		('Burt', 1)		('Herb', 1)
('Darby', 1)		('Clybe', 1)		('Sundance', 1)		('Kota', 1)		('Vinscent', 1)
('Clarkus', 1)		('Lambeau', 1)		('Remy', 1)		('Gert', 1)		('Emmie', 1)
('Brandi', 1)		('Christoper', 1)	('Jessifer', 1)		('Barry', 1)		('Michelangelope', 1)
('Rudy', 1)		('Bluebert', 1)		('Hanz', 1)		('Ridley', 1)		('Nigel', 1)
('Combo', 1)		('unacceptable', 1)	('Tilly', 1)		('Kona', 1)		('Simba', 1)
('Blu', 1)		('such', 1)		('Darla', 1)		('Tayzie', 1)		('Stormy', 1)
('Ben', 1)		('Caryl', 1)		('Kloey', 1)		('Terrenth', 1)		('Mairi', 1)
('Mike', 1)		('Kirk', 1)		('Pepper', 1)		('Beckham', 1)		('Monster', 1)
('Strider', 1)		('Arlen', 1)		('Bruiser', 1)		('Reagan', 1)		('Kobe', 1)
('Rhino', 1)		('Godi', 1)		('Cannon', 1)		('Hermione', 1)		('Mitch', 1)
('Snoop', 1)		('Julio', 1)		('Bobble', 1)		('Dallas', 1)		('Beemo', 1)
('Champ', 1)		('Meatball', 1)		('Bobby', 1)		('Jiminus', 1)		('Trip', 1)
('Huxley', 1)		('Asher', 1)		('Ozzie', 1)		('Mya', 1)		('Roscoe', 1)
('Yukon', 1)		('Taco', 1)		('Ronnie', 1)		('Vixen', 1)		('Mimosa', 1)
('Kial', 1)		('Stuart', 1)		('Bert', 1)		('Jomathan', 1)		('Chevy', 1)
('Perry', 1)		('Ginger', 1)		('Noah', 1)		('Tyrus', 1)		('Gunner', 1)
('Shelby', 1)		('Maxwell', 1)		('Jed', 1)		('Richie', 1)		('Rose', 1)
('Rodman', 1)		('Augie', 1)		('Emma', 1)		('Skittle', 1)		('Laela', 1)
('Hector', 1)		('Skye', 1)		('Saydee', 1)		('space', 1)		('Apollo', 1)
('Dudley', 1)		('Boots', 1)		('Ambrose', 1)		('Jennifur', 1)		('Sora', 1)
('Keet', 1)		('Stark', 1)		('Aldrick', 1)		('Humphrey', 1)		('Karl', 1)
('Howie', 1)		('Harvey', 1)		('Skittles', 1)		('Ronduh', 1)		('Tuco', 1)
('Stu', 1)		('Robin', 1)		('Rooney', 1)		('Ralf', 1)		('Jazz', 1)
('Lilah', 1)		('Luther', 1)		('Billl', 1)		('Hamrick', 1)		('Cermet', 1)
('Snickers', 1)		('Izzy', 1)		('Kody', 1)		('Marvin', 1)		('his', 1)
('Lillie', 1)		('Sailer', 1)		('Shooter', 1)		('Coleman', 1)		('Malikai', 1)
('Chuq', 1)		('Loomis', 1)		('Zuzu', 1)		('Dylan', 1)		('Jackie', 1)
('Kawhi', 1)		('Georgie', 1)		('Tanner', 1)		('Harrison', 1)		('Pavlov', 1)
('Arnold', 1)		('Glacier', 1)		('Bones', 1)		('Kaia', 1)		('Smiley', 1)
('Arlo', 1)		('Cuddles', 1)		('Kaiya', 1)		('Andru', 1)		('Crouton', 1)
('Rumble', 1)		('Brandonald', 1)	('Ike', 1)		('Mark', 1)		('Duddles', 1)
('Berb', 1)		('Lassie', 1)		('Napolean', 1)		('Noosh', 1)		('Margo', 1)
('Teddy', 1)		('Pupcasso', 1)		('Brooks', 1)		('Striker', 1)		('all', 1)
('Freddery', 1)		('Aiden', 1)		('Mac', 1)		('Stewie', 1)		('Zara', 1)
('Coopson', 1)		('Dante', 1)		('Maisey', 1)		('Chase', 1)		('JD', 1)
('Chef', 1)		('Boston', 1)		('Tuck', 1)		('Tripp', 1)		('Doobert', 1)
('Furzey', 1)		('Liam', 1)		('Bookstore', 1)	('Jaspers', 1)		('Sandra', 1)
('Tess', 1)		('Stefan', 1)		('Lucia', 1)		('Sprout', 1)		('Schnitzel', 1)
('Sonny', 1)		('Sobe', 1)		('Cleopatricia', 1)	('Timofy', 1)		('incredibly', 1)
('Alf', 1)		('Shnuggles', 1)	('Alexanderson', 1)	('Jonah', 1)		('Cecil', 1)
('Daniel', 1)		('Birf', 1)		('Bradley', 1)		('Kuyu', 1)		('Harlso', 1)
('Toffee', 1)		('Hubertson', 1)	('Jay', 1)		('Ralph', 1)		('Rontu', 1)
('Tommy', 1)		('Ebby', 1)		('Socks', 1)		('Brandy', 1)		('Filup', 1)
('Traviss', 1)		('Zoe', 1)		('Rueben', 1)		('Kanu', 1)		('Brady', 1)
('Sweets', 1)		('Rolf', 1)		('Fido', 1)		('Gilbert', 1)		('Lupe', 1)
('Duchess', 1)		('Scruffers', 1)	('Crimson', 1)		('Grey', 1)		('Kayla', 1)
('Milky', 1)		('Yoda', 1)		('Mauve', 1)		('Jeffrie', 1)		('Holly', 1)
('Lulu', 1)		('Nimbus', 1)		('Dook', 1)		('Pilot', 1)		('Kevon', 1)
('Bloop', 1)		('Jo', 1)		('Jimbo', 1)		('Monty', 1)		('Bruno', 1)
('Clarq', 1)		('Edgar', 1)		('Edd', 1)		('Suki', 1)		('Norman', 1)
('Bloo', 1)		('Iroh', 1)		('Patch', 1)		('Spanky', 1)		('Blue', 1)
('Craig', 1)		('Bubba', 1)		('Biden', 1)		('Jeffri', 1)		('Bronte', 1)
('Murphy', 1)		('Remus', 1)		('Flurpson', 1)		('Bayley', 1)		('Aubie', 1)
('Billy', 1)		('Fiji', 1)		('Tycho', 1)		('Gordon', 1)		('Wafer', 1)
('Rizzo', 1)		('Karll', 1)		('Tug', 1)		('Meera', 1)		('Baron', 1)
('Blakely', 1)		('Molly', 1)		('Darrel', 1)		('Callie', 1)		('Jameson', 1)
('Rinna', 1)		('Josep', 1)		('Tassy', 1)		('Mollie', 1)		('Anthony', 1)
('Jangle', 1)		('Hero', 1)		('Eugene', 1)		('Dotsy', 1)		('Goliath', 1)
('Beya', 1)		('Tebow', 1)		('Ashleigh', 1)		('Tom', 1)		('Tupawc', 1)
('Mack', 1)		('Tango', 1)		('Kenzie', 1)		('Hazel', 1)		('Donny', 1)
('Gustav', 1)		('Jerome', 1)		('Grady', 1)		('Marq', 1)		('Franq', 1)
('Binky', 1)		('Rey', 1)		('Blanket', 1)		('Willow', 1)		('Barclay', 1)
('Puff', 1)		('Finnegus', 1)		('O', 1)		('Nico', 1)		('Bowie', 1)
('Snoopy', 1)		('Obi', 1)		('Iggy', 1)		('Cilantro', 1)		('Gin', 1)
('Dex', 1)		('Carbon', 1)		('Brudge', 1)		('Naphaniel', 1)	('Barney', 1)
('Tobi', 1)		('Alfy', 1)		('Jareld', 1)		('Emanuel', 1)		('Monkey', 1)
('Lugan', 1)		('Jaycob', 1)		('Bobbay', 1)		('Geno', 1)		('Batdog', 1)
('Miguel', 1)		('Cora', 1)		('Mosby', 1)		('Cheryl', 1)		('Rorie', 1)
('Pip', 1)		('Carper', 1)		('Kendall', 1)		('Butters', 1)		('Lili', 1)
('Fletcher', 1)		('Rumpole', 1)		('Zeus', 1)		('Willie', 1)		('Travis', 1)
('Millie', 1)		('Damon', 1)		('Eriq', 1)		('Banditt', 1)		('Carll', 1)
('Joey', 1)		('Chloe', 1)		('Nugget', 1)		('Maks', 1)		('Sparky', 1)
('Charl', 1)		('Sunshine', 1)		('Katie', 1)		('Ed', 1)		('Kingsley', 1)
('Ralphy', 1)		('Erik', 1)		('Dietrich', 1)		('Tessa', 1)		('Todo', 1)
('Lorelei', 1)		('Rascal', 1)		('Mo', 1)		('DayZ', 1)		('Dunkin', 1)
('BeBe', 1)		('Heinrich', 1)		('Eazy', 1)		('Ruffles', 1)		('Tonks', 1)
('Ralphé', 1)		('Ricky', 1)		('Rover', 1)		('Zeek', 1)		('Angel', 1)
('Juckson', 1)		('Goose', 1)		('Maya', 1)		('Lolo', 1)		('Carter', 1)
('Staniel', 1)		('Maude', 1)		('Opie', 1)		('Buddah', 1)		('Peanut', 1)
('Kara', 1)		('Frönq', 1)		('Trigger', 1)		('Deacon', 1)		('Schnozz', 1)
('Al', 1)		('Shakespeare', 1)	('Henry', 1)		('Leonidas', 1)		('Pancake', 1)
('Kellogg', 1)		('Claude', 1)		('Mary', 1)		('Acro', 1)		('Obie', 1)
('Sid', 1)		('Akumi', 1)		('Moofasa', 1)		('Taz', 1)		('Severus', 1)
('Oddie', 1)		('Leonard', 1)		

Nearly all of the name values are words from the text that were false positives in whatever name detection algorithm was used. These all begin with lowercase letters, so we can create a list of row indexes to check based on whether or not the name value begins with an lowercase letter or not.

In [26]:
names_to_check = []
for index, row in df_1.iterrows():
    if row['name'][0].islower():
        names_to_check.append(index)
In [27]:
count = 0
item = ""
for i, name in enumerate(names_to_check):
    item = item + str(name)
    if count < 4:
        count += 1
        if len(str(name)) < 16:
            item = item + "\t\t"
        else:
            item = item + "\t"
    elif count == 4:
        print(item)
        count = 0
        item = ""
print(item)
22		56		118		169		193
335		369		542		649		682
759		773		801		819		822
852		924		988		992		993
1002		1004		1017		1025		1031
1040		1049		1063		1071		1095
1097		1120		1121		1138		1193
1206		1207		1259		1340		1351
1361		1362		1368		1382		1385
1435		1457		1499		1527		1603
1693		1724		1737		1747		1785
1797		1815		1853		1854		1877
1878		1916		1923		1936		1941
1955		1994		2001		2019		2030
2034		2037		2066		2116		2125
2128		2146		2153		2161		2191
2198		2204		2211		2212		2218
2222		2235		2249		2255		2264
2273		2287		2304		2311		2314
2326		2327		2333		2334		2335
2345		2346		2347		2348		2349
2350		2352		2353		2354		
In [28]:
count = 0
item = ""
for i, name in enumerate(names_to_check):
    item = item + str(df_1.iloc[name]['name'])
    if count < 4:
        count += 1
        if len(str(df_1.iloc[name]['name'])) < 8:
            item = item + "\t\t"
        else:
            item = item + "\t"
    elif count == 4:
        print(item)
        count = 0
        item = ""
print(item)
such		a		quite		quite		quite
not		one		incredibly	a		mad
an		very		a		very		just
my		one		not		his		one
a		a		a		an		very
actually	a		just		getting		mad
very		this		unacceptable	all		a
old		a		infuriating	a		a
a		an		a		a		very
getting		just		a		the		the
actually	by		a		officially	a
the		the		a		a		a
a		life		a		one		a
a		a		light		just		space
a		the		a		a		a
a		a		a		a		a
a		an		a		the		a
a		a		a		a		a
a		a		a		a		a
quite		a		an		a		an
the		the		a		a		an
a		a		a		a		

Quality 8: Accuracy
Many of the name values are incorrectly set to words from the text value when they should be 'None'.

In [29]:
text_names_to_check = []
for index, row in df_1.iterrows():
    if row['name'] not in(row['text']) and row['name'] != 'None':
        text_names_to_check.append(index)

text_names_to_check
Out[29]:
[]

All of the name values that are not 'None' do appear to be present in their respective row's text column.

In [30]:
df_1[:20]
Out[30]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None
5 891087950875897856 NaN NaN 2017-07-29 00:08:17 +0000 <a href="http://twitter.com/download/iphone" r... Here we have a majestic great white breaching ... NaN NaN NaN https://twitter.com/dog_rates/status/891087950... 13 10 None None None None None
6 890971913173991426 NaN NaN 2017-07-28 16:27:12 +0000 <a href="http://twitter.com/download/iphone" r... Meet Jax. He enjoys ice cream so much he gets ... NaN NaN NaN https://gofundme.com/ydvmve-surgery-for-jax,ht... 13 10 Jax None None None None
7 890729181411237888 NaN NaN 2017-07-28 00:22:40 +0000 <a href="http://twitter.com/download/iphone" r... When you watch your owner call another dog a g... NaN NaN NaN https://twitter.com/dog_rates/status/890729181... 13 10 None None None None None
8 890609185150312448 NaN NaN 2017-07-27 16:25:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Zoey. She doesn't want to be one of th... NaN NaN NaN https://twitter.com/dog_rates/status/890609185... 13 10 Zoey None None None None
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Cassie. She is a college pup. Studying... NaN NaN NaN https://twitter.com/dog_rates/status/890240255... 14 10 Cassie doggo None None None
10 890006608113172480 NaN NaN 2017-07-26 00:31:25 +0000 <a href="http://twitter.com/download/iphone" r... This is Koda. He is a South Australian decksha... NaN NaN NaN https://twitter.com/dog_rates/status/890006608... 13 10 Koda None None None None
11 889880896479866881 NaN NaN 2017-07-25 16:11:53 +0000 <a href="http://twitter.com/download/iphone" r... This is Bruno. He is a service shark. Only get... NaN NaN NaN https://twitter.com/dog_rates/status/889880896... 13 10 Bruno None None None None
12 889665388333682689 NaN NaN 2017-07-25 01:55:32 +0000 <a href="http://twitter.com/download/iphone" r... Here's a puppo that seems to be on the fence a... NaN NaN NaN https://twitter.com/dog_rates/status/889665388... 13 10 None None None None puppo
13 889638837579907072 NaN NaN 2017-07-25 00:10:02 +0000 <a href="http://twitter.com/download/iphone" r... This is Ted. He does his best. Sometimes that'... NaN NaN NaN https://twitter.com/dog_rates/status/889638837... 12 10 Ted None None None None
14 889531135344209921 NaN NaN 2017-07-24 17:02:04 +0000 <a href="http://twitter.com/download/iphone" r... This is Stuart. He's sporting his favorite fan... NaN NaN NaN https://twitter.com/dog_rates/status/889531135... 13 10 Stuart None None None puppo
15 889278841981685760 NaN NaN 2017-07-24 00:19:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Oliver. You're witnessing one of his m... NaN NaN NaN https://twitter.com/dog_rates/status/889278841... 13 10 Oliver None None None None
16 888917238123831296 NaN NaN 2017-07-23 00:22:39 +0000 <a href="http://twitter.com/download/iphone" r... This is Jim. He found a fren. Taught him how t... NaN NaN NaN https://twitter.com/dog_rates/status/888917238... 12 10 Jim None None None None
17 888804989199671297 NaN NaN 2017-07-22 16:56:37 +0000 <a href="http://twitter.com/download/iphone" r... This is Zeke. He has a new stick. Very proud o... NaN NaN NaN https://twitter.com/dog_rates/status/888804989... 13 10 Zeke None None None None
18 888554962724278272 NaN NaN 2017-07-22 00:23:06 +0000 <a href="http://twitter.com/download/iphone" r... This is Ralphus. He's powering up. Attempting ... NaN NaN NaN https://twitter.com/dog_rates/status/888554962... 13 10 Ralphus None None None None
19 888202515573088257 NaN NaN 2017-07-21 01:02:36 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Canela. She attempted s... 8.874740e+17 4.196984e+09 2017-07-19 00:47:34 +0000 https://twitter.com/dog_rates/status/887473957... 13 10 Canela None None None None
In [31]:
df_3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
created_at                       2354 non-null datetime64[ns, UTC]
id                               2354 non-null int64
id_str                           2354 non-null int64
full_text                        2354 non-null object
truncated                        2354 non-null bool
display_text_range               2354 non-null object
entities                         2354 non-null object
extended_entities                2073 non-null object
source                           2354 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null float64
in_reply_to_screen_name          78 non-null object
user                             2354 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2354 non-null bool
retweet_count                    2354 non-null int64
favorite_count                   2354 non-null int64
favorited                        2354 non-null bool
retweeted                        2354 non-null bool
possibly_sensitive               2211 non-null float64
possibly_sensitive_appealable    2211 non-null float64
lang                             2354 non-null object
retweeted_status                 179 non-null object
quoted_status_id                 29 non-null float64
quoted_status_id_str             29 non-null float64
quoted_status                    28 non-null object
dtypes: bool(4), datetime64[ns, UTC](1), float64(11), int64(4), object(11)
memory usage: 505.9+ KB
In [32]:
df_2['p1'].sample(10)
Out[32]:
723     Rhodesian_ridgeback
609                  pillow
1039               Shih-Tzu
568                 dogsled
1363     Norwegian_elkhound
1709       golden_retriever
1944          Arabian_camel
75            Saint_Bernard
390                      ox
1826                  hyena
Name: p1, dtype: object

Quality 9: Consistancy
The dog breed name values in the p1, p2, and p3 columns are not consistantly stored. Make sure that capitalization of the values is uniform and replace the underscores with spaces for readability.

Tidyness 2: Each variable forms a column.
The columns 'id', 'in_reply_to_status_id', 'in_reply_to_user_id', and 'quoted_status_id' are currently represented in redundant duplicate columns that originally held string versions of the values, but this is unnecessary. To make this dataset tidy, it is simplest to just drop the repeat columns.

Tidyness 3: Each type of observational unit forms a table.
These three dataframes are all referencing the same observations. It would be best to combine them all into a single dataframe so that all of the information for each tweet is in the same place.

In [ ]:
 

Cleaning

Quality Issues

  1. id columns in all three dataframes are the wrong dtype.
  2. The dog development category columns in df_1 are set to 'None' when NaN should be used.
  3. The 14 rows with multiple dog development category values appear to be outliers or incorrect and need to be removed.
  4. Some of the rating_numerator values are set to the numerator of the first fraction found in the text and not the actual rating fraction.
  5. Some of these rating_numerator values should contain float values, so the rating numerator dtype should be updated to allow this and the numerators recalculated from the text values.
  6. Some of the rating_denominator values are set to the demoninator of the first fraction found in the text and not the actual rating fraction.
  7. Some of the row are replies or retweets that do not actually have a real dog, or a real rating. These all need to be removed to ensure that they do not impact any analysis.
  8. Many of the name values are incorrectly set to words from the text value when they should be 'None'.
  9. Make the capitalization of the p1, p2, and p3 column values uniform and replace underscores with spaces.

Tidyness Issues

  1. The 4 dog development categories should be combined to make a single categorical column.
  2. Duplicate data columns need to be removed.
  3. The three dataframes need to be combined because the rows in each are all for the same observations.

Tidyness Issue 3 and Quality Issue 7

In [33]:
df_3_copy = df_3.rename(index=str, columns={"id": "tweet_id"});
df_3_copy['tweet_id']
Out[33]:
0       892420643555336193
1       892177421306343426
2       891815181378084864
3       891689557279858688
4       891327558926688256
               ...        
2349    666049248165822465
2350    666044226329800704
2351    666033412701032449
2352    666029285002620928
2353    666020888022790149
Name: tweet_id, Length: 2354, dtype: int64
In [34]:
master_df = df_1.merge(df_2, on = 'tweet_id')
master_df = master_df.merge(df_3_copy, on = 'tweet_id')
master_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 58 columns):
tweet_id                         2073 non-null int64
in_reply_to_status_id_x          23 non-null float64
in_reply_to_user_id_x            23 non-null float64
timestamp                        2073 non-null object
source_x                         2073 non-null object
text                             2073 non-null object
retweeted_status_id              79 non-null float64
retweeted_status_user_id         79 non-null float64
retweeted_status_timestamp       79 non-null object
expanded_urls                    2073 non-null object
rating_numerator                 2073 non-null int64
rating_denominator               2073 non-null int64
name                             2073 non-null object
doggo                            2073 non-null object
floofer                          2073 non-null object
pupper                           2073 non-null object
puppo                            2073 non-null object
jpg_url                          2073 non-null object
img_num                          2073 non-null int64
p1                               2073 non-null object
p1_conf                          2073 non-null float64
p1_dog                           2073 non-null bool
p2                               2073 non-null object
p2_conf                          2073 non-null float64
p2_dog                           2073 non-null bool
p3                               2073 non-null object
p3_conf                          2073 non-null float64
p3_dog                           2073 non-null bool
created_at                       2073 non-null datetime64[ns, UTC]
id_str                           2073 non-null int64
full_text                        2073 non-null object
truncated                        2073 non-null bool
display_text_range               2073 non-null object
entities                         2073 non-null object
extended_entities                2073 non-null object
source_y                         2073 non-null object
in_reply_to_status_id_y          23 non-null float64
in_reply_to_status_id_str        23 non-null float64
in_reply_to_user_id_y            23 non-null float64
in_reply_to_user_id_str          23 non-null float64
in_reply_to_screen_name          23 non-null object
user                             2073 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2073 non-null bool
retweet_count                    2073 non-null int64
favorite_count                   2073 non-null int64
favorited                        2073 non-null bool
retweeted                        2073 non-null bool
possibly_sensitive               2073 non-null float64
possibly_sensitive_appealable    2073 non-null float64
lang                             2073 non-null object
retweeted_status                 79 non-null object
quoted_status_id                 0 non-null float64
quoted_status_id_str             0 non-null float64
quoted_status                    0 non-null object
dtypes: bool(7), datetime64[ns, UTC](1), float64(18), int64(7), object(25)
memory usage: 856.3+ KB

Quality Issue 3

In [35]:
to_del = check_multiple_categories(master_df)
to_del
Out[35]:
[154, 160, 366, 429, 457, 464, 566, 627, 665, 722, 780, 871, 917]
In [36]:
master_df = master_df.drop(to_del)
In [37]:
check_multiple_categories(master_df)
Out[37]:
[]

Tidyness Issue 1 and Quality Issue 2

In [38]:
master_df['development_stage'] = np.NaN
master_df['development_stage'] = master_df['development_stage'].astype('object')
master_df['development_stage']
Out[38]:
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
2068    NaN
2069    NaN
2070    NaN
2071    NaN
2072    NaN
Name: development_stage, Length: 2060, dtype: object
In [39]:
for index, row in master_df.iterrows():

    if row['doggo'] == 'doggo':
        master_df.at[index, 'development_stage'] = 'doggo'
    elif row['floofer'] == 'floofer':
        master_df.at[index, 'development_stage'] = 'floofer'
    elif row['pupper'] == 'pupper':
        master_df.at[index, 'development_stage'] = 'pupper'
    elif row['puppo'] == 'puppo':
        master_df.at[index, 'development_stage'] = 'puppo'
In [40]:
master_df['development_stage'] = master_df['development_stage'].astype('category')
master_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2060 entries, 0 to 2072
Data columns (total 59 columns):
tweet_id                         2060 non-null int64
in_reply_to_status_id_x          22 non-null float64
in_reply_to_user_id_x            22 non-null float64
timestamp                        2060 non-null object
source_x                         2060 non-null object
text                             2060 non-null object
retweeted_status_id              77 non-null float64
retweeted_status_user_id         77 non-null float64
retweeted_status_timestamp       77 non-null object
expanded_urls                    2060 non-null object
rating_numerator                 2060 non-null int64
rating_denominator               2060 non-null int64
name                             2060 non-null object
doggo                            2060 non-null object
floofer                          2060 non-null object
pupper                           2060 non-null object
puppo                            2060 non-null object
jpg_url                          2060 non-null object
img_num                          2060 non-null int64
p1                               2060 non-null object
p1_conf                          2060 non-null float64
p1_dog                           2060 non-null bool
p2                               2060 non-null object
p2_conf                          2060 non-null float64
p2_dog                           2060 non-null bool
p3                               2060 non-null object
p3_conf                          2060 non-null float64
p3_dog                           2060 non-null bool
created_at                       2060 non-null datetime64[ns, UTC]
id_str                           2060 non-null int64
full_text                        2060 non-null object
truncated                        2060 non-null bool
display_text_range               2060 non-null object
entities                         2060 non-null object
extended_entities                2060 non-null object
source_y                         2060 non-null object
in_reply_to_status_id_y          22 non-null float64
in_reply_to_status_id_str        22 non-null float64
in_reply_to_user_id_y            22 non-null float64
in_reply_to_user_id_str          22 non-null float64
in_reply_to_screen_name          22 non-null object
user                             2060 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2060 non-null bool
retweet_count                    2060 non-null int64
favorite_count                   2060 non-null int64
favorited                        2060 non-null bool
retweeted                        2060 non-null bool
possibly_sensitive               2060 non-null float64
possibly_sensitive_appealable    2060 non-null float64
lang                             2060 non-null object
retweeted_status                 77 non-null object
quoted_status_id                 0 non-null float64
quoted_status_id_str             0 non-null float64
quoted_status                    0 non-null object
development_stage                307 non-null category
dtypes: bool(7), category(1), datetime64[ns, UTC](1), float64(18), int64(7), object(25)
memory usage: 933.2+ KB
In [41]:
master_df = master_df.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis = 1)
master_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2060 entries, 0 to 2072
Data columns (total 55 columns):
tweet_id                         2060 non-null int64
in_reply_to_status_id_x          22 non-null float64
in_reply_to_user_id_x            22 non-null float64
timestamp                        2060 non-null object
source_x                         2060 non-null object
text                             2060 non-null object
retweeted_status_id              77 non-null float64
retweeted_status_user_id         77 non-null float64
retweeted_status_timestamp       77 non-null object
expanded_urls                    2060 non-null object
rating_numerator                 2060 non-null int64
rating_denominator               2060 non-null int64
name                             2060 non-null object
jpg_url                          2060 non-null object
img_num                          2060 non-null int64
p1                               2060 non-null object
p1_conf                          2060 non-null float64
p1_dog                           2060 non-null bool
p2                               2060 non-null object
p2_conf                          2060 non-null float64
p2_dog                           2060 non-null bool
p3                               2060 non-null object
p3_conf                          2060 non-null float64
p3_dog                           2060 non-null bool
created_at                       2060 non-null datetime64[ns, UTC]
id_str                           2060 non-null int64
full_text                        2060 non-null object
truncated                        2060 non-null bool
display_text_range               2060 non-null object
entities                         2060 non-null object
extended_entities                2060 non-null object
source_y                         2060 non-null object
in_reply_to_status_id_y          22 non-null float64
in_reply_to_status_id_str        22 non-null float64
in_reply_to_user_id_y            22 non-null float64
in_reply_to_user_id_str          22 non-null float64
in_reply_to_screen_name          22 non-null object
user                             2060 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2060 non-null bool
retweet_count                    2060 non-null int64
favorite_count                   2060 non-null int64
favorited                        2060 non-null bool
retweeted                        2060 non-null bool
possibly_sensitive               2060 non-null float64
possibly_sensitive_appealable    2060 non-null float64
lang                             2060 non-null object
retweeted_status                 77 non-null object
quoted_status_id                 0 non-null float64
quoted_status_id_str             0 non-null float64
quoted_status                    0 non-null object
development_stage                307 non-null category
dtypes: bool(7), category(1), datetime64[ns, UTC](1), float64(18), int64(7), object(21)
memory usage: 868.8+ KB

Tidyness Issue 2

In [42]:
master_df = master_df.drop(['id_str', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'quoted_status_id_str'], axis = 1)
master_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2060 entries, 0 to 2072
Data columns (total 51 columns):
tweet_id                         2060 non-null int64
in_reply_to_status_id_x          22 non-null float64
in_reply_to_user_id_x            22 non-null float64
timestamp                        2060 non-null object
source_x                         2060 non-null object
text                             2060 non-null object
retweeted_status_id              77 non-null float64
retweeted_status_user_id         77 non-null float64
retweeted_status_timestamp       77 non-null object
expanded_urls                    2060 non-null object
rating_numerator                 2060 non-null int64
rating_denominator               2060 non-null int64
name                             2060 non-null object
jpg_url                          2060 non-null object
img_num                          2060 non-null int64
p1                               2060 non-null object
p1_conf                          2060 non-null float64
p1_dog                           2060 non-null bool
p2                               2060 non-null object
p2_conf                          2060 non-null float64
p2_dog                           2060 non-null bool
p3                               2060 non-null object
p3_conf                          2060 non-null float64
p3_dog                           2060 non-null bool
created_at                       2060 non-null datetime64[ns, UTC]
full_text                        2060 non-null object
truncated                        2060 non-null bool
display_text_range               2060 non-null object
entities                         2060 non-null object
extended_entities                2060 non-null object
source_y                         2060 non-null object
in_reply_to_status_id_y          22 non-null float64
in_reply_to_user_id_y            22 non-null float64
in_reply_to_screen_name          22 non-null object
user                             2060 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2060 non-null bool
retweet_count                    2060 non-null int64
favorite_count                   2060 non-null int64
favorited                        2060 non-null bool
retweeted                        2060 non-null bool
possibly_sensitive               2060 non-null float64
possibly_sensitive_appealable    2060 non-null float64
lang                             2060 non-null object
retweeted_status                 77 non-null object
quoted_status_id                 0 non-null float64
quoted_status                    0 non-null object
development_stage                307 non-null category
dtypes: bool(7), category(1), datetime64[ns, UTC](1), float64(15), int64(6), object(21)
memory usage: 804.4+ KB

Quality Issue 1

In [43]:
master_df['tweet_id'] = master_df['tweet_id'].astype('object')
master_df['in_reply_to_status_id_x'] = master_df['in_reply_to_status_id_x'].astype('object')
master_df['in_reply_to_user_id_x'] = master_df['in_reply_to_user_id_x'].astype('object')
master_df['retweeted_status_id'] = master_df['retweeted_status_id'].astype('object')
master_df['retweeted_status_user_id'] = master_df['retweeted_status_user_id'].astype('object')
master_df['in_reply_to_status_id_y'] = master_df['in_reply_to_status_id_y'].astype('object')
master_df['in_reply_to_user_id_y'] = master_df['in_reply_to_user_id_y'].astype('object')
master_df['quoted_status_id'] = master_df['quoted_status_id'].astype('object')
master_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2060 entries, 0 to 2072
Data columns (total 51 columns):
tweet_id                         2060 non-null object
in_reply_to_status_id_x          22 non-null object
in_reply_to_user_id_x            22 non-null object
timestamp                        2060 non-null object
source_x                         2060 non-null object
text                             2060 non-null object
retweeted_status_id              77 non-null object
retweeted_status_user_id         77 non-null object
retweeted_status_timestamp       77 non-null object
expanded_urls                    2060 non-null object
rating_numerator                 2060 non-null int64
rating_denominator               2060 non-null int64
name                             2060 non-null object
jpg_url                          2060 non-null object
img_num                          2060 non-null int64
p1                               2060 non-null object
p1_conf                          2060 non-null float64
p1_dog                           2060 non-null bool
p2                               2060 non-null object
p2_conf                          2060 non-null float64
p2_dog                           2060 non-null bool
p3                               2060 non-null object
p3_conf                          2060 non-null float64
p3_dog                           2060 non-null bool
created_at                       2060 non-null datetime64[ns, UTC]
full_text                        2060 non-null object
truncated                        2060 non-null bool
display_text_range               2060 non-null object
entities                         2060 non-null object
extended_entities                2060 non-null object
source_y                         2060 non-null object
in_reply_to_status_id_y          22 non-null object
in_reply_to_user_id_y            22 non-null object
in_reply_to_screen_name          22 non-null object
user                             2060 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  2060 non-null bool
retweet_count                    2060 non-null int64
favorite_count                   2060 non-null int64
favorited                        2060 non-null bool
retweeted                        2060 non-null bool
possibly_sensitive               2060 non-null float64
possibly_sensitive_appealable    2060 non-null float64
lang                             2060 non-null object
retweeted_status                 77 non-null object
quoted_status_id                 0 non-null object
quoted_status                    0 non-null object
development_stage                307 non-null category
dtypes: bool(7), category(1), datetime64[ns, UTC](1), float64(8), int64(5), object(29)
memory usage: 804.4+ KB

Quality Issue 8

In [44]:
names_to_change = []
for index, row in master_df.iterrows():
    if row['name'][0].islower():
        names_to_change.append(index)

len(names_to_change)
Out[44]:
98
In [45]:
for index in names_to_change:
    master_df.at[index, 'name'] = 'None'
In [46]:
names_to_change = []
for index, row in master_df.iterrows():
    if row['name'][0].islower():
        names_to_change.append(index)

len(names_to_change)
Out[46]:
0

Quality Issue 4, Quality Issue 5, and Quality Issue 6

In [47]:
master_df['rating_numerator'] = master_df['rating_numerator'].astype('float')
In [48]:
for index, row in master_df.iterrows():
    numerator = 0.0
    numerator_end = 0
    numerator_start = 0
    
    # Find the position of the / in the first fraction in the text.
    for letter_index, letter in enumerate(row['text']):
        if letter == '/' and row['text'][letter_index + 1].isdigit():
            numerator_end = letter_index
            break
    
    # Find the position of the beginning of the numerator of the first fraction in the text.
    for letter_index2, letter in enumerate(row['text'][numerator_end::-1]):
        if letter.isspace():
            numerator_start = numerator_end - letter_index2
            break
    
    # Some text values do not have a space before the rating.
    # In these cases, shorten the string until it is just the rating numbers.
    getting_numerator = True
    while getting_numerator:
        try:
            numerator = float(row['text'][numerator_start:numerator_end])
            getting_numerator = False
            break
        except:
            pass
        if getting_numerator:
            numerator_start += 1
    
    master_df.at[index, 'rating_numerator'] = numerator
In [49]:
master_df['rating_numerator'].value_counts()
Out[49]:
 12.00      464
 10.00      418
 11.00      407
 13.00      281
 9.00       148
 8.00        94
 7.00        52
 14.00       40
 6.00        32
 5.00        31
 3.00        19
 4.00        16
 0.10        10
 2.00         9
 1.00         5
 0.11         4
 0.90         3
 0.00         2
 0.12         2
 50.00        1
 84.00        1
 24.00        1
 15.00        1
 13.50        1
 0.80         1
 80.00        1
 9.75         1
 420.00       1
 1776.00      1
 165.00       1
 45.00        1
 204.00       1
 99.00        1
 121.00       1
 60.00        1
 11.27        1
 11.26        1
-5.00         1
 88.00        1
 144.00       1
 143.00       1
 44.00        1
Name: rating_numerator, dtype: int64
In [50]:
fractions_to_check = []
for index, row in master_df.iterrows():
    num_forward_slashs = 0
    num_url_slashs = 0
    
    # Check for each / and ignore 3 of them for each URL.
    num_forward_slashs = row['text'].count('/')
    num_url_slashes = row['text'].count('https://t.co/') * 3
    
    if num_forward_slashs - num_url_slashes != 1:
        fractions_to_check.append(index)

len(fractions_to_check)
Out[50]:
30
In [51]:
for index in fractions_to_check:
    print(master_df.iloc[index]['text'])
    print(str(master_df.iloc[index]['rating_numerator']) + '/' + str(master_df.iloc[index]['rating_denominator']) + '\n\n')
This is Loki. He'll do your taxes for you. Can also make room in your budget for all the things you bought today. 12/10 what a puppo https://t.co/5oWrHCWg87
12.0/10


Atlas rolled around in some chalk and now he's a magical rainbow floofer. 13/10 please never take a bath https://t.co/nzqTNw0744
13.0/10


Meet Abby. She's incredibly distracting. Just wants to help steer. Hazardous af. Still 12/10 would pet while driving https://t.co/gLbLiZtwsp
12.0/10


When a single soap orb changes your entire perception of the universe... 10/10 https://t.co/9eCXpVExJc
10.0/10


This is Harnold. He accidentally opened the front facing camera. 10/10 get it together Harnold https://t.co/S6JHaSMtln
10.0/10


This is Curtis. He's an Albino Haberdasher. Terrified of dandelions. They really spook him up. 10/10 it'll be ok pup https://t.co/s8YcfZrWhK
10.0/10


This is Kane. He's a semi-submerged Haitian Huffleplop. Happy af. Sick waterfall. 11/10 would pat head approvingly https://t.co/7zjEC501Ul
11.0/10


Meet Watson. He's a Suzuki Tickleboop. Leader of a notorious biker gang. Only one ear functional. 12/10 snuggable af https://t.co/R1gLc5vDqG
12.0/10


Meet Rilo. He's a Northern Curly Ticonderoga. Currently balancing on one paw even in strong wind. Acrobatic af 11/10 https://t.co/KInss2PXyX
11.0/10


This pupper is afraid of its own feet. 12/10 would comfort https://t.co/Tn9Mp0oPoJ
12.0/10


I hope you guys enjoy this beautiful snowy pupper as much as I did. 11/10 https://t.co/DYUsHtL2aR
11.0/10


This is Hazel. She's a gymnast. Training hard for Rio. 11/10 focused af https://t.co/CneG2ZbxHP
11.0/10


This is Ricky. He's being escorted out of the dog park for talking shit about the other dogs. 8/10 not cool Ricky https://t.co/XtDkrsdEfF
8.0/10


This pupper just wants to say hello. 11/10 would knock down fence for https://t.co/A8X8fwS78x
11.0/10


This is Tess. Her main passions are shelves and baking too many cookies. 11/10 https://t.co/IriJlVZ6m4
11.0/10


Meet Ash. He's just a head now. Lost his body during the Third Crusade. Still in good spirits. 10/10 would pet well https://t.co/NJj2uP0atK
10.0/10


This is Kenneth. He's stuck in a bubble. 10/10 hang in there Kenneth https://t.co/uQt37xlYMJ
10.0/10


Here's a handful of sleepy puppers. All look unaware of their surroundings. Lousy guard dogs. Still cute tho 11/10s https://t.co/lyXX3v5j4s
11.0/10


This is Lenny. He wants to be a sprinkler. 10/10 you got this Lenny https://t.co/CZ0YaB40Hn
10.0/10


This is Kenny. He just wants to be included in the happenings. 11/10 https://t.co/2S6oye3XqK
11.0/10


This is Terry. He's a Toasty Western Sriracha. Doubles as a table. Great for parties. 10/10 would highly recommend https://t.co/1ui7a1ZLTT
10.0/10


This is Lola. She fell asleep on a piece of pizza. 10/10 frighteningly relatable https://t.co/eqmkr2gmPH
10.0/10


This is Jett. He is unimpressed by flower. 7/10 https://t.co/459qWNnV3F
7.0/10


This is Butters. He's not ready for Thanksgiving to be over. 10/10 poor Butters https://t.co/iTc578yDmY
10.0/10


This is a Slovakian Helter Skelter Feta named Leroi. Likes to skip on roofs. Good traction. Much balance. 10/10 wow! https://t.co/Dmy2mY2Qj5
10.0/10


Say hello to Bisquick. He is a Brown Douglass Fir terrier. Very inbred. Looks terrified. 8/10 still cute tho https://t.co/1XYRh8N00K
8.0/10


Exotic dog here. Long neck. Weird paws. Obsessed with bread. Waddles. Flies sometimes (wow!). Very happy dog. 6/10 https://t.co/rqO4I3nf2N
6.0/10


Quite an advanced dog here. Impressively dressed for canine. Has weapon. About to take out trash. 10/10 good dog https://t.co/8uCMwS9CbV
10.0/10


This is Scout. She is a black Downton Abbey. Isn't afraid to get dirty. 9/10 nothing bad to say https://t.co/kH60oka1HW
9.0/10


Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt
10.0/10


Quality Issue 9

In [52]:
master_df[['p1', 'p2', 'p3']] = master_df[['p1', 'p2', 'p3']].replace("_", " ", regex = True)
master_df['p1'] = master_df['p1'].str.title()
master_df['p2'] = master_df['p2'].str.title()
master_df['p3'] = master_df['p3'].str.title()
master_df.p1
Out[52]:
0                       Orange
1                    Chihuahua
2                    Chihuahua
3                  Paper Towel
4                       Basset
                 ...          
2068        Miniature Pinscher
2069       Rhodesian Ridgeback
2070           German Shepherd
2071                   Redbone
2072    Welsh Springer Spaniel
Name: p1, Length: 2060, dtype: object

Analysis

The average scores

Of all ratings

In [53]:
master_df['rating_numerator'].mean()
Out[53]:
12.128621359223303

Of all dog ratings

In [54]:
master_df.loc[master_df['p1_dog'] == True]['rating_numerator'].mean()
Out[54]:
11.39437788018433
In [55]:
#Create a row that is rating_numerator / rating_denominator and then get the mean.
master_df['score'] = master_df['rating_numerator'] / master_df['rating_denominator']

Of the scores of all ratings

In [56]:
master_df['score'].mean()
Out[56]:
1.1559121926617073

Of the scores of all dog ratings

In [57]:
master_df.loc[master_df['p1_dog'] == True]['score'].mean()
Out[57]:
1.083576532749673
In [ ]:
 

The most common breed of dog

Only consider the ratings that correspond to an actual dog.

In [58]:
is_dog = master_df.loc[master_df['p1_dog'] == True]
is_dog.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1519 entries, 1 to 2072
Data columns (total 52 columns):
tweet_id                         1519 non-null object
in_reply_to_status_id_x          13 non-null object
in_reply_to_user_id_x            13 non-null object
timestamp                        1519 non-null object
source_x                         1519 non-null object
text                             1519 non-null object
retweeted_status_id              52 non-null object
retweeted_status_user_id         52 non-null object
retweeted_status_timestamp       52 non-null object
expanded_urls                    1519 non-null object
rating_numerator                 1519 non-null float64
rating_denominator               1519 non-null int64
name                             1519 non-null object
jpg_url                          1519 non-null object
img_num                          1519 non-null int64
p1                               1519 non-null object
p1_conf                          1519 non-null float64
p1_dog                           1519 non-null bool
p2                               1519 non-null object
p2_conf                          1519 non-null float64
p2_dog                           1519 non-null bool
p3                               1519 non-null object
p3_conf                          1519 non-null float64
p3_dog                           1519 non-null bool
created_at                       1519 non-null datetime64[ns, UTC]
full_text                        1519 non-null object
truncated                        1519 non-null bool
display_text_range               1519 non-null object
entities                         1519 non-null object
extended_entities                1519 non-null object
source_y                         1519 non-null object
in_reply_to_status_id_y          13 non-null object
in_reply_to_user_id_y            13 non-null object
in_reply_to_screen_name          13 non-null object
user                             1519 non-null object
geo                              0 non-null float64
coordinates                      0 non-null float64
place                            1 non-null object
contributors                     0 non-null float64
is_quote_status                  1519 non-null bool
retweet_count                    1519 non-null int64
favorite_count                   1519 non-null int64
favorited                        1519 non-null bool
retweeted                        1519 non-null bool
possibly_sensitive               1519 non-null float64
possibly_sensitive_appealable    1519 non-null float64
lang                             1519 non-null object
retweeted_status                 52 non-null object
quoted_status_id                 0 non-null object
quoted_status                    0 non-null object
development_stage                225 non-null category
score                            1519 non-null float64
dtypes: bool(7), category(1), datetime64[ns, UTC](1), float64(10), int64(4), object(29)
memory usage: 546.1+ KB
In [59]:
is_dog['p1'].value_counts()
Out[59]:
Golden Retriever      144
Labrador Retriever     99
Pembroke               88
Chihuahua              83
Pug                    57
                     ... 
Clumber                 1
Entlebucher             1
Japanese Spaniel        1
Standard Schnauzer      1
Silky Terrier           1
Name: p1, Length: 111, dtype: int64
In [60]:
# Split the breed counts into a top 10 list and all others.
totals = is_dog['p1'].value_counts()
top_10_totals = totals[0:10]
others = totals[10:]
In [61]:
# Also create a bottom 10 list.
bottom_10_totals = totals[-10:]
bottom_10_totals
Out[61]:
Appenzeller           2
Australian Terrier    2
Toy Terrier           2
Groenendael           1
Scotch Terrier        1
Clumber               1
Entlebucher           1
Japanese Spaniel      1
Standard Schnauzer    1
Silky Terrier         1
Name: p1, dtype: int64
In [62]:
# Combine the other breeds into a single value for plotting.
top_10_totals.at['Other'] = others.sum()
top_10_totals
Out[62]:
Golden Retriever      144
Labrador Retriever     99
Pembroke               88
Chihuahua              83
Pug                    57
Chow                   44
Samoyed                43
Toy Poodle             39
Pomeranian             38
Cocker Spaniel         30
Other                 854
Name: p1, dtype: int64
In [63]:
plt.pie(top_10_totals, labels = top_10_totals.index)
plt.title('The Most Common Breeds of Dogs on We Rate Dogs')
plt.savefig('Dog Breed Proportions Pieplot.png', dpi=300, bbox_inches = "tight")
plt.show()

The breed of dog that gets the highest ratings

In [64]:
breeds = is_dog['p1'].value_counts().keys()
breed_mean_rating = []
breeds
Out[64]:
Index(['Golden Retriever', 'Labrador Retriever', 'Pembroke', 'Chihuahua',
       'Pug', 'Chow', 'Samoyed', 'Toy Poodle', 'Pomeranian', 'Cocker Spaniel',
       ...
       'Appenzeller', 'Australian Terrier', 'Toy Terrier', 'Groenendael',
       'Scotch Terrier', 'Clumber', 'Entlebucher', 'Japanese Spaniel',
       'Standard Schnauzer', 'Silky Terrier'],
      dtype='object', length=111)
In [65]:
for breed in breeds:
    avg = is_dog.loc[is_dog['p1'] == breed]['rating_numerator'].mean()
    breed_mean_rating.append([breed, avg])
In [66]:
breed_mean_rating.sort(key=lambda x: x[1], reverse=True)
breed_mean_rating
Out[66]:
[['Soft-Coated Wheaten Terrier', 25.454545454545453],
 ['West Highland White Terrier', 15.642857142857142],
 ['Great Pyrenees', 14.928571428571429],
 ['Borzoi', 14.444444444444445],
 ['Labrador Retriever', 13.487979797979797],
 ['Siberian Husky', 13.25],
 ['Golden Retriever', 12.997361111111111],
 ['Saluki', 12.5],
 ['Tibetan Mastiff', 12.4],
 ['Briard', 12.333333333333334],
 ['Giant Schnauzer', 12.0],
 ['Standard Schnauzer', 12.0],
 ['Silky Terrier', 12.0],
 ['Irish Setter', 11.833333333333334],
 ['Eskimo Dog', 11.777777777777779],
 ['Gordon Setter', 11.75],
 ['Samoyed', 11.69767441860465],
 ['Chow', 11.636363636363637],
 ['Wire-Haired Fox Terrier', 11.5],
 ['Australian Terrier', 11.5],
 ['Kelpie', 11.454545454545455],
 ['Norfolk Terrier', 11.428571428571429],
 ['Greater Swiss Mountain Dog', 11.333333333333334],
 ['Irish Water Spaniel', 11.333333333333334],
 ['Leonberg', 11.333333333333334],
 ['Pembroke', 11.319431818181819],
 ['Rottweiler', 11.294117647058824],
 ['Blenheim Spaniel', 11.272727272727273],
 ['Clumber', 11.27],
 ['Doberman', 11.25],
 ['Bernese Mountain Dog', 11.2],
 ['Old English Sheepdog', 11.166666666666666],
 ['Pekinese', 11.153846153846153],
 ['Basset', 11.153846153846153],
 ['Pomeranian', 11.151315789473685],
 ['Kuvasz', 11.14125],
 ['Toy Poodle', 11.128205128205128],
 ['Norwegian Elkhound', 11.125],
 ['Collie', 11.1],
 ['Cocker Spaniel', 11.07],
 ['Cardigan', 11.005263157894737],
 ['American Staffordshire Terrier', 11.0],
 ['English Springer', 11.0],
 ['Komondor', 11.0],
 ['Cairn', 11.0],
 ['Sussex Spaniel', 11.0],
 ['Appenzeller', 11.0],
 ['Toy Terrier', 11.0],
 ['Entlebucher', 11.0],
 ['Malamute', 10.9],
 ['Schipperke', 10.9],
 ['French Bulldog', 10.88846153846154],
 ['Vizsla', 10.846153846153847],
 ['Irish Terrier', 10.833333333333334],
 ['Lakeland Terrier', 10.823529411764707],
 ['Staffordshire Bullterrier', 10.8],
 ['Standard Poodle', 10.75],
 ['Mexican Hairless', 10.75],
 ['Malinois', 10.666666666666666],
 ['Dandie Dinmont', 10.666666666666666],
 ['Miniature Pinscher', 10.652173913043478],
 ['Boxer', 10.6],
 ['German Shepherd', 10.595],
 ['Border Terrier', 10.587142857142856],
 ['Border Collie', 10.583333333333334],
 ['Flat-Coated Retriever', 10.571428571428571],
 ['Chihuahua', 10.544819277108434],
 ['Shih-Tzu', 10.529411764705882],
 ['Yorkshire Terrier', 10.5],
 ['Afghan Hound', 10.5],
 ['Bluetick', 10.5],
 ['Black-And-Tan Coonhound', 10.5],
 ['Beagle', 10.444444444444445],
 ['Whippet', 10.444444444444445],
 ['Bloodhound', 10.428571428571429],
 ['Brittany Spaniel', 10.428571428571429],
 ['Bull Mastiff', 10.4],
 ['Lhasa', 10.4],
 ['Shetland Sheepdog', 10.38888888888889],
 ['Chesapeake Bay Retriever', 10.352173913043478],
 ['Scottish Deerhound', 10.333333333333334],
 ['Pug', 10.31578947368421],
 ['Keeshond', 10.25],
 ['Newfoundland', 10.2],
 ['Saint Bernard', 10.142857142857142],
 ['Papillon', 10.125],
 ['English Setter', 10.0],
 ['Bedlington Terrier', 10.0],
 ['Brabancon Griffon', 10.0],
 ['Groenendael', 10.0],
 ['Italian Greyhound', 9.9375],
 ['Miniature Poodle', 9.875],
 ['Airedale', 9.833333333333334],
 ['Rhodesian Ridgeback', 9.75],
 ['Redbone', 9.666666666666666],
 ['Dalmatian', 9.545454545454545],
 ['Boston Bull', 9.444444444444445],
 ['Great Dane', 9.31111111111111],
 ['Maltese Dog', 9.277777777777779],
 ['Tibetan Terrier', 9.25],
 ['Miniature Schnauzer', 9.25],
 ['German Short-Haired Pointer', 9.014285714285714],
 ['Walker Hound', 9.0],
 ['Norwich Terrier', 9.0],
 ['Ibizan Hound', 9.0],
 ['Welsh Springer Spaniel', 9.0],
 ['Curly-Coated Retriever', 9.0],
 ['Scotch Terrier', 9.0],
 ['Basenji', 8.73],
 ['Weimaraner', 8.525],
 ['Japanese Spaniel', 5.0]]
In [67]:
top_10_breeds = breed_mean_rating[0:10]
top_10_breeds
Out[67]:
[['Soft-Coated Wheaten Terrier', 25.454545454545453],
 ['West Highland White Terrier', 15.642857142857142],
 ['Great Pyrenees', 14.928571428571429],
 ['Borzoi', 14.444444444444445],
 ['Labrador Retriever', 13.487979797979797],
 ['Siberian Husky', 13.25],
 ['Golden Retriever', 12.997361111111111],
 ['Saluki', 12.5],
 ['Tibetan Mastiff', 12.4],
 ['Briard', 12.333333333333334]]
In [68]:
bottom_10_breeds = breed_mean_rating[-10:]
bottom_10_breeds
Out[68]:
[['German Short-Haired Pointer', 9.014285714285714],
 ['Walker Hound', 9.0],
 ['Norwich Terrier', 9.0],
 ['Ibizan Hound', 9.0],
 ['Welsh Springer Spaniel', 9.0],
 ['Curly-Coated Retriever', 9.0],
 ['Scotch Terrier', 9.0],
 ['Basenji', 8.73],
 ['Weimaraner', 8.525],
 ['Japanese Spaniel', 5.0]]
In [69]:
# Limit the averages to two decimal places for plotting.
for breed in top_10_breeds:
    breed[1] = float("{0:.2f}".format(breed[1]))
    
top_10_breeds
Out[69]:
[['Soft-Coated Wheaten Terrier', 25.45],
 ['West Highland White Terrier', 15.64],
 ['Great Pyrenees', 14.93],
 ['Borzoi', 14.44],
 ['Labrador Retriever', 13.49],
 ['Siberian Husky', 13.25],
 ['Golden Retriever', 13.0],
 ['Saluki', 12.5],
 ['Tibetan Mastiff', 12.4],
 ['Briard', 12.33]]
In [70]:
breed_names = list(zip(*top_10_breeds))[0]
breed_ratings = list(zip(*top_10_breeds))[1]
In [71]:
plt.bar(breed_names, breed_ratings)
plt.xticks(rotation=90)
plt.xlabel('Dog Breed')
plt.ylabel('Average Rating')
plt.title('Top 10 Average Ratings of Dog Breeds')
plt.savefig('Top 10 Ratings Barplot.png', dpi=300, bbox_inches = "tight")
plt.show()

Storage

In [72]:
master_df.to_csv('twitter_archive_master.csv')
In [ ]: