より効率的こうりつてき第二だいに言語げんご習得しゅうとくのための日本語にほんご補題ほだい検討けんとう

Examination of Japanese Lemmas for a More Efficient Second Language Acquisition

Matthew Unrue, December 2018

Udacity Data Analyst Nanodegree Capstone Project

Project context, goals, and findings can be found in the readme.txt file here. Now displayed below for convenience. Additionally, download links to datasets created throughout this project are now available below.

Sources:

  • The lemma dataset can be found hosted here and similar datasets for other languages can be found here.
  • The Part of Speech distribution frequency dataset can be found here.
  • More information on lemmas can be found on its corresponding Wikipedia page.
  • The JLPT tier dataset was created from this webpage here.
  • The Kanji .json dataset is found here.
  • Character information is additionally gathered from here.

readme.txt addition:

Project Context:

In linguistics, a lexeme is the set of forms that a word, or more specifically a single semantic value, can take on in a language regardless of the number of ways it can be modified through inflection.

A lemma is the dictionary form of a word that is chosen by the conventions of its individual language to represent the entirety of the lexeme.

Lemmas and word stems are different in that a stem is the portion of a word that remains constant despite morphological inflection while a lemma is the base form of the word that represents the distinct meaning of the word regardless of inflection.

When studying a language, multitudes of different approaches can be taken. One method of efficient study is to memorize or learn the base form of a concept, or the lemma, and through the application of the grammatical rules of the language, begin to incorporate the remainder of the lexeme into their usage.

More information on lemmas can be found on its corresponding Wikipedia page: (https://en.wikipedia.org/wiki/Lemma_(morphology))

Project Goals:

This project examines the frequencies of lemmas in the Japanese language, and what factors influence those frequencies, in order to determine a more efficient approach towards Japanese second language acquisition and the ordering of teaching materials for this purpose.

Efficiency will be measured by the estimated frequency, and thus number of applications or general usefulness, that learning a word will give, assuming that the student can apply grammatical rules to utilize all appropriate forms of the word, as determined by the frequency of the lemma in the Internet Corpus.

Additionally, the part of speech that each lemma is classified as will be used to look into ideas for a more efficient order of learning various sets of grammatical rules in Japanese.

Data Sources:

Main Findings:

  1. Lemma frequency is most strongly affected by its length and both total and average character stroke counts.

  2. When weighting the importance of learning a lemma, kanji and hiragana script words should be prioritized more.

  3. While longer lemmas are generally used less frequently, the lemmas included in the JLPT n5 exam vocabulary list should be made exempt from negative weighting from length due to the type of hiragana words in this category.

  4. A more efficient Japanese learning order will have the largest focus on nouns, verbs, and adjective syntax, but will cover auxiliary verb, conjunction, and particle rules in earlier stages.

Visualizations for Presentation:

  1. The 'Frequency by Lemma Length', 'Frequency by Lemma Total Stroke Count', and 'Frequency by Lemma Average Character Stroke Count' plots all have extremely similar trend lines, which means that lemma length, lemma total stroke count, and lemma average character stroke count all have similar impacts on a lemma's frequency. Combining these as subplots on a single plot makes this comparison clear.

  2. The 'Lemma Frequency by Script' plot shows that while the differences in medians and interquartile ranges among the script types show the varying importance of focusing on each script, the sheer number of extremely high frequency outliers in the hiragana script is worth discussion and study alone.

  3. The 'Lemma JLPT Level by Frequency' plot shows the distributions of each JLPT level's frequencies. While the n5 exam has only the third highest median and third quartile of the scripts, it also has the most outliers and the highest frequency values of all. When compared to the 'Lemma JLPT Level by Lemma Length' plot, the reason for this becomes apparent: the JLPT n5 exam has the highest mean and range of lemma length.

  4. The 'Distribution of Lemma Parts of Speech' shows the ratio of each lemma part of speech, with nouns, verbs, and adjectives having the most representation by far. Comparing this information with the 'Lemma Average Character Stroke Count' and 'Lemma Average Character Frequency' plots shows why language ordering cannot be only based on ratios, as three of the least common parts of speech in the dataset are shown to have the highest average character frequencies.


The project begins by loading in datasets and creating columns of data from the existing information in order to have a sufficient amount of variables to examine.

In [1]:
# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
import json
import statsmodels.api as sm;

%matplotlib inline
In [2]:
# Read the lemma dataset into a dataframe called original_df.
original_df = pd.read_csv('japanese_lemmas.csv')
original_df.head()
Out[2]:
rank frequency lemma
0 1 41309.50
1 2 23509.54
2 3 22216.80
3 4 20431.93
4 5 20326.59

The lemma dataset has three columns:

  • rank: The ranking of frequency of the lemma
  • frequency: The number of instances of the lemma per million words
  • lemma: The actual lemma
In [3]:
# Read the part of speech dataset into a dataframe called original_pos_df.
original_pos_df = pd.read_csv('japanese_pos_frequencies.csv', names = ['rank', 'frequency', 'jap_pos'])
original_pos_df.head()
Out[3]:
rank frequency jap_pos
0 1 343804.25 名詞
1 2 208342.21 助詞
2 3 203199.30 記号
3 4 99121.80 動詞
4 5 68734.93 助動詞

The lemma dataset has three columns:

  • rank: The ranking of frequency of the part of speech
  • frequency: The number of instances of the part of speech per million words.
  • jap_pos: The actual part of speech
In [4]:
# Work with copies of the original dataframes.
df = original_df.copy()
pos_df = original_pos_df.copy()
In [5]:
# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()

The distribution of the lemma frequencies appears logarithmic, which makes sense because only a few words should be extremely common from simplicity or syntactical importance, with the bulk of others slowly becoming less frequent as they become more specific or niche.

In [6]:
# Check for duplicate rows.
df.duplicated().sum()
Out[6]:
0
In [7]:
# Check for null values.
df.isnull().sum()
Out[7]:
rank         0
frequency    0
lemma        0
dtype: int64
In [8]:
# Check dtypes.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
rank         15000 non-null int64
frequency    15000 non-null float64
lemma        15000 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 351.7+ KB

This dataset is already clean, so little tidying needs to be done.

In [9]:
# Correct the rank column's dtype from int to string.
df['rank'] = df['rank'].astype('object')
In [10]:
# Check dtypes.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
rank         15000 non-null object
frequency    15000 non-null float64
lemma        15000 non-null object
dtypes: float64(1), object(2)
memory usage: 351.7+ KB

Begin working with the part of speech frequency dataset by adding a translated and transliterated part of speech column for English speakers.

In [11]:
# Manually define the translations and transliterations.
translations = ['noun', 'particle', 'symbol', 'verb', 'auxiliary verb', 'adverb', 'adjective', 'adnominal', 'conjunction', 'prefix', 'interjection', 'filler', 'other']
transliterations = ['めいし', 'じょし', 'きごう', 'どうし', 'じょどうし', 'ふくし', 'けいようし', 'れんたいし', 'せつぞくし', 'せっとうし', 'かんどうし', 'ふぃらあ', 'そのほか']
pos_df['eng_pos'] = translations
pos_df['transliterated_pos'] = transliterations

# Reorder the columns so that eng_pos is next to frequency and the two Japanese columns are adjacent for easier reading.
pos_df = pos_df[['rank', 'frequency', 'eng_pos', 'jap_pos', 'transliterated_pos']]
In [12]:
# Ensure the pos_df is easily readable.
pos_df
Out[12]:
rank frequency eng_pos jap_pos transliterated_pos
0 1 343804.25 noun 名詞 めいし
1 2 208342.21 particle 助詞 じょし
2 3 203199.30 symbol 記号 きごう
3 4 99121.80 verb 動詞 どうし
4 5 68734.93 auxiliary verb 助動詞 じょどうし
5 6 15003.37 adverb 副詞 ふくし
6 7 10040.91 adjective 形容詞 けいようし
7 8 7509.62 adnominal 連体詞 れんたいし
8 9 5684.89 conjunction 接続詞 せつぞくし
9 10 5227.79 prefix 接頭詞 せっとうし
10 11 1257.96 interjection 感動詞 かんどうし
11 12 64.82 filler フィラー ふぃらあ
12 13 14.98 other その他 そのほか
In [13]:
# Check and set column dtypes.
pos_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
rank                  13 non-null int64
frequency             13 non-null float64
eng_pos               13 non-null object
jap_pos               13 non-null object
transliterated_pos    13 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 648.0+ bytes

This part of speech dataset is also already clean and just needs minor dtype adjustment.

In [14]:
# Correct the rank column's dtype from int to string.
pos_df['rank'] = pos_df['rank'].astype('object')
In [15]:
# Check dtypes.
pos_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
rank                  13 non-null object
frequency             13 non-null float64
eng_pos               13 non-null object
jap_pos               13 non-null object
transliterated_pos    13 non-null object
dtypes: float64(1), object(4)
memory usage: 648.0+ bytes

The ratios of each part of speech in the pos_df will be useful to have with to each row.

In [16]:
# Calculate the frequency percentage of each part of speech.
total = pos_df['frequency'].sum()
pos_df['frequency_percentage'] = pos_df['frequency'].apply(lambda x: x / total)
In [17]:
# Reorder the columns to place the frequency percentage by the frequency.
pos_df = pos_df[['rank', 'eng_pos', 'frequency', 'frequency_percentage', 'jap_pos', 'transliterated_pos']]
pos_df
Out[17]:
rank eng_pos frequency frequency_percentage jap_pos transliterated_pos
0 1 noun 343804.25 0.355167 名詞 めいし
1 2 particle 208342.21 0.215228 助詞 じょし
2 3 symbol 203199.30 0.209915 記号 きごう
3 4 verb 99121.80 0.102398 動詞 どうし
4 5 auxiliary verb 68734.93 0.071007 助動詞 じょどうし
5 6 adverb 15003.37 0.015499 副詞 ふくし
6 7 adjective 10040.91 0.010373 形容詞 けいようし
7 8 adnominal 7509.62 0.007758 連体詞 れんたいし
8 9 conjunction 5684.89 0.005873 接続詞 せつぞくし
9 10 prefix 5227.79 0.005401 接頭詞 せっとうし
10 11 interjection 1257.96 0.001300 感動詞 かんどうし
11 12 filler 64.82 0.000067 フィラー ふぃらあ
12 13 other 14.98 0.000015 その他 そのほか

Nouns understandably take up just over a third of the language usage, but particles actually take up a full fifth of the language usage, twice as much as verbs.

Dataframe to Create:

  • Character Frequency

Columns to Calculate:

Character Frequency Dataframe

  • Frequency
  • Frequency Rank
  • Character
  • Weighted Frequency (Sum of the Frequency of Character in all Words * Frequency of the Respective Words)
  • Type of Character
  • JLPT Exam Level
  • Stroke Count

Lemma Frequency Dataframe

  • Type of Characters
  • Part of Speech
  • Highest NLPT Exam Character

Create a dataframe of every individual character found in the lemma dataset.

While populating the list of characters, total the number each character is used.

In [18]:
# Find each character and the number that each of these characters appear in the lemma dataset.
characters = {}
for lemma in df['lemma']:
    for char in lemma:
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 1

Glance at a random subsection of the character dictionary.

In [19]:
dict(list(characters.items())[:30])
Out[19]:
{'の': 89,
 'に': 164,
 'は': 73,
 'て': 143,
 'を': 7,
 'が': 92,
 'だ': 76,
 'た': 182,
 'す': 338,
 'る': 1135,
 'と': 200,
 'ま': 224,
 'で': 58,
 'な': 184,
 'い': 603,
 'も': 120,
 'あ': 122,
 '・': 1,
 '「': 1,
 '」': 1,
 'こ': 131,
 'e': 1,
 'か': 300,
 'o': 1,
 'a': 1,
 't': 1,
 'れ': 205,
 'ら': 192,
 ')': 2,
 '(': 2}
In [20]:
# Create the character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index, 
char_df = pd.DataFrame.from_dict(characters, orient = 'index')
char_df = char_df.reset_index()
char_df = char_df.rename({'index': 'character', 0: 'frequency'}, axis='columns')
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df.head()
Out[20]:
character frequency
0 1135
1 873
2 673
3 603
4 401

Additionally, calculate a 'weighted' frequency for each character.

This is the sum of the frequencies of the words that the character appears in, counting each time the character apears in the word.

In [21]:
# Calculate the approximate amount that each character appeared in the dataset that the lemma dataset was calculated form.
weighted_characters = {}
for index, row in df.iterrows():
    for char in row['lemma']:
        if char in weighted_characters:
            weighted_characters[char] += row['frequency']
        else:
            weighted_characters[char] = row['frequency']
In [22]:
dict(list(weighted_characters.items())[:30])
Out[22]:
{'の': 50197.62000000002,
 'に': 29859.61000000001,
 'は': 23858.740000000005,
 'て': 28420.48999999998,
 'を': 20445.979999999996,
 'が': 21896.87999999998,
 'だ': 22435.67,
 'た': 25189.439999999988,
 'す': 38030.48,
 'る': 76544.48999999989,
 'と': 29886.07999999998,
 'ま': 16669.62,
 'で': 20441.469999999994,
 'な': 19293.459999999985,
 'い': 40478.36000000004,
 'も': 15020.910000000005,
 'あ': 9598.750000000005,
 '・': 6001.95,
 '「': 5690.07,
 '」': 5672.68,
 'こ': 13838.199999999997,
 'e': 5444.29,
 'か': 15954.139999999992,
 'o': 4590.55,
 'a': 4553.18,
 't': 4248.5,
 'れ': 12860.55,
 'ら': 9623.280000000002,
 ')': 3697.93,
 '(': 3661.64}
In [23]:
# Create the weighted character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index, 
weighted_char_df = pd.DataFrame.from_dict(weighted_characters, orient = 'index')
weighted_char_df = weighted_char_df.reset_index()
weighted_char_df = weighted_char_df.rename({'index': 'character', 0: 'weighted_frequency'}, axis='columns')
weighted_char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
weighted_char_df = weighted_char_df.reset_index(drop = True)
weighted_char_df.head()
Out[23]:
character weighted_frequency
0 76544.49
1 50197.62
2 40478.36
3 38030.48
4 29886.08
In [24]:
# Merge the character dataframes based on the character for each row.
char_df = char_df.merge(weighted_char_df)
char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df
Out[24]:
character frequency weighted_frequency
0 1135 76544.49
1 89 50197.62
2 603 40478.36
3 338 38030.48
4 200 29886.08
... ... ... ...
2635 1 2.25
2636 1 2.25
2637 1 2.25
2638 1 2.24
2639 1 2.24

2640 rows × 3 columns

In [25]:
# Ensure that each row has a value for each column of data.
char_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2640 entries, 0 to 2639
Data columns (total 3 columns):
character             2640 non-null object
frequency             2640 non-null int64
weighted_frequency    2640 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 62.0+ KB

Create a rank column in the character dataframe that is equivalent to the rank column in the lemma dataframe.

In [26]:
# Create a rank column for the char_df to match the lemma df for both weighted and non-weighted frequency columns.
char_df['weighted_rank'] = char_df['weighted_frequency'].rank(method = 'first', ascending = False,)
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df['rank'] = char_df['frequency'].rank(method = 'first', ascending = False,)
char_df
Out[26]:
character frequency weighted_frequency weighted_rank rank
0 1135 76544.49 1.0 1.0
22 873 10294.03 23.0 2.0
27 673 7547.24 28.0 3.0
2 603 40478.36 3.0 4.0
39 401 5061.23 40.0 5.0
... ... ... ... ... ...
2006 1 9.45 2007.0 2636.0
2007 1 9.44 2008.0 2637.0
2008 1 9.42 2009.0 2638.0
2009 1 9.42 2010.0 2639.0
2639 1 2.24 2640.0 2640.0

2640 rows × 5 columns

In [27]:
# Fix the index and correct the rank and weighted_rank dtypes from int to string.
char_df = char_df.reset_index(drop = True)
char_df['rank'] = char_df['rank'].astype('int').astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('int').astype('object')
char_df
Out[27]:
character frequency weighted_frequency weighted_rank rank
0 1135 76544.49 1 1
1 873 10294.03 23 2
2 673 7547.24 28 3
3 603 40478.36 3 4
4 401 5061.23 40 5
... ... ... ... ... ...
2635 1 9.45 2007 2636
2636 1 9.44 2008 2637
2637 1 9.42 2009 2638
2638 1 9.42 2010 2639
2639 1 2.24 2640 2640

2640 rows × 5 columns

In [28]:
# Reorder the columns for human readability.
char_df = char_df[['rank', 'frequency','character', 'weighted_frequency', 'weighted_rank',]]
char_df
Out[28]:
rank frequency character weighted_frequency weighted_rank
0 1 1135 76544.49 1
1 2 873 10294.03 23
2 3 673 7547.24 28
3 4 603 40478.36 3
4 5 401 5061.23 40
... ... ... ... ... ...
2635 2636 1 9.45 2007
2636 2637 1 9.44 2008
2637 2638 1 9.42 2009
2638 2639 1 9.42 2010
2639 2640 1 2.24 2640

2640 rows × 5 columns

Each character needs to be tagged with its appropriate script type.

We can easily classify each character by looking up its unicode representation.

In [29]:
def classify_char_script(char):
    """Look up the integer representing Unicode code point for the given character and return its script."""
    
    char = ord(char)
    if 0 <= char <= 8591:
        return 'latin'
    elif 12288 <= char <= 12351:
        return 'punctuation'
    elif 12352 <= char <= 12447:
        return 'hiragana'
    elif 12448 <= char <= 12543:
        return 'katakana'
    elif 19968 <= char <= 40879:
        return 'kanji'
    elif 65280 <= char <= 65374:
        return 'full-width_roman'
    elif 65375 <= char <= 65519:
        return 'half-width_katakana'
    else:
        return 'other'
In [30]:
# Classify each character in the char_df
char_df['script'] = char_df['character'].apply(classify_char_script)
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [31]:
# Check the total number of characters in each script in this dataset.
char_df['script'].value_counts()
Out[31]:
kanji                  2279
katakana                 84
hiragana                 83
latin                    80
full-width_roman         74
punctuation              19
other                    19
half-width_katakana       2
Name: script, dtype: int64

Adding the characters' Unicode code point to each row may be useful for later sorting or testing.

In [32]:
def get_ord(row):
    """Return the character's integer representing Unicode code point from the given row."""
    return ord(row['character'])
In [33]:
# Create a column that holds each character's integer representing Unicode code point for reference.
char_df['ord'] = char_df.apply(get_ord, axis = 1)
char_df
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[33]:
rank frequency character weighted_frequency weighted_rank script ord
0 1 1135 76544.49 1 hiragana 12427
1 2 873 10294.03 23 katakana 12540
2 3 673 7547.24 28 katakana 12531
3 4 603 40478.36 3 hiragana 12356
4 5 401 5061.23 40 katakana 12473
... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677
2636 2637 1 9.44 2008 kanji 36795
2637 2638 1 9.42 2009 kanji 25524
2638 2639 1 9.42 2010 kanji 25375
2639 2640 1 2.24 2640 kanji 24265

2640 rows × 7 columns

The Japanese-Language Proficiency Test (JLPT) is an extremely influential standardized test used to evaluate a non-native student's Japanese ability.

It consists of 5 different levels, the n5, n4, n3, n2, and n1 exams. The n5 is the easiest, testing beginner concepts, and the n1 is the most advanced, testing the ability to understand Japanese in virtually any circumstance.

Each character should be tagged with its appropriate JLPT exam level.

In [34]:
# Read in and then merge the JLPT rank dataset into the character dataframe.
jlpt_df = pd.read_csv('jlpt_levels.csv')
jlpt_df = jlpt_df.rename(columns={"kanji": "character"})
jlpt_df.head()
Out[34]:
character jlpt_level
0 5
1 5
2 5
3 5
4 5
In [35]:
# Use the object dtype for the jlpt_level column because they are categorical, not quantitative.
char_df = char_df.merge(jlpt_df, how = 'left')
char_df['jlpt_level'] = char_df['jlpt_level'].astype('object')
char_df
Out[35]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level
0 1 1135 76544.49 1 hiragana 12427 NaN
1 2 873 10294.03 23 katakana 12540 NaN
2 3 673 7547.24 28 katakana 12531 NaN
3 4 603 40478.36 3 hiragana 12356 NaN
4 5 401 5061.23 40 katakana 12473 NaN
... ... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677 1
2636 2637 1 9.44 2008 kanji 36795 NaN
2637 2638 1 9.42 2009 kanji 25524 NaN
2638 2639 1 9.42 2010 kanji 25375 2
2639 2640 1 2.24 2640 kanji 24265 1

2640 rows × 8 columns

In [36]:
# Check the total number of charcters in each JLPT exam level in the JLPT dataset.
jlpt_df['jlpt_level'].value_counts()
Out[36]:
1    1235
3     370
2     368
4     167
5      80
Name: jlpt_level, dtype: int64
In [37]:
# Check the total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
Out[37]:
1.0    963
3.0    370
2.0    368
4.0    167
5.0     80
Name: jlpt_level, dtype: int64

The jlpt_level has strings from float values rather than int values because NaN's were present and require float.

In [38]:
# Correct the float strings to integer strings and change the exam level names.
char_df.loc[char_df['jlpt_level'] == 1.0, 'jlpt_level'] = 'n1'
char_df.loc[char_df['jlpt_level'] == 2.0, 'jlpt_level'] = 'n2'
char_df.loc[char_df['jlpt_level'] == 3.0, 'jlpt_level'] = 'n3'
char_df.loc[char_df['jlpt_level'] == 4.0, 'jlpt_level'] = 'n4'
char_df.loc[char_df['jlpt_level'] == 5.0, 'jlpt_level'] = 'n5'
In [39]:
char_df['jlpt_level'].value_counts()
Out[39]:
n1    963
n3    370
n2    368
n4    167
n5     80
Name: jlpt_level, dtype: int64

Hiragana and Katakana characters are considerably easier to learn than nearly all kanji, and both syllabaries are expected to be known before the JLPT n5 exam is taken.

The hiragana and katakana characters will be set to the easiest JLPT exam level, the n5.

In [40]:
# Set all hiragana and katakana characters to the easiest JLPT exam level.
char_df.loc[char_df['script'] == 'hiragana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'katakana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'half-width_katakana', 'jlpt_level'] = 'n5'
char_df
Out[40]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level
0 1 1135 76544.49 1 hiragana 12427 n5
1 2 873 10294.03 23 katakana 12540 n5
2 3 673 7547.24 28 katakana 12531 n5
3 4 603 40478.36 3 hiragana 12356 n5
4 5 401 5061.23 40 katakana 12473 n5
... ... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677 n1
2636 2637 1 9.44 2008 kanji 36795 NaN
2637 2638 1 9.42 2009 kanji 25524 NaN
2638 2639 1 9.42 2010 kanji 25375 n2
2639 2640 1 2.24 2640 kanji 24265 n1

2640 rows × 8 columns

In [41]:
# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
Out[41]:
n1    963
n3    370
n2    368
n5    249
n4    167
Name: jlpt_level, dtype: int64
In [42]:
# Visualize the total number of charcters in each JLPT exam level in the character dataframe.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution')
plt.show()

After the initial investment of learning both hiragana and katakana with basic kanji, each JLPT exam level expects more new kanji than before.

The JLPT tests fluency, but cannot be truly comprehensive. Many kanji characters are not regularly used by even native speakers, so these are not tested.

Set the kanji that are more advanced than the JLPT exams to the value of n0. There is no n0 JLPT exam, but this will signify that the character is beyond the exams.

In [43]:
# Set the JLPT exam level to n0 for every character that does not have a jlpt_level value yet.
char_df.loc[char_df['jlpt_level'].isna(), 'jlpt_level'] = 'n0'
In [44]:
# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
Out[44]:
n1    963
n0    523
n3    370
n2    368
n5    249
n4    167
Name: jlpt_level, dtype: int64
In [45]:
# Visualize the new total number of charcters in each JLPT exam level in the character dataframe with a barplot.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution Including Advanced Characters')
plt.show()

There are far, far more kanji than is represented on this graph, but these counts make up all of the characters in the most common 15,000 lemmas of the Japanese language. 523 of these kanji are not included on the JLPT exams but are still common enough to plan to eventually learn.

The stroke count, or the total number of strokes needed to write the character, is another metric of assessing difficulty of learning Japanese characters. The higher the stroke count, the more individual pieces needed to be memorized and recalled correctly.

Each character will be tagged with the appropriate stroke count.

Characters that have a debatable stroke count will be given the higher number of the possibilities.

In [46]:
# Read in the kanji.json file to get the stroke count of each character.
json_df = pd.read_json('kanji.json', encoding = 'UTF-8')
json_df.head()
Out[46]:
kanji occurrence stroke grade radical onyomi kunyomi nanori meaning
0 9999 7 表外漢字 (hyōgai kanji) a cup with pendants, a pennant, wild, barren, ...
1 9999 13 表外漢字 (hyōgai kanji) to help, to assist, to achieve, to rise, to raise
2 9999 5 表外漢字 (hyōgai kanji)
3 9999 8 表外漢字 (hyōgai kanji) ragged clothing, ragged, old and wear out
4 9999 10 表外漢字 (hyōgai kanji) a vase, a pitcher, earthenware
In [47]:
# Check the total number of charcters in each grade group in the character dataframe.
json_df['grade'].value_counts()
Out[47]:
常用漢字 (jōyō kanji)         1041
教育漢字 (kyōiku kanji)       1006
表外漢字 (hyōgai kanji)        426
人名用漢字 (jinmeiyō kanji)     373
Name: grade, dtype: int64

While the grade grouping information is very useful, breaking it down into individual grade level will be more useful, so this information will be left out.

In [48]:
# Sort by stroke count.
json_df.sort_values('stroke', inplace = True)
json_df = json_df.reset_index(drop = True)
json_df
Out[48]:
kanji occurrence stroke grade radical onyomi kunyomi nanori meaning
0 9999 1 表外漢字 (hyōgai kanji) チュ dot, tick or dot radical (no. 3)
1 丿 9999 1 表外漢字 (hyōgai kanji) 丿 ヘツ えい, よう katakana no radical (no. 4)
2 9999 1 表外漢字 (hyōgai kanji) なが.れる
3 1841 1 常用漢字 (jōyō kanji) 乙(⺄,乚) オツ, イツ おと-, きのと the latter, duplicate, strange, witty, fishhoo...
4 9999 1 表外漢字 (hyōgai kanji) 乙(⺄,乚) イン, オン かく.す, かく.れる, かか.す, よ.る hidden, mysterious, secret, to conceal, small,...
... ... ... ... ... ... ... ... ... ...
2841 2482 23 人名用漢字 (jinmeiyō kanji) ソン, セン, ザン ます salmon trout
2842 1391 23 常用漢字 (jōyō kanji) 金(釒) カン かんが.みる, かがみ あき, あきら specimen, take warning from, learn from
2843 2494 24 人名用漢字 (jinmeiyō kanji) リン うろこ, こけ, こけら scales (fish)
2844 2172 24 人名用漢字 (jinmeiyō kanji) さぎ heron
2845 1676 24 人名用漢字 (jinmeiyō kanji) ヨウ, オウ たか hawk

2846 rows × 9 columns

In [49]:
# Rename the kanji column to character as in all other dataframes.
json_df = json_df.rename(columns={"kanji": "character"})
json_df.head()
Out[49]:
character occurrence stroke grade radical onyomi kunyomi nanori meaning
0 9999 1 表外漢字 (hyōgai kanji) チュ dot, tick or dot radical (no. 3)
1 丿 9999 1 表外漢字 (hyōgai kanji) 丿 ヘツ えい, よう katakana no radical (no. 4)
2 9999 1 表外漢字 (hyōgai kanji) なが.れる
3 1841 1 常用漢字 (jōyō kanji) 乙(⺄,乚) オツ, イツ おと-, きのと the latter, duplicate, strange, witty, fishhoo...
4 9999 1 表外漢字 (hyōgai kanji) 乙(⺄,乚) イン, オン かく.す, かく.れる, かか.す, よ.る hidden, mysterious, secret, to conceal, small,...
In [50]:
# Check for any oddities.
char_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 8 columns):
rank                  2640 non-null object
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null object
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 185.6+ KB
In [51]:
# Get the stroke count for each character from the json_df for the char_df.
char_df['stroke_count'] = char_df['character'].map(json_df.set_index('character')['stroke'])
In [52]:
char_df
Out[52]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level stroke_count
0 1 1135 76544.49 1 hiragana 12427 n5 NaN
1 2 873 10294.03 23 katakana 12540 n5 NaN
2 3 673 7547.24 28 katakana 12531 n5 NaN
3 4 603 40478.36 3 hiragana 12356 n5 NaN
4 5 401 5061.23 40 katakana 12473 n5 NaN
... ... ... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677 n1 10.0
2636 2637 1 9.44 2008 kanji 36795 n0 5.0
2637 2638 1 9.42 2009 kanji 25524 n0 NaN
2638 2639 1 9.42 2010 kanji 25375 n2 9.0
2639 2640 1 2.24 2640 kanji 24265 n1 13.0

2640 rows × 9 columns

Many of the stroke counts are still missing. The remaining data will be gotten from another source. While this is being collected, the grade and frequency rating will be collected as well.

In [53]:
# Create empty columns for the grade and frequency rating values.
char_df['grade'] = np.nan
char_df['frequency_rating'] = np.nan
In [54]:
char_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 11 columns):
rank                  2640 non-null object
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null object
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
stroke_count          2115 non-null float64
grade                 0 non-null float64
frequency_rating      0 non-null float64
dtypes: float64(4), int64(2), object(5)
memory usage: 247.5+ KB
In [55]:
char_df
Out[55]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level stroke_count grade frequency_rating
0 1 1135 76544.49 1 hiragana 12427 n5 NaN NaN NaN
1 2 873 10294.03 23 katakana 12540 n5 NaN NaN NaN
2 3 673 7547.24 28 katakana 12531 n5 NaN NaN NaN
3 4 603 40478.36 3 hiragana 12356 n5 NaN NaN NaN
4 5 401 5061.23 40 katakana 12473 n5 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677 n1 10.0 NaN NaN
2636 2637 1 9.44 2008 kanji 36795 n0 5.0 NaN NaN
2637 2638 1 9.42 2009 kanji 25524 n0 NaN NaN NaN
2638 2639 1 9.42 2010 kanji 25375 n2 9.0 NaN NaN
2639 2640 1 2.24 2640 kanji 24265 n1 13.0 NaN NaN

2640 rows × 11 columns

That dataset was not complete enough, so the stroke count will be scraped from http://www.edrdg.org/.

Additionally, the school grade that the character is typically learned in will be scraped as well.

In [56]:
from bs4 import Tag

def char_lookup(char):
    """Scrapes the character's corresponding webpage at http://www.edrdg.org/ and sets the character's char_df stroke_count, grade, and frequency_rating column values to the scraped information."""

    try:
        url_base = 'http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MMJ'

        # Create the URL for the current account page.
        current_url = url_base + str(char)

        # Request the current account page.
        page = requests.get(current_url)

        # Parse the page with BS4.
        # This source has extra </b></td></tr> tags that break the python default HTML parser.
        # Use the external lxml parser instead.
        soup = BeautifulSoup(page.content, 'lxml')

        table = soup.find_all("table")[1]
        
        stroke_element = table.find("td", string = 'Stroke Count')
        stroke_count = str(stroke_element.next_sibling)[7:-9]
        
        # Some kanji have different possibilities based on writing style.
        # Calculate both and then assume the higher.
        if ' ' in stroke_count:
            stroke_count = stroke_count[:-1]
            before = str(stroke_count)
            
            space_loc = 0
            for index, letter in enumerate(stroke_count):
                if letter == ' ':
                    space_loc = index
            
            first = stroke_count.split()[0]
            second = stroke_count.split()[1]
            
            # Take the higher stroke count.
            stroke_count = max(int(first), int(second))
            
            after = str(stroke_count)
            print(char + ': ' + before + ' -> ' + after)
            
        #print('stroke_count: ' + stroke_count)
        
        try:
            grade_element = table.find("td", string = 'Grade')
            grade = str(grade_element.next_sibling)[7:-9]
            #print('grade: ' + grade)
        except:
            print(char + ': Grade not found.')
            grade = np.NaN
        
        try:
            freq_element = table.find("td", string = 'Frequency ranking')
            frequency_ranking = str(freq_element.next_sibling)[7:-9]
            #print('frequency_ranking: ' + frequency_ranking)
        except:
            print(char + ': No frequency ranking found.')
            frequency_ranking = np.NaN
        
        # Save the results
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = stroke_count
        char_df.at[index, 'grade'] = grade
        char_df.at[index, 'frequency_ranking'] = frequency_ranking
            
        print(char + ': Success')
        
    except:
        print(char + ': Failed')
In [57]:
# Test the function on a simple and common character
char_lookup('本')
本: Success
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:64: FutureWarning: `item` has been deprecated and will be removed in a future version
In [58]:
# Note that this takes a long time.
# This block is commented out to avoid accidentally re-scraping it all.
'''
# Use the char_lookup function to scrape the stroke_count, grade, and frequency_rating cloumn values for each character, and then save the resulting dataframe to avoid having to rescrape the data.
char_df['character'].map(char_lookup)
char_df.to_csv('char_df_with_strokes.csv')
'''
Out[58]:
"\n# Use the char_lookup function to scrape the stroke_count, grade, and frequency_rating cloumn values for each character, and then save the resulting dataframe to avoid having to rescrape the data.\nchar_df['character'].map(char_lookup)\nchar_df.to_csv('char_df_with_strokes.csv')\n"
In [59]:
# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
char_df = pd.read_csv('char_df_with_strokes.csv', index_col = 0)
In [60]:
# Check to see if any kanji characters' stroke_counts are missing and need to be re-scraped.
chars_to_redo = char_df.loc[(char_df['stroke_count'].isnull()) & (char_df['script'] == 'kanji')].index.tolist()
chars_to_redo
Out[60]:
[]
In [61]:
# Re-scrape any missing stroke_counts.
for index in chars_to_redo:
    char_lookup(char_df.at[index, 'character'])

The hiragana and katakana characters will be added manually because it will be simpler this way than to scrap it from a different source.

In [62]:
# Create a dictionary with each hiragana character's stroke count.
hiragana_strokes = {
    'あ': 3,
    'い': 2,
    'う': 2,
    'え': 2,
    'お': 3,
    'か': 3,
    'き': 3,
    'く': 1,
    'け': 3,
    'こ': 2,
    'さ': 2,
    'し': 1,
    'す': 2,
    'せ': 3,
    'そ': 1,
    'た': 4,
    'ち': 2,
    'つ': 1,
    'て': 1,
    'と': 2,
    'な': 4,
    'に': 3,
    'ぬ': 2,
    'ね': 2,
    'の': 1,
    'は': 3,
    'ひ': 1,
    'ふ': 4,
    'へ': 1,
    'ほ': 4,
    'ま': 3,
    'み': 2,
    'む': 3,
    'め': 2,
    'も': 3,
    'や': 3,
    'ゆ': 2,
    'よ': 2,
    'ら': 2,
    'り': 2,
    'る': 1,
    'れ': 2,
    'ろ': 1,
    'わ': 2,
    'を': 3,
    'ん': 1,
    'が': 3,
    'ぎ': 3,
    'ぐ': 1,
    'げ': 3,
    'ご': 2,
    'ざ': 2,
    'じ': 1,
    'ず': 2,
    'ぜ': 3,
    'ぞ': 1,
    'だ': 4,
    'ぢ': 2,
    'づ': 1,
    'で': 1,
    'ど': 2,
    'ば': 3,
    'び': 1,
    'ぶ': 4,
    'べ': 1,
    'ぼ': 4,
    'ぱ': 3,
    'ぴ': 1,
    'ぷ': 4,
    'ぺ': 1,
    'ぽ': 4,
    'ゃ': 3,
    'ゅ': 2,
    'ょ': 2,
    ' ゙': 2,
    '゜': 1,
    'ゐ': 1,
    'ゑ': 1
}
In [63]:
# Create a dictionary with each katakana and needed computer symbol character's stroke count.
katakana_strokes = {
    'ア': 2,
    'イ': 2,
    'ウ': 3,
    'エ': 3,
    'オ': 3,
    'カ': 2,
    'キ': 3,
    'ク': 2,
    'ケ': 3,
    'コ': 2,
    'サ': 3,
    'シ': 3,
    'ス': 2,
    'セ': 2,
    'ソ': 2,
    'タ': 3,
    'チ': 3,
    'ツ': 3,
    'テ': 3,
    'ト': 2,
    'ナ': 2,
    'ニ': 2,
    'ヌ': 2,
    'ネ': 4,
    'ノ': 1,
    'ハ': 2,
    'ヒ': 2,
    'フ': 1,
    'ヘ': 1,
    'ホ': 4,
    'マ': 2,
    'ミ': 3,
    'ム': 2,
    'メ': 2,
    'モ': 3,
    'ヤ': 2,
    'ユ': 2,
    'ヨ': 3,
    'ラ': 2,
    'リ': 2,
    'ル': 2,
    'レ': 1,
    'ロ': 3,
    'ワ': 2,
    'ヲ': 3,
    'ン': 2,
    'ガ': 2,
    'ギ': 3,
    'グ': 2,
    'ゲ': 3,
    'ゴ': 2,
    'ザ': 3,
    'ジ': 3,
    'ズ': 2,
    'ゼ': 2,
    'ゾ': 2,
    'ダ': 3,
    'ヂ': 3,
    'ヅ': 3,
    'デ': 3,
    'ド': 2,
    'バ': 2,
    'ビ': 2,
    'ブ': 1,
    'ベ': 1,
    'ボ': 4,
    'パ': 2,
    'ピ': 2,
    'プ': 1,
    'ペ': 1,
    'ポ': 4,
    'ャ': 2,
    'ュ': 2,
    'ョ': 3,
    'ヰ': 4,
    'ヱ': 3,
    # Nonbasic characters below here.
    'ー': 1,
    'ィ': 2,
    '々': 3,
    'ェ': 3,
    'ァ': 2,
    'ォ': 3,
    'ぁ': 3,
    'ヴ': 3,
    '―': 1,
    '─': 1,
    'ヶ': 3,
    'ぇ': 2,
    'ゝ': 1,
    'ぉ': 3,
    '¥': 4,
    '□': 3,
    'ゞ': 1,
    '〒': 3,
    'ヵ': 2,
    '・': 1,
    'T': 2,
    '0': 1,
    '1': 1,
    '2': 1,
    '3': 1,
    '4': 2,
    '5': 2,
    '6': 1,
    '7': 1,
    '8': 1,
    '9': 1,
    '「': 1,
    '」': 1,
    '(': 1,
    ')': 1,
    '{': 1,
    '}': 1,
    '’': 1,
    '”': 2,
    '<': 1,
    '>': 1,
    '、': 1,
    '。': 1,
    '・': 1,
    '?': 2,
    '゛': 2,
    '〜': 1,
    # W杯 / W-hai for World Cup
    'W': 1,
    # Tシャツ / T-shatsu for T-Shirt
    'T': 2,
    # Jリーグ / J-riigu for J1 League
    'J': 2,
    #  ̄ Upperscore / Macron for Hepburn long vowel notation
    ' ̄': 1,
    # ヽ Katakana iteration mark
    'ヽ': 1,
    # ヾ Katakana dakuten / voiced iteration mark
    'ヾ': 3,
    # ゛ Dakuten
    '゛': 2
}
In [64]:
# Set the stroke_counts and grade for the hiragana manually.
for char in hiragana_strokes.keys():
    try:
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = hiragana_strokes[char]
        char_df.at[index, 'grade'] = 0
    except:
        print(char + ': Not in char_df')
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.
 ゙: Not in char_df
ゑ: Not in char_df
In [65]:
# Set the stroke_counts and grade for the katakana manually.
for char in katakana_strokes.keys():
    try:
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = katakana_strokes[char]
        char_df.at[index, 'grade'] = 0
    except:
        print(char + ': Not in char_df')
ヂ: Not in char_df
ヅ: Not in char_df
ヰ: Not in char_df
ヱ: Not in char_df
”: Not in char_df
、: Not in char_df
。: Not in char_df
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.
In [66]:
# Display the total number of characters with stroke counts, the total number of characters still without stroke counts, and the percentage of characters still without stroke counts.
num_has_strokes = char_df['stroke_count'].value_counts().sum()
num_without_strokes = 2640 - num_has_strokes
percent_without_strokes = num_without_strokes / 2640
num_has_strokes, num_without_strokes, percent_without_strokes
Out[66]:
(2479, 161, 0.06098484848484848)
In [67]:
# Create a list of characters that still do not have stroke counts to consider adding them individually.
char_to_manually_add = []

for char in char_df.loc[char_df['stroke_count'].isnull()]['character']:
    char_to_manually_add.append(char[0])

Some characters will still not have a stroke count or grade value, but these should only include non-Japanese characters that are irrelevant to this project.

A dataframe that only contains the relevant characters will be determined later, after insuring that they will not be needed.

In [68]:
# Search through these and retroactively add the remaining non-Latin
# and non-punctuation characters to the hiragana and katakana dictionaries.
char_to_manually_add
Out[68]:
['F',
 'P',
 'N',
 'O',
 'S',
 'D',
 'B',
 'C',
 'G',
 'K',
 'M',
 'm',
 'A',
 'I',
 'k',
 'A',
 'R',
 'V',
 'H',
 '−',
 'c',
 'g',
 'C',
 'X',
 'U',
 'L',
 'Y',
 '&',
 'b',
 'e',
 'E',
 'D',
 'Q',
 'U',
 'Z',
 'β',
 'p',
 'x',
 'σ',
 '÷',
 'μ',
 'Σ',
 'ε',
 'Ω',
 'a',
 '〆',
 'v',
 's',
 'E',
 'R',
 'I',
 'M',
 '…',
 ',',
 'P',
 'k',
 'w',
 'g',
 ':',
 'O',
 'N',
 '`',
 'G',
 '■',
 'H',
 'W',
 'v',
 'F',
 '○',
 '.',
 'L',
 '』',
 '『',
 'f',
 'B',
 'y',
 '!',
 'c',
 'l',
 'm',
 's',
 'r',
 'n',
 'd',
 '‐',
 'S',
 'b',
 '|',
 '#',
 'u',
 'p',
 '】',
 '【',
 '_',
 '+',
 '[',
 '▼',
 'Z',
 '△',
 '←',
 '〇',
 '※',
 'w',
 'Q',
 '‥',
 '〔',
 '〕',
 '_',
 '^',
 ']',
 '▲',
 '↑',
 'q',
 '《',
 'Δ',
 '》',
 '@',
 ';',
 '↓',
 '◇',
 '◎',
 '×',
 '☆',
 '*',
 '^',
 '=',
 '/',
 '%',
 'x',
 '●',
 'V',
 '★',
 '$',
 'j',
 'J',
 'K',
 '▽',
 'Y',
 'z',
 '\\',
 '~',
 '◆',
 '|',
 '→',
 '´',
 '〉',
 '〈',
 'X',
 'i',
 'α',
 '$',
 '{',
 'a',
 'o',
 '‘',
 'e',
 '}',
 't',
 '〓',
 '〃',
 'ω']
In [69]:
# Check column dtypes.
char_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 11 columns):
rank                  2640 non-null int64
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null int64
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
stroke_count          2479 non-null float64
grade                 2381 non-null float64
frequency_ranking     2115 non-null float64
dtypes: float64(4), int64(4), object(3)
memory usage: 327.5+ KB
In [70]:
# Check the total number of characters in each grade.
char_df['grade'].value_counts()
Out[70]:
8.0     991
4.0     202
3.0     200
0.0     197
5.0     185
6.0     180
9.0     176
2.0     160
1.0      80
10.0     10
Name: grade, dtype: int64

Grade 8 represents basic high school as a whole in Japan. 9 and 10 are more niche and advanced levels during the same time periods of education. 7 isn't used at all by convention.

In [71]:
# Change the char_df's rank and weighted_rank columns' dtypes from int and float to string.
char_df['rank'] = char_df['rank'].astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('object')
In [72]:
char_df.head(2)
Out[72]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level stroke_count grade frequency_ranking
0 1 1135 76544.49 1 hiragana 12427 n5 1.0 0.0 NaN
1 2 873 10294.03 23 katakana 12540 n5 1.0 0.0 NaN

Now that a complete character dataframe has been created, we can apply the information calculated in it to provide a lot of insight and information about the lemma dataset.

First, the script that each lemma is made up of will be calculated.

In [73]:
def classify_word_script(row):
    """Use the char_lookup function to determine and return the script(s) the lemma in the given row is made up of."""
    
    word = row['lemma']
    char_scripts = []
    kanji = False
    hiragana = False
    katakana = False
    
    for char in word:
        char_scripts.append(classify_char_script(char))
        
        if 'kanji' in char_scripts:
            kanji = True
        if 'hiragana' in char_scripts:
            hiragana = True
        if 'katakana' in char_scripts:
            katakana = True
        if 'half-width_katakana' in char_scripts:
            katakana = True
      
    # Return the proper category of combinations.
    if kanji and not hiragana and not katakana:
        return 'kanji'
    elif hiragana and not kanji and not katakana:
        return 'hiragana'
    elif katakana and not kanji and not hiragana:
        return 'katakana'
    elif kanji and hiragana and not katakana:
        return 'kanji_and_hiragana'
    elif kanji and katakana and not hiragana:
        return 'kanji_and_katakana'
    elif hiragana and katakana and not kanji:
         return 'hiragana_and_katakana'
    elif kanji and hiragana and katakana:
        return 'all'
    else:
        return 'not_japanese'
In [74]:
# Create a script column for the lemma dataframe by using the classify_word_script() function on each row.
df['script'] = df.apply(classify_word_script, axis = 1)
df
Out[74]:
rank frequency lemma script
0 1 41309.50 hiragana
1 2 23509.54 hiragana
2 3 22216.80 hiragana
3 4 20431.93 hiragana
4 5 20326.59 hiragana
... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana
14996 14997 2.24 売場 kanji
14997 14998 2.24 たたき台 kanji_and_hiragana
14998 14999 2.24 かしこ hiragana
14999 15000 2.24 バックグラウンド katakana

15000 rows × 4 columns

In [75]:
# Check the total number of lemmas in each script combination.
df['script'].value_counts()
Out[75]:
kanji                    8185
kanji_and_hiragana       2446
katakana                 2387
hiragana                 1686
not_japanese              230
kanji_and_katakana         42
hiragana_and_katakana      24
Name: script, dtype: int64
In [76]:
# Visualize the total number of lemmas in each script combination with a barplot.
plot_data = df['script'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y)
ax.set(xlabel = 'Script', ylabel = 'Number of Lemmas')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.title('Lemma Script Distribution')
plt.show()

With the disproportionate amount of kanji in the language compared to the other scripts, it's no surprise that kanji-only words make up the bulk of the language usage. However, there's not a single instance of a word that contains all three scripts. Additionally, Katakana-only words are more common than hiragana-only words, likely because of foreign loanwords.

Like the scripts that each lemma is made up of, a minimum JLPT exam level can be determined by calculating the highest exam level of each character that the lemma is made up of.

In [77]:
def jlpt_level(row):
    """Determines and returns the highest ranking jlpt_level among all characters in the lemma of the given row. JLPT Ranking Order: n0 > n1 > n2 > n3 > n4 > n5."""
    
    word = row['lemma']
    
    char_levels = []
    
    for char in word:
        char_row = char_df.loc[char_df['character'] == char]
        char_levels.append(str(char_row.get('jlpt_level').item()))
    
    # Return the highest character rank, since knowing the word requires knowing all the characters in it.
    if 'n0' in char_levels:
        return 'n0'
    elif 'n1' in char_levels:
        return 'n1'
    elif 'n2' in char_levels:
        return 'n2'
    elif 'n3' in char_levels:
        return 'n3'
    elif 'n4' in char_levels:
        return 'n4'
    elif 'n5' in char_levels:
        return 'n5'
    else:
        return 'error'
In [78]:
# Test the jlpt_level() function.
jlpt_level(df.iloc[400])
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: `item` has been deprecated and will be removed in a future version
  # Remove the CWD from sys.path while we load stuff.
Out[78]:
'n5'
In [79]:
# Create an ordered list to use when referencing all JLPT exam levels from here on out.
jlpt_exams = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0']
In [80]:
# Note that this takes some time to execute.
# Use the jlpt_level on each row of the lemma dataframe to assign a JLPT exam level to each lemma.
df['jlpt_level'] = df.apply(jlpt_level, axis = 1)
df
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: `item` has been deprecated and will be removed in a future version
  # Remove the CWD from sys.path while we load stuff.
Out[80]:
rank frequency lemma script jlpt_level
0 1 41309.50 hiragana n5
1 2 23509.54 hiragana n5
2 3 22216.80 hiragana n5
3 4 20431.93 hiragana n5
4 5 20326.59 hiragana n5
... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4
14996 14997 2.24 売場 kanji n4
14997 14998 2.24 たたき台 kanji_and_hiragana n4
14998 14999 2.24 かしこ hiragana n5
14999 15000 2.24 バックグラウンド katakana n5

15000 rows × 5 columns

In [81]:
# Check the total number of lemmas in each JLPT exam level.
df['jlpt_level'].value_counts()
Out[81]:
n5    4629
n3    3332
n1    3054
n2    1855
n4    1435
n0     695
Name: jlpt_level, dtype: int64
In [82]:
# Visualize the total number of lemmas in each JLPT exam level with a barplot.
plot_data = df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = jlpt_exams)
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Lemmas')
plt.title('Lemma JLPT Exam Level Distribution')
plt.show()

The n5 level should be include the most lemmas, because hiragana-only and katakana-only words can be easily learned, but perhaps should not be learned at this level alone. Accounting for this caveat, the highest leaps in vocabulary accessibility comes at the n3 and the n1 exam levels.

An additional pair of variables for each character and lemma will be the total number of words and the average character length of these words.

In [83]:
def count_word_usage(char):
    """Calculates and returns the number of lemmas and the average lemma length that the given character appears in."""
    
    word_list = []
    char_count = 0
    
    for lemma in df['lemma']:
        if char in lemma:
            word_list.append(lemma)
            
    for word in word_list:
        for char in word:
            char_count += 1
    
    count = len(word_list)
    average_word_length = char_count / count
    average_word_length = round(average_word_length, 2)
            
    return count, average_word_length
In [84]:
# Test the count_word_usage() function.
print(count_word_usage('食'))
(33, 2.15)
In [85]:
# Create a word count and average word length column for each character in the character dataframe using the count_word_usage() function.
# frequency is the number of times the character appears in the list of lemmas.
# word_count is the number of lemmas that the character appears in at least once.
char_df['word_count'], char_df['average_word_length'] = zip(*char_df['character'].map(count_word_usage))
char_df
Out[85]:
rank frequency character weighted_frequency weighted_rank script ord jlpt_level stroke_count grade frequency_ranking word_count average_word_length
0 1 1135 76544.49 1 hiragana 12427 n5 1.0 0.0 NaN 1133 3.21
1 2 873 10294.03 23 katakana 12540 n5 1.0 0.0 NaN 806 4.37
2 3 673 7547.24 28 katakana 12531 n5 2.0 0.0 387.0 618 4.55
3 4 603 40478.36 3 hiragana 12356 n5 2.0 0.0 NaN 576 3.33
4 5 401 5061.23 40 katakana 12473 n5 2.0 0.0 NaN 386 4.25
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2635 2636 1 9.45 2007 kanji 29677 n1 10.0 6.0 1592.0 1 1.00
2636 2637 1 9.44 2008 kanji 36795 n0 5.0 9.0 1614.0 1 1.00
2637 2638 1 9.42 2009 kanji 25524 n0 11.0 NaN NaN 1 2.00
2638 2639 1 9.42 2010 kanji 25375 n2 9.0 8.0 1870.0 1 2.00
2639 2640 1 2.24 2640 kanji 24265 n1 13.0 8.0 2066.0 1 2.00

2640 rows × 13 columns

In [86]:
# Describe the word_count column data.
char_df['word_count'].describe()
Out[86]:
count    2640.000000
mean       13.996212
std        44.592024
min         1.000000
25%         1.000000
50%         4.000000
75%        10.250000
max      1133.000000
Name: word_count, dtype: float64
In [87]:
# Visualize the character word_count amounts with a scatterplot and a log scaled y-axis.
x = range(0, len(char_df))
y = char_df['word_count']
plt.scatter(x, y)
plt.title('Word Counts of Japanese Characters')
plt.ylabel('Word Count')
plt.yscale('log')
plt.show()

Like the lemma frequencies, the total number of lemmas each character appears in will be somewhat logarithmic, but word count has a significantly less dramatic curve.

In [88]:
# Describe the average_word_length column data.
char_df['average_word_length'].describe()
Out[88]:
count    2640.000000
mean        2.026080
std         0.682107
min         1.000000
25%         1.750000
50%         2.000000
75%         2.250000
max         5.880000
Name: average_word_length, dtype: float64
In [89]:
# Visualize the character word_count values by their average_word_length with a scatterplot.
x = char_df['average_word_length']
y = char_df['word_count']
plt.scatter(x, y, alpha = 0.1)
plt.title('Word Counts of Japanese Characters')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
#plt.yscale('log')
plt.show()

It would make sense for word count and average word length to be related, given the assumption that simplier is more common, but that isn't always the case in languages. There appears to be a very slight positive linear relatonship between these two variables, but it doesn't appear to be significant.

In [90]:
# Visualize the kanji character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'kanji']['average_word_length']
y = char_df.loc[char_df['script'] == 'kanji']['word_count']
plt.scatter(x, y, alpha = 0.2)
plt.title('Average Word Lengths of Kanji by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()
In [91]:
# Visualize the hiragana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'hiragana']['average_word_length']
y = char_df.loc[char_df['script'] == 'hiragana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Hiragana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()
In [92]:
# Visualize the katakana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'katakana']['average_word_length']
y = char_df.loc[char_df['script'] == 'katakana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Katakana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()

These three graphs show the word count and word length of each character by script type. Hiragana and Katakana tend towards longer words, while kanji words are far more concise.

In [93]:
# View and reassess a sample of the lemma dataframe before continuing.
df.sample(20)
Out[93]:
rank frequency lemma script jlpt_level
5696 5697 9.61 朝刊 kanji n2
6639 6640 7.70 取り除く kanji_and_hiragana n3
1393 1394 53.83 集める kanji_and_hiragana n4
2300 2301 30.23 kanji n1
11634 11635 3.33 私見 kanji n4
2804 2805 24.13 寄せる kanji_and_hiragana n3
14520 14521 2.36 二日酔い kanji_and_hiragana n1
5049 5050 11.30 部会 kanji n3
4992 4993 11.48 かよう hiragana n5
5140 5141 11.02 kanji n1
1122 1123 67.15 地球 kanji n3
4243 4244 14.28 誇る kanji_and_hiragana n1
11337 11338 3.47 メトロ katakana n5
12990 12991 2.81 贈与 kanji n2
8438 8439 5.39 ため息 kanji_and_hiragana n3
8189 8190 5.64 きみ hiragana n5
2572 2573 27.05 強調 kanji n3
14451 14452 2.38 前項 kanji n1
6069 6070 8.77 営利 kanji n2
9529 9530 4.48 ようこそ hiragana n5

Though difficult to find outside of tokenizing and PoS tagging a Japanese corpus, the most common part of speech usage forms can be looked up for each word.

This will allow a comparison of the lemma dataset's distribution of parts of speech and the pos dataframe's.

In [94]:
def jisho_lookup(word):
    """Scrapes and returns the lemma's most common part of speech from its corresponding webpage at http://www.jisho.org/ ."""
    
    try:
        soup = url_base = "http://jisho.org/search/"

        # Create the URL for the current account page.
        current_url = url_base + str(word)

        # Request the current account page.
        page = requests.get(current_url)

        # Parse the page with BS4.
        soup = BeautifulSoup(page.content, 'html.parser')

        pos = soup.find("div", {"class": "meaning-tags"}).contents[0]
        
        print(word + ': ' + pos)
        
        return pos
        
    except:
        print(word + ': Failed')
        return 'Failed'
In [95]:
def jisho_api_lookup(word):
    """Scrapes and returns the lemma's most common part of speech from its corresponding webpage through http://www.jisho.org/ 's experimental alpha API.'."""
    
    try:
        pos = []

        url_base = 'https://jisho.org/api/v1/search/words?keyword='

        # Create the URL for the current account page.
        current_url = url_base + str(word)

        # Request the current account page.
        page = requests.get(current_url)
        
        # Return a failure on a 404 or 500.
        if page.status_code == 404:
            print(word + ': Failed. 404 error.')
            return ['Failed. 404 error.']
        if page.status_code == 500:
            print(word + ': Failed. 500 error.')
            return ['Failed. 500 error.']

        # Parse the page with BS4.
        soup = BeautifulSoup(page.content, 'html.parser')

        data = json.loads(str(soup))['data']
        
        # Some characters like 」 will load an api page but not have data.
        if data == []:
            print(word + ': Failed. No api data.')
            return ['Failed. No api data.']

        # Access the correct values in the nested dictionaries and lists.
        for item in data:
            for index, x in enumerate(item['japanese']):

                try:
                    if x['word'] == word or x['reading'] == word:
                        
                        for variation in item['senses']:
                            
                            for part in variation['parts_of_speech']:
                                pos.append(part)
                                
                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass


                try:
                    if x['word'] == word:

                        for variation in item['senses']:

                            for part in variation['parts_of_speech']:
                                pos.append(part)
                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass


                try:
                    if x['reading'] == word:

                        for variation in item['senses']:

                            for part in variation['parts_of_speech']:
                                pos.append(part)

                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass

        # If an api page is loaded with information but none of it is correct, return none.
        print(word + ': Failed due to incorrect data.')
        return ['Failed due to incorrect data.']
        
    except:
        print(word + ': Failed')
        return ['Failed']
In [96]:
# Test the jisho_lookup() function.
print(jisho_lookup('換算'))
換算: Noun, Suru verb
Noun, Suru verb
In [97]:
# Test the jisho_lookup() function with a character that should fail.
jisho_api_lookup('$')
$: Failed. 500 error.
Out[97]:
['Failed. 500 error.']
In [98]:
# Scrape all parts of speech from Jisho.org for each lemma.
# Note that this takes a very long time to execute.
# This block is commented out to avoid accidentally re-scraping it all.
'''
df['pos'] = df['lemma'].apply(jisho_api_lookup)
df.to_csv('df_with_pos.csv', sep = '|')
df
'''
Out[98]:
"\ndf['pos'] = df['lemma'].apply(jisho_api_lookup)\ndf.to_csv('df_with_pos.csv', sep = '|')\ndf\n"
In [99]:
# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
df = pd.read_csv('df_with_pos.csv', sep = '|', index_col = 0)
In [100]:
df
Out[100]:
rank frequency lemma script jlpt_level pos
0 1 41309.50 hiragana n5 ['Particle']
1 2 23509.54 hiragana n5 ['Numeric']
2 3 22216.80 hiragana n5 ['Particle']
3 4 20431.93 hiragana n5 ['Noun']
4 5 20326.59 hiragana n5 ['Particle']
... ... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun']
14996 14997 2.24 売場 kanji n4 ['Noun']
14997 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun']
14998 14999 2.24 かしこ hiragana n5 ['Expression']
14999 15000 2.24 バックグラウンド katakana n5 ['Noun']

15000 rows × 6 columns

In [101]:
# View the total value counts for each lemma part of speech value.
df['pos'].value_counts()
Out[101]:
['Noun']                                                                 5868
['Noun', 'Suru verb']                                                    1985
['Noun', 'No-adjective']                                                  907
['Failed due to incorrect data.']                                         647
['Noun', 'Suru verb', 'No-adjective']                                     408
                                                                         ... 
['Adverbial noun', 'Noun - used as a suffix']                               1
['Godan verb with su ending', 'Transitive verb', 'intransitive verb']       1
['Expression', 'No-adjective', 'Adverb']                                    1
['Taru-adjective', "Adverb taking the 'to' particle", 'Adverb']             1
['Suru verb', 'Noun']                                                       1
Name: pos, Length: 290, dtype: int64
In [102]:
# Calculate each unique part of speech from the jisho api scrape.
from ast import literal_eval

parts_of_speech = {}
for pos_list in df['pos']:
    for pos in literal_eval(pos_list):
        if pos in parts_of_speech:
            parts_of_speech[pos] += 1
        else:
            parts_of_speech[pos] = 1
            
parts_of_speech
Out[102]:
{'Particle': 50,
 'Numeric': 40,
 'Noun': 10385,
 'Copula': 1,
 'Suru verb - irregular': 1,
 'Godan verb with su ending': 241,
 'intransitive verb': 663,
 'Transitive verb': 998,
 'Noun - used as a suffix': 316,
 'I-adjective': 244,
 'Ichidan verb': 584,
 'Godan verb with ru ending (irregular verb)': 3,
 'Expression': 169,
 'Failed due to incorrect data.': 647,
 'Failed. No api data.': 214,
 'Godan verb with ru ending': 388,
 'Auxiliary verb': 19,
 'Godan verb with u ending': 156,
 'Pre-noun adjectival': 40,
 'Na-adjective': 833,
 'Suffix': 95,
 'Pronoun': 76,
 'Kuru verb - special class': 4,
 'Prefix': 85,
 'Godan verb - Iku/Yuku special class': 4,
 'Adverb': 554,
 'Conjunction': 89,
 'Adverbial noun': 302,
 'Yodan verb with ru ending (archaic)': 3,
 'Taru-adjective': 11,
 "Adverb taking the 'to' particle": 84,
 'Noun - used as a prefix': 32,
 'Godan verb with ku ending': 136,
 'Godan verb with tsu ending': 32,
 'Suru verb': 2512,
 'No-adjective': 1739,
 'Temporal noun': 218,
 'Godan verb with mu ending': 128,
 'Noun or verb acting prenominally': 40,
 'Godan verb - aru special class': 7,
 'Counter': 40,
 'Auxiliary adjective': 2,
 'Godan verb with u ending (special class)': 3,
 'Godan verb with bu ending': 19,
 'Godan verb with nu ending': 2,
 'Irregular nu verb': 2,
 'Wikipedia definition': 171,
 'Su verb - precursor to the modern suru': 9,
 'Place': 39,
 'Failed. 500 error.': 1,
 'Godan verb with gu ending': 26,
 'Auxiliary': 5,
 'Suru verb - special class': 12,
 'Full name': 2,
 'I-adjective (yoi/ii class)': 3,
 'Proper noun': 4,
 'Product': 1,
 'Archaic/formal form of na-adjective': 2,
 'Unclassified': 1,
 'Nidan verb (upper class) with ru ending (archaic)': 1,
 'Nidan verb (lower class) with ru ending (archaic)': 1,
 'Ichidan verb - zuru verb (alternative form of -jiru verbs)': 2,
 'Company': 3,
 'Nidan verb (lower class) with mu ending (archaic)': 1}
In [103]:
# Look at one of the more odd part of speech scrap values.
df.loc[df['pos'] == "['Wikipedia definition']"].head()
Out[103]:
rank frequency lemma script jlpt_level pos
1531 1532 48.05 小泉 kanji n2 ['Wikipedia definition']
2154 2155 32.71 田中 kanji n4 ['Wikipedia definition']
2213 2214 31.58 佐藤 kanji n1 ['Wikipedia definition']
2697 2698 25.43 ジョン katakana n5 ['Wikipedia definition']
2746 2747 24.81 村上 kanji n2 ['Wikipedia definition']

There is an excessively large amount of specific parts of speech in this data.

Create a dictionary to simplify and 'translate' the parts_of_speech to the pos_df eng_pos values.

In [104]:
# Create a dictionary to simplify and translate the parts_of_speech to the pos_df eng_pos values.
translation_dict = {}
for pos in list(parts_of_speech.keys()):
    translation_dict[pos] = ''

translation_dict
Out[104]:
{'Particle': '',
 'Numeric': '',
 'Noun': '',
 'Copula': '',
 'Suru verb - irregular': '',
 'Godan verb with su ending': '',
 'intransitive verb': '',
 'Transitive verb': '',
 'Noun - used as a suffix': '',
 'I-adjective': '',
 'Ichidan verb': '',
 'Godan verb with ru ending (irregular verb)': '',
 'Expression': '',
 'Failed due to incorrect data.': '',
 'Failed. No api data.': '',
 'Godan verb with ru ending': '',
 'Auxiliary verb': '',
 'Godan verb with u ending': '',
 'Pre-noun adjectival': '',
 'Na-adjective': '',
 'Suffix': '',
 'Pronoun': '',
 'Kuru verb - special class': '',
 'Prefix': '',
 'Godan verb - Iku/Yuku special class': '',
 'Adverb': '',
 'Conjunction': '',
 'Adverbial noun': '',
 'Yodan verb with ru ending (archaic)': '',
 'Taru-adjective': '',
 "Adverb taking the 'to' particle": '',
 'Noun - used as a prefix': '',
 'Godan verb with ku ending': '',
 'Godan verb with tsu ending': '',
 'Suru verb': '',
 'No-adjective': '',
 'Temporal noun': '',
 'Godan verb with mu ending': '',
 'Noun or verb acting prenominally': '',
 'Godan verb - aru special class': '',
 'Counter': '',
 'Auxiliary adjective': '',
 'Godan verb with u ending (special class)': '',
 'Godan verb with bu ending': '',
 'Godan verb with nu ending': '',
 'Irregular nu verb': '',
 'Wikipedia definition': '',
 'Su verb - precursor to the modern suru': '',
 'Place': '',
 'Failed. 500 error.': '',
 'Godan verb with gu ending': '',
 'Auxiliary': '',
 'Suru verb - special class': '',
 'Full name': '',
 'I-adjective (yoi/ii class)': '',
 'Proper noun': '',
 'Product': '',
 'Archaic/formal form of na-adjective': '',
 'Unclassified': '',
 'Nidan verb (upper class) with ru ending (archaic)': '',
 'Nidan verb (lower class) with ru ending (archaic)': '',
 'Ichidan verb - zuru verb (alternative form of -jiru verbs)': '',
 'Company': '',
 'Nidan verb (lower class) with mu ending (archaic)': ''}
In [105]:
# View the part of speech values from the pos_df.
pos_df['eng_pos']
Out[105]:
0               noun
1           particle
2             symbol
3               verb
4     auxiliary verb
5             adverb
6          adjective
7          adnominal
8        conjunction
9             prefix
10      interjection
11            filler
12             other
Name: eng_pos, dtype: object
In [106]:
# Manually fill out the dictionary for condensing the lemma part of speech values.
translation_dict = {
    'Particle': 'particle',
    'Numeric': 'other',
    'Noun': 'noun',
    'Copula': 'other',
    'Suru verb - irregular': 'verb',
    'Godan verb with su ending': 'verb',
    'intransitive verb': 'verb',
    'Transitive verb': 'verb',
    'Noun - used as a suffix': 'noun',
    'I-adjective': 'adjective',
    'Ichidan verb': 'verb',
    'Godan verb with ru ending (irregular verb)': 'verb',
    'Expression': 'other',
    'Failed due to incorrect data.': '',
    'Failed. No api data.': '',
    'Godan verb with ru ending': 'verb',
    'Auxiliary verb': 'auxiliary verb',
    'Godan verb with u ending': 'verb',
    'Pre-noun adjectival': 'adjective',
    'Na-adjective': 'adjective',
    'Suffix': 'other',
    'Pronoun': 'noun',
    'Kuru verb - special class': 'verb',
    'Prefix': 'prefix',
    'Godan verb - Iku/Yuku special class': 'verb',
    'Adverb': 'adverb',
    'Conjunction': 'conjunction',
    'Adverbial noun': 'noun',
    'Yodan verb with ru ending (archaic)': 'verb',
    'Taru-adjective': 'adjective',
    "Adverb taking the 'to' particle": 'adverb',
    'Noun - used as a prefix': 'noun',
    'Godan verb with ku ending': 'verb',
    'Godan verb with tsu ending': 'verb',
    'Suru verb': 'verb',
    'No-adjective': 'adjective',
    'Temporal noun': 'noun',
    'Godan verb with mu ending': 'verb',
    'Noun or verb acting prenominally': 'other',
    'Godan verb - aru special class': 'verb',
    'Counter': 'symbol',
    'Auxiliary adjective': 'adjective',
    'Godan verb with u ending (special class)': 'verb',
    'Godan verb with bu ending': 'verb',
    'Godan verb with nu ending': 'verb',
    'Irregular nu verb': 'verb',
    'Wikipedia definition': '',
    'Su verb - precursor to the modern suru': 'verb',
    'Place': 'noun',
    'Failed. 500 error.': '',
    'Godan verb with gu ending': 'verb',
    'Auxiliary': 'auxiliary verb',
    'Suru verb - special class': 'verb',
    'Full name': 'noun',
    'I-adjective (yoi/ii class)': 'adjective',
    'Proper noun': 'noun',
    'Product': 'other',
    'Archaic/formal form of na-adjective': 'verb',
    'Unclassified': '',
    'Nidan verb (upper class) with ru ending (archaic)': 'verb',
    'Nidan verb (lower class) with ru ending (archaic)': 'verb',
    'Ichidan verb - zuru verb (alternative form of -jiru verbs)': 'verb',
    'Company': 'noun',
    'Nidan verb (lower class) with mu ending (archaic)': 'verb'
}

This condensing of part of speech values inherently limits the precision of the calculations, unfortunately, but is the best that can be done without an intricate knowledge of how the pos_df was originally tagged. Perhaps this can be examined in a later project by looking into the ChaSen morphological analyzer that was used. (http://chasen-legacy.osdn.jp/)

In [107]:
# View the columns for each dataframe before continuing.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15000 entries, 0 to 14999
Data columns (total 6 columns):
rank          15000 non-null int64
frequency     15000 non-null float64
lemma         15000 non-null object
script        15000 non-null object
jlpt_level    15000 non-null object
pos           15000 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 820.3+ KB
In [108]:
# View the columns for each dataframe before continuing.
char_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 13 columns):
rank                   2640 non-null object
frequency              2640 non-null int64
character              2640 non-null object
weighted_frequency     2640 non-null float64
weighted_rank          2640 non-null object
script                 2640 non-null object
ord                    2640 non-null int64
jlpt_level             2640 non-null object
stroke_count           2479 non-null float64
grade                  2381 non-null float64
frequency_ranking      2115 non-null float64
word_count             2640 non-null int32
average_word_length    2640 non-null float64
dtypes: float64(5), int32(1), int64(2), object(5)
memory usage: 358.4+ KB
In [109]:
# View the columns for each dataframe before continuing.
pos_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 6 columns):
rank                    13 non-null object
eng_pos                 13 non-null object
frequency               13 non-null float64
frequency_percentage    13 non-null float64
jap_pos                 13 non-null object
transliterated_pos      13 non-null object
dtypes: float64(2), object(4)
memory usage: 752.0+ bytes
In [110]:
# Create a list of all scraped part of speech value combinations.
pos_lists = df['pos'].value_counts().keys()
pos_lists
Out[110]:
Index(['['Noun']', '['Noun', 'Suru verb']', '['Noun', 'No-adjective']',
       '['Failed due to incorrect data.']',
       '['Noun', 'Suru verb', 'No-adjective']', '['Na-adjective', 'Noun']',
       '['Ichidan verb', 'Transitive verb']', '['Adverb']', '['I-adjective']',
       '['Noun', 'Noun - used as a suffix']',
       ...
       '['Adverbial noun', 'Conjunction']', '['No-adjective', 'Prefix']',
       '["Adverb taking the 'to' particle"]',
       '['Noun', 'Suru verb', 'Adverb', "Adverb taking the 'to' particle"]',
       '['Noun', 'Noun - used as a prefix', 'No-adjective']',
       '['Adverbial noun', 'Noun - used as a suffix']',
       '['Godan verb with su ending', 'Transitive verb', 'intransitive verb']',
       '['Expression', 'No-adjective', 'Adverb']',
       '['Taru-adjective', "Adverb taking the 'to' particle", 'Adverb']',
       '['Suru verb', 'Noun']'],
      dtype='object', length=290)

Nearly all of these are made up of very niche or at least overly specific parts of speech. A dummy variable column for each part of speech can be created to simplify the grouping of rows.

In [111]:
# Create a dummy variable column for each part of speech from the pos_df for each lemma in the lemma dataframe.
df = df.assign(**{'adjective': 0, 'adverb': 0, 'auxiliary verb': 0, 'conjunction': 0, 'noun': 0, 'other': 0, 'particle': 0, 'prefix': 0, 'symbol': 0, 'verb': 0})
df
Out[111]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb
0 1 41309.50 hiragana n5 ['Particle'] 0 0 0 0 0 0 0 0 0 0
1 2 23509.54 hiragana n5 ['Numeric'] 0 0 0 0 0 0 0 0 0 0
2 3 22216.80 hiragana n5 ['Particle'] 0 0 0 0 0 0 0 0 0 0
3 4 20431.93 hiragana n5 ['Noun'] 0 0 0 0 0 0 0 0 0 0
4 5 20326.59 hiragana n5 ['Particle'] 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun'] 0 0 0 0 0 0 0 0 0 0
14996 14997 2.24 売場 kanji n4 ['Noun'] 0 0 0 0 0 0 0 0 0 0
14997 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun'] 0 0 0 0 0 0 0 0 0 0
14998 14999 2.24 かしこ hiragana n5 ['Expression'] 0 0 0 0 0 0 0 0 0 0
14999 15000 2.24 バックグラウンド katakana n5 ['Noun'] 0 0 0 0 0 0 0 0 0 0

15000 rows × 16 columns

In [112]:
def translate_pos(pos):
    """Returns the condensed part of speech value calculated from the translation_dict."""
    
    return translation_dict[pos].lower()
In [113]:
def set_df_pos_columns(row, index):
    """Iterates over each sepearate scraped part of speech for the given row and sets the corresponding dummy variable columns for each condensed part of speech."""
    
    for pos in literal_eval(row['pos'].item()):
        if translate_pos(pos) != '':
            df.loc[[index],[translate_pos(pos)]] = 1
In [114]:
# Test the set_df_pos_columns() function with the first lemma dataframe row.
set_df_pos_columns(df.iloc[[0]], 0)
df.head()
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.
Out[114]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb
0 1 41309.50 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0
1 2 23509.54 hiragana n5 ['Numeric'] 0 0 0 0 0 0 0 0 0 0
2 3 22216.80 hiragana n5 ['Particle'] 0 0 0 0 0 0 0 0 0 0
3 4 20431.93 hiragana n5 ['Noun'] 0 0 0 0 0 0 0 0 0 0
4 5 20326.59 hiragana n5 ['Particle'] 0 0 0 0 0 0 0 0 0 0
In [115]:
# Note that this will take some time.
# Use the set_df_pos_columns() function to set all part of speech dummy variables for each row in the lemma dataframe.
for index, row in enumerate(df.iterrows()):
    set_df_pos_columns(df.iloc[[index]], index)
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.
In [116]:
# Calculate the total number of lemmas of each part of speech based on the dummy variables.
lemma_pos_counts = {}
for column in ['adjective', 'adverb', 'auxiliary verb', 'conjunction', 'noun', 'other', 'particle', 'prefix', 'symbol', 'verb']:
    lemma_pos_counts[column] = df[column].sum()
    
lemma_pos_counts
Out[116]:
{'adjective': 2652,
 'adverb': 568,
 'auxiliary verb': 24,
 'conjunction': 89,
 'noun': 10809,
 'other': 341,
 'particle': 50,
 'prefix': 85,
 'symbol': 40,
 'verb': 4266}

Nouns and verbs are understandably the most common words.

The average frequency all of the characters in a lemma may be a way to measure the characters' impact on a lemma's frequency.

In [117]:
def average_char_frequency(word):
    """Calculates and returns the average frequency among all characters in the given word."""
    
    freq_total = 0
    char_count = len(word)
    
    for char in word:
        freq_total += char_df.loc[char_df['character'] == char]['frequency'].item()
        
    return freq_total / char_count
In [118]:
# Test the average_char_frequency() function.
average_char_frequency('図書館')
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version
  
Out[118]:
26.666666666666668
In [119]:
# Note that this will take some time.
# Calculate the average character frequency of each lemma in the lemma dataframe.
df['average_character_frequency'] = df['lemma'].map(average_char_frequency)
df
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version
  
Out[119]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb average_character_frequency
0 1 41309.50 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 89.000000
1 2 23509.54 hiragana n5 ['Numeric'] 0 0 0 0 0 1 0 0 0 0 164.000000
2 3 22216.80 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 73.000000
3 4 20431.93 hiragana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000
4 5 20326.59 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 7.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun'] 0 0 0 0 1 0 0 0 0 0 19.000000
14996 14997 2.24 売場 kanji n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 34.000000
14997 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000
14998 14999 2.24 かしこ hiragana n5 ['Expression'] 0 0 0 0 0 1 0 0 0 0 253.333333
14999 15000 2.24 バックグラウンド katakana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 247.625000

15000 rows × 17 columns

Likewise, the total stroke count of all characters in a word, as well as the average stroke count per character, can provide an average measure for difficulty of a word.

In [120]:
def total_stroke_count(word):
    """Calculates and returns the sum of the stroke counts of all characters in the given word."""
    
    stroke_total = 0
    
    try:
        for char in word:
            stroke_total += char_df.loc[char_df['character'] == char]['stroke_count'].item()
    except:
        return np.nan
    
    return stroke_total
In [121]:
def average_char_stroke_count(word):
    """Calculates and returns the average stroke count among all characters in the given word."""
    
    stroke_total = total_stroke_count(word)
    char_count = len(word)
    
    return stroke_total / char_count
In [122]:
# Test the total_stroke_count() and average_char_stroke_count() functions.
total_stroke_count('図書館'), average_char_stroke_count('図書館')
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version
  
Out[122]:
(33.0, 11.0)
In [123]:
# Note that this will take a while.
# Calculate the total and average stroke counts for each lemma in the lemma dataframe.
df['total_stroke_count'] = df['lemma'].map(total_stroke_count)
df['average_character_stroke_count'] = df['lemma'].map(average_char_stroke_count)
df
C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version
  
Out[123]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb average_character_frequency total_stroke_count average_character_stroke_count
0 1 41309.50 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 89.000000 1.0 1.00
1 2 23509.54 hiragana n5 ['Numeric'] 0 0 0 0 0 1 0 0 0 0 164.000000 3.0 3.00
2 3 22216.80 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 73.000000 3.0 3.00
3 4 20431.93 hiragana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000 1.0 1.00
4 5 20326.59 hiragana n5 ['Particle'] 0 0 0 0 0 0 1 0 0 0 7.000000 3.0 3.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun'] 0 0 0 0 1 0 0 0 0 0 19.000000 4.0 2.00
14996 14997 2.24 売場 kanji n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 34.000000 19.0 9.50
14997 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000 16.0 4.00
14998 14999 2.24 かしこ hiragana n5 ['Expression'] 0 0 0 0 0 1 0 0 0 0 253.333333 6.0 2.00
14999 15000 2.24 バックグラウンド katakana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 247.625000 22.0 2.75

15000 rows × 19 columns

In [124]:
# Visualize the total number of each lemma part of speech with a bar plot.
x = list(lemma_pos_counts.keys())
y = list(lemma_pos_counts.values())

sns.barplot(x, y, orient = 'v')
plt.title('Part of Speech Totals for the Most Common Japanese Lemmas')
plt.ylabel('Number of Lemmas')
plt.xlabel('Part of Speech')
plt.xticks(rotation = '90')
plt.show()

The large differences in values between these categories may make graphing the data tricky, but it is also very easy to see the distinctive differences in counts.

Nonbasic Lemma Dataframe Creation

Create a dataframe that will only contain Japanese Lemmas made of multiple kana or kanji.

In [125]:
# Create a completely separate copy of the lemma dataframe.
nonbasic_lemma_df = df.copy(deep = True)
In [126]:
# Remove non-Japanese lemmas from the lemma dataframe copy.
nonbasic_lemma_df = nonbasic_lemma_df[df['script'] != 'not_japanese']
In [127]:
# Look at the new total number of lemmas in each script.
nonbasic_lemma_df['script'].value_counts()
Out[127]:
kanji                    8185
kanji_and_hiragana       2446
katakana                 2387
hiragana                 1686
kanji_and_katakana         42
hiragana_and_katakana      24
Name: script, dtype: int64
In [128]:
# Remove all hiragana-only and katakana-only lemmas that are only a single character long.
to_remove = nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'hiragana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist()
to_remove.extend(nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'katakana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist())
to_remove.sort()
nonbasic_lemma_df.drop(nonbasic_lemma_df.index[[to_remove]], inplace = True)
nonbasic_lemma_df
C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\indexes\base.py:4291: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result = getitem(key)
Out[128]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb average_character_frequency total_stroke_count average_character_stroke_count
8 9 16841.17 する hiragana n5 ['Suru verb - irregular'] 0 0 0 0 0 0 0 0 0 1 736.500000 3.0 1.50
10 11 9604.49 ます hiragana n5 ['Godan verb with su ending', 'intransitive ve... 0 0 0 0 0 0 0 0 0 1 281.000000 5.0 2.50
12 13 8189.00 ない hiragana n5 ['I-adjective'] 1 0 0 0 0 0 0 0 0 0 393.500000 6.0 3.00
13 14 8140.22 いる hiragana n5 ['Ichidan verb', 'intransitive verb'] 0 0 0 0 0 0 0 0 0 1 869.000000 3.0 1.50
15 16 6766.19 ある hiragana n5 ['Godan verb with ru ending (irregular verb)',... 0 0 0 0 0 0 0 0 0 1 628.500000 4.0 2.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14995 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun'] 0 0 0 0 1 0 0 0 0 0 19.000000 4.0 2.00
14996 14997 2.24 売場 kanji n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 34.000000 19.0 9.50
14997 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000 16.0 4.00
14998 14999 2.24 かしこ hiragana n5 ['Expression'] 0 0 0 0 0 1 0 0 0 0 253.333333 6.0 2.00
14999 15000 2.24 バックグラウンド katakana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 247.625000 22.0 2.75

14686 rows × 19 columns

In [129]:
# Check the non-null values of each column.
nonbasic_lemma_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14686 entries, 8 to 14999
Data columns (total 19 columns):
rank                              14686 non-null int64
frequency                         14686 non-null float64
lemma                             14686 non-null object
script                            14686 non-null object
jlpt_level                        14686 non-null object
pos                               14686 non-null object
adjective                         14686 non-null int64
adverb                            14686 non-null int64
auxiliary verb                    14686 non-null int64
conjunction                       14686 non-null int64
noun                              14686 non-null int64
other                             14686 non-null int64
particle                          14686 non-null int64
prefix                            14686 non-null int64
symbol                            14686 non-null int64
verb                              14686 non-null int64
average_character_frequency       14686 non-null float64
total_stroke_count                14686 non-null float64
average_character_stroke_count    14686 non-null float64
dtypes: float64(4), int64(11), object(4)
memory usage: 2.2+ MB
In [130]:
# Reset the index in the new dataframe.
nonbasic_lemma_df.reset_index(drop = True, inplace = True)
nonbasic_lemma_df
Out[130]:
rank frequency lemma script jlpt_level pos adjective adverb auxiliary verb conjunction noun other particle prefix symbol verb average_character_frequency total_stroke_count average_character_stroke_count
0 9 16841.17 する hiragana n5 ['Suru verb - irregular'] 0 0 0 0 0 0 0 0 0 1 736.500000 3.0 1.50
1 11 9604.49 ます hiragana n5 ['Godan verb with su ending', 'intransitive ve... 0 0 0 0 0 0 0 0 0 1 281.000000 5.0 2.50
2 13 8189.00 ない hiragana n5 ['I-adjective'] 1 0 0 0 0 0 0 0 0 0 393.500000 6.0 3.00
3 14 8140.22 いる hiragana n5 ['Ichidan verb', 'intransitive verb'] 0 0 0 0 0 0 0 0 0 1 869.000000 3.0 1.50
4 16 6766.19 ある hiragana n5 ['Godan verb with ru ending (irregular verb)',... 0 0 0 0 0 0 0 0 0 1 628.500000 4.0 2.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14681 14996 2.24 夕べ kanji_and_hiragana n4 ['Adverbial noun', 'Temporal noun'] 0 0 0 0 1 0 0 0 0 0 19.000000 4.0 2.00
14682 14997 2.24 売場 kanji n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 34.000000 19.0 9.50
14683 14998 2.24 たたき台 kanji_and_hiragana n4 ['Noun'] 0 0 0 0 1 0 0 0 0 0 143.000000 16.0 4.00
14684 14999 2.24 かしこ hiragana n5 ['Expression'] 0 0 0 0 0 1 0 0 0 0 253.333333 6.0 2.00
14685 15000 2.24 バックグラウンド katakana n5 ['Noun'] 0 0 0 0 1 0 0 0 0 0 247.625000 22.0 2.75

14686 rows × 19 columns

This new dataframe includes nearly all of the original lemma dataset, but also has no lemma made up of irrelevant characters.

It also removes the hiragana and katakana characters from being examined as if they were lemmas themselves, removing the largest outliers for frequencies.

Now that the data has been set up in the ways needed for visualization and analysis, the plotting can begin. Frequency is the biggest variable to look into, because it is such a large factor into the immediate usefulness of learning a word. The JLPT exam level is another variable to examine because of how it impacts so many students of the Japanese language through language programs, classes, and tools. Finally, the part of speech ratios and the grades that native speakers order these lemmas in will be looked at as well.

First, the lemmas will be examined, followed by the individual characters that make up the lemmas.

Create wrapper functions to simplify the creation of plots.

In [131]:
def create_graph_bar(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, order = None, orient = None):
    """A wrapper function for creating seaborn barplots."""
    
    ax = sns.barplot(x, y, order = order, orient = orient).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    plt.show()
In [132]:
def create_graph_reg(y, x, title, ylabel, xlabel, order = 1, ylog = None, alpha = 1, line_color = None, truncate = True, xjitter = 0):
    """A wrapper function for creating seaborn regplots."""
    
    if line_color is not None:
        ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}, line_kws = {"color": line_color}).set_title(title)
    else:
        ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    plt.show()
In [133]:
def create_graph_box(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, order = None, orient = None):
    """A wrapper function for creating seaborn boxplots."""
    
    ax = sns.boxplot(x, y, order = order, orient = orient).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    if xlog: plt.xscale('log')
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    plt.show()
In [134]:
def create_graph_cat(y, x,  title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, hue = None,  data = None, kind = 'scatter',  order = None):
    """A wrapper function for creating seaborn catplots."""
    
    ax = sns.catplot(x = x, y = y, hue = hue, data = data, kind = kind, order = order)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    if ylog: plt.yscale('log')
    if xlog: plt.xscale('log')
    plt.show()

Visualizing the numerical data for the lemma dataframe will show how the data is distributed. Any polynomial trend will be very interesting, but any linear or exponential trend will also be useful to know about.

In [135]:
# Display a grid of histograms of univariate numerical columns.
nonbasic_lemma_df.hist(column = ['frequency', 'average_character_frequency', 'total_stroke_count', 'average_character_stroke_count'], bins = 16, figsize = (10, 10), grid = False);
In [136]:
# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()

Frequency has too high of a variance for the visual to be useful in the same form as the others, but a previously made chart can be reexamined.

Average character frequency, average character stroke count, and total character stroke count all appear to have a polynomial trend where data near the median appears to be higher or more frequent than the surrounding.

Lemma frequency is one of the most important variables for this dataset.

The relationships between a lemma's frequency and script, length, JLPT exam level, average character frequency, total stroke count, and average character stroke count will be visualized to give insight to which factors may need to be further researched.

In [137]:
# Display the relationship between frequency and script with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['script'], 'Lemma Frequency by Script', 'Frequency', 'Script', 45, ylog = True)

Frequency seems very slightly affected by the script type.

In [138]:
# Display the relationship between frequency and lemma length with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['lemma'].str.len(), 'Frequency by Lemma Length', 'Frequency', 'Lemma Length', 3, True, line_color = 'red')

A lemma's frequency seems to be negatively affected by its length. Additionally, the frequency of a lemma drops drastically when it is longer than 8 characters.

In [139]:
# Display the relationship between frequency and JLPT exam level with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['jlpt_level'], 'Lemma Frequency by JLPT Exam Level', 'Frequency', 'JLPT Exam Level', ylog = True, order = jlpt_exams)

While the n5 exam has some of the most frequently used lemmas, it has a lower median than other exam levels.

In [140]:
# Display the relationship between frequency and average character frequency with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['average_character_frequency'], 'Frequency by Lemma Average Character Frequency', 'Frequency', 'Average Character Frequency', ylog = True, alpha = 0.05, line_color = 'red')

A lemma's frequency is positively affected by its average character frequency as a general trend.

In [141]:
# Display the relationship between frequency and total stroke count with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['total_stroke_count'], 'Frequency by Lemma Total Stroke Count', 'Frequency', 'Lemma Total Stroke Count', order = 1, ylog = True, alpha = 0.05, line_color = 'red')