Udacity Data Analyst Nanodegree Capstone Project
Project context, goals, and findings can be found in the readme.txt file here. Now displayed below for convenience. Additionally, download links to datasets created throughout this project are now available below.
Sources:
Dataset downloads:
In linguistics, a lexeme is the set of forms that a word, or more specifically a single semantic value, can take on in a language regardless of the number of ways it can be modified through inflection.
A lemma is the dictionary form of a word that is chosen by the conventions of its individual language to represent the entirety of the lexeme.
Lemmas and word stems are different in that a stem is the portion of a word that remains constant despite morphological inflection while a lemma is the base form of the word that represents the distinct meaning of the word regardless of inflection.
When studying a language, multitudes of different approaches can be taken. One method of efficient study is to memorize or learn the base form of a concept, or the lemma, and through the application of the grammatical rules of the language, begin to incorporate the remainder of the lexeme into their usage.
More information on lemmas can be found on its corresponding Wikipedia page: (https://en.wikipedia.org/wiki/Lemma_(morphology))
This project examines the frequencies of lemmas in the Japanese language, and what factors influence those frequencies, in order to determine a more efficient approach towards Japanese second language acquisition and the ordering of teaching materials for this purpose.
Efficiency will be measured by the estimated frequency, and thus number of applications or general usefulness, that learning a word will give, assuming that the student can apply grammatical rules to utilize all appropriate forms of the word, as determined by the frequency of the lemma in the Internet Corpus.
Additionally, the part of speech that each lemma is classified as will be used to look into ideas for a more efficient order of learning various sets of grammatical rules in Japanese.
The lemma dataset can be found hosted at http://corpus.leeds.ac.uk/frqc/internet-jp.num and similar datasets for other languages can be found at http://corpus.leeds.ac.uk/list.html
The Part of Speech distribution frequency dataset can be found at http://corpus.leeds.ac.uk/frqc/internet-jp-pos.num
The JLPT tier dataset was created from this webpage https://www.nihongo-pro.com/kanji-pal/list/jlpt
The Kanji .json dataset is found at https://thekanjimap.com/kanji.html
Character information is additionally gathered from http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C
Lemma frequency is most strongly affected by its length and both total and average character stroke counts.
When weighting the importance of learning a lemma, kanji and hiragana script words should be prioritized more.
While longer lemmas are generally used less frequently, the lemmas included in the JLPT n5 exam vocabulary list should be made exempt from negative weighting from length due to the type of hiragana words in this category.
A more efficient Japanese learning order will have the largest focus on nouns, verbs, and adjective syntax, but will cover auxiliary verb, conjunction, and particle rules in earlier stages.
The 'Frequency by Lemma Length', 'Frequency by Lemma Total Stroke Count', and 'Frequency by Lemma Average Character Stroke Count' plots all have extremely similar trend lines, which means that lemma length, lemma total stroke count, and lemma average character stroke count all have similar impacts on a lemma's frequency. Combining these as subplots on a single plot makes this comparison clear.
The 'Lemma Frequency by Script' plot shows that while the differences in medians and interquartile ranges among the script types show the varying importance of focusing on each script, the sheer number of extremely high frequency outliers in the hiragana script is worth discussion and study alone.
The 'Lemma JLPT Level by Frequency' plot shows the distributions of each JLPT level's frequencies. While the n5 exam has only the third highest median and third quartile of the scripts, it also has the most outliers and the highest frequency values of all. When compared to the 'Lemma JLPT Level by Lemma Length' plot, the reason for this becomes apparent: the JLPT n5 exam has the highest mean and range of lemma length.
The 'Distribution of Lemma Parts of Speech' shows the ratio of each lemma part of speech, with nouns, verbs, and adjectives having the most representation by far. Comparing this information with the 'Lemma Average Character Stroke Count' and 'Lemma Average Character Frequency' plots shows why language ordering cannot be only based on ratios, as three of the least common parts of speech in the dataset are shown to have the highest average character frequencies.
The project begins by loading in datasets and creating columns of data from the existing information in order to have a sufficient amount of variables to examine.
# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
import json
import statsmodels.api as sm;
%matplotlib inline
# Read the lemma dataset into a dataframe called original_df.
original_df = pd.read_csv('japanese_lemmas.csv')
original_df.head()
The lemma dataset has three columns:
# Read the part of speech dataset into a dataframe called original_pos_df.
original_pos_df = pd.read_csv('japanese_pos_frequencies.csv', names = ['rank', 'frequency', 'jap_pos'])
original_pos_df.head()
The lemma dataset has three columns:
# Work with copies of the original dataframes.
df = original_df.copy()
pos_df = original_pos_df.copy()
# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()
The distribution of the lemma frequencies appears logarithmic, which makes sense because only a few words should be extremely common from simplicity or syntactical importance, with the bulk of others slowly becoming less frequent as they become more specific or niche.
# Check for duplicate rows.
df.duplicated().sum()
# Check for null values.
df.isnull().sum()
# Check dtypes.
df.info()
This dataset is already clean, so little tidying needs to be done.
# Correct the rank column's dtype from int to string.
df['rank'] = df['rank'].astype('object')
# Check dtypes.
df.info()
Begin working with the part of speech frequency dataset by adding a translated and transliterated part of speech column for English speakers.
# Manually define the translations and transliterations.
translations = ['noun', 'particle', 'symbol', 'verb', 'auxiliary verb', 'adverb', 'adjective', 'adnominal', 'conjunction', 'prefix', 'interjection', 'filler', 'other']
transliterations = ['めいし', 'じょし', 'きごう', 'どうし', 'じょどうし', 'ふくし', 'けいようし', 'れんたいし', 'せつぞくし', 'せっとうし', 'かんどうし', 'ふぃらあ', 'そのほか']
pos_df['eng_pos'] = translations
pos_df['transliterated_pos'] = transliterations
# Reorder the columns so that eng_pos is next to frequency and the two Japanese columns are adjacent for easier reading.
pos_df = pos_df[['rank', 'frequency', 'eng_pos', 'jap_pos', 'transliterated_pos']]
# Ensure the pos_df is easily readable.
pos_df
# Check and set column dtypes.
pos_df.info()
This part of speech dataset is also already clean and just needs minor dtype adjustment.
# Correct the rank column's dtype from int to string.
pos_df['rank'] = pos_df['rank'].astype('object')
# Check dtypes.
pos_df.info()
The ratios of each part of speech in the pos_df will be useful to have with to each row.
# Calculate the frequency percentage of each part of speech.
total = pos_df['frequency'].sum()
pos_df['frequency_percentage'] = pos_df['frequency'].apply(lambda x: x / total)
# Reorder the columns to place the frequency percentage by the frequency.
pos_df = pos_df[['rank', 'eng_pos', 'frequency', 'frequency_percentage', 'jap_pos', 'transliterated_pos']]
pos_df
Nouns understandably take up just over a third of the language usage, but particles actually take up a full fifth of the language usage, twice as much as verbs.
Create a dataframe of every individual character found in the lemma dataset.
While populating the list of characters, total the number each character is used.
# Find each character and the number that each of these characters appear in the lemma dataset.
characters = {}
for lemma in df['lemma']:
for char in lemma:
if char in characters:
characters[char] += 1
else:
characters[char] = 1
Glance at a random subsection of the character dictionary.
dict(list(characters.items())[:30])
# Create the character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index,
char_df = pd.DataFrame.from_dict(characters, orient = 'index')
char_df = char_df.reset_index()
char_df = char_df.rename({'index': 'character', 0: 'frequency'}, axis='columns')
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df.head()
Additionally, calculate a 'weighted' frequency for each character.
This is the sum of the frequencies of the words that the character appears in, counting each time the character apears in the word.
# Calculate the approximate amount that each character appeared in the dataset that the lemma dataset was calculated form.
weighted_characters = {}
for index, row in df.iterrows():
for char in row['lemma']:
if char in weighted_characters:
weighted_characters[char] += row['frequency']
else:
weighted_characters[char] = row['frequency']
dict(list(weighted_characters.items())[:30])
# Create the weighted character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index,
weighted_char_df = pd.DataFrame.from_dict(weighted_characters, orient = 'index')
weighted_char_df = weighted_char_df.reset_index()
weighted_char_df = weighted_char_df.rename({'index': 'character', 0: 'weighted_frequency'}, axis='columns')
weighted_char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
weighted_char_df = weighted_char_df.reset_index(drop = True)
weighted_char_df.head()
# Merge the character dataframes based on the character for each row.
char_df = char_df.merge(weighted_char_df)
char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df
# Ensure that each row has a value for each column of data.
char_df.info()
Create a rank column in the character dataframe that is equivalent to the rank column in the lemma dataframe.
# Create a rank column for the char_df to match the lemma df for both weighted and non-weighted frequency columns.
char_df['weighted_rank'] = char_df['weighted_frequency'].rank(method = 'first', ascending = False,)
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df['rank'] = char_df['frequency'].rank(method = 'first', ascending = False,)
char_df
# Fix the index and correct the rank and weighted_rank dtypes from int to string.
char_df = char_df.reset_index(drop = True)
char_df['rank'] = char_df['rank'].astype('int').astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('int').astype('object')
char_df
# Reorder the columns for human readability.
char_df = char_df[['rank', 'frequency','character', 'weighted_frequency', 'weighted_rank',]]
char_df
Each character needs to be tagged with its appropriate script type.
We can easily classify each character by looking up its unicode representation.
def classify_char_script(char):
"""Look up the integer representing Unicode code point for the given character and return its script."""
char = ord(char)
if 0 <= char <= 8591:
return 'latin'
elif 12288 <= char <= 12351:
return 'punctuation'
elif 12352 <= char <= 12447:
return 'hiragana'
elif 12448 <= char <= 12543:
return 'katakana'
elif 19968 <= char <= 40879:
return 'kanji'
elif 65280 <= char <= 65374:
return 'full-width_roman'
elif 65375 <= char <= 65519:
return 'half-width_katakana'
else:
return 'other'
# Classify each character in the char_df
char_df['script'] = char_df['character'].apply(classify_char_script)
# Check the total number of characters in each script in this dataset.
char_df['script'].value_counts()
Adding the characters' Unicode code point to each row may be useful for later sorting or testing.
def get_ord(row):
"""Return the character's integer representing Unicode code point from the given row."""
return ord(row['character'])
# Create a column that holds each character's integer representing Unicode code point for reference.
char_df['ord'] = char_df.apply(get_ord, axis = 1)
char_df
The Japanese-Language Proficiency Test (JLPT) is an extremely influential standardized test used to evaluate a non-native student's Japanese ability.
It consists of 5 different levels, the n5, n4, n3, n2, and n1 exams. The n5 is the easiest, testing beginner concepts, and the n1 is the most advanced, testing the ability to understand Japanese in virtually any circumstance.
Each character should be tagged with its appropriate JLPT exam level.
# Read in and then merge the JLPT rank dataset into the character dataframe.
jlpt_df = pd.read_csv('jlpt_levels.csv')
jlpt_df = jlpt_df.rename(columns={"kanji": "character"})
jlpt_df.head()
# Use the object dtype for the jlpt_level column because they are categorical, not quantitative.
char_df = char_df.merge(jlpt_df, how = 'left')
char_df['jlpt_level'] = char_df['jlpt_level'].astype('object')
char_df
# Check the total number of charcters in each JLPT exam level in the JLPT dataset.
jlpt_df['jlpt_level'].value_counts()
# Check the total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
The jlpt_level has strings from float values rather than int values because NaN's were present and require float.
# Correct the float strings to integer strings and change the exam level names.
char_df.loc[char_df['jlpt_level'] == 1.0, 'jlpt_level'] = 'n1'
char_df.loc[char_df['jlpt_level'] == 2.0, 'jlpt_level'] = 'n2'
char_df.loc[char_df['jlpt_level'] == 3.0, 'jlpt_level'] = 'n3'
char_df.loc[char_df['jlpt_level'] == 4.0, 'jlpt_level'] = 'n4'
char_df.loc[char_df['jlpt_level'] == 5.0, 'jlpt_level'] = 'n5'
char_df['jlpt_level'].value_counts()
Hiragana and Katakana characters are considerably easier to learn than nearly all kanji, and both syllabaries are expected to be known before the JLPT n5 exam is taken.
The hiragana and katakana characters will be set to the easiest JLPT exam level, the n5.
# Set all hiragana and katakana characters to the easiest JLPT exam level.
char_df.loc[char_df['script'] == 'hiragana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'katakana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'half-width_katakana', 'jlpt_level'] = 'n5'
char_df
# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
# Visualize the total number of charcters in each JLPT exam level in the character dataframe.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution')
plt.show()
After the initial investment of learning both hiragana and katakana with basic kanji, each JLPT exam level expects more new kanji than before.
The JLPT tests fluency, but cannot be truly comprehensive. Many kanji characters are not regularly used by even native speakers, so these are not tested.
Set the kanji that are more advanced than the JLPT exams to the value of n0. There is no n0 JLPT exam, but this will signify that the character is beyond the exams.
# Set the JLPT exam level to n0 for every character that does not have a jlpt_level value yet.
char_df.loc[char_df['jlpt_level'].isna(), 'jlpt_level'] = 'n0'
# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()
# Visualize the new total number of charcters in each JLPT exam level in the character dataframe with a barplot.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution Including Advanced Characters')
plt.show()
There are far, far more kanji than is represented on this graph, but these counts make up all of the characters in the most common 15,000 lemmas of the Japanese language. 523 of these kanji are not included on the JLPT exams but are still common enough to plan to eventually learn.
The stroke count, or the total number of strokes needed to write the character, is another metric of assessing difficulty of learning Japanese characters. The higher the stroke count, the more individual pieces needed to be memorized and recalled correctly.
Each character will be tagged with the appropriate stroke count.
Characters that have a debatable stroke count will be given the higher number of the possibilities.
# Read in the kanji.json file to get the stroke count of each character.
json_df = pd.read_json('kanji.json', encoding = 'UTF-8')
json_df.head()
# Check the total number of charcters in each grade group in the character dataframe.
json_df['grade'].value_counts()
While the grade grouping information is very useful, breaking it down into individual grade level will be more useful, so this information will be left out.
# Sort by stroke count.
json_df.sort_values('stroke', inplace = True)
json_df = json_df.reset_index(drop = True)
json_df
# Rename the kanji column to character as in all other dataframes.
json_df = json_df.rename(columns={"kanji": "character"})
json_df.head()
# Check for any oddities.
char_df.info()
# Get the stroke count for each character from the json_df for the char_df.
char_df['stroke_count'] = char_df['character'].map(json_df.set_index('character')['stroke'])
char_df
Many of the stroke counts are still missing. The remaining data will be gotten from another source. While this is being collected, the grade and frequency rating will be collected as well.
# Create empty columns for the grade and frequency rating values.
char_df['grade'] = np.nan
char_df['frequency_rating'] = np.nan
char_df.info()
char_df
That dataset was not complete enough, so the stroke count will be scraped from http://www.edrdg.org/.
Additionally, the school grade that the character is typically learned in will be scraped as well.
from bs4 import Tag
def char_lookup(char):
"""Scrapes the character's corresponding webpage at http://www.edrdg.org/ and sets the character's char_df stroke_count, grade, and frequency_rating column values to the scraped information."""
try:
url_base = 'http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MMJ'
# Create the URL for the current account page.
current_url = url_base + str(char)
# Request the current account page.
page = requests.get(current_url)
# Parse the page with BS4.
# This source has extra </b></td></tr> tags that break the python default HTML parser.
# Use the external lxml parser instead.
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find_all("table")[1]
stroke_element = table.find("td", string = 'Stroke Count')
stroke_count = str(stroke_element.next_sibling)[7:-9]
# Some kanji have different possibilities based on writing style.
# Calculate both and then assume the higher.
if ' ' in stroke_count:
stroke_count = stroke_count[:-1]
before = str(stroke_count)
space_loc = 0
for index, letter in enumerate(stroke_count):
if letter == ' ':
space_loc = index
first = stroke_count.split()[0]
second = stroke_count.split()[1]
# Take the higher stroke count.
stroke_count = max(int(first), int(second))
after = str(stroke_count)
print(char + ': ' + before + ' -> ' + after)
#print('stroke_count: ' + stroke_count)
try:
grade_element = table.find("td", string = 'Grade')
grade = str(grade_element.next_sibling)[7:-9]
#print('grade: ' + grade)
except:
print(char + ': Grade not found.')
grade = np.NaN
try:
freq_element = table.find("td", string = 'Frequency ranking')
frequency_ranking = str(freq_element.next_sibling)[7:-9]
#print('frequency_ranking: ' + frequency_ranking)
except:
print(char + ': No frequency ranking found.')
frequency_ranking = np.NaN
# Save the results
index = char_df.loc[char_df['character'] == char].index.item()
char_df.at[index, 'stroke_count'] = stroke_count
char_df.at[index, 'grade'] = grade
char_df.at[index, 'frequency_ranking'] = frequency_ranking
print(char + ': Success')
except:
print(char + ': Failed')
# Test the function on a simple and common character
char_lookup('本')
# Note that this takes a long time.
# This block is commented out to avoid accidentally re-scraping it all.
'''
# Use the char_lookup function to scrape the stroke_count, grade, and frequency_rating cloumn values for each character, and then save the resulting dataframe to avoid having to rescrape the data.
char_df['character'].map(char_lookup)
char_df.to_csv('char_df_with_strokes.csv')
'''
# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
char_df = pd.read_csv('char_df_with_strokes.csv', index_col = 0)
# Check to see if any kanji characters' stroke_counts are missing and need to be re-scraped.
chars_to_redo = char_df.loc[(char_df['stroke_count'].isnull()) & (char_df['script'] == 'kanji')].index.tolist()
chars_to_redo
# Re-scrape any missing stroke_counts.
for index in chars_to_redo:
char_lookup(char_df.at[index, 'character'])
The hiragana and katakana characters will be added manually because it will be simpler this way than to scrap it from a different source.
# Create a dictionary with each hiragana character's stroke count.
hiragana_strokes = {
'あ': 3,
'い': 2,
'う': 2,
'え': 2,
'お': 3,
'か': 3,
'き': 3,
'く': 1,
'け': 3,
'こ': 2,
'さ': 2,
'し': 1,
'す': 2,
'せ': 3,
'そ': 1,
'た': 4,
'ち': 2,
'つ': 1,
'て': 1,
'と': 2,
'な': 4,
'に': 3,
'ぬ': 2,
'ね': 2,
'の': 1,
'は': 3,
'ひ': 1,
'ふ': 4,
'へ': 1,
'ほ': 4,
'ま': 3,
'み': 2,
'む': 3,
'め': 2,
'も': 3,
'や': 3,
'ゆ': 2,
'よ': 2,
'ら': 2,
'り': 2,
'る': 1,
'れ': 2,
'ろ': 1,
'わ': 2,
'を': 3,
'ん': 1,
'が': 3,
'ぎ': 3,
'ぐ': 1,
'げ': 3,
'ご': 2,
'ざ': 2,
'じ': 1,
'ず': 2,
'ぜ': 3,
'ぞ': 1,
'だ': 4,
'ぢ': 2,
'づ': 1,
'で': 1,
'ど': 2,
'ば': 3,
'び': 1,
'ぶ': 4,
'べ': 1,
'ぼ': 4,
'ぱ': 3,
'ぴ': 1,
'ぷ': 4,
'ぺ': 1,
'ぽ': 4,
'ゃ': 3,
'ゅ': 2,
'ょ': 2,
' ゙': 2,
'゜': 1,
'ゐ': 1,
'ゑ': 1
}
# Create a dictionary with each katakana and needed computer symbol character's stroke count.
katakana_strokes = {
'ア': 2,
'イ': 2,
'ウ': 3,
'エ': 3,
'オ': 3,
'カ': 2,
'キ': 3,
'ク': 2,
'ケ': 3,
'コ': 2,
'サ': 3,
'シ': 3,
'ス': 2,
'セ': 2,
'ソ': 2,
'タ': 3,
'チ': 3,
'ツ': 3,
'テ': 3,
'ト': 2,
'ナ': 2,
'ニ': 2,
'ヌ': 2,
'ネ': 4,
'ノ': 1,
'ハ': 2,
'ヒ': 2,
'フ': 1,
'ヘ': 1,
'ホ': 4,
'マ': 2,
'ミ': 3,
'ム': 2,
'メ': 2,
'モ': 3,
'ヤ': 2,
'ユ': 2,
'ヨ': 3,
'ラ': 2,
'リ': 2,
'ル': 2,
'レ': 1,
'ロ': 3,
'ワ': 2,
'ヲ': 3,
'ン': 2,
'ガ': 2,
'ギ': 3,
'グ': 2,
'ゲ': 3,
'ゴ': 2,
'ザ': 3,
'ジ': 3,
'ズ': 2,
'ゼ': 2,
'ゾ': 2,
'ダ': 3,
'ヂ': 3,
'ヅ': 3,
'デ': 3,
'ド': 2,
'バ': 2,
'ビ': 2,
'ブ': 1,
'ベ': 1,
'ボ': 4,
'パ': 2,
'ピ': 2,
'プ': 1,
'ペ': 1,
'ポ': 4,
'ャ': 2,
'ュ': 2,
'ョ': 3,
'ヰ': 4,
'ヱ': 3,
# Nonbasic characters below here.
'ー': 1,
'ィ': 2,
'々': 3,
'ェ': 3,
'ァ': 2,
'ォ': 3,
'ぁ': 3,
'ヴ': 3,
'―': 1,
'─': 1,
'ヶ': 3,
'ぇ': 2,
'ゝ': 1,
'ぉ': 3,
'¥': 4,
'□': 3,
'ゞ': 1,
'〒': 3,
'ヵ': 2,
'・': 1,
'T': 2,
'0': 1,
'1': 1,
'2': 1,
'3': 1,
'4': 2,
'5': 2,
'6': 1,
'7': 1,
'8': 1,
'9': 1,
'「': 1,
'」': 1,
'(': 1,
')': 1,
'{': 1,
'}': 1,
'’': 1,
'”': 2,
'<': 1,
'>': 1,
'、': 1,
'。': 1,
'・': 1,
'?': 2,
'゛': 2,
'〜': 1,
# W杯 / W-hai for World Cup
'W': 1,
# Tシャツ / T-shatsu for T-Shirt
'T': 2,
# Jリーグ / J-riigu for J1 League
'J': 2,
#  ̄ Upperscore / Macron for Hepburn long vowel notation
' ̄': 1,
# ヽ Katakana iteration mark
'ヽ': 1,
# ヾ Katakana dakuten / voiced iteration mark
'ヾ': 3,
# ゛ Dakuten
'゛': 2
}
# Set the stroke_counts and grade for the hiragana manually.
for char in hiragana_strokes.keys():
try:
index = char_df.loc[char_df['character'] == char].index.item()
char_df.at[index, 'stroke_count'] = hiragana_strokes[char]
char_df.at[index, 'grade'] = 0
except:
print(char + ': Not in char_df')
# Set the stroke_counts and grade for the katakana manually.
for char in katakana_strokes.keys():
try:
index = char_df.loc[char_df['character'] == char].index.item()
char_df.at[index, 'stroke_count'] = katakana_strokes[char]
char_df.at[index, 'grade'] = 0
except:
print(char + ': Not in char_df')
# Display the total number of characters with stroke counts, the total number of characters still without stroke counts, and the percentage of characters still without stroke counts.
num_has_strokes = char_df['stroke_count'].value_counts().sum()
num_without_strokes = 2640 - num_has_strokes
percent_without_strokes = num_without_strokes / 2640
num_has_strokes, num_without_strokes, percent_without_strokes
# Create a list of characters that still do not have stroke counts to consider adding them individually.
char_to_manually_add = []
for char in char_df.loc[char_df['stroke_count'].isnull()]['character']:
char_to_manually_add.append(char[0])
Some characters will still not have a stroke count or grade value, but these should only include non-Japanese characters that are irrelevant to this project.
A dataframe that only contains the relevant characters will be determined later, after insuring that they will not be needed.
# Search through these and retroactively add the remaining non-Latin
# and non-punctuation characters to the hiragana and katakana dictionaries.
char_to_manually_add
# Check column dtypes.
char_df.info()
# Check the total number of characters in each grade.
char_df['grade'].value_counts()
Grade 8 represents basic high school as a whole in Japan. 9 and 10 are more niche and advanced levels during the same time periods of education. 7 isn't used at all by convention.
# Change the char_df's rank and weighted_rank columns' dtypes from int and float to string.
char_df['rank'] = char_df['rank'].astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('object')
char_df.head(2)
Now that a complete character dataframe has been created, we can apply the information calculated in it to provide a lot of insight and information about the lemma dataset.
First, the script that each lemma is made up of will be calculated.
def classify_word_script(row):
"""Use the char_lookup function to determine and return the script(s) the lemma in the given row is made up of."""
word = row['lemma']
char_scripts = []
kanji = False
hiragana = False
katakana = False
for char in word:
char_scripts.append(classify_char_script(char))
if 'kanji' in char_scripts:
kanji = True
if 'hiragana' in char_scripts:
hiragana = True
if 'katakana' in char_scripts:
katakana = True
if 'half-width_katakana' in char_scripts:
katakana = True
# Return the proper category of combinations.
if kanji and not hiragana and not katakana:
return 'kanji'
elif hiragana and not kanji and not katakana:
return 'hiragana'
elif katakana and not kanji and not hiragana:
return 'katakana'
elif kanji and hiragana and not katakana:
return 'kanji_and_hiragana'
elif kanji and katakana and not hiragana:
return 'kanji_and_katakana'
elif hiragana and katakana and not kanji:
return 'hiragana_and_katakana'
elif kanji and hiragana and katakana:
return 'all'
else:
return 'not_japanese'
# Create a script column for the lemma dataframe by using the classify_word_script() function on each row.
df['script'] = df.apply(classify_word_script, axis = 1)
df
# Check the total number of lemmas in each script combination.
df['script'].value_counts()
# Visualize the total number of lemmas in each script combination with a barplot.
plot_data = df['script'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y)
ax.set(xlabel = 'Script', ylabel = 'Number of Lemmas')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.title('Lemma Script Distribution')
plt.show()
With the disproportionate amount of kanji in the language compared to the other scripts, it's no surprise that kanji-only words make up the bulk of the language usage. However, there's not a single instance of a word that contains all three scripts. Additionally, Katakana-only words are more common than hiragana-only words, likely because of foreign loanwords.
Like the scripts that each lemma is made up of, a minimum JLPT exam level can be determined by calculating the highest exam level of each character that the lemma is made up of.
def jlpt_level(row):
"""Determines and returns the highest ranking jlpt_level among all characters in the lemma of the given row. JLPT Ranking Order: n0 > n1 > n2 > n3 > n4 > n5."""
word = row['lemma']
char_levels = []
for char in word:
char_row = char_df.loc[char_df['character'] == char]
char_levels.append(str(char_row.get('jlpt_level').item()))
# Return the highest character rank, since knowing the word requires knowing all the characters in it.
if 'n0' in char_levels:
return 'n0'
elif 'n1' in char_levels:
return 'n1'
elif 'n2' in char_levels:
return 'n2'
elif 'n3' in char_levels:
return 'n3'
elif 'n4' in char_levels:
return 'n4'
elif 'n5' in char_levels:
return 'n5'
else:
return 'error'
# Test the jlpt_level() function.
jlpt_level(df.iloc[400])
# Create an ordered list to use when referencing all JLPT exam levels from here on out.
jlpt_exams = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0']
# Note that this takes some time to execute.
# Use the jlpt_level on each row of the lemma dataframe to assign a JLPT exam level to each lemma.
df['jlpt_level'] = df.apply(jlpt_level, axis = 1)
df
# Check the total number of lemmas in each JLPT exam level.
df['jlpt_level'].value_counts()
# Visualize the total number of lemmas in each JLPT exam level with a barplot.
plot_data = df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = jlpt_exams)
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Lemmas')
plt.title('Lemma JLPT Exam Level Distribution')
plt.show()
The n5 level should be include the most lemmas, because hiragana-only and katakana-only words can be easily learned, but perhaps should not be learned at this level alone. Accounting for this caveat, the highest leaps in vocabulary accessibility comes at the n3 and the n1 exam levels.
An additional pair of variables for each character and lemma will be the total number of words and the average character length of these words.
def count_word_usage(char):
"""Calculates and returns the number of lemmas and the average lemma length that the given character appears in."""
word_list = []
char_count = 0
for lemma in df['lemma']:
if char in lemma:
word_list.append(lemma)
for word in word_list:
for char in word:
char_count += 1
count = len(word_list)
average_word_length = char_count / count
average_word_length = round(average_word_length, 2)
return count, average_word_length
# Test the count_word_usage() function.
print(count_word_usage('食'))
# Create a word count and average word length column for each character in the character dataframe using the count_word_usage() function.
# frequency is the number of times the character appears in the list of lemmas.
# word_count is the number of lemmas that the character appears in at least once.
char_df['word_count'], char_df['average_word_length'] = zip(*char_df['character'].map(count_word_usage))
char_df
# Describe the word_count column data.
char_df['word_count'].describe()
# Visualize the character word_count amounts with a scatterplot and a log scaled y-axis.
x = range(0, len(char_df))
y = char_df['word_count']
plt.scatter(x, y)
plt.title('Word Counts of Japanese Characters')
plt.ylabel('Word Count')
plt.yscale('log')
plt.show()
Like the lemma frequencies, the total number of lemmas each character appears in will be somewhat logarithmic, but word count has a significantly less dramatic curve.
# Describe the average_word_length column data.
char_df['average_word_length'].describe()
# Visualize the character word_count values by their average_word_length with a scatterplot.
x = char_df['average_word_length']
y = char_df['word_count']
plt.scatter(x, y, alpha = 0.1)
plt.title('Word Counts of Japanese Characters')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
#plt.yscale('log')
plt.show()
It would make sense for word count and average word length to be related, given the assumption that simplier is more common, but that isn't always the case in languages. There appears to be a very slight positive linear relatonship between these two variables, but it doesn't appear to be significant.
# Visualize the kanji character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'kanji']['average_word_length']
y = char_df.loc[char_df['script'] == 'kanji']['word_count']
plt.scatter(x, y, alpha = 0.2)
plt.title('Average Word Lengths of Kanji by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()
# Visualize the hiragana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'hiragana']['average_word_length']
y = char_df.loc[char_df['script'] == 'hiragana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Hiragana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()
# Visualize the katakana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'katakana']['average_word_length']
y = char_df.loc[char_df['script'] == 'katakana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Katakana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()
These three graphs show the word count and word length of each character by script type. Hiragana and Katakana tend towards longer words, while kanji words are far more concise.
# View and reassess a sample of the lemma dataframe before continuing.
df.sample(20)
Though difficult to find outside of tokenizing and PoS tagging a Japanese corpus, the most common part of speech usage forms can be looked up for each word.
This will allow a comparison of the lemma dataset's distribution of parts of speech and the pos dataframe's.
def jisho_lookup(word):
"""Scrapes and returns the lemma's most common part of speech from its corresponding webpage at http://www.jisho.org/ ."""
try:
soup = url_base = "http://jisho.org/search/"
# Create the URL for the current account page.
current_url = url_base + str(word)
# Request the current account page.
page = requests.get(current_url)
# Parse the page with BS4.
soup = BeautifulSoup(page.content, 'html.parser')
pos = soup.find("div", {"class": "meaning-tags"}).contents[0]
print(word + ': ' + pos)
return pos
except:
print(word + ': Failed')
return 'Failed'
def jisho_api_lookup(word):
"""Scrapes and returns the lemma's most common part of speech from its corresponding webpage through http://www.jisho.org/ 's experimental alpha API.'."""
try:
pos = []
url_base = 'https://jisho.org/api/v1/search/words?keyword='
# Create the URL for the current account page.
current_url = url_base + str(word)
# Request the current account page.
page = requests.get(current_url)
# Return a failure on a 404 or 500.
if page.status_code == 404:
print(word + ': Failed. 404 error.')
return ['Failed. 404 error.']
if page.status_code == 500:
print(word + ': Failed. 500 error.')
return ['Failed. 500 error.']
# Parse the page with BS4.
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(str(soup))['data']
# Some characters like 」 will load an api page but not have data.
if data == []:
print(word + ': Failed. No api data.')
return ['Failed. No api data.']
# Access the correct values in the nested dictionaries and lists.
for item in data:
for index, x in enumerate(item['japanese']):
try:
if x['word'] == word or x['reading'] == word:
for variation in item['senses']:
for part in variation['parts_of_speech']:
pos.append(part)
print(word + ': ' + str(pos))
return str(pos)
except:
pass
try:
if x['word'] == word:
for variation in item['senses']:
for part in variation['parts_of_speech']:
pos.append(part)
print(word + ': ' + str(pos))
return str(pos)
except:
pass
try:
if x['reading'] == word:
for variation in item['senses']:
for part in variation['parts_of_speech']:
pos.append(part)
print(word + ': ' + str(pos))
return str(pos)
except:
pass
# If an api page is loaded with information but none of it is correct, return none.
print(word + ': Failed due to incorrect data.')
return ['Failed due to incorrect data.']
except:
print(word + ': Failed')
return ['Failed']
# Test the jisho_lookup() function.
print(jisho_lookup('換算'))
# Test the jisho_lookup() function with a character that should fail.
jisho_api_lookup('$')
# Scrape all parts of speech from Jisho.org for each lemma.
# Note that this takes a very long time to execute.
# This block is commented out to avoid accidentally re-scraping it all.
'''
df['pos'] = df['lemma'].apply(jisho_api_lookup)
df.to_csv('df_with_pos.csv', sep = '|')
df
'''
# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
df = pd.read_csv('df_with_pos.csv', sep = '|', index_col = 0)
df
# View the total value counts for each lemma part of speech value.
df['pos'].value_counts()
# Calculate each unique part of speech from the jisho api scrape.
from ast import literal_eval
parts_of_speech = {}
for pos_list in df['pos']:
for pos in literal_eval(pos_list):
if pos in parts_of_speech:
parts_of_speech[pos] += 1
else:
parts_of_speech[pos] = 1
parts_of_speech
# Look at one of the more odd part of speech scrap values.
df.loc[df['pos'] == "['Wikipedia definition']"].head()
There is an excessively large amount of specific parts of speech in this data.
Create a dictionary to simplify and 'translate' the parts_of_speech to the pos_df eng_pos values.
# Create a dictionary to simplify and translate the parts_of_speech to the pos_df eng_pos values.
translation_dict = {}
for pos in list(parts_of_speech.keys()):
translation_dict[pos] = ''
translation_dict
# View the part of speech values from the pos_df.
pos_df['eng_pos']
# Manually fill out the dictionary for condensing the lemma part of speech values.
translation_dict = {
'Particle': 'particle',
'Numeric': 'other',
'Noun': 'noun',
'Copula': 'other',
'Suru verb - irregular': 'verb',
'Godan verb with su ending': 'verb',
'intransitive verb': 'verb',
'Transitive verb': 'verb',
'Noun - used as a suffix': 'noun',
'I-adjective': 'adjective',
'Ichidan verb': 'verb',
'Godan verb with ru ending (irregular verb)': 'verb',
'Expression': 'other',
'Failed due to incorrect data.': '',
'Failed. No api data.': '',
'Godan verb with ru ending': 'verb',
'Auxiliary verb': 'auxiliary verb',
'Godan verb with u ending': 'verb',
'Pre-noun adjectival': 'adjective',
'Na-adjective': 'adjective',
'Suffix': 'other',
'Pronoun': 'noun',
'Kuru verb - special class': 'verb',
'Prefix': 'prefix',
'Godan verb - Iku/Yuku special class': 'verb',
'Adverb': 'adverb',
'Conjunction': 'conjunction',
'Adverbial noun': 'noun',
'Yodan verb with ru ending (archaic)': 'verb',
'Taru-adjective': 'adjective',
"Adverb taking the 'to' particle": 'adverb',
'Noun - used as a prefix': 'noun',
'Godan verb with ku ending': 'verb',
'Godan verb with tsu ending': 'verb',
'Suru verb': 'verb',
'No-adjective': 'adjective',
'Temporal noun': 'noun',
'Godan verb with mu ending': 'verb',
'Noun or verb acting prenominally': 'other',
'Godan verb - aru special class': 'verb',
'Counter': 'symbol',
'Auxiliary adjective': 'adjective',
'Godan verb with u ending (special class)': 'verb',
'Godan verb with bu ending': 'verb',
'Godan verb with nu ending': 'verb',
'Irregular nu verb': 'verb',
'Wikipedia definition': '',
'Su verb - precursor to the modern suru': 'verb',
'Place': 'noun',
'Failed. 500 error.': '',
'Godan verb with gu ending': 'verb',
'Auxiliary': 'auxiliary verb',
'Suru verb - special class': 'verb',
'Full name': 'noun',
'I-adjective (yoi/ii class)': 'adjective',
'Proper noun': 'noun',
'Product': 'other',
'Archaic/formal form of na-adjective': 'verb',
'Unclassified': '',
'Nidan verb (upper class) with ru ending (archaic)': 'verb',
'Nidan verb (lower class) with ru ending (archaic)': 'verb',
'Ichidan verb - zuru verb (alternative form of -jiru verbs)': 'verb',
'Company': 'noun',
'Nidan verb (lower class) with mu ending (archaic)': 'verb'
}
This condensing of part of speech values inherently limits the precision of the calculations, unfortunately, but is the best that can be done without an intricate knowledge of how the pos_df was originally tagged. Perhaps this can be examined in a later project by looking into the ChaSen morphological analyzer that was used. (http://chasen-legacy.osdn.jp/)
# View the columns for each dataframe before continuing.
df.info()
# View the columns for each dataframe before continuing.
char_df.info()
# View the columns for each dataframe before continuing.
pos_df.info()
# Create a list of all scraped part of speech value combinations.
pos_lists = df['pos'].value_counts().keys()
pos_lists
Nearly all of these are made up of very niche or at least overly specific parts of speech. A dummy variable column for each part of speech can be created to simplify the grouping of rows.
# Create a dummy variable column for each part of speech from the pos_df for each lemma in the lemma dataframe.
df = df.assign(**{'adjective': 0, 'adverb': 0, 'auxiliary verb': 0, 'conjunction': 0, 'noun': 0, 'other': 0, 'particle': 0, 'prefix': 0, 'symbol': 0, 'verb': 0})
df
def translate_pos(pos):
"""Returns the condensed part of speech value calculated from the translation_dict."""
return translation_dict[pos].lower()
def set_df_pos_columns(row, index):
"""Iterates over each sepearate scraped part of speech for the given row and sets the corresponding dummy variable columns for each condensed part of speech."""
for pos in literal_eval(row['pos'].item()):
if translate_pos(pos) != '':
df.loc[[index],[translate_pos(pos)]] = 1
# Test the set_df_pos_columns() function with the first lemma dataframe row.
set_df_pos_columns(df.iloc[[0]], 0)
df.head()
# Note that this will take some time.
# Use the set_df_pos_columns() function to set all part of speech dummy variables for each row in the lemma dataframe.
for index, row in enumerate(df.iterrows()):
set_df_pos_columns(df.iloc[[index]], index)
# Calculate the total number of lemmas of each part of speech based on the dummy variables.
lemma_pos_counts = {}
for column in ['adjective', 'adverb', 'auxiliary verb', 'conjunction', 'noun', 'other', 'particle', 'prefix', 'symbol', 'verb']:
lemma_pos_counts[column] = df[column].sum()
lemma_pos_counts
Nouns and verbs are understandably the most common words.
The average frequency all of the characters in a lemma may be a way to measure the characters' impact on a lemma's frequency.
def average_char_frequency(word):
"""Calculates and returns the average frequency among all characters in the given word."""
freq_total = 0
char_count = len(word)
for char in word:
freq_total += char_df.loc[char_df['character'] == char]['frequency'].item()
return freq_total / char_count
# Test the average_char_frequency() function.
average_char_frequency('図書館')
# Note that this will take some time.
# Calculate the average character frequency of each lemma in the lemma dataframe.
df['average_character_frequency'] = df['lemma'].map(average_char_frequency)
df
Likewise, the total stroke count of all characters in a word, as well as the average stroke count per character, can provide an average measure for difficulty of a word.
def total_stroke_count(word):
"""Calculates and returns the sum of the stroke counts of all characters in the given word."""
stroke_total = 0
try:
for char in word:
stroke_total += char_df.loc[char_df['character'] == char]['stroke_count'].item()
except:
return np.nan
return stroke_total
def average_char_stroke_count(word):
"""Calculates and returns the average stroke count among all characters in the given word."""
stroke_total = total_stroke_count(word)
char_count = len(word)
return stroke_total / char_count
# Test the total_stroke_count() and average_char_stroke_count() functions.
total_stroke_count('図書館'), average_char_stroke_count('図書館')
# Note that this will take a while.
# Calculate the total and average stroke counts for each lemma in the lemma dataframe.
df['total_stroke_count'] = df['lemma'].map(total_stroke_count)
df['average_character_stroke_count'] = df['lemma'].map(average_char_stroke_count)
df
# Visualize the total number of each lemma part of speech with a bar plot.
x = list(lemma_pos_counts.keys())
y = list(lemma_pos_counts.values())
sns.barplot(x, y, orient = 'v')
plt.title('Part of Speech Totals for the Most Common Japanese Lemmas')
plt.ylabel('Number of Lemmas')
plt.xlabel('Part of Speech')
plt.xticks(rotation = '90')
plt.show()
The large differences in values between these categories may make graphing the data tricky, but it is also very easy to see the distinctive differences in counts.
Create a dataframe that will only contain Japanese Lemmas made of multiple kana or kanji.
# Create a completely separate copy of the lemma dataframe.
nonbasic_lemma_df = df.copy(deep = True)
# Remove non-Japanese lemmas from the lemma dataframe copy.
nonbasic_lemma_df = nonbasic_lemma_df[df['script'] != 'not_japanese']
# Look at the new total number of lemmas in each script.
nonbasic_lemma_df['script'].value_counts()
# Remove all hiragana-only and katakana-only lemmas that are only a single character long.
to_remove = nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'hiragana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist()
to_remove.extend(nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'katakana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist())
to_remove.sort()
nonbasic_lemma_df.drop(nonbasic_lemma_df.index[[to_remove]], inplace = True)
nonbasic_lemma_df
# Check the non-null values of each column.
nonbasic_lemma_df.info()
# Reset the index in the new dataframe.
nonbasic_lemma_df.reset_index(drop = True, inplace = True)
nonbasic_lemma_df
This new dataframe includes nearly all of the original lemma dataset, but also has no lemma made up of irrelevant characters.
It also removes the hiragana and katakana characters from being examined as if they were lemmas themselves, removing the largest outliers for frequencies.
Now that the data has been set up in the ways needed for visualization and analysis, the plotting can begin. Frequency is the biggest variable to look into, because it is such a large factor into the immediate usefulness of learning a word. The JLPT exam level is another variable to examine because of how it impacts so many students of the Japanese language through language programs, classes, and tools. Finally, the part of speech ratios and the grades that native speakers order these lemmas in will be looked at as well.
First, the lemmas will be examined, followed by the individual characters that make up the lemmas.
Create wrapper functions to simplify the creation of plots.
def create_graph_bar(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, order = None, orient = None):
"""A wrapper function for creating seaborn barplots."""
ax = sns.barplot(x, y, order = order, orient = orient).set_title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
if ylog: plt.yscale('log')
if rotate_degree != None: plt.xticks(rotation = rotate_degree)
plt.show()
def create_graph_reg(y, x, title, ylabel, xlabel, order = 1, ylog = None, alpha = 1, line_color = None, truncate = True, xjitter = 0):
"""A wrapper function for creating seaborn regplots."""
if line_color is not None:
ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}, line_kws = {"color": line_color}).set_title(title)
else:
ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}).set_title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
if ylog: plt.yscale('log')
plt.show()
def create_graph_box(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, order = None, orient = None):
"""A wrapper function for creating seaborn boxplots."""
ax = sns.boxplot(x, y, order = order, orient = orient).set_title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
if ylog: plt.yscale('log')
if xlog: plt.xscale('log')
if rotate_degree != None: plt.xticks(rotation = rotate_degree)
plt.show()
def create_graph_cat(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, hue = None, data = None, kind = 'scatter', order = None):
"""A wrapper function for creating seaborn catplots."""
ax = sns.catplot(x = x, y = y, hue = hue, data = data, kind = kind, order = order)
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
if rotate_degree != None: plt.xticks(rotation = rotate_degree)
if ylog: plt.yscale('log')
if xlog: plt.xscale('log')
plt.show()
Visualizing the numerical data for the lemma dataframe will show how the data is distributed. Any polynomial trend will be very interesting, but any linear or exponential trend will also be useful to know about.
# Display a grid of histograms of univariate numerical columns.
nonbasic_lemma_df.hist(column = ['frequency', 'average_character_frequency', 'total_stroke_count', 'average_character_stroke_count'], bins = 16, figsize = (10, 10), grid = False);
# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()
Frequency has too high of a variance for the visual to be useful in the same form as the others, but a previously made chart can be reexamined.
Average character frequency, average character stroke count, and total character stroke count all appear to have a polynomial trend where data near the median appears to be higher or more frequent than the surrounding.
Lemma frequency is one of the most important variables for this dataset.
The relationships between a lemma's frequency and script, length, JLPT exam level, average character frequency, total stroke count, and average character stroke count will be visualized to give insight to which factors may need to be further researched.
# Display the relationship between frequency and script with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['script'], 'Lemma Frequency by Script', 'Frequency', 'Script', 45, ylog = True)
Frequency seems very slightly affected by the script type.
# Display the relationship between frequency and lemma length with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['lemma'].str.len(), 'Frequency by Lemma Length', 'Frequency', 'Lemma Length', 3, True, line_color = 'red')
A lemma's frequency seems to be negatively affected by its length. Additionally, the frequency of a lemma drops drastically when it is longer than 8 characters.
# Display the relationship between frequency and JLPT exam level with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['jlpt_level'], 'Lemma Frequency by JLPT Exam Level', 'Frequency', 'JLPT Exam Level', ylog = True, order = jlpt_exams)
While the n5 exam has some of the most frequently used lemmas, it has a lower median than other exam levels.
# Display the relationship between frequency and average character frequency with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['average_character_frequency'], 'Frequency by Lemma Average Character Frequency', 'Frequency', 'Average Character Frequency', ylog = True, alpha = 0.05, line_color = 'red')
A lemma's frequency is positively affected by its average character frequency as a general trend.
# Display the relationship between frequency and total stroke count with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['total_stroke_count'], 'Frequency by Lemma Total Stroke Count', 'Frequency', 'Lemma Total Stroke Count', order = 1, ylog = True, alpha = 0.05, line_color = 'red')
A lemma's frequency is negatively affected by its total stroke count. Additionally, the frequency of a lemma drops drastically around and past a maximum of 25 total strokes.
# Display the relationship between frequency and average character stroke count with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['average_character_stroke_count'], 'Frequency by Lemma Average Character Stroke Count', 'Frequency', 'Lemma Average Character Stroke Count', ylog = True, alpha = 0.05, line_color = 'red')
A lemma's frequency is negatively affected by its average character stroke count. Additionally, the frequency of a lemma drops drastically past a maximum average of 15 strokes per character.
Lemma frequency is most strongly affected by length and stroke count. The more complex that a lemma is to read and write, the less frequently it tends to be used.
While one might think that the JLPT n5 exam would largely contain the most frequent lemmas, it only does to a point, largely because of the bias these relations have with the simplicity of the hiragana and katakana scripts.
Lemmas with a length of more than 8 characters, a total of 25 strokes or greater, with an average 15 strokes or greater per character have significantly lower frequencies in Japanese.
Though not as directly important as frequency, determining the effectiveness of the JLPT exam divisions and ordering will be helpful in applying these goals to the overall language.
The relationships between a lemma's JLPT exam level and length, average character frequency, total stroke count, and average character stroke count will be visualized to give insight to which factors may need to be further researched.
# Display the relationship between JLPT exam level and lemma length with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['lemma'].str.len(), 'Lemma JLPT Exam Level by Lemma Length', 'JLPT Exam Level', 'Lemma Length', order = jlpt_exams)
The JLPT n5 exam has significantly longer lemmas than all other exam levels. This is likely because of the amount of foreign loan-words and modern words that were created after Japanese had split from Chinese influence and the influx of new kanji because these types of words are written out with many hiragana or katakana characters.
# Display the relationship between JLPT exam level and average character frequency with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['average_character_frequency'], 'Lemma JLPT Exam Level by Average Character Frequency', 'JLPT Exam Level', 'Average Character Frequency', order = jlpt_exams)
JLPT exam level difficulty has a negative relation with average character frequency.
# Display the relationship between JLPT exam level and total stroke count with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['total_stroke_count'], 'Lemma JLPT Exam Level by Total Stroke Count', 'JLPT Exam Level', 'Total Stroke Count', order = jlpt_exams)
JLPT exam level difficulty has a positive relation with total stroke count.
# Display the relationship between JLPT exam level and average character stroke count with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['average_character_stroke_count'], 'Lemma JLPT Exam Level by Average Character Stroke Count', 'JLPT Exam Level', 'Average Character Stroke Count', order = jlpt_exams)
JLPT exam level difficulty has a positive relation with average character stroke count.
The JLPT exams do tend to follow frequency trends, and likely used character and lemma frequency as a metric when dividing the material between the exam levels.
When weighting the importance of learning a lemma, kanji and hiragana script words should be prioritized more. Additionally, while longer lemmas are less important based on frequency, the JLPT n5 exam vocabulary list should be made exempt from negative weighting from length.
Like lemma frequency, the part of speech is one of the most important variables for this dataset; if not for vocabulary applications, then syntax and grammar.
The relationships between a lemma's part of speech and frequency, average character frequency,and average stroke count will be visualized to give insight to which factors may need to be further researched.
# Create a list of parts of speech for referencing.
pos_list = ['adjective', 'adverb', 'auxiliary verb', 'conjunction', 'noun', 'other', 'particle', 'prefix', 'symbol', 'verb']
# Create a dataframe of lemma column averages grouped by part of speech.
col_list = ['total_lemmas', 'total_lemma_proportion', 'average_character_frequency', 'total_stroke_count', 'average_character_stroke_count']
pos_means_df = pd.DataFrame(index = pos_list, columns = col_list)
pos_means_df
# Calculate the averages of various columns grouped by lemma part of speech.
for pos in pos_list:
total_lemmas = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1][pos].sum()
average_character_frequency = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['average_character_frequency'].mean()
total_stroke_count = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['total_stroke_count'].mean()
average_character_stroke_count = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['average_character_stroke_count'].mean()
# Add an np.nan for total_lemma_proportion because the total of all parts of speech is needed to calculate the proportion.
pos_means_df.loc[[pos]] = [[total_lemmas, np.nan, average_character_frequency, total_stroke_count, average_character_stroke_count]]
# Now calculate the total_lemma_proportion column.
total = pos_means_df['total_lemmas'].sum()
for pos in pos_list:
pos_means_df.at[pos, 'total_lemma_proportion'] = (pos_means_df.at[pos, 'total_lemmas']) / total
# Check to make sure that the proportions add up to 100%.
print('Total Percentage: ' + str(pos_means_df['total_lemma_proportion'].sum() * 100) + '%')
# View the resulting averages.
pos_means_df
# Display the ratio of parts of speech with bar plots.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech', 'Percentage of Total', 'Part of Speech', 45)
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)
Nouns, verb, and adjectives are by far the most common parts of speech, and their related grammatical rules should be weighted to reflect that.
# Display the relationship between part of speech and average character frequency with a bar plot.
create_graph_bar(pos_means_df['average_character_frequency'], pos_means_df.index, 'Lemma Average Character Frequency by Part of Speech', 'Average Character Frequency', 'Part of Speech', 45)
When grouping by average character frequency, auxiliary verbs go from the least common part of speech to the part of speech whose characters are extremely common in comparison of others. This also means that auxiliary verbs may be able to be learned much earlier on in writing and reading studies than other parts of speech.
# Display the relationship between part of speech and average character stroke count with a bar plot.
create_graph_bar(pos_means_df['average_character_stroke_count'], pos_means_df.index, 'Lemma Average Character Stroke Count by Part of Speech', 'Average Character Stroke Count', 'Part of Speech', 45)
Auxiliary verbs, conjunctions, and particles are going to likely be the easiest to write, and may be easily learned early on as a whole.
# Display and compare the ratio of parts of speech with bar plots for both the part of speech df and the lemma dataframe.
# Only display the parts of speech that exist in both dataframes.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)
create_graph_bar(pos_df['frequency_percentage'], pos_df['eng_pos'], 'Distribution of Japanese Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True, order = pos_list)
These proportions likely differ because of not factoring in every possible combination of part of speech usage per lemma and not having access to the tokenized corpus alongside the frequency datasets used. These ratios should not be used for larger calculations.
Noun, verb, and adjective grammar and syntactical rules should be learned freely and early on for the sake of efficiency, but conjunctions, particles, and especially auxiliary verbs will be the easiest to write and read early on.
Perhaps this latter set of rules can best be covered early on in Japanese second language acquisition between a larger focus on the former set of rules.
The various JLPT exam levels of lemmas and scripts used in Japanese can be encountered in many different ways. Comparing the six script combinations and the five JLPT exam levels (and the lemmas beyond the exams) gives thirty-six different aspects to examine and then to determine how to go about applying the existing JLPT ordering to a newer form.
# Create a dataframe of the lemma dataframe JLPT exam level counts grouped by script in order to plot it.
scripts = set(nonbasic_lemma_df['script'].values)
script_jlpt_df = pd.DataFrame(columns = scripts)
for script in scripts:
counts = nonbasic_lemma_df.loc[nonbasic_lemma_df['script'] == script]['jlpt_level'].value_counts()
for exam in jlpt_exams:
# Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
try:
script_jlpt_df.at[exam, script] = counts[exam]
except:
script_jlpt_df.at[exam, script] = 0
# Display the calculated counts.
script_jlpt_df
# Prepare the JLPT exam level counts by script dataframe for plotting.
script_jlpt_df = script_jlpt_df.transpose()
script_jlpt_df.reset_index(inplace = True)
script_jlpt_df.rename(columns = {'index': 'script'}, inplace = True)
script_jlpt_df = pd.melt(script_jlpt_df, id_vars = "script", var_name = "exam_level", value_name = "count");
# Display the relationship between the distributions of JLPT exam level counts grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Distribution of JLPT Exam Level Lemmas by Script', 'Amount of Lemmas', 'Script', hue = 'exam_level', data = script_jlpt_df, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'hiragana_and_katakana', 'kanji', 'kanji_and_hiragana', 'kanji_and_katakana'])
Lemmas made up of kanji and/or hiragana are by far the most common. Additionally, knowledge of kanji will be needed the most out of all scripts.
Create a grouped barplot for the part of speech ratios from the pos_df and the lemma dataframe.
# Display and compare the ratio of parts of speech with bar plots for both the part of speech df and the lemma dataframe.
# This time, display all parts of speech even if they do not exist in both dataframes.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)
create_graph_bar(pos_df['frequency_percentage'], pos_df['eng_pos'], 'Distribution of Japanese Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True, order = pos_list)
Combining these two plots into a single grouped bar chart will be easier to compare and understand the data with.
# Create a list for pos, proportion of total, and the dataframe it came from to create a new dataframe for multivariate plotting.
temp_pos_list = pos_means_df.index.tolist() + pos_df['eng_pos'].tolist()
temp_proportion_list = pos_means_df['total_lemma_proportion'].tolist() + pos_df['frequency_percentage'].tolist()
temp_group_list = []
for i in range(0, 10):
temp_group_list.append('lemma_df')
for i in range(0, 13):
temp_group_list.append('pos_df')
# Combine these lists into a dataframe for plotting.
data = {'pos': temp_pos_list, 'proportion': temp_proportion_list, 'dataframe': temp_group_list}
grouped_df = pd.DataFrame(data)
# Manually add the missing blank rows.
grouped_df = grouped_df.append({'pos': 'adnominal', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)
grouped_df = grouped_df.append({'pos': 'filler', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)
grouped_df = grouped_df.append({'pos': 'interjection', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)
grouped_df
# Update the pos_list with the parts of speech that exist in the part of speech dataframe but not the lemma dataframe.
pos_list = ['adjective', 'adnominal', 'adverb', 'auxiliary verb', 'conjunction', 'filler', 'interjection', 'noun', 'other' 'particle', 'prefix', 'symbol', 'verb']
# Create a grouped barplot based on the part of speech proportions in both the lemma and pos dataframes.
fig = sns.catplot(x = 'pos', y = 'proportion', hue = 'dataframe', data = grouped_df, kind = 'bar', col_order = pos_list)
fig.set_axis_labels(x_var = 'Part of Speech', y_var = 'Percentage of Total')
fig.set_xticklabels(rotation = 90)
plt.title('Part of Speech Ratios by Dataframe')
plt.show()
These differences in ratios may be the result of a poor dataset or from the dataset tidying. Whether or not this will need to be taken into account for the analysis applications will have to be determined at a later time.
The character dataframe has considerably more numerical variables than the lemma dataframe. Similarly, knowing the trends of the data in the variables will is important.
# Display a grid of histograms of all univariate numerical columns.
char_df.hist(column = ['frequency', 'weighted_frequency', 'stroke_count', 'grade', 'frequency_ranking', 'word_count', 'average_word_length'], bins = 10, figsize = (10, 10), grid = False);
Frequency, weighted frequency, and word count are likely logarithmic.
Average word length and stroke count appear quadratic, with the average word length having a significantly larger amount of words between two and three character long.
Grade appears confusing here, but this is because of how Japanese standards group secondary education requirements.
As before with lemmas, frequency is one of the most important variables for this dataset.
The relationships between a character's frequency and script, JLPT exam level, stroke count, grade, and average word length will be visualized to give insight to which factors may need to be further researched, and to begin to order which types of characters should be focused on in Japanese language acquisition.
# Display the relationship between frequency and script with a box plot.
create_graph_box(char_df['frequency'], char_df['script'], 'Character Frequency by Script', 'Frequency', 'Script', 45, ylog = True)
The hiragana and katakana scripts should obviously be mastered before turning to kanji, but the usage of roman characters in contemporary Japanese is worth discussion as well.
# Display the relationship between frequency and JLPT exam level with a box plot.
create_graph_box(char_df['frequency'], char_df['jlpt_level'], 'Character Frequency by JLPT Exam Level', 'Frequency', 'JLPT Exam Level', ylog = True, order = jlpt_exams)
No real deviation from learning characters in order of frequency is necessary to pass the JLPT exams in order.
# Display the relationship between frequency and stroke count with a box plot.
create_graph_reg(char_df['frequency'], char_df['stroke_count'], 'Frequency by Character Stroke Count', 'Frequency', 'Stroke Count', 3, True, line_color = 'Red', alpha = 0.1)
# Look at the value counts for the relationship between frequency and stroke count.
char_df['stroke_count'].value_counts().sort_index()
# Plot the counts of characters grouped by stroke count.
plt.plot(char_df['stroke_count'].value_counts().sort_index())
plt.ylabel('Amount of Characters')
plt.xlabel('Character Stroke Count')
plt.title('Amount of Characters by Stroke Count')
plt.show()
Character frequency seems negatively correlated with stroke count, but an interesting frequency trend occurs between 11 and 16 strokes. This likely occurs because of the limitation of meaningfully distinguished combinations of strokes in symbols under a certain threshold of complexity.
# Display the relationship between frequency and grade with a reg plot.
create_graph_reg(char_df['frequency'], char_df['grade'], 'Frequency by Grade', 'Frequency', 'Grade Learned', 3, True, alpha = 0.1, line_color = 'red', xjitter = 0.5)
The Japanese schooling system's ordering of character learning is very strongly correlated with the overall frequency of the characters. This means that books, tools, software, and other materials for Japanese first language acquisition will largely be ineffective, or at least inefficient, for Japanese second language acquistion.
# Display the relationship between frequency and a character's average word length with a reg plot.
create_graph_reg(char_df['frequency'], char_df['average_word_length'], 'Frequency by Character Average Word Length', 'Frequency', 'Average Word Length', ylog = True, alpha = 0.1, line_color = 'red')
Interestingly enough, character frequency is positively correlated with the average length of the lemmas that the character appears in. This may have to do with how hiragana and katakana have a combination of the longest and some of the most frequently occuring lemmas.
Although character frequency can help order the generally chaotic learning of kanji, it doesn't offer much insight in other ways. Hiragana and katakana should be learned first by all metrics.
As before with lemmas, JLPT exam level is not as directly important as frequency, but determining the effectiveness of the JLPT exam divisions and ordering will be helpful in applying these goals to the overall language.
The relationships between a character's JLPT exam level and stroke count, grade, word count, and average word length will be visualized to give insight to which factors may need to be further researched.
# Display the relationship between JLPT exam level and stroke count with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['stroke_count'], 'Character JLPT Exam Level by Stroke Count', 'JLPT Exam Level', 'Character Stroke Count', order = jlpt_exams)
The n5 JLPT exam level continues to have the most outliers out of any exam level. Here we see that stroke count in a character is seen as a form of difficulty with these exams.
# Display the relationship between JLPT exam level and grade with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['grade'], 'Character JLPT Exam Level by Grade', 'JLPT Exam Level', 'Character Grade', order = jlpt_exams)
Like stroke counts before, grade is also positively related with exam level. However, with grade, the JLPT n1 exam level has the most outliers, with some characters generally taught to non-native speakers much later on actually being taught to native Japanese speakers very early.
# Display the relationship between JLPT exam level and a character's word count with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['word_count'], 'Character JLPT Exam Level by Word Count', 'JLPT Exam Level', 'Word Appearance Count', xlog = True, order = jlpt_exams)
Word count has a lot of outliers in all JLPT exam levels, but the trend is that the less common characters are seen as either more difficult or less important to learn early on.
# Display the relationship between JLPT exam level and a character's average word length with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['average_word_length'], 'Character JLPT Exam Level by Average Word Apperance Length', 'JLPT Exam Level', 'Average Word Length', order = jlpt_exams)
Like word count, average word length has a lot of outliers in all JLPT exam levels, but the trend is that the less common characters are seen as either more difficult or less important to learn early on.
All of these plots show that the JLPT ordering is an excellent metric to order character study by. Exam ease correlates with greater character simplicity and frequency.
Chracter grade is not as directly important as frequency, but determining the trends of the grade divisions and ordering of native education programs and standards will be helpful in applying these goals to the overall language.
The relationships between a character's grade and frequency, stroke count, word count, and average word length will be visualized to give insight to which factors may need to be further researched.
# Display the relationship between grade and frequency with a box plot.
create_graph_box(char_df['grade'], char_df['frequency'], 'Character Grade Level by Frequency', 'Grade', 'Character Frequency', xlog = True, orient = 'h')
The more frequent the character, the earlier it tends to be learned by native speakers.
# Display the relationship between grade and stroke count with a box plot.
create_graph_box(char_df['grade'], char_df['stroke_count'], 'Character Grade Level by Stroke Count', 'Grade', 'Stroke Count', orient = 'h')
The more complex the character is in the form of stroke count, the later the character tends to be learned by native speakers.
# Display the relationship between grade and a character's word count with a box plot.
create_graph_box(char_df['grade'], char_df['word_count'], 'Character Grade Level Word Appearance Count', 'Grade', 'Number of Appearances', xlog = True, orient = 'h')
The more frequent the character, the earlier it tends to be learned by native speakers.
# Display the relationship between grade and a character's average word length with a box plot.
create_graph_box(char_df['grade'], char_df['average_word_length'], 'Character Grade Level by Average Word Appearance Length', 'Grade', 'Average Word Length', orient = 'h')
The more frequent the character, the earlier it tends to be learned by native speakers.
These graphs simply reinforce similar trends throughout the character dataframe sections, repeating the same trends as JLPT exam level.
The various JLPT exam levels of characters and scripts used in Japanese can be encountered in many different ways. Comparing the six script combinations and the five JLPT exam levels (and the characters beyond the exams) gives thirty-six different aspects to examine and then to determine how to go about applying the existing JLPT ordering to a newer form.
# Create a dataframe of the character dataframe script counts grouped by JLPT exam level in order to plot it.
scripts_char = set(char_df['script'].values)
script_jlpt_df_char = pd.DataFrame(columns = scripts_char)
for script in scripts_char:
counts = char_df.loc[char_df['script'] == script]['jlpt_level'].value_counts()
for exam in jlpt_exams:
# Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
try:
script_jlpt_df_char.at[exam, script] = counts[exam]
except:
script_jlpt_df_char.at[exam, script] = 0
# Display the calculated counts.
script_jlpt_df_char
# Prepare the character JLPT exam level by script dataframe for plotting.
script_jlpt_df_char = script_jlpt_df_char.transpose()
script_jlpt_df_char.reset_index(inplace = True)
script_jlpt_df_char.rename(columns = {'index': 'script'}, inplace = True)
script_jlpt_df_char = pd.melt(script_jlpt_df_char, id_vars = "script", var_name = "exam_level", value_name = "count")
# Display the relationship between the distributions of character JLPT exam level grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Distribution of JLPT Exam Level Characters by Script', 'Amount of Characters', 'Script', hue = 'exam_level', data = script_jlpt_df_char, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'hiragana_and_katakana', 'kanji', 'kanji_and_hiragana', 'kanji_and_katakana'])
Kanji continues to be the most important focus in writing overall, with the focus needing to be stronger as the learner goes up through the JLPT exam levels until getting beyond all of them.
The various scripts of characters used in Japanese and the grade learned by native speakers can be encountered in many different combinations. Comparing the six script combinations and the ten effective grades gives sixty different aspects to examine and then to determine how to go about applying the existing grade ordering to a newer form.
# Create a dataframe of the character dataframe grade counts grouped by script in order to plot it.
scripts_char = set(char_df['script'].values)
grades_char = set(char_df['grade'].values)
script_grade_df_char = pd.DataFrame(columns = scripts_char)
for script in scripts_char:
counts = char_df.loc[char_df['script'] == script]['grade'].value_counts()
for exam in grades_char:
# Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
try:
script_grade_df_char.at[exam, script] = counts[exam]
except:
script_grade_df_char.at[exam, script] = 0
# Display the calculated counts.
script_grade_df_char
# Prepare the character grade by script dataframe for plotting.
script_grade_df_char = script_grade_df_char.transpose()
script_grade_df_char.reset_index(inplace = True)
script_grade_df_char.rename(columns = {'index': 'script'}, inplace = True)
script_grade_df_char = pd.melt(script_grade_df_char, id_vars = "script", var_name = "grade", value_name = "count")
# Display the relationship between the distributions of character grade grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Character Grade Levels by Script', 'Amount of Characters', 'Script', hue = 'grade', data = script_grade_df_char, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'kanji',])
With character grade, unlike JLPT exam level, kanji focus remains more consistant on the character level.
The various JLPT exam levels of characters and the grades these characters are learned by native speakers can be encountered in many different ways. Comparing the five JLPT exam levels (and the characters beyond the exams) with the ten effective grades gives sixty different aspects to examine and then to determine how to go about applying the existing JLPT and grade ordering to a newer form.
# Create a dataframe of the character dataframe JLPT exam level counts grouped by grade in order to plot it.
grades_char = set(char_df.loc[char_df['grade'].notna()]['grade'].values)
grades_jlpt_df_char = pd.DataFrame(columns = grades_char)
for grade in grades_char:
counts = char_df.loc[char_df['grade'] == grade]['jlpt_level'].value_counts()
for exam in jlpt_exams:
# Some combinations do not exist in the nonbasic_lemma_df
try:
grades_jlpt_df_char.at[exam, grade] = counts[exam]
except:
grades_jlpt_df_char.at[exam, grade] = 0
# Display the calculated counts.
grades_jlpt_df_char
# Prepare the character JLPT exam level by grade dataframe for plotting.
grades_jlpt_df_char = grades_jlpt_df_char.transpose()
grades_jlpt_df_char.reset_index(inplace = True)
grades_jlpt_df_char.rename(columns = {'index': 'grades'}, inplace = True)
grades_jlpt_df_char = pd.melt(grades_jlpt_df_char, id_vars = "grades", var_name = "jlpt_level", value_name = "count")
# Display the relationship between the distributions of character JLPT exam level grade grouped by grade with a grouped bar plot.
create_graph_cat('count', 'grades', 'JLPT Exam Level Characters by Grade', 'Amount of Characters', 'Grade', hue = 'jlpt_level', data = grades_jlpt_df_char, kind = 'bar', rotate_degree = 90, ylog = True)
This shows how the JLPT exam levels compare to the compulsory education of Japanese native speakers. It seems that the higher exam levels still have a lot of lower grade material for students to learn.
# Save all dataframes for future use and loading into nbconvert slides.
nonbasic_lemma_df.to_csv('final_lemma_df.csv', sep = '|')
char_df.to_csv('final_char_df.csv', sep = '|')
pos_df.to_csv('final_pos_df.csv', sep = '|')