より効率的こうりつてきな第二だいに言語げんご習得しゅうとくのための日本語にほんご補題ほだいの検討けんとう

Examination of Japanese Lemmas for a More Efficient Second Language Acquisition

Matthew Unrue, December 2018¶

Udacity Data Analyst Nanodegree Capstone Project

Table of Contents:¶

Data Loading and Creation
Character Dataframe Creation
Lemma Dataframe Column Creation
Part of Speech Data Gathering
Nonbasic Lemma Dataframe Creation
Data Exploration and Plotting

Project context, goals, and findings can be found in the readme.txt file here. Now displayed below for convenience. Additionally, download links to datasets created throughout this project are now available below.

Sources:

The lemma dataset can be found hosted here and similar datasets for other languages can be found here.
The Part of Speech distribution frequency dataset can be found here.
More information on lemmas can be found on its corresponding Wikipedia page.
The JLPT tier dataset was created from this webpage here.
The Kanji .json dataset is found here.
Character information is additionally gathered from here.

Dataset downloads:

readme.txt addition:¶

Project Context:¶

In linguistics, a lexeme is the set of forms that a word, or more specifically a single semantic value, can take on in a language regardless of the number of ways it can be modified through inflection.

A lemma is the dictionary form of a word that is chosen by the conventions of its individual language to represent the entirety of the lexeme.

Lemmas and word stems are different in that a stem is the portion of a word that remains constant despite morphological inflection while a lemma is the base form of the word that represents the distinct meaning of the word regardless of inflection.

When studying a language, multitudes of different approaches can be taken. One method of efficient study is to memorize or learn the base form of a concept, or the lemma, and through the application of the grammatical rules of the language, begin to incorporate the remainder of the lexeme into their usage.

More information on lemmas can be found on its corresponding Wikipedia page: (https://en.wikipedia.org/wiki/Lemma_(morphology))

Project Goals:¶

This project examines the frequencies of lemmas in the Japanese language, and what factors influence those frequencies, in order to determine a more efficient approach towards Japanese second language acquisition and the ordering of teaching materials for this purpose.

Efficiency will be measured by the estimated frequency, and thus number of applications or general usefulness, that learning a word will give, assuming that the student can apply grammatical rules to utilize all appropriate forms of the word, as determined by the frequency of the lemma in the Internet Corpus.

Additionally, the part of speech that each lemma is classified as will be used to look into ideas for a more efficient order of learning various sets of grammatical rules in Japanese.

Data Sources:¶

The lemma dataset can be found hosted at http://corpus.leeds.ac.uk/frqc/internet-jp.num and similar datasets for other languages can be found at http://corpus.leeds.ac.uk/list.html
The Part of Speech distribution frequency dataset can be found at http://corpus.leeds.ac.uk/frqc/internet-jp-pos.num
The JLPT tier dataset was created from this webpage https://www.nihongo-pro.com/kanji-pal/list/jlpt
The Kanji .json dataset is found at https://thekanjimap.com/kanji.html
Character information is additionally gathered from http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C

Main Findings:¶

Lemma frequency is most strongly affected by its length and both total and average character stroke counts.
When weighting the importance of learning a lemma, kanji and hiragana script words should be prioritized more.
While longer lemmas are generally used less frequently, the lemmas included in the JLPT n5 exam vocabulary list should be made exempt from negative weighting from length due to the type of hiragana words in this category.
A more efficient Japanese learning order will have the largest focus on nouns, verbs, and adjective syntax, but will cover auxiliary verb, conjunction, and particle rules in earlier stages.

Visualizations for Presentation:¶

The 'Frequency by Lemma Length', 'Frequency by Lemma Total Stroke Count', and 'Frequency by Lemma Average Character Stroke Count' plots all have extremely similar trend lines, which means that lemma length, lemma total stroke count, and lemma average character stroke count all have similar impacts on a lemma's frequency. Combining these as subplots on a single plot makes this comparison clear.
The 'Lemma Frequency by Script' plot shows that while the differences in medians and interquartile ranges among the script types show the varying importance of focusing on each script, the sheer number of extremely high frequency outliers in the hiragana script is worth discussion and study alone.
The 'Lemma JLPT Level by Frequency' plot shows the distributions of each JLPT level's frequencies. While the n5 exam has only the third highest median and third quartile of the scripts, it also has the most outliers and the highest frequency values of all. When compared to the 'Lemma JLPT Level by Lemma Length' plot, the reason for this becomes apparent: the JLPT n5 exam has the highest mean and range of lemma length.
The 'Distribution of Lemma Parts of Speech' shows the ratio of each lemma part of speech, with nouns, verbs, and adjectives having the most representation by far. Comparing this information with the 'Lemma Average Character Stroke Count' and 'Lemma Average Character Frequency' plots shows why language ordering cannot be only based on ratios, as three of the least common parts of speech in the dataset are shown to have the highest average character frequencies.

Data Loading and Creation ¶

The project begins by loading in datasets and creating columns of data from the existing information in order to have a sufficient amount of variables to examine.

# Import all modules and libraries, as well as set matplotlib plotting to occur in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
import json
import statsmodels.api as sm;

%matplotlib inline

# Read the lemma dataset into a dataframe called original_df.
original_df = pd.read_csv('japanese_lemmas.csv')
original_df.head()

The lemma dataset has three columns:

rank: The ranking of frequency of the lemma
frequency: The number of instances of the lemma per million words
lemma: The actual lemma

# Read the part of speech dataset into a dataframe called original_pos_df.
original_pos_df = pd.read_csv('japanese_pos_frequencies.csv', names = ['rank', 'frequency', 'jap_pos'])
original_pos_df.head()

The lemma dataset has three columns:

rank: The ranking of frequency of the part of speech
frequency: The number of instances of the part of speech per million words.
jap_pos: The actual part of speech

# Work with copies of the original dataframes.
df = original_df.copy()
pos_df = original_pos_df.copy()

# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()

The distribution of the lemma frequencies appears logarithmic, which makes sense because only a few words should be extremely common from simplicity or syntactical importance, with the bulk of others slowly becoming less frequent as they become more specific or niche.

# Check for duplicate rows.
df.duplicated().sum()

0

# Check for null values.
df.isnull().sum()

rank         0
frequency    0
lemma        0
dtype: int64

# Check dtypes.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
rank         15000 non-null int64
frequency    15000 non-null float64
lemma        15000 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 351.7+ KB

This dataset is already clean, so little tidying needs to be done.

# Correct the rank column's dtype from int to string.
df['rank'] = df['rank'].astype('object')

# Check dtypes.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
rank         15000 non-null object
frequency    15000 non-null float64
lemma        15000 non-null object
dtypes: float64(1), object(2)
memory usage: 351.7+ KB

Begin working with the part of speech frequency dataset by adding a translated and transliterated part of speech column for English speakers.

# Manually define the translations and transliterations.
translations = ['noun', 'particle', 'symbol', 'verb', 'auxiliary verb', 'adverb', 'adjective', 'adnominal', 'conjunction', 'prefix', 'interjection', 'filler', 'other']
transliterations = ['めいし', 'じょし', 'きごう', 'どうし', 'じょどうし', 'ふくし', 'けいようし', 'れんたいし', 'せつぞくし', 'せっとうし', 'かんどうし', 'ふぃらあ', 'そのほか']
pos_df['eng_pos'] = translations
pos_df['transliterated_pos'] = transliterations

# Reorder the columns so that eng_pos is next to frequency and the two Japanese columns are adjacent for easier reading.
pos_df = pos_df[['rank', 'frequency', 'eng_pos', 'jap_pos', 'transliterated_pos']]

# Ensure the pos_df is easily readable.
pos_df

# Check and set column dtypes.
pos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
rank                  13 non-null int64
frequency             13 non-null float64
eng_pos               13 non-null object
jap_pos               13 non-null object
transliterated_pos    13 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 648.0+ bytes

This part of speech dataset is also already clean and just needs minor dtype adjustment.

# Correct the rank column's dtype from int to string.
pos_df['rank'] = pos_df['rank'].astype('object')

# Check dtypes.
pos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
rank                  13 non-null object
frequency             13 non-null float64
eng_pos               13 non-null object
jap_pos               13 non-null object
transliterated_pos    13 non-null object
dtypes: float64(1), object(4)
memory usage: 648.0+ bytes

The ratios of each part of speech in the pos_df will be useful to have with to each row.

# Calculate the frequency percentage of each part of speech.
total = pos_df['frequency'].sum()
pos_df['frequency_percentage'] = pos_df['frequency'].apply(lambda x: x / total)

# Reorder the columns to place the frequency percentage by the frequency.
pos_df = pos_df[['rank', 'eng_pos', 'frequency', 'frequency_percentage', 'jap_pos', 'transliterated_pos']]
pos_df

Nouns understandably take up just over a third of the language usage, but particles actually take up a full fifth of the language usage, twice as much as verbs.

Dataframe to Create:¶

Character Frequency

Columns to Calculate:¶

Character Frequency Dataframe¶

Frequency
Frequency Rank
Character
Weighted Frequency (Sum of the Frequency of Character in all Words * Frequency of the Respective Words)
Type of Character
JLPT Exam Level
Stroke Count

Lemma Frequency Dataframe¶

Type of Characters
Part of Speech
Highest NLPT Exam Character

Character Dataframe Creation ¶

Create a dataframe of every individual character found in the lemma dataset.

While populating the list of characters, total the number each character is used.

# Find each character and the number that each of these characters appear in the lemma dataset.
characters = {}
for lemma in df['lemma']:
    for char in lemma:
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 1

Glance at a random subsection of the character dictionary.

dict(list(characters.items())[:30])

{'の': 89,
 'に': 164,
 'は': 73,
 'て': 143,
 'を': 7,
 'が': 92,
 'だ': 76,
 'た': 182,
 'す': 338,
 'る': 1135,
 'と': 200,
 'ま': 224,
 'で': 58,
 'な': 184,
 'い': 603,
 'も': 120,
 'あ': 122,
 '・': 1,
 '「': 1,
 '」': 1,
 'こ': 131,
 'e': 1,
 'か': 300,
 'o': 1,
 'a': 1,
 't': 1,
 'れ': 205,
 'ら': 192,
 '）': 2,
 '（': 2}

# Create the character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index, 
char_df = pd.DataFrame.from_dict(characters, orient = 'index')
char_df = char_df.reset_index()
char_df = char_df.rename({'index': 'character', 0: 'frequency'}, axis='columns')
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df.head()

Additionally, calculate a 'weighted' frequency for each character.

This is the sum of the frequencies of the words that the character appears in, counting each time the character apears in the word.

# Calculate the approximate amount that each character appeared in the dataset that the lemma dataset was calculated form.
weighted_characters = {}
for index, row in df.iterrows():
    for char in row['lemma']:
        if char in weighted_characters:
            weighted_characters[char] += row['frequency']
        else:
            weighted_characters[char] = row['frequency']

dict(list(weighted_characters.items())[:30])

{'の': 50197.62000000002,
 'に': 29859.61000000001,
 'は': 23858.740000000005,
 'て': 28420.48999999998,
 'を': 20445.979999999996,
 'が': 21896.87999999998,
 'だ': 22435.67,
 'た': 25189.439999999988,
 'す': 38030.48,
 'る': 76544.48999999989,
 'と': 29886.07999999998,
 'ま': 16669.62,
 'で': 20441.469999999994,
 'な': 19293.459999999985,
 'い': 40478.36000000004,
 'も': 15020.910000000005,
 'あ': 9598.750000000005,
 '・': 6001.95,
 '「': 5690.07,
 '」': 5672.68,
 'こ': 13838.199999999997,
 'e': 5444.29,
 'か': 15954.139999999992,
 'o': 4590.55,
 'a': 4553.18,
 't': 4248.5,
 'れ': 12860.55,
 'ら': 9623.280000000002,
 '）': 3697.93,
 '（': 3661.64}

# Create the weighted character dataframe from the characters dictionary.
# Change the column names, sort the rows by descending frequency, and correct the index, 
weighted_char_df = pd.DataFrame.from_dict(weighted_characters, orient = 'index')
weighted_char_df = weighted_char_df.reset_index()
weighted_char_df = weighted_char_df.rename({'index': 'character', 0: 'weighted_frequency'}, axis='columns')
weighted_char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
weighted_char_df = weighted_char_df.reset_index(drop = True)
weighted_char_df.head()

# Merge the character dataframes based on the character for each row.
char_df = char_df.merge(weighted_char_df)
char_df.sort_values('weighted_frequency', ascending = False, inplace = True)
char_df = char_df.reset_index(drop = True)
char_df

# Ensure that each row has a value for each column of data.
char_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2640 entries, 0 to 2639
Data columns (total 3 columns):
character             2640 non-null object
frequency             2640 non-null int64
weighted_frequency    2640 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 62.0+ KB

Create a rank column in the character dataframe that is equivalent to the rank column in the lemma dataframe.

# Create a rank column for the char_df to match the lemma df for both weighted and non-weighted frequency columns.
char_df['weighted_rank'] = char_df['weighted_frequency'].rank(method = 'first', ascending = False,)
char_df.sort_values('frequency', ascending = False, inplace = True)
char_df['rank'] = char_df['frequency'].rank(method = 'first', ascending = False,)
char_df

# Fix the index and correct the rank and weighted_rank dtypes from int to string.
char_df = char_df.reset_index(drop = True)
char_df['rank'] = char_df['rank'].astype('int').astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('int').astype('object')
char_df

# Reorder the columns for human readability.
char_df = char_df[['rank', 'frequency','character', 'weighted_frequency', 'weighted_rank',]]
char_df

Each character needs to be tagged with its appropriate script type.

We can easily classify each character by looking up its unicode representation.

def classify_char_script(char):
    """Look up the integer representing Unicode code point for the given character and return its script."""
    
    char = ord(char)
    if 0 <= char <= 8591:
        return 'latin'
    elif 12288 <= char <= 12351:
        return 'punctuation'
    elif 12352 <= char <= 12447:
        return 'hiragana'
    elif 12448 <= char <= 12543:
        return 'katakana'
    elif 19968 <= char <= 40879:
        return 'kanji'
    elif 65280 <= char <= 65374:
        return 'full-width_roman'
    elif 65375 <= char <= 65519:
        return 'half-width_katakana'
    else:
        return 'other'

# Classify each character in the char_df
char_df['script'] = char_df['character'].apply(classify_char_script)

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# Check the total number of characters in each script in this dataset.
char_df['script'].value_counts()

kanji                  2279
katakana                 84
hiragana                 83
latin                    80
full-width_roman         74
punctuation              19
other                    19
half-width_katakana       2
Name: script, dtype: int64

Adding the characters' Unicode code point to each row may be useful for later sorting or testing.

def get_ord(row):
    """Return the character's integer representing Unicode code point from the given row."""
    return ord(row['character'])

# Create a column that holds each character's integer representing Unicode code point for reference.
char_df['ord'] = char_df.apply(get_ord, axis = 1)
char_df

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

The Japanese-Language Proficiency Test (JLPT) is an extremely influential standardized test used to evaluate a non-native student's Japanese ability.

It consists of 5 different levels, the n5, n4, n3, n2, and n1 exams. The n5 is the easiest, testing beginner concepts, and the n1 is the most advanced, testing the ability to understand Japanese in virtually any circumstance.

Each character should be tagged with its appropriate JLPT exam level.

# Read in and then merge the JLPT rank dataset into the character dataframe.
jlpt_df = pd.read_csv('jlpt_levels.csv')
jlpt_df = jlpt_df.rename(columns={"kanji": "character"})
jlpt_df.head()

# Use the object dtype for the jlpt_level column because they are categorical, not quantitative.
char_df = char_df.merge(jlpt_df, how = 'left')
char_df['jlpt_level'] = char_df['jlpt_level'].astype('object')
char_df

# Check the total number of charcters in each JLPT exam level in the JLPT dataset.
jlpt_df['jlpt_level'].value_counts()

1    1235
3     370
2     368
4     167
5      80
Name: jlpt_level, dtype: int64

# Check the total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()

1.0    963
3.0    370
2.0    368
4.0    167
5.0     80
Name: jlpt_level, dtype: int64

The jlpt_level has strings from float values rather than int values because NaN's were present and require float.

# Correct the float strings to integer strings and change the exam level names.
char_df.loc[char_df['jlpt_level'] == 1.0, 'jlpt_level'] = 'n1'
char_df.loc[char_df['jlpt_level'] == 2.0, 'jlpt_level'] = 'n2'
char_df.loc[char_df['jlpt_level'] == 3.0, 'jlpt_level'] = 'n3'
char_df.loc[char_df['jlpt_level'] == 4.0, 'jlpt_level'] = 'n4'
char_df.loc[char_df['jlpt_level'] == 5.0, 'jlpt_level'] = 'n5'

char_df['jlpt_level'].value_counts()

n1    963
n3    370
n2    368
n4    167
n5     80
Name: jlpt_level, dtype: int64

Hiragana and Katakana characters are considerably easier to learn than nearly all kanji, and both syllabaries are expected to be known before the JLPT n5 exam is taken.

The hiragana and katakana characters will be set to the easiest JLPT exam level, the n5.

# Set all hiragana and katakana characters to the easiest JLPT exam level.
char_df.loc[char_df['script'] == 'hiragana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'katakana', 'jlpt_level'] = 'n5'
char_df.loc[char_df['script'] == 'half-width_katakana', 'jlpt_level'] = 'n5'
char_df

# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()

n1    963
n3    370
n2    368
n5    249
n4    167
Name: jlpt_level, dtype: int64

# Visualize the total number of charcters in each JLPT exam level in the character dataframe.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution')
plt.show()

After the initial investment of learning both hiragana and katakana with basic kanji, each JLPT exam level expects more new kanji than before.

The JLPT tests fluency, but cannot be truly comprehensive. Many kanji characters are not regularly used by even native speakers, so these are not tested.

Set the kanji that are more advanced than the JLPT exams to the value of n0. There is no n0 JLPT exam, but this will signify that the character is beyond the exams.

# Set the JLPT exam level to n0 for every character that does not have a jlpt_level value yet.
char_df.loc[char_df['jlpt_level'].isna(), 'jlpt_level'] = 'n0'

# Check the new total number of charcters in each JLPT exam level in the character dataframe.
char_df['jlpt_level'].value_counts()

n1    963
n0    523
n3    370
n2    368
n5    249
n4    167
Name: jlpt_level, dtype: int64

# Visualize the new total number of charcters in each JLPT exam level in the character dataframe with a barplot.
plot_data = char_df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0'])
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Characters')
plt.title('JLPT Exam Level Character Distribution Including Advanced Characters')
plt.show()

There are far, far more kanji than is represented on this graph, but these counts make up all of the characters in the most common 15,000 lemmas of the Japanese language. 523 of these kanji are not included on the JLPT exams but are still common enough to plan to eventually learn.

The stroke count, or the total number of strokes needed to write the character, is another metric of assessing difficulty of learning Japanese characters. The higher the stroke count, the more individual pieces needed to be memorized and recalled correctly.

Each character will be tagged with the appropriate stroke count.

Characters that have a debatable stroke count will be given the higher number of the possibilities.

# Read in the kanji.json file to get the stroke count of each character.
json_df = pd.read_json('kanji.json', encoding = 'UTF-8')
json_df.head()

# Check the total number of charcters in each grade group in the character dataframe.
json_df['grade'].value_counts()

常用漢字 (jōyō kanji)         1041
教育漢字 (kyōiku kanji)       1006
表外漢字 (hyōgai kanji)        426
人名用漢字 (jinmeiyō kanji)     373
Name: grade, dtype: int64

While the grade grouping information is very useful, breaking it down into individual grade level will be more useful, so this information will be left out.

# Sort by stroke count.
json_df.sort_values('stroke', inplace = True)
json_df = json_df.reset_index(drop = True)
json_df

# Rename the kanji column to character as in all other dataframes.
json_df = json_df.rename(columns={"kanji": "character"})
json_df.head()

# Check for any oddities.
char_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 8 columns):
rank                  2640 non-null object
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null object
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 185.6+ KB

# Get the stroke count for each character from the json_df for the char_df.
char_df['stroke_count'] = char_df['character'].map(json_df.set_index('character')['stroke'])

char_df

Many of the stroke counts are still missing. The remaining data will be gotten from another source. While this is being collected, the grade and frequency rating will be collected as well.

# Create empty columns for the grade and frequency rating values.
char_df['grade'] = np.nan
char_df['frequency_rating'] = np.nan

char_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 11 columns):
rank                  2640 non-null object
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null object
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
stroke_count          2115 non-null float64
grade                 0 non-null float64
frequency_rating      0 non-null float64
dtypes: float64(4), int64(2), object(5)
memory usage: 247.5+ KB

char_df

That dataset was not complete enough, so the stroke count will be scraped from http://www.edrdg.org/.

Additionally, the school grade that the character is typically learned in will be scraped as well.

from bs4 import Tag

def char_lookup(char):
    """Scrapes the character's corresponding webpage at http://www.edrdg.org/ and sets the character's char_df stroke_count, grade, and frequency_rating column values to the scraped information."""

    try:
        url_base = 'http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MMJ'

        # Create the URL for the current account page.
        current_url = url_base + str(char)

        # Request the current account page.
        page = requests.get(current_url)

        # Parse the page with BS4.
        # This source has extra </b></td></tr> tags that break the python default HTML parser.
        # Use the external lxml parser instead.
        soup = BeautifulSoup(page.content, 'lxml')

        table = soup.find_all("table")[1]
        
        stroke_element = table.find("td", string = 'Stroke Count')
        stroke_count = str(stroke_element.next_sibling)[7:-9]
        
        # Some kanji have different possibilities based on writing style.
        # Calculate both and then assume the higher.
        if ' ' in stroke_count:
            stroke_count = stroke_count[:-1]
            before = str(stroke_count)
            
            space_loc = 0
            for index, letter in enumerate(stroke_count):
                if letter == ' ':
                    space_loc = index
            
            first = stroke_count.split()[0]
            second = stroke_count.split()[1]
            
            # Take the higher stroke count.
            stroke_count = max(int(first), int(second))
            
            after = str(stroke_count)
            print(char + ': ' + before + ' -> ' + after)
            
        #print('stroke_count: ' + stroke_count)
        
        try:
            grade_element = table.find("td", string = 'Grade')
            grade = str(grade_element.next_sibling)[7:-9]
            #print('grade: ' + grade)
        except:
            print(char + ': Grade not found.')
            grade = np.NaN
        
        try:
            freq_element = table.find("td", string = 'Frequency ranking')
            frequency_ranking = str(freq_element.next_sibling)[7:-9]
            #print('frequency_ranking: ' + frequency_ranking)
        except:
            print(char + ': No frequency ranking found.')
            frequency_ranking = np.NaN
        
        # Save the results
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = stroke_count
        char_df.at[index, 'grade'] = grade
        char_df.at[index, 'frequency_ranking'] = frequency_ranking
            
        print(char + ': Success')
        
    except:
        print(char + ': Failed')

# Test the function on a simple and common character
char_lookup('本')

本: Success

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:64: FutureWarning: `item` has been deprecated and will be removed in a future version

# Note that this takes a long time.
# This block is commented out to avoid accidentally re-scraping it all.
'''
# Use the char_lookup function to scrape the stroke_count, grade, and frequency_rating cloumn values for each character, and then save the resulting dataframe to avoid having to rescrape the data.
char_df['character'].map(char_lookup)
char_df.to_csv('char_df_with_strokes.csv')
'''

"\n# Use the char_lookup function to scrape the stroke_count, grade, and frequency_rating cloumn values for each character, and then save the resulting dataframe to avoid having to rescrape the data.\nchar_df['character'].map(char_lookup)\nchar_df.to_csv('char_df_with_strokes.csv')\n"

# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
char_df = pd.read_csv('char_df_with_strokes.csv', index_col = 0)

# Check to see if any kanji characters' stroke_counts are missing and need to be re-scraped.
chars_to_redo = char_df.loc[(char_df['stroke_count'].isnull()) & (char_df['script'] == 'kanji')].index.tolist()
chars_to_redo

[]

# Re-scrape any missing stroke_counts.
for index in chars_to_redo:
    char_lookup(char_df.at[index, 'character'])

The hiragana and katakana characters will be added manually because it will be simpler this way than to scrap it from a different source.

# Create a dictionary with each hiragana character's stroke count.
hiragana_strokes = {
    'あ': 3,
    'い': 2,
    'う': 2,
    'え': 2,
    'お': 3,
    'か': 3,
    'き': 3,
    'く': 1,
    'け': 3,
    'こ': 2,
    'さ': 2,
    'し': 1,
    'す': 2,
    'せ': 3,
    'そ': 1,
    'た': 4,
    'ち': 2,
    'つ': 1,
    'て': 1,
    'と': 2,
    'な': 4,
    'に': 3,
    'ぬ': 2,
    'ね': 2,
    'の': 1,
    'は': 3,
    'ひ': 1,
    'ふ': 4,
    'へ': 1,
    'ほ': 4,
    'ま': 3,
    'み': 2,
    'む': 3,
    'め': 2,
    'も': 3,
    'や': 3,
    'ゆ': 2,
    'よ': 2,
    'ら': 2,
    'り': 2,
    'る': 1,
    'れ': 2,
    'ろ': 1,
    'わ': 2,
    'を': 3,
    'ん': 1,
    'が': 3,
    'ぎ': 3,
    'ぐ': 1,
    'げ': 3,
    'ご': 2,
    'ざ': 2,
    'じ': 1,
    'ず': 2,
    'ぜ': 3,
    'ぞ': 1,
    'だ': 4,
    'ぢ': 2,
    'づ': 1,
    'で': 1,
    'ど': 2,
    'ば': 3,
    'び': 1,
    'ぶ': 4,
    'べ': 1,
    'ぼ': 4,
    'ぱ': 3,
    'ぴ': 1,
    'ぷ': 4,
    'ぺ': 1,
    'ぽ': 4,
    'ゃ': 3,
    'ゅ': 2,
    'ょ': 2,
    ' ﾞ': 2,
    '゜': 1,
    'ゐ': 1,
    'ゑ': 1
}

# Create a dictionary with each katakana and needed computer symbol character's stroke count.
katakana_strokes = {
    'ア': 2,
    'イ': 2,
    'ウ': 3,
    'エ': 3,
    'オ': 3,
    'カ': 2,
    'キ': 3,
    'ク': 2,
    'ケ': 3,
    'コ': 2,
    'サ': 3,
    'シ': 3,
    'ス': 2,
    'セ': 2,
    'ソ': 2,
    'タ': 3,
    'チ': 3,
    'ツ': 3,
    'テ': 3,
    'ト': 2,
    'ナ': 2,
    'ニ': 2,
    'ヌ': 2,
    'ネ': 4,
    'ノ': 1,
    'ハ': 2,
    'ヒ': 2,
    'フ': 1,
    'ヘ': 1,
    'ホ': 4,
    'マ': 2,
    'ミ': 3,
    'ム': 2,
    'メ': 2,
    'モ': 3,
    'ヤ': 2,
    'ユ': 2,
    'ヨ': 3,
    'ラ': 2,
    'リ': 2,
    'ル': 2,
    'レ': 1,
    'ロ': 3,
    'ワ': 2,
    'ヲ': 3,
    'ン': 2,
    'ガ': 2,
    'ギ': 3,
    'グ': 2,
    'ゲ': 3,
    'ゴ': 2,
    'ザ': 3,
    'ジ': 3,
    'ズ': 2,
    'ゼ': 2,
    'ゾ': 2,
    'ダ': 3,
    'ヂ': 3,
    'ヅ': 3,
    'デ': 3,
    'ド': 2,
    'バ': 2,
    'ビ': 2,
    'ブ': 1,
    'ベ': 1,
    'ボ': 4,
    'パ': 2,
    'ピ': 2,
    'プ': 1,
    'ペ': 1,
    'ポ': 4,
    'ャ': 2,
    'ュ': 2,
    'ョ': 3,
    'ヰ': 4,
    'ヱ': 3,
    # Nonbasic characters below here.
    'ー': 1,
    'ィ': 2,
    '々': 3,
    'ェ': 3,
    'ァ': 2,
    'ォ': 3,
    'ぁ': 3,
    'ヴ': 3,
    '―': 1,
    '─': 1,
    'ヶ': 3,
    'ぇ': 2,
    'ゝ': 1,
    'ぉ': 3,
    '￥': 4,
    '□': 3,
    'ゞ': 1,
    '〒': 3,
    'ヵ': 2,
    '・': 1,
    'Ｔ': 2,
    '０': 1,
    '１': 1,
    '２': 1,
    '３': 1,
    '４': 2,
    '５': 2,
    '６': 1,
    '７': 1,
    '８': 1,
    '９': 1,
    '「': 1,
    '」': 1,
    '（': 1,
    '）': 1,
    '｛': 1,
    '｝': 1,
    '’': 1,
    '”': 2,
    '＜': 1,
    '＞': 1,
    '、': 1,
    '。': 1,
    '・': 1,
    '？': 2,
    '゛': 2,
    '〜': 1,
    # Ｗ杯 / W-hai for World Cup
    'Ｗ': 1,
    # Tシャツ / T-shatsu for T-Shirt
    'T': 2,
    # Ｊリーグ / J-riigu for J1 League
    'Ｊ': 2,
    # ￣ Upperscore / Macron for Hepburn long vowel notation
    '￣': 1,
    # ヽ Katakana iteration mark
    'ヽ': 1,
    # ヾ Katakana dakuten / voiced iteration mark
    'ヾ': 3,
    # ゛ Dakuten
    '゛': 2
}

# Set the stroke_counts and grade for the hiragana manually.
for char in hiragana_strokes.keys():
    try:
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = hiragana_strokes[char]
        char_df.at[index, 'grade'] = 0
    except:
        print(char + ': Not in char_df')

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.

 ﾞ: Not in char_df
ゑ: Not in char_df

# Set the stroke_counts and grade for the katakana manually.
for char in katakana_strokes.keys():
    try:
        index = char_df.loc[char_df['character'] == char].index.item()
        char_df.at[index, 'stroke_count'] = katakana_strokes[char]
        char_df.at[index, 'grade'] = 0
    except:
        print(char + ': Not in char_df')

ヂ: Not in char_df
ヅ: Not in char_df
ヰ: Not in char_df
ヱ: Not in char_df
”: Not in char_df
、: Not in char_df
。: Not in char_df

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.

# Display the total number of characters with stroke counts, the total number of characters still without stroke counts, and the percentage of characters still without stroke counts.
num_has_strokes = char_df['stroke_count'].value_counts().sum()
num_without_strokes = 2640 - num_has_strokes
percent_without_strokes = num_without_strokes / 2640
num_has_strokes, num_without_strokes, percent_without_strokes

(2479, 161, 0.06098484848484848)

# Create a list of characters that still do not have stroke counts to consider adding them individually.
char_to_manually_add = []

for char in char_df.loc[char_df['stroke_count'].isnull()]['character']:
    char_to_manually_add.append(char[0])

Some characters will still not have a stroke count or grade value, but these should only include non-Japanese characters that are irrelevant to this project.

A dataframe that only contains the relevant characters will be determined later, after insuring that they will not be needed.

# Search through these and retroactively add the remaining non-Latin
# and non-punctuation characters to the hiragana and katakana dictionaries.
char_to_manually_add

['Ｆ',
 'Ｐ',
 'Ｎ',
 'Ｏ',
 'Ｓ',
 'Ｄ',
 'Ｂ',
 'Ｃ',
 'Ｇ',
 'Ｋ',
 'Ｍ',
 'ｍ',
 'Ａ',
 'Ｉ',
 'ｋ',
 'A',
 'Ｒ',
 'Ｖ',
 'Ｈ',
 '−',
 'ｃ',
 'ｇ',
 'C',
 'Ｘ',
 'Ｕ',
 'Ｌ',
 'Ｙ',
 '＆',
 'ｂ',
 'ｅ',
 'Ｅ',
 'D',
 'Ｑ',
 'U',
 'Ｚ',
 'β',
 'ｐ',
 'ｘ',
 'σ',
 '÷',
 'μ',
 'Σ',
 'ε',
 'Ω',
 'ａ',
 '〆',
 'ｖ',
 'ｓ',
 'E',
 'R',
 'I',
 'M',
 '…',
 '，',
 'P',
 'k',
 'w',
 'g',
 '：',
 'O',
 'N',
 '｀',
 'G',
 '■',
 'H',
 'W',
 'v',
 'F',
 '○',
 '．',
 'L',
 '』',
 '『',
 'f',
 'B',
 'y',
 '！',
 'c',
 'l',
 'm',
 's',
 'r',
 'n',
 'd',
 '‐',
 'S',
 'b',
 '|',
 '#',
 'u',
 'p',
 '】',
 '【',
 '_',
 '＋',
 '［',
 '▼',
 'Z',
 '△',
 '←',
 '〇',
 '※',
 'ｗ',
 'Q',
 '‥',
 '〔',
 '〕',
 '＿',
 '＾',
 '］',
 '▲',
 '↑',
 'q',
 '《',
 'Δ',
 '》',
 '＠',
 '；',
 '↓',
 '◇',
 '◎',
 '×',
 '☆',
 '＊',
 '^',
 '＝',
 '／',
 '％',
 'x',
 '●',
 'V',
 '★',
 '$',
 'j',
 'J',
 'K',
 '▽',
 'Y',
 'z',
 '\\',
 '~',
 '◆',
 '｜',
 '→',
 '´',
 '〉',
 '〈',
 'X',
 'i',
 'α',
 '＄',
 '{',
 'a',
 'o',
 '‘',
 'e',
 '}',
 't',
 '〓',
 '〃',
 'ω']

# Check column dtypes.
char_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 11 columns):
rank                  2640 non-null int64
frequency             2640 non-null int64
character             2640 non-null object
weighted_frequency    2640 non-null float64
weighted_rank         2640 non-null int64
script                2640 non-null object
ord                   2640 non-null int64
jlpt_level            2640 non-null object
stroke_count          2479 non-null float64
grade                 2381 non-null float64
frequency_ranking     2115 non-null float64
dtypes: float64(4), int64(4), object(3)
memory usage: 327.5+ KB

# Check the total number of characters in each grade.
char_df['grade'].value_counts()

8.0     991
4.0     202
3.0     200
0.0     197
5.0     185
6.0     180
9.0     176
2.0     160
1.0      80
10.0     10
Name: grade, dtype: int64

Grade 8 represents basic high school as a whole in Japan. 9 and 10 are more niche and advanced levels during the same time periods of education. 7 isn't used at all by convention.

# Change the char_df's rank and weighted_rank columns' dtypes from int and float to string.
char_df['rank'] = char_df['rank'].astype('object')
char_df['weighted_rank'] = char_df['weighted_rank'].astype('object')

char_df.head(2)

Lemma Dataframe Column Creation ¶

Now that a complete character dataframe has been created, we can apply the information calculated in it to provide a lot of insight and information about the lemma dataset.

First, the script that each lemma is made up of will be calculated.

def classify_word_script(row):
    """Use the char_lookup function to determine and return the script(s) the lemma in the given row is made up of."""
    
    word = row['lemma']
    char_scripts = []
    kanji = False
    hiragana = False
    katakana = False
    
    for char in word:
        char_scripts.append(classify_char_script(char))
        
        if 'kanji' in char_scripts:
            kanji = True
        if 'hiragana' in char_scripts:
            hiragana = True
        if 'katakana' in char_scripts:
            katakana = True
        if 'half-width_katakana' in char_scripts:
            katakana = True
      
    # Return the proper category of combinations.
    if kanji and not hiragana and not katakana:
        return 'kanji'
    elif hiragana and not kanji and not katakana:
        return 'hiragana'
    elif katakana and not kanji and not hiragana:
        return 'katakana'
    elif kanji and hiragana and not katakana:
        return 'kanji_and_hiragana'
    elif kanji and katakana and not hiragana:
        return 'kanji_and_katakana'
    elif hiragana and katakana and not kanji:
         return 'hiragana_and_katakana'
    elif kanji and hiragana and katakana:
        return 'all'
    else:
        return 'not_japanese'

# Create a script column for the lemma dataframe by using the classify_word_script() function on each row.
df['script'] = df.apply(classify_word_script, axis = 1)
df

# Check the total number of lemmas in each script combination.
df['script'].value_counts()

kanji                    8185
kanji_and_hiragana       2446
katakana                 2387
hiragana                 1686
not_japanese              230
kanji_and_katakana         42
hiragana_and_katakana      24
Name: script, dtype: int64

# Visualize the total number of lemmas in each script combination with a barplot.
plot_data = df['script'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y)
ax.set(xlabel = 'Script', ylabel = 'Number of Lemmas')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.title('Lemma Script Distribution')
plt.show()

With the disproportionate amount of kanji in the language compared to the other scripts, it's no surprise that kanji-only words make up the bulk of the language usage. However, there's not a single instance of a word that contains all three scripts. Additionally, Katakana-only words are more common than hiragana-only words, likely because of foreign loanwords.

Like the scripts that each lemma is made up of, a minimum JLPT exam level can be determined by calculating the highest exam level of each character that the lemma is made up of.

def jlpt_level(row):
    """Determines and returns the highest ranking jlpt_level among all characters in the lemma of the given row. JLPT Ranking Order: n0 > n1 > n2 > n3 > n4 > n5."""
    
    word = row['lemma']
    
    char_levels = []
    
    for char in word:
        char_row = char_df.loc[char_df['character'] == char]
        char_levels.append(str(char_row.get('jlpt_level').item()))
    
    # Return the highest character rank, since knowing the word requires knowing all the characters in it.
    if 'n0' in char_levels:
        return 'n0'
    elif 'n1' in char_levels:
        return 'n1'
    elif 'n2' in char_levels:
        return 'n2'
    elif 'n3' in char_levels:
        return 'n3'
    elif 'n4' in char_levels:
        return 'n4'
    elif 'n5' in char_levels:
        return 'n5'
    else:
        return 'error'

# Test the jlpt_level() function.
jlpt_level(df.iloc[400])

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: `item` has been deprecated and will be removed in a future version
  # Remove the CWD from sys.path while we load stuff.

'n5'

# Create an ordered list to use when referencing all JLPT exam levels from here on out.
jlpt_exams = ['n5', 'n4', 'n3', 'n2', 'n1', 'n0']

# Note that this takes some time to execute.
# Use the jlpt_level on each row of the lemma dataframe to assign a JLPT exam level to each lemma.
df['jlpt_level'] = df.apply(jlpt_level, axis = 1)
df

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: `item` has been deprecated and will be removed in a future version
  # Remove the CWD from sys.path while we load stuff.

# Check the total number of lemmas in each JLPT exam level.
df['jlpt_level'].value_counts()

n5    4629
n3    3332
n1    3054
n2    1855
n4    1435
n0     695
Name: jlpt_level, dtype: int64

# Visualize the total number of lemmas in each JLPT exam level with a barplot.
plot_data = df['jlpt_level'].value_counts()
x = plot_data.keys().tolist()
y = plot_data.values.tolist()
ax = sns.barplot(x, y, order = jlpt_exams)
ax.set(xlabel = 'JLPT Exam Level', ylabel = 'Number of Lemmas')
plt.title('Lemma JLPT Exam Level Distribution')
plt.show()

The n5 level should be include the most lemmas, because hiragana-only and katakana-only words can be easily learned, but perhaps should not be learned at this level alone. Accounting for this caveat, the highest leaps in vocabulary accessibility comes at the n3 and the n1 exam levels.

An additional pair of variables for each character and lemma will be the total number of words and the average character length of these words.

def count_word_usage(char):
    """Calculates and returns the number of lemmas and the average lemma length that the given character appears in."""
    
    word_list = []
    char_count = 0
    
    for lemma in df['lemma']:
        if char in lemma:
            word_list.append(lemma)
            
    for word in word_list:
        for char in word:
            char_count += 1
    
    count = len(word_list)
    average_word_length = char_count / count
    average_word_length = round(average_word_length, 2)
            
    return count, average_word_length

# Test the count_word_usage() function.
print(count_word_usage('食'))

(33, 2.15)

# Create a word count and average word length column for each character in the character dataframe using the count_word_usage() function.
# frequency is the number of times the character appears in the list of lemmas.
# word_count is the number of lemmas that the character appears in at least once.
char_df['word_count'], char_df['average_word_length'] = zip(*char_df['character'].map(count_word_usage))
char_df

# Describe the word_count column data.
char_df['word_count'].describe()

count    2640.000000
mean       13.996212
std        44.592024
min         1.000000
25%         1.000000
50%         4.000000
75%        10.250000
max      1133.000000
Name: word_count, dtype: float64

# Visualize the character word_count amounts with a scatterplot and a log scaled y-axis.
x = range(0, len(char_df))
y = char_df['word_count']
plt.scatter(x, y)
plt.title('Word Counts of Japanese Characters')
plt.ylabel('Word Count')
plt.yscale('log')
plt.show()

Like the lemma frequencies, the total number of lemmas each character appears in will be somewhat logarithmic, but word count has a significantly less dramatic curve.

# Describe the average_word_length column data.
char_df['average_word_length'].describe()

count    2640.000000
mean        2.026080
std         0.682107
min         1.000000
25%         1.750000
50%         2.000000
75%         2.250000
max         5.880000
Name: average_word_length, dtype: float64

# Visualize the character word_count values by their average_word_length with a scatterplot.
x = char_df['average_word_length']
y = char_df['word_count']
plt.scatter(x, y, alpha = 0.1)
plt.title('Word Counts of Japanese Characters')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
#plt.yscale('log')
plt.show()

It would make sense for word count and average word length to be related, given the assumption that simplier is more common, but that isn't always the case in languages. There appears to be a very slight positive linear relatonship between these two variables, but it doesn't appear to be significant.

# Visualize the kanji character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'kanji']['average_word_length']
y = char_df.loc[char_df['script'] == 'kanji']['word_count']
plt.scatter(x, y, alpha = 0.2)
plt.title('Average Word Lengths of Kanji by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()

# Visualize the hiragana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'hiragana']['average_word_length']
y = char_df.loc[char_df['script'] == 'hiragana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Hiragana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()

# Visualize the katakana character word_count values by their average_word_length with a scatterplot.
x = char_df.loc[char_df['script'] == 'katakana']['average_word_length']
y = char_df.loc[char_df['script'] == 'katakana']['word_count']
plt.scatter(x, y)
plt.title('Average Word Lengths of Katakana by Word Count')
plt.xlabel('Average Word Length')
plt.ylabel('Word Count')
plt.show()

These three graphs show the word count and word length of each character by script type. Hiragana and Katakana tend towards longer words, while kanji words are far more concise.

# View and reassess a sample of the lemma dataframe before continuing.
df.sample(20)

Part of Speech Data Gathering ¶

Though difficult to find outside of tokenizing and PoS tagging a Japanese corpus, the most common part of speech usage forms can be looked up for each word.

This will allow a comparison of the lemma dataset's distribution of parts of speech and the pos dataframe's.

def jisho_lookup(word):
    """Scrapes and returns the lemma's most common part of speech from its corresponding webpage at http://www.jisho.org/ ."""
    
    try:
        soup = url_base = "http://jisho.org/search/"

        # Create the URL for the current account page.
        current_url = url_base + str(word)

        # Request the current account page.
        page = requests.get(current_url)

        # Parse the page with BS4.
        soup = BeautifulSoup(page.content, 'html.parser')

        pos = soup.find("div", {"class": "meaning-tags"}).contents[0]
        
        print(word + ': ' + pos)
        
        return pos
        
    except:
        print(word + ': Failed')
        return 'Failed'

def jisho_api_lookup(word):
    """Scrapes and returns the lemma's most common part of speech from its corresponding webpage through http://www.jisho.org/ 's experimental alpha API.'."""
    
    try:
        pos = []

        url_base = 'https://jisho.org/api/v1/search/words?keyword='

        # Create the URL for the current account page.
        current_url = url_base + str(word)

        # Request the current account page.
        page = requests.get(current_url)
        
        # Return a failure on a 404 or 500.
        if page.status_code == 404:
            print(word + ': Failed. 404 error.')
            return ['Failed. 404 error.']
        if page.status_code == 500:
            print(word + ': Failed. 500 error.')
            return ['Failed. 500 error.']

        # Parse the page with BS4.
        soup = BeautifulSoup(page.content, 'html.parser')

        data = json.loads(str(soup))['data']
        
        # Some characters like 」 will load an api page but not have data.
        if data == []:
            print(word + ': Failed. No api data.')
            return ['Failed. No api data.']

        # Access the correct values in the nested dictionaries and lists.
        for item in data:
            for index, x in enumerate(item['japanese']):

                try:
                    if x['word'] == word or x['reading'] == word:
                        
                        for variation in item['senses']:
                            
                            for part in variation['parts_of_speech']:
                                pos.append(part)
                                
                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass


                try:
                    if x['word'] == word:

                        for variation in item['senses']:

                            for part in variation['parts_of_speech']:
                                pos.append(part)
                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass


                try:
                    if x['reading'] == word:

                        for variation in item['senses']:

                            for part in variation['parts_of_speech']:
                                pos.append(part)

                            print(word + ': ' + str(pos))
                            return str(pos)
                        
                except:
                    pass

        # If an api page is loaded with information but none of it is correct, return none.
        print(word + ': Failed due to incorrect data.')
        return ['Failed due to incorrect data.']
        
    except:
        print(word + ': Failed')
        return ['Failed']

# Test the jisho_lookup() function.
print(jisho_lookup('換算'))

換算: Noun, Suru verb
Noun, Suru verb

# Test the jisho_lookup() function with a character that should fail.
jisho_api_lookup('$')

$: Failed. 500 error.

['Failed. 500 error.']

# Scrape all parts of speech from Jisho.org for each lemma.
# Note that this takes a very long time to execute.
# This block is commented out to avoid accidentally re-scraping it all.
'''
df['pos'] = df['lemma'].apply(jisho_api_lookup)
df.to_csv('df_with_pos.csv', sep = '|')
df
'''

"\ndf['pos'] = df['lemma'].apply(jisho_api_lookup)\ndf.to_csv('df_with_pos.csv', sep = '|')\ndf\n"

# Presume that the entire notebook is being run, and reload the previously saved dataframe that includes the scraped data.
df = pd.read_csv('df_with_pos.csv', sep = '|', index_col = 0)

df

# View the total value counts for each lemma part of speech value.
df['pos'].value_counts()

['Noun']                                                                 5868
['Noun', 'Suru verb']                                                    1985
['Noun', 'No-adjective']                                                  907
['Failed due to incorrect data.']                                         647
['Noun', 'Suru verb', 'No-adjective']                                     408
                                                                         ... 
['Adverbial noun', 'Noun - used as a suffix']                               1
['Godan verb with su ending', 'Transitive verb', 'intransitive verb']       1
['Expression', 'No-adjective', 'Adverb']                                    1
['Taru-adjective', "Adverb taking the 'to' particle", 'Adverb']             1
['Suru verb', 'Noun']                                                       1
Name: pos, Length: 290, dtype: int64

# Calculate each unique part of speech from the jisho api scrape.
from ast import literal_eval

parts_of_speech = {}
for pos_list in df['pos']:
    for pos in literal_eval(pos_list):
        if pos in parts_of_speech:
            parts_of_speech[pos] += 1
        else:
            parts_of_speech[pos] = 1
            
parts_of_speech

{'Particle': 50,
 'Numeric': 40,
 'Noun': 10385,
 'Copula': 1,
 'Suru verb - irregular': 1,
 'Godan verb with su ending': 241,
 'intransitive verb': 663,
 'Transitive verb': 998,
 'Noun - used as a suffix': 316,
 'I-adjective': 244,
 'Ichidan verb': 584,
 'Godan verb with ru ending (irregular verb)': 3,
 'Expression': 169,
 'Failed due to incorrect data.': 647,
 'Failed. No api data.': 214,
 'Godan verb with ru ending': 388,
 'Auxiliary verb': 19,
 'Godan verb with u ending': 156,
 'Pre-noun adjectival': 40,
 'Na-adjective': 833,
 'Suffix': 95,
 'Pronoun': 76,
 'Kuru verb - special class': 4,
 'Prefix': 85,
 'Godan verb - Iku/Yuku special class': 4,
 'Adverb': 554,
 'Conjunction': 89,
 'Adverbial noun': 302,
 'Yodan verb with ru ending (archaic)': 3,
 'Taru-adjective': 11,
 "Adverb taking the 'to' particle": 84,
 'Noun - used as a prefix': 32,
 'Godan verb with ku ending': 136,
 'Godan verb with tsu ending': 32,
 'Suru verb': 2512,
 'No-adjective': 1739,
 'Temporal noun': 218,
 'Godan verb with mu ending': 128,
 'Noun or verb acting prenominally': 40,
 'Godan verb - aru special class': 7,
 'Counter': 40,
 'Auxiliary adjective': 2,
 'Godan verb with u ending (special class)': 3,
 'Godan verb with bu ending': 19,
 'Godan verb with nu ending': 2,
 'Irregular nu verb': 2,
 'Wikipedia definition': 171,
 'Su verb - precursor to the modern suru': 9,
 'Place': 39,
 'Failed. 500 error.': 1,
 'Godan verb with gu ending': 26,
 'Auxiliary': 5,
 'Suru verb - special class': 12,
 'Full name': 2,
 'I-adjective (yoi/ii class)': 3,
 'Proper noun': 4,
 'Product': 1,
 'Archaic/formal form of na-adjective': 2,
 'Unclassified': 1,
 'Nidan verb (upper class) with ru ending (archaic)': 1,
 'Nidan verb (lower class) with ru ending (archaic)': 1,
 'Ichidan verb - zuru verb (alternative form of -jiru verbs)': 2,
 'Company': 3,
 'Nidan verb (lower class) with mu ending (archaic)': 1}

# Look at one of the more odd part of speech scrap values.
df.loc[df['pos'] == "['Wikipedia definition']"].head()

There is an excessively large amount of specific parts of speech in this data.

Create a dictionary to simplify and 'translate' the parts_of_speech to the pos_df eng_pos values.

# Create a dictionary to simplify and translate the parts_of_speech to the pos_df eng_pos values.
translation_dict = {}
for pos in list(parts_of_speech.keys()):
    translation_dict[pos] = ''

translation_dict

{'Particle': '',
 'Numeric': '',
 'Noun': '',
 'Copula': '',
 'Suru verb - irregular': '',
 'Godan verb with su ending': '',
 'intransitive verb': '',
 'Transitive verb': '',
 'Noun - used as a suffix': '',
 'I-adjective': '',
 'Ichidan verb': '',
 'Godan verb with ru ending (irregular verb)': '',
 'Expression': '',
 'Failed due to incorrect data.': '',
 'Failed. No api data.': '',
 'Godan verb with ru ending': '',
 'Auxiliary verb': '',
 'Godan verb with u ending': '',
 'Pre-noun adjectival': '',
 'Na-adjective': '',
 'Suffix': '',
 'Pronoun': '',
 'Kuru verb - special class': '',
 'Prefix': '',
 'Godan verb - Iku/Yuku special class': '',
 'Adverb': '',
 'Conjunction': '',
 'Adverbial noun': '',
 'Yodan verb with ru ending (archaic)': '',
 'Taru-adjective': '',
 "Adverb taking the 'to' particle": '',
 'Noun - used as a prefix': '',
 'Godan verb with ku ending': '',
 'Godan verb with tsu ending': '',
 'Suru verb': '',
 'No-adjective': '',
 'Temporal noun': '',
 'Godan verb with mu ending': '',
 'Noun or verb acting prenominally': '',
 'Godan verb - aru special class': '',
 'Counter': '',
 'Auxiliary adjective': '',
 'Godan verb with u ending (special class)': '',
 'Godan verb with bu ending': '',
 'Godan verb with nu ending': '',
 'Irregular nu verb': '',
 'Wikipedia definition': '',
 'Su verb - precursor to the modern suru': '',
 'Place': '',
 'Failed. 500 error.': '',
 'Godan verb with gu ending': '',
 'Auxiliary': '',
 'Suru verb - special class': '',
 'Full name': '',
 'I-adjective (yoi/ii class)': '',
 'Proper noun': '',
 'Product': '',
 'Archaic/formal form of na-adjective': '',
 'Unclassified': '',
 'Nidan verb (upper class) with ru ending (archaic)': '',
 'Nidan verb (lower class) with ru ending (archaic)': '',
 'Ichidan verb - zuru verb (alternative form of -jiru verbs)': '',
 'Company': '',
 'Nidan verb (lower class) with mu ending (archaic)': ''}

# View the part of speech values from the pos_df.
pos_df['eng_pos']

0               noun
1           particle
2             symbol
3               verb
4     auxiliary verb
5             adverb
6          adjective
7          adnominal
8        conjunction
9             prefix
10      interjection
11            filler
12             other
Name: eng_pos, dtype: object

# Manually fill out the dictionary for condensing the lemma part of speech values.
translation_dict = {
    'Particle': 'particle',
    'Numeric': 'other',
    'Noun': 'noun',
    'Copula': 'other',
    'Suru verb - irregular': 'verb',
    'Godan verb with su ending': 'verb',
    'intransitive verb': 'verb',
    'Transitive verb': 'verb',
    'Noun - used as a suffix': 'noun',
    'I-adjective': 'adjective',
    'Ichidan verb': 'verb',
    'Godan verb with ru ending (irregular verb)': 'verb',
    'Expression': 'other',
    'Failed due to incorrect data.': '',
    'Failed. No api data.': '',
    'Godan verb with ru ending': 'verb',
    'Auxiliary verb': 'auxiliary verb',
    'Godan verb with u ending': 'verb',
    'Pre-noun adjectival': 'adjective',
    'Na-adjective': 'adjective',
    'Suffix': 'other',
    'Pronoun': 'noun',
    'Kuru verb - special class': 'verb',
    'Prefix': 'prefix',
    'Godan verb - Iku/Yuku special class': 'verb',
    'Adverb': 'adverb',
    'Conjunction': 'conjunction',
    'Adverbial noun': 'noun',
    'Yodan verb with ru ending (archaic)': 'verb',
    'Taru-adjective': 'adjective',
    "Adverb taking the 'to' particle": 'adverb',
    'Noun - used as a prefix': 'noun',
    'Godan verb with ku ending': 'verb',
    'Godan verb with tsu ending': 'verb',
    'Suru verb': 'verb',
    'No-adjective': 'adjective',
    'Temporal noun': 'noun',
    'Godan verb with mu ending': 'verb',
    'Noun or verb acting prenominally': 'other',
    'Godan verb - aru special class': 'verb',
    'Counter': 'symbol',
    'Auxiliary adjective': 'adjective',
    'Godan verb with u ending (special class)': 'verb',
    'Godan verb with bu ending': 'verb',
    'Godan verb with nu ending': 'verb',
    'Irregular nu verb': 'verb',
    'Wikipedia definition': '',
    'Su verb - precursor to the modern suru': 'verb',
    'Place': 'noun',
    'Failed. 500 error.': '',
    'Godan verb with gu ending': 'verb',
    'Auxiliary': 'auxiliary verb',
    'Suru verb - special class': 'verb',
    'Full name': 'noun',
    'I-adjective (yoi/ii class)': 'adjective',
    'Proper noun': 'noun',
    'Product': 'other',
    'Archaic/formal form of na-adjective': 'verb',
    'Unclassified': '',
    'Nidan verb (upper class) with ru ending (archaic)': 'verb',
    'Nidan verb (lower class) with ru ending (archaic)': 'verb',
    'Ichidan verb - zuru verb (alternative form of -jiru verbs)': 'verb',
    'Company': 'noun',
    'Nidan verb (lower class) with mu ending (archaic)': 'verb'
}

This condensing of part of speech values inherently limits the precision of the calculations, unfortunately, but is the best that can be done without an intricate knowledge of how the pos_df was originally tagged. Perhaps this can be examined in a later project by looking into the ChaSen morphological analyzer that was used. (http://chasen-legacy.osdn.jp/)

# View the columns for each dataframe before continuing.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15000 entries, 0 to 14999
Data columns (total 6 columns):
rank          15000 non-null int64
frequency     15000 non-null float64
lemma         15000 non-null object
script        15000 non-null object
jlpt_level    15000 non-null object
pos           15000 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 820.3+ KB

# View the columns for each dataframe before continuing.
char_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2639
Data columns (total 13 columns):
rank                   2640 non-null object
frequency              2640 non-null int64
character              2640 non-null object
weighted_frequency     2640 non-null float64
weighted_rank          2640 non-null object
script                 2640 non-null object
ord                    2640 non-null int64
jlpt_level             2640 non-null object
stroke_count           2479 non-null float64
grade                  2381 non-null float64
frequency_ranking      2115 non-null float64
word_count             2640 non-null int32
average_word_length    2640 non-null float64
dtypes: float64(5), int32(1), int64(2), object(5)
memory usage: 358.4+ KB

# View the columns for each dataframe before continuing.
pos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 6 columns):
rank                    13 non-null object
eng_pos                 13 non-null object
frequency               13 non-null float64
frequency_percentage    13 non-null float64
jap_pos                 13 non-null object
transliterated_pos      13 non-null object
dtypes: float64(2), object(4)
memory usage: 752.0+ bytes

# Create a list of all scraped part of speech value combinations.
pos_lists = df['pos'].value_counts().keys()
pos_lists

Index(['['Noun']', '['Noun', 'Suru verb']', '['Noun', 'No-adjective']',
       '['Failed due to incorrect data.']',
       '['Noun', 'Suru verb', 'No-adjective']', '['Na-adjective', 'Noun']',
       '['Ichidan verb', 'Transitive verb']', '['Adverb']', '['I-adjective']',
       '['Noun', 'Noun - used as a suffix']',
       ...
       '['Adverbial noun', 'Conjunction']', '['No-adjective', 'Prefix']',
       '["Adverb taking the 'to' particle"]',
       '['Noun', 'Suru verb', 'Adverb', "Adverb taking the 'to' particle"]',
       '['Noun', 'Noun - used as a prefix', 'No-adjective']',
       '['Adverbial noun', 'Noun - used as a suffix']',
       '['Godan verb with su ending', 'Transitive verb', 'intransitive verb']',
       '['Expression', 'No-adjective', 'Adverb']',
       '['Taru-adjective', "Adverb taking the 'to' particle", 'Adverb']',
       '['Suru verb', 'Noun']'],
      dtype='object', length=290)

Nearly all of these are made up of very niche or at least overly specific parts of speech. A dummy variable column for each part of speech can be created to simplify the grouping of rows.

# Create a dummy variable column for each part of speech from the pos_df for each lemma in the lemma dataframe.
df = df.assign(**{'adjective': 0, 'adverb': 0, 'auxiliary verb': 0, 'conjunction': 0, 'noun': 0, 'other': 0, 'particle': 0, 'prefix': 0, 'symbol': 0, 'verb': 0})
df

def translate_pos(pos):
    """Returns the condensed part of speech value calculated from the translation_dict."""
    
    return translation_dict[pos].lower()

def set_df_pos_columns(row, index):
    """Iterates over each sepearate scraped part of speech for the given row and sets the corresponding dummy variable columns for each condensed part of speech."""
    
    for pos in literal_eval(row['pos'].item()):
        if translate_pos(pos) != '':
            df.loc[[index],[translate_pos(pos)]] = 1

# Test the set_df_pos_columns() function with the first lemma dataframe row.
set_df_pos_columns(df.iloc[[0]], 0)
df.head()

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.

# Note that this will take some time.
# Use the set_df_pos_columns() function to set all part of speech dummy variables for each row in the lemma dataframe.
for index, row in enumerate(df.iterrows()):
    set_df_pos_columns(df.iloc[[index]], index)

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: `item` has been deprecated and will be removed in a future version
  after removing the cwd from sys.path.

# Calculate the total number of lemmas of each part of speech based on the dummy variables.
lemma_pos_counts = {}
for column in ['adjective', 'adverb', 'auxiliary verb', 'conjunction', 'noun', 'other', 'particle', 'prefix', 'symbol', 'verb']:
    lemma_pos_counts[column] = df[column].sum()
    
lemma_pos_counts

{'adjective': 2652,
 'adverb': 568,
 'auxiliary verb': 24,
 'conjunction': 89,
 'noun': 10809,
 'other': 341,
 'particle': 50,
 'prefix': 85,
 'symbol': 40,
 'verb': 4266}

Nouns and verbs are understandably the most common words.

The average frequency all of the characters in a lemma may be a way to measure the characters' impact on a lemma's frequency.

def average_char_frequency(word):
    """Calculates and returns the average frequency among all characters in the given word."""
    
    freq_total = 0
    char_count = len(word)
    
    for char in word:
        freq_total += char_df.loc[char_df['character'] == char]['frequency'].item()
        
    return freq_total / char_count

# Test the average_char_frequency() function.
average_char_frequency('図書館')

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version

26.666666666666668

# Note that this will take some time.
# Calculate the average character frequency of each lemma in the lemma dataframe.
df['average_character_frequency'] = df['lemma'].map(average_char_frequency)
df

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version

Likewise, the total stroke count of all characters in a word, as well as the average stroke count per character, can provide an average measure for difficulty of a word.

def total_stroke_count(word):
    """Calculates and returns the sum of the stroke counts of all characters in the given word."""
    
    stroke_total = 0
    
    try:
        for char in word:
            stroke_total += char_df.loc[char_df['character'] == char]['stroke_count'].item()
    except:
        return np.nan
    
    return stroke_total

def average_char_stroke_count(word):
    """Calculates and returns the average stroke count among all characters in the given word."""
    
    stroke_total = total_stroke_count(word)
    char_count = len(word)
    
    return stroke_total / char_count

# Test the total_stroke_count() and average_char_stroke_count() functions.
total_stroke_count('図書館'), average_char_stroke_count('図書館')

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version

(33.0, 11.0)

# Note that this will take a while.
# Calculate the total and average stroke counts for each lemma in the lemma dataframe.
df['total_stroke_count'] = df['lemma'].map(total_stroke_count)
df['average_character_stroke_count'] = df['lemma'].map(average_char_stroke_count)
df

C:\Users\aedri\Anaconda3\envs\tf1\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: `item` has been deprecated and will be removed in a future version

# Visualize the total number of each lemma part of speech with a bar plot.
x = list(lemma_pos_counts.keys())
y = list(lemma_pos_counts.values())

sns.barplot(x, y, orient = 'v')
plt.title('Part of Speech Totals for the Most Common Japanese Lemmas')
plt.ylabel('Number of Lemmas')
plt.xlabel('Part of Speech')
plt.xticks(rotation = '90')
plt.show()

The large differences in values between these categories may make graphing the data tricky, but it is also very easy to see the distinctive differences in counts.

Nonbasic Lemma Dataframe Creation ¶

Create a dataframe that will only contain Japanese Lemmas made of multiple kana or kanji.

# Create a completely separate copy of the lemma dataframe.
nonbasic_lemma_df = df.copy(deep = True)

# Remove non-Japanese lemmas from the lemma dataframe copy.
nonbasic_lemma_df = nonbasic_lemma_df[df['script'] != 'not_japanese']

# Look at the new total number of lemmas in each script.
nonbasic_lemma_df['script'].value_counts()

kanji                    8185
kanji_and_hiragana       2446
katakana                 2387
hiragana                 1686
kanji_and_katakana         42
hiragana_and_katakana      24
Name: script, dtype: int64

# Remove all hiragana-only and katakana-only lemmas that are only a single character long.
to_remove = nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'hiragana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist()
to_remove.extend(nonbasic_lemma_df.loc[(nonbasic_lemma_df['script'] == 'katakana') & (nonbasic_lemma_df['lemma'].str.len() == 1)].index.tolist())
to_remove.sort()
nonbasic_lemma_df.drop(nonbasic_lemma_df.index[[to_remove]], inplace = True)
nonbasic_lemma_df

C:\Users\aedri\AppData\Local\Programs\Python\Python37\Lib\site-packages\pandas\core\indexes\base.py:4291: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result = getitem(key)

# Check the non-null values of each column.
nonbasic_lemma_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14686 entries, 8 to 14999
Data columns (total 19 columns):
rank                              14686 non-null int64
frequency                         14686 non-null float64
lemma                             14686 non-null object
script                            14686 non-null object
jlpt_level                        14686 non-null object
pos                               14686 non-null object
adjective                         14686 non-null int64
adverb                            14686 non-null int64
auxiliary verb                    14686 non-null int64
conjunction                       14686 non-null int64
noun                              14686 non-null int64
other                             14686 non-null int64
particle                          14686 non-null int64
prefix                            14686 non-null int64
symbol                            14686 non-null int64
verb                              14686 non-null int64
average_character_frequency       14686 non-null float64
total_stroke_count                14686 non-null float64
average_character_stroke_count    14686 non-null float64
dtypes: float64(4), int64(11), object(4)
memory usage: 2.2+ MB

# Reset the index in the new dataframe.
nonbasic_lemma_df.reset_index(drop = True, inplace = True)
nonbasic_lemma_df

This new dataframe includes nearly all of the original lemma dataset, but also has no lemma made up of irrelevant characters.

It also removes the hiragana and katakana characters from being examined as if they were lemmas themselves, removing the largest outliers for frequencies.

Now that the data has been set up in the ways needed for visualization and analysis, the plotting can begin. Frequency is the biggest variable to look into, because it is such a large factor into the immediate usefulness of learning a word. The JLPT exam level is another variable to examine because of how it impacts so many students of the Japanese language through language programs, classes, and tools. Finally, the part of speech ratios and the grades that native speakers order these lemmas in will be looked at as well.

First, the lemmas will be examined, followed by the individual characters that make up the lemmas.

Create wrapper functions to simplify the creation of plots.

def create_graph_bar(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, order = None, orient = None):
    """A wrapper function for creating seaborn barplots."""
    
    ax = sns.barplot(x, y, order = order, orient = orient).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    plt.show()

def create_graph_reg(y, x, title, ylabel, xlabel, order = 1, ylog = None, alpha = 1, line_color = None, truncate = True, xjitter = 0):
    """A wrapper function for creating seaborn regplots."""
    
    if line_color is not None:
        ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}, line_kws = {"color": line_color}).set_title(title)
    else:
        ax = sns.regplot(x, y, scatter = True, truncate = True, order = order, x_jitter = xjitter, scatter_kws = {'alpha': alpha}).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    plt.show()

def create_graph_box(y, x, title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, order = None, orient = None):
    """A wrapper function for creating seaborn boxplots."""
    
    ax = sns.boxplot(x, y, order = order, orient = orient).set_title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if ylog: plt.yscale('log')
    if xlog: plt.xscale('log')
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    plt.show()

def create_graph_cat(y, x,  title, ylabel, xlabel, rotate_degree = None, ylog = None, xlog = None, hue = None,  data = None, kind = 'scatter',  order = None):
    """A wrapper function for creating seaborn catplots."""
    
    ax = sns.catplot(x = x, y = y, hue = hue, data = data, kind = kind, order = order)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if rotate_degree != None: plt.xticks(rotation = rotate_degree)
    if ylog: plt.yscale('log')
    if xlog: plt.xscale('log')
    plt.show()

Dataframe: Lemma ¶

Univariate ¶

Visualizing the numerical data for the lemma dataframe will show how the data is distributed. Any polynomial trend will be very interesting, but any linear or exponential trend will also be useful to know about.

# Display a grid of histograms of univariate numerical columns.
nonbasic_lemma_df.hist(column = ['frequency', 'average_character_frequency', 'total_stroke_count', 'average_character_stroke_count'], bins = 16, figsize = (10, 10), grid = False);

# Visualize the information of the lemma frequency dataset.
# Scale the y values as log because of the large frequency differences between the most common lemmas and the bulk of the lemmas.
x = df['rank']
y = df['frequency']
plt.plot(x, y)
plt.title('Frequencies of the 15,000 Most Common Lemmas in Japanese')
plt.xlabel('Lemma Rank')
plt.xticks([0, 1500, 3000, 4500, 6000, 7500, 9000, 10500, 12000, 13500, 15000], rotation = 'vertical')
plt.ylabel('Lemma Frequency')
plt.yscale('log')
plt.show()

Frequency has too high of a variance for the visual to be useful in the same form as the others, but a previously made chart can be reexamined.

Average character frequency, average character stroke count, and total character stroke count all appear to have a polynomial trend where data near the median appears to be higher or more frequent than the surrounding.

Bivariate: Frequency ¶

Lemma frequency is one of the most important variables for this dataset.

The relationships between a lemma's frequency and script, length, JLPT exam level, average character frequency, total stroke count, and average character stroke count will be visualized to give insight to which factors may need to be further researched.

# Display the relationship between frequency and script with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['script'], 'Lemma Frequency by Script', 'Frequency', 'Script', 45, ylog = True)

Frequency seems very slightly affected by the script type.

# Display the relationship between frequency and lemma length with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['lemma'].str.len(), 'Frequency by Lemma Length', 'Frequency', 'Lemma Length', 3, True, line_color = 'red')

A lemma's frequency seems to be negatively affected by its length. Additionally, the frequency of a lemma drops drastically when it is longer than 8 characters.

# Display the relationship between frequency and JLPT exam level with a box plot.
create_graph_box(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['jlpt_level'], 'Lemma Frequency by JLPT Exam Level', 'Frequency', 'JLPT Exam Level', ylog = True, order = jlpt_exams)

While the n5 exam has some of the most frequently used lemmas, it has a lower median than other exam levels.

# Display the relationship between frequency and average character frequency with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['average_character_frequency'], 'Frequency by Lemma Average Character Frequency', 'Frequency', 'Average Character Frequency', ylog = True, alpha = 0.05, line_color = 'red')

A lemma's frequency is positively affected by its average character frequency as a general trend.

# Display the relationship between frequency and total stroke count with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['total_stroke_count'], 'Frequency by Lemma Total Stroke Count', 'Frequency', 'Lemma Total Stroke Count', order = 1, ylog = True, alpha = 0.05, line_color = 'red')

A lemma's frequency is negatively affected by its total stroke count. Additionally, the frequency of a lemma drops drastically around and past a maximum of 25 total strokes.

# Display the relationship between frequency and average character stroke count with a reg plot.
create_graph_reg(nonbasic_lemma_df['frequency'], nonbasic_lemma_df['average_character_stroke_count'], 'Frequency by Lemma Average Character Stroke Count', 'Frequency', 'Lemma Average Character Stroke Count', ylog = True, alpha = 0.05, line_color = 'red')

A lemma's frequency is negatively affected by its average character stroke count. Additionally, the frequency of a lemma drops drastically past a maximum average of 15 strokes per character.

Lemma Frequency Summary¶

Lemma frequency is most strongly affected by length and stroke count. The more complex that a lemma is to read and write, the less frequently it tends to be used.

While one might think that the JLPT n5 exam would largely contain the most frequent lemmas, it only does to a point, largely because of the bias these relations have with the simplicity of the hiragana and katakana scripts.

Lemmas with a length of more than 8 characters, a total of 25 strokes or greater, with an average 15 strokes or greater per character have significantly lower frequencies in Japanese.

Bivariate: JLPT Exam Level ¶

Though not as directly important as frequency, determining the effectiveness of the JLPT exam divisions and ordering will be helpful in applying these goals to the overall language.

The relationships between a lemma's JLPT exam level and length, average character frequency, total stroke count, and average character stroke count will be visualized to give insight to which factors may need to be further researched.

# Display the relationship between JLPT exam level and lemma length with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['lemma'].str.len(), 'Lemma JLPT Exam Level by Lemma Length', 'JLPT Exam Level', 'Lemma Length', order = jlpt_exams)

The JLPT n5 exam has significantly longer lemmas than all other exam levels. This is likely because of the amount of foreign loan-words and modern words that were created after Japanese had split from Chinese influence and the influx of new kanji because these types of words are written out with many hiragana or katakana characters.

# Display the relationship between JLPT exam level and average character frequency with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['average_character_frequency'], 'Lemma JLPT Exam Level by Average Character Frequency', 'JLPT Exam Level', 'Average Character Frequency', order = jlpt_exams)

JLPT exam level difficulty has a negative relation with average character frequency.

# Display the relationship between JLPT exam level and total stroke count with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['total_stroke_count'], 'Lemma JLPT Exam Level by Total Stroke Count', 'JLPT Exam Level', 'Total Stroke Count', order = jlpt_exams)

JLPT exam level difficulty has a positive relation with total stroke count.

# Display the relationship between JLPT exam level and average character stroke count with a box plot.
create_graph_box(nonbasic_lemma_df['jlpt_level'], nonbasic_lemma_df['average_character_stroke_count'], 'Lemma JLPT Exam Level by Average Character Stroke Count', 'JLPT Exam Level', 'Average Character Stroke Count', order = jlpt_exams)

JLPT exam level difficulty has a positive relation with average character stroke count.

Lemma JLPT Exam Level Summary¶

The JLPT exams do tend to follow frequency trends, and likely used character and lemma frequency as a metric when dividing the material between the exam levels.

When weighting the importance of learning a lemma, kanji and hiragana script words should be prioritized more. Additionally, while longer lemmas are less important based on frequency, the JLPT n5 exam vocabulary list should be made exempt from negative weighting from length.

Lemma Part of Speech Grouping Data ¶

Like lemma frequency, the part of speech is one of the most important variables for this dataset; if not for vocabulary applications, then syntax and grammar.

The relationships between a lemma's part of speech and frequency, average character frequency,and average stroke count will be visualized to give insight to which factors may need to be further researched.

# Create a list of parts of speech for referencing.
pos_list = ['adjective', 'adverb', 'auxiliary verb', 'conjunction', 'noun', 'other', 'particle', 'prefix', 'symbol', 'verb']

# Create a dataframe of lemma column averages grouped by part of speech.
col_list = ['total_lemmas', 'total_lemma_proportion', 'average_character_frequency', 'total_stroke_count', 'average_character_stroke_count']
pos_means_df = pd.DataFrame(index = pos_list, columns = col_list)
pos_means_df

# Calculate the averages of various columns grouped by lemma part of speech.
for pos in pos_list:
    total_lemmas = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1][pos].sum()
    average_character_frequency = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['average_character_frequency'].mean()
    total_stroke_count = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['total_stroke_count'].mean()
    average_character_stroke_count = nonbasic_lemma_df.loc[nonbasic_lemma_df[pos] == 1]['average_character_stroke_count'].mean()
    
    # Add an np.nan for total_lemma_proportion because the total of all parts of speech is needed to calculate the proportion.
    pos_means_df.loc[[pos]] = [[total_lemmas, np.nan, average_character_frequency, total_stroke_count, average_character_stroke_count]]

    
# Now calculate the total_lemma_proportion column.
total = pos_means_df['total_lemmas'].sum()

for pos in pos_list:
    pos_means_df.at[pos, 'total_lemma_proportion'] = (pos_means_df.at[pos, 'total_lemmas']) / total
    
    
# Check to make sure that the proportions add up to 100%.
print('Total Percentage: ' + str(pos_means_df['total_lemma_proportion'].sum() * 100) + '%')

Total Percentage: 99.99999999999999%

# View the resulting averages.
pos_means_df

# Display the ratio of parts of speech with bar plots.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech', 'Percentage of Total', 'Part of Speech', 45)
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)

Nouns, verb, and adjectives are by far the most common parts of speech, and their related grammatical rules should be weighted to reflect that.

# Display the relationship between part of speech and average character frequency with a bar plot.
create_graph_bar(pos_means_df['average_character_frequency'], pos_means_df.index, 'Lemma Average Character Frequency by Part of Speech', 'Average Character Frequency', 'Part of Speech', 45)

When grouping by average character frequency, auxiliary verbs go from the least common part of speech to the part of speech whose characters are extremely common in comparison of others. This also means that auxiliary verbs may be able to be learned much earlier on in writing and reading studies than other parts of speech.

# Display the relationship between part of speech and average character stroke count with a bar plot.
create_graph_bar(pos_means_df['average_character_stroke_count'], pos_means_df.index, 'Lemma Average Character Stroke Count by Part of Speech', 'Average Character Stroke Count', 'Part of Speech', 45)

Auxiliary verbs, conjunctions, and particles are going to likely be the easiest to write, and may be easily learned early on as a whole.

# Display and compare the ratio of parts of speech with bar plots for both the part of speech df and the lemma dataframe.
# Only display the parts of speech that exist in both dataframes.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)
create_graph_bar(pos_df['frequency_percentage'], pos_df['eng_pos'], 'Distribution of Japanese Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True, order = pos_list)

These proportions likely differ because of not factoring in every possible combination of part of speech usage per lemma and not having access to the tokenized corpus alongside the frequency datasets used. These ratios should not be used for larger calculations.

Lemma Part of Speech Grouping Summary¶

Noun, verb, and adjective grammar and syntactical rules should be learned freely and early on for the sake of efficiency, but conjunctions, particles, and especially auxiliary verbs will be the easiest to write and read early on.

Perhaps this latter set of rules can best be covered early on in Japanese second language acquisition between a larger focus on the former set of rules.

Multivariate ¶

The various JLPT exam levels of lemmas and scripts used in Japanese can be encountered in many different ways. Comparing the six script combinations and the five JLPT exam levels (and the lemmas beyond the exams) gives thirty-six different aspects to examine and then to determine how to go about applying the existing JLPT ordering to a newer form.

# Create a dataframe of the lemma dataframe JLPT exam level counts grouped by script in order to plot it.
scripts = set(nonbasic_lemma_df['script'].values)
script_jlpt_df = pd.DataFrame(columns = scripts)

for script in scripts:
    counts = nonbasic_lemma_df.loc[nonbasic_lemma_df['script'] == script]['jlpt_level'].value_counts()
    
    for exam in jlpt_exams:
        # Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
        try:
            script_jlpt_df.at[exam, script] = counts[exam]
        except:
            script_jlpt_df.at[exam, script] = 0

# Display the calculated counts.
script_jlpt_df

# Prepare the JLPT exam level counts by script dataframe for plotting.
script_jlpt_df = script_jlpt_df.transpose()
script_jlpt_df.reset_index(inplace = True)
script_jlpt_df.rename(columns = {'index': 'script'}, inplace = True)
script_jlpt_df = pd.melt(script_jlpt_df, id_vars = "script", var_name = "exam_level", value_name = "count");

# Display the relationship between the distributions of JLPT exam level counts grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Distribution of JLPT Exam Level Lemmas by Script', 'Amount of Lemmas', 'Script', hue = 'exam_level', data = script_jlpt_df, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'hiragana_and_katakana', 'kanji', 'kanji_and_hiragana', 'kanji_and_katakana'])

Lemmas made up of kanji and/or hiragana are by far the most common. Additionally, knowledge of kanji will be needed the most out of all scripts.

Create a grouped barplot for the part of speech ratios from the pos_df and the lemma dataframe.

# Display and compare the ratio of parts of speech with bar plots for both the part of speech df and the lemma dataframe.
# This time, display all parts of speech even if they do not exist in both dataframes.
create_graph_bar(pos_means_df['total_lemma_proportion'], pos_means_df.index, 'Distribution of Lemma Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True)
create_graph_bar(pos_df['frequency_percentage'], pos_df['eng_pos'], 'Distribution of Japanese Parts of Speech (Y Logarithmic Scale)', 'Percentage of Total', 'Part of Speech', 45, ylog = True, order = pos_list)

Combining these two plots into a single grouped bar chart will be easier to compare and understand the data with.

# Create a list for pos, proportion of total, and the dataframe it came from to create a new dataframe for multivariate plotting.
temp_pos_list = pos_means_df.index.tolist() + pos_df['eng_pos'].tolist()
temp_proportion_list = pos_means_df['total_lemma_proportion'].tolist() + pos_df['frequency_percentage'].tolist()
temp_group_list = []
for i in range(0, 10):
    temp_group_list.append('lemma_df')
for i in range(0, 13):
    temp_group_list.append('pos_df')

# Combine these lists into a dataframe for plotting.
data = {'pos': temp_pos_list, 'proportion': temp_proportion_list, 'dataframe': temp_group_list}
grouped_df = pd.DataFrame(data)


# Manually add the missing blank rows.
grouped_df = grouped_df.append({'pos': 'adnominal', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)
grouped_df = grouped_df.append({'pos': 'filler', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)
grouped_df = grouped_df.append({'pos': 'interjection', 'proportion': 0, 'dataframe': 'lemma_df'}, ignore_index = True)

grouped_df

# Update the pos_list with the parts of speech that exist in the part of speech dataframe but not the lemma dataframe.
pos_list = ['adjective', 'adnominal', 'adverb', 'auxiliary verb', 'conjunction', 'filler', 'interjection', 'noun', 'other' 'particle', 'prefix', 'symbol', 'verb']

# Create a grouped barplot based on the part of speech proportions in both the lemma and pos dataframes.
fig = sns.catplot(x = 'pos', y = 'proportion', hue = 'dataframe', data = grouped_df, kind = 'bar', col_order = pos_list)
fig.set_axis_labels(x_var = 'Part of Speech', y_var = 'Percentage of Total')
fig.set_xticklabels(rotation = 90)
plt.title('Part of Speech Ratios by Dataframe')
plt.show()

These differences in ratios may be the result of a poor dataset or from the dataset tidying. Whether or not this will need to be taken into account for the analysis applications will have to be determined at a later time.

Dataframe: Character ¶

Univariate ¶

The character dataframe has considerably more numerical variables than the lemma dataframe. Similarly, knowing the trends of the data in the variables will is important.

# Display a grid of histograms of all univariate numerical columns.
char_df.hist(column = ['frequency', 'weighted_frequency', 'stroke_count', 'grade', 'frequency_ranking', 'word_count', 'average_word_length'], bins = 10, figsize = (10, 10), grid = False);

Frequency, weighted frequency, and word count are likely logarithmic.

Average word length and stroke count appear quadratic, with the average word length having a significantly larger amount of words between two and three character long.

Grade appears confusing here, but this is because of how Japanese standards group secondary education requirements.

Bivariate: Frequency ¶

As before with lemmas, frequency is one of the most important variables for this dataset.

The relationships between a character's frequency and script, JLPT exam level, stroke count, grade, and average word length will be visualized to give insight to which factors may need to be further researched, and to begin to order which types of characters should be focused on in Japanese language acquisition.

# Display the relationship between frequency and script with a box plot.
create_graph_box(char_df['frequency'], char_df['script'], 'Character Frequency by Script', 'Frequency', 'Script', 45, ylog = True)

The hiragana and katakana scripts should obviously be mastered before turning to kanji, but the usage of roman characters in contemporary Japanese is worth discussion as well.

# Display the relationship between frequency and JLPT exam level with a box plot.
create_graph_box(char_df['frequency'], char_df['jlpt_level'], 'Character Frequency by JLPT Exam Level', 'Frequency', 'JLPT Exam Level', ylog = True, order = jlpt_exams)

No real deviation from learning characters in order of frequency is necessary to pass the JLPT exams in order.

# Display the relationship between frequency and stroke count with a box plot.
create_graph_reg(char_df['frequency'], char_df['stroke_count'], 'Frequency by Character Stroke Count', 'Frequency', 'Stroke Count', 3, True, line_color = 'Red', alpha = 0.1)

# Look at the value counts for the relationship between frequency and stroke count.
char_df['stroke_count'].value_counts().sort_index()

1.0      59
2.0      86
3.0      85
4.0      79
5.0     103
6.0     118
7.0     161
8.0     216
9.0     204
10.0    225
11.0    233
12.0    231
13.0    174
14.0    131
15.0    130
16.0     90
17.0     51
18.0     42
19.0     34
20.0     11
21.0      8
22.0      4
23.0      2
24.0      1
29.0      1
Name: stroke_count, dtype: int64

# Plot the counts of characters grouped by stroke count.
plt.plot(char_df['stroke_count'].value_counts().sort_index())
plt.ylabel('Amount of Characters')
plt.xlabel('Character Stroke Count')
plt.title('Amount of Characters by Stroke Count')
plt.show()

Character frequency seems negatively correlated with stroke count, but an interesting frequency trend occurs between 11 and 16 strokes. This likely occurs because of the limitation of meaningfully distinguished combinations of strokes in symbols under a certain threshold of complexity.

# Display the relationship between frequency and grade with a reg plot.
create_graph_reg(char_df['frequency'], char_df['grade'], 'Frequency by Grade', 'Frequency', 'Grade Learned', 3, True, alpha = 0.1, line_color = 'red', xjitter = 0.5)

The Japanese schooling system's ordering of character learning is very strongly correlated with the overall frequency of the characters. This means that books, tools, software, and other materials for Japanese first language acquisition will largely be ineffective, or at least inefficient, for Japanese second language acquistion.

# Display the relationship between frequency and a character's average word length with a reg plot.
create_graph_reg(char_df['frequency'], char_df['average_word_length'], 'Frequency by Character Average Word Length', 'Frequency', 'Average Word Length', ylog = True, alpha = 0.1, line_color = 'red')

Interestingly enough, character frequency is positively correlated with the average length of the lemmas that the character appears in. This may have to do with how hiragana and katakana have a combination of the longest and some of the most frequently occuring lemmas.

Character Frequency Summary¶

Although character frequency can help order the generally chaotic learning of kanji, it doesn't offer much insight in other ways. Hiragana and katakana should be learned first by all metrics.

Bivariate: JLPT Exam Level ¶

As before with lemmas, JLPT exam level is not as directly important as frequency, but determining the effectiveness of the JLPT exam divisions and ordering will be helpful in applying these goals to the overall language.

The relationships between a character's JLPT exam level and stroke count, grade, word count, and average word length will be visualized to give insight to which factors may need to be further researched.

# Display the relationship between JLPT exam level and stroke count with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['stroke_count'], 'Character JLPT Exam Level by Stroke Count', 'JLPT Exam Level', 'Character Stroke Count', order = jlpt_exams)

The n5 JLPT exam level continues to have the most outliers out of any exam level. Here we see that stroke count in a character is seen as a form of difficulty with these exams.

# Display the relationship between JLPT exam level and grade with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['grade'], 'Character JLPT Exam Level by Grade', 'JLPT Exam Level', 'Character Grade', order = jlpt_exams)

Like stroke counts before, grade is also positively related with exam level. However, with grade, the JLPT n1 exam level has the most outliers, with some characters generally taught to non-native speakers much later on actually being taught to native Japanese speakers very early.

# Display the relationship between JLPT exam level and a character's word count with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['word_count'], 'Character JLPT Exam Level by Word Count', 'JLPT Exam Level', 'Word Appearance Count', xlog = True, order = jlpt_exams)

Word count has a lot of outliers in all JLPT exam levels, but the trend is that the less common characters are seen as either more difficult or less important to learn early on.

# Display the relationship between JLPT exam level and a character's average word length with a box plot.
create_graph_box(char_df['jlpt_level'], char_df['average_word_length'], 'Character JLPT Exam Level by Average Word Apperance Length', 'JLPT Exam Level', 'Average Word Length', order = jlpt_exams)

Like word count, average word length has a lot of outliers in all JLPT exam levels, but the trend is that the less common characters are seen as either more difficult or less important to learn early on.

Character JLPT Exam Level Summary¶

All of these plots show that the JLPT ordering is an excellent metric to order character study by. Exam ease correlates with greater character simplicity and frequency.

Bivariate: Grade ¶

Chracter grade is not as directly important as frequency, but determining the trends of the grade divisions and ordering of native education programs and standards will be helpful in applying these goals to the overall language.

The relationships between a character's grade and frequency, stroke count, word count, and average word length will be visualized to give insight to which factors may need to be further researched.

# Display the relationship between grade and frequency with a box plot.
create_graph_box(char_df['grade'], char_df['frequency'], 'Character Grade Level by Frequency', 'Grade', 'Character Frequency', xlog = True, orient = 'h')

The more frequent the character, the earlier it tends to be learned by native speakers.

# Display the relationship between grade and stroke count with a box plot.
create_graph_box(char_df['grade'], char_df['stroke_count'], 'Character Grade Level by Stroke Count', 'Grade', 'Stroke Count', orient = 'h')

The more complex the character is in the form of stroke count, the later the character tends to be learned by native speakers.

# Display the relationship between grade and a character's word count with a box plot.
create_graph_box(char_df['grade'], char_df['word_count'], 'Character Grade Level Word Appearance Count', 'Grade', 'Number of Appearances', xlog = True, orient = 'h')

The more frequent the character, the earlier it tends to be learned by native speakers.

# Display the relationship between grade and a character's average word length with a box plot.
create_graph_box(char_df['grade'], char_df['average_word_length'], 'Character Grade Level by Average Word Appearance Length', 'Grade', 'Average Word Length', orient = 'h')

The more frequent the character, the earlier it tends to be learned by native speakers.

Character Grades Summary¶

These graphs simply reinforce similar trends throughout the character dataframe sections, repeating the same trends as JLPT exam level.

Multivariate ¶

The various JLPT exam levels of characters and scripts used in Japanese can be encountered in many different ways. Comparing the six script combinations and the five JLPT exam levels (and the characters beyond the exams) gives thirty-six different aspects to examine and then to determine how to go about applying the existing JLPT ordering to a newer form.

# Create a dataframe of the character dataframe script counts grouped by JLPT exam level in order to plot it.
scripts_char = set(char_df['script'].values)
script_jlpt_df_char = pd.DataFrame(columns = scripts_char)

for script in scripts_char:
    counts = char_df.loc[char_df['script'] == script]['jlpt_level'].value_counts()
    
    for exam in jlpt_exams:
        # Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
        try:
            script_jlpt_df_char.at[exam, script] = counts[exam]
        except:
            script_jlpt_df_char.at[exam, script] = 0

# Display the calculated counts.
script_jlpt_df_char

# Prepare the character JLPT exam level by script dataframe for plotting.
script_jlpt_df_char = script_jlpt_df_char.transpose()
script_jlpt_df_char.reset_index(inplace = True)
script_jlpt_df_char.rename(columns = {'index': 'script'}, inplace = True)
script_jlpt_df_char = pd.melt(script_jlpt_df_char, id_vars = "script", var_name = "exam_level", value_name = "count")

# Display the relationship between the distributions of character JLPT exam level grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Distribution of JLPT Exam Level Characters by Script', 'Amount of Characters', 'Script', hue = 'exam_level', data = script_jlpt_df_char, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'hiragana_and_katakana', 'kanji', 'kanji_and_hiragana', 'kanji_and_katakana'])

Kanji continues to be the most important focus in writing overall, with the focus needing to be stronger as the learner goes up through the JLPT exam levels until getting beyond all of them.

The various scripts of characters used in Japanese and the grade learned by native speakers can be encountered in many different combinations. Comparing the six script combinations and the ten effective grades gives sixty different aspects to examine and then to determine how to go about applying the existing grade ordering to a newer form.

# Create a dataframe of the character dataframe grade counts grouped by script in order to plot it.
scripts_char = set(char_df['script'].values)
grades_char = set(char_df['grade'].values)
script_grade_df_char = pd.DataFrame(columns = scripts_char)

for script in scripts_char:
    counts = char_df.loc[char_df['script'] == script]['grade'].value_counts()
    
    for exam in grades_char:
        # Some combinations of script and JLPT exam level do not exist in the nonbasic_lemma_df
        try:
            script_grade_df_char.at[exam, script] = counts[exam]
        except:
            script_grade_df_char.at[exam, script] = 0

# Display the calculated counts.
script_grade_df_char

# Prepare the character grade by script dataframe for plotting.
script_grade_df_char = script_grade_df_char.transpose()
script_grade_df_char.reset_index(inplace = True)
script_grade_df_char.rename(columns = {'index': 'script'}, inplace = True)
script_grade_df_char = pd.melt(script_grade_df_char, id_vars = "script", var_name = "grade", value_name = "count")

# Display the relationship between the distributions of character grade grouped by scripts with a grouped bar plot.
create_graph_cat('count', 'script', 'Character Grade Levels by Script', 'Amount of Characters', 'Script', hue = 'grade', data = script_grade_df_char, kind = 'bar', rotate_degree = 90, ylog = True, order = ['hiragana', 'katakana', 'kanji',])

With character grade, unlike JLPT exam level, kanji focus remains more consistant on the character level.

The various JLPT exam levels of characters and the grades these characters are learned by native speakers can be encountered in many different ways. Comparing the five JLPT exam levels (and the characters beyond the exams) with the ten effective grades gives sixty different aspects to examine and then to determine how to go about applying the existing JLPT and grade ordering to a newer form.

# Create a dataframe of the character dataframe JLPT exam level counts grouped by grade in order to plot it.
grades_char = set(char_df.loc[char_df['grade'].notna()]['grade'].values)
grades_jlpt_df_char = pd.DataFrame(columns = grades_char)

for grade in grades_char:
    counts = char_df.loc[char_df['grade'] == grade]['jlpt_level'].value_counts()
    
    for exam in jlpt_exams:
        # Some combinations do not exist in the nonbasic_lemma_df
        try:
            grades_jlpt_df_char.at[exam, grade] = counts[exam]
        except:
            grades_jlpt_df_char.at[exam, grade] = 0

# Display the calculated counts.
grades_jlpt_df_char

# Prepare the character JLPT exam level by grade dataframe for plotting.
grades_jlpt_df_char = grades_jlpt_df_char.transpose()
grades_jlpt_df_char.reset_index(inplace = True)
grades_jlpt_df_char.rename(columns = {'index': 'grades'}, inplace = True)
grades_jlpt_df_char = pd.melt(grades_jlpt_df_char, id_vars = "grades", var_name = "jlpt_level", value_name = "count")

# Display the relationship between the distributions of character JLPT exam level grade grouped by grade with a grouped bar plot.
create_graph_cat('count', 'grades', 'JLPT Exam Level Characters by Grade', 'Amount of Characters', 'Grade', hue = 'jlpt_level', data = grades_jlpt_df_char, kind = 'bar', rotate_degree = 90, ylog = True)

This shows how the JLPT exam levels compare to the compulsory education of Japanese native speakers. It seems that the higher exam levels still have a lot of lower grade material for students to learn.

# Save all dataframes for future use and loading into nbconvert slides.
nonbasic_lemma_df.to_csv('final_lemma_df.csv', sep = '|')
char_df.to_csv('final_char_df.csv', sep = '|')
pos_df.to_csv('final_pos_df.csv', sep = '|')

	rank	frequency	jap_pos
0	1	343804.25	名詞
1	2	208342.21	助詞
2	3	203199.30	記号
3	4	99121.80	動詞
4	5	68734.93	助動詞

	character	frequency	weighted_frequency	weighted_rank	rank
0	る	1135	76544.49	1	1
1	ー	873	10294.03	23	2
2	ン	673	7547.24	28	3
3	い	603	40478.36	3	4
4	ス	401	5061.23	40	5
...	...	...	...	...	...
2635	班	1	9.45	2007	2636
2636	辻	1	9.44	2008	2637
2637	掴	1	9.42	2009	2638
2638	挟	1	9.42	2010	2639
2639	廉	1	2.24	2640	2640

	rank	frequency	character	weighted_frequency	weighted_rank
0	1	1135	る	76544.49	1
1	2	873	ー	10294.03	23
2	3	673	ン	7547.24	28
3	4	603	い	40478.36	3
4	5	401	ス	5061.23	40
...	...	...	...	...	...
2635	2636	1	班	9.45	2007
2636	2637	1	辻	9.44	2008
2637	2638	1	掴	9.42	2009
2638	2639	1	挟	9.42	2010
2639	2640	1	廉	2.24	2640

	character	jlpt_level
0	一	5
1	七	5
2	万	5
3	三	5
4	上	5

	kanji	occurrence	stroke	grade	radical	meaning
0	㐬	9999	7	表外漢字 (hyōgai kanji)	亠	a cup with pendants, a pennant, wild, barren, ...
1	㐮	9999	13	表外漢字 (hyōgai kanji)	亠	to help, to assist, to achieve, to rise, to raise
2	㠯	9999	5	表外漢字 (hyōgai kanji)	己
3	㡀	9999	8	表外漢字 (hyōgai kanji)	巾	ragged clothing, ragged, old and wear out
4	䍃	9999	10	表外漢字 (hyōgai kanji)	缶	a vase, a pitcher, earthenware

	rank	frequency	lemma
0	1	41309.50	の
1	2	23509.54	に
2	3	22216.80	は
3	4	20431.93	て
4	5	20326.59	を

	rank	frequency	eng_pos	jap_pos	transliterated_pos
0	1	343804.25	noun	名詞	めいし
1	2	208342.21	particle	助詞	じょし
2	3	203199.30	symbol	記号	きごう
3	4	99121.80	verb	動詞	どうし
4	5	68734.93	auxiliary verb	助動詞	じょどうし
5	6	15003.37	adverb	副詞	ふくし
6	7	10040.91	adjective	形容詞	けいようし
7	8	7509.62	adnominal	連体詞	れんたいし
8	9	5684.89	conjunction	接続詞	せつぞくし
9	10	5227.79	prefix	接頭詞	せっとうし
10	11	1257.96	interjection	感動詞	かんどうし
11	12	64.82	filler	フィラー	ふぃらあ
12	13	14.98	other	その他	そのほか

	rank	eng_pos	frequency	frequency_percentage	jap_pos	transliterated_pos
0	1	noun	343804.25	0.355167	名詞	めいし
1	2	particle	208342.21	0.215228	助詞	じょし
2	3	symbol	203199.30	0.209915	記号	きごう
3	4	verb	99121.80	0.102398	動詞	どうし
4	5	auxiliary verb	68734.93	0.071007	助動詞	じょどうし
5	6	adverb	15003.37	0.015499	副詞	ふくし
6	7	adjective	10040.91	0.010373	形容詞	けいようし
7	8	adnominal	7509.62	0.007758	連体詞	れんたいし
8	9	conjunction	5684.89	0.005873	接続詞	せつぞくし
9	10	prefix	5227.79	0.005401	接頭詞	せっとうし
10	11	interjection	1257.96	0.001300	感動詞	かんどうし
11	12	filler	64.82	0.000067	フィラー	ふぃらあ
12	13	other	14.98	0.000015	その他	そのほか

	character	weighted_frequency
0	る	76544.49
1	の	50197.62
2	い	40478.36
3	す	38030.48
4	と	29886.08

	rank	frequency	character	weighted_frequency	weighted_rank	script	ord
0	1	1135	る	76544.49	1	hiragana	12427
1	2	873	ー	10294.03	23	katakana	12540
2	3	673	ン	7547.24	28	katakana	12531
3	4	603	い	40478.36	3	hiragana	12356
4	5	401	ス	5061.23	40	katakana	12473
...	...	...	...	...	...	...	...
2635	2636	1	班	9.45	2007	kanji	29677
2636	2637	1	辻	9.44	2008	kanji	36795
2637	2638	1	掴	9.42	2009	kanji	25524
2638	2639	1	挟	9.42	2010	kanji	25375
2639	2640	1	廉	2.24	2640	kanji	24265

	kanji	occurrence	stroke	grade	radical	onyomi	kunyomi	nanori	meaning
0	丶	9999	1	表外漢字 (hyōgai kanji)	丶	チュ			dot, tick or dot radical (no. 3)
1	丿	9999	1	表外漢字 (hyōgai kanji)	丿	ヘツ	えい, よう	の	katakana no radical (no. 4)
2	乁	9999	1	表外漢字 (hyōgai kanji)	丶	イ	なが.れる
3	乙	1841	1	常用漢字 (jōyō kanji)	乙(⺄,乚)	オツ, イツ	おと-, きのと		the latter, duplicate, strange, witty, fishhoo...
4	乚	9999	1	表外漢字 (hyōgai kanji)	乙(⺄,乚)	イン, オン	かく.す, かく.れる, かか.す, よ.る		hidden, mysterious, secret, to conceal, small,...
...	...	...	...	...	...	...	...	...	...
2841	鱒	2482	23	人名用漢字 (jinmeiyō kanji)	魚	ソン, セン, ザン	ます		salmon trout
2842	鑑	1391	23	常用漢字 (jōyō kanji)	金(釒)	カン	かんが.みる, かがみ	あき, あきら	specimen, take warning from, learn from
2843	鱗	2494	24	人名用漢字 (jinmeiyō kanji)	魚	リン	うろこ, こけ, こけら		scales (fish)
2844	鷺	2172	24	人名用漢字 (jinmeiyō kanji)	鳥	ロ	さぎ		heron
2845	鷹	1676	24	人名用漢字 (jinmeiyō kanji)	鳥	ヨウ, オウ	たか		hawk

	rank	frequency	lemma	script	jlpt_level
5696	5697	9.61	朝刊	kanji	n2
6639	6640	7.70	取り除く	kanji_and_hiragana	n3
1393	1394	53.83	集める	kanji_and_hiragana	n4
2300	2301	30.23	辰	kanji	n1
11634	11635	3.33	私見	kanji	n4
2804	2805	24.13	寄せる	kanji_and_hiragana	n3
14520	14521	2.36	二日酔い	kanji_and_hiragana	n1
5049	5050	11.30	部会	kanji	n3
4992	4993	11.48	かよう	hiragana	n5
5140	5141	11.02	剛	kanji	n1
1122	1123	67.15	地球	kanji	n3
4243	4244	14.28	誇る	kanji_and_hiragana	n1
11337	11338	3.47	メトロ	katakana	n5
12990	12991	2.81	贈与	kanji	n2
8438	8439	5.39	ため息	kanji_and_hiragana	n3
8189	8190	5.64	きみ	hiragana	n5
2572	2573	27.05	強調	kanji	n3
14451	14452	2.38	前項	kanji	n1
6069	6070	8.77	営利	kanji	n2
9529	9530	4.48	ようこそ	hiragana	n5

	rank	frequency	lemma	script	jlpt_level	pos
0	1	41309.50	の	hiragana	n5	['Particle']
1	2	23509.54	に	hiragana	n5	['Numeric']
2	3	22216.80	は	hiragana	n5	['Particle']
3	4	20431.93	て	hiragana	n5	['Noun']
4	5	20326.59	を	hiragana	n5	['Particle']
...	...	...	...	...	...	...
14995	14996	2.24	夕べ	kanji_and_hiragana	n4	['Adverbial noun', 'Temporal noun']
14996	14997	2.24	売場	kanji	n4	['Noun']
14997	14998	2.24	たたき台	kanji_and_hiragana	n4	['Noun']
14998	14999	2.24	かしこ	hiragana	n5	['Expression']
14999	15000	2.24	バックグラウンド	katakana	n5	['Noun']

	rank	frequency	lemma	script	jlpt_level	pos
1531	1532	48.05	小泉	kanji	n2	['Wikipedia definition']
2154	2155	32.71	田中	kanji	n4	['Wikipedia definition']
2213	2214	31.58	佐藤	kanji	n1	['Wikipedia definition']
2697	2698	25.43	ジョン	katakana	n5	['Wikipedia definition']
2746	2747	24.81	村上	kanji	n2	['Wikipedia definition']

	rank	frequency	lemma	script	jlpt_level	pos	adjective	adverb	auxiliary verb	conjunction	noun	other	particle	prefix	symbol	verb	average_character_frequency	total_stroke_count	average_character_stroke_count
8	9	16841.17	する	hiragana	n5	['Suru verb - irregular']	0	0	0	0	0	0	0	0	0	1	736.500000	3.0	1.50
10	11	9604.49	ます	hiragana	n5	['Godan verb with su ending', 'intransitive ve...	0	0	0	0	0	0	0	0	0	1	281.000000	5.0	2.50
12	13	8189.00	ない	hiragana	n5	['I-adjective']	1	0	0	0	0	0	0	0	0	0	393.500000	6.0	3.00
13	14	8140.22	いる	hiragana	n5	['Ichidan verb', 'intransitive verb']	0	0	0	0	0	0	0	0	0	1	869.000000	3.0	1.50
15	16	6766.19	ある	hiragana	n5	['Godan verb with ru ending (irregular verb)',...	0	0	0	0	0	0	0	0	0	1	628.500000	4.0	2.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14995	14996	2.24	夕べ	kanji_and_hiragana	n4	['Adverbial noun', 'Temporal noun']	0	0	0	0	1	0	0	0	0	0	19.000000	4.0	2.00
14996	14997	2.24	売場	kanji	n4	['Noun']	0	0	0	0	1	0	0	0	0	0	34.000000	19.0	9.50
14997	14998	2.24	たたき台	kanji_and_hiragana	n4	['Noun']	0	0	0	0	1	0	0	0	0	0	143.000000	16.0	4.00
14998	14999	2.24	かしこ	hiragana	n5	['Expression']	0	0	0	0	0	1	0	0	0	0	253.333333	6.0	2.00
14999	15000	2.24	バックグラウンド	katakana	n5	['Noun']	0	0	0	0	1	0	0	0	0	0	247.625000	22.0	2.75

	total_lemmas	total_lemma_proportion	average_character_frequency	total_stroke_count	average_character_stroke_count
adjective	NaN	NaN	NaN	NaN	NaN
adverb	NaN	NaN	NaN	NaN	NaN
auxiliary verb	NaN	NaN	NaN	NaN	NaN
conjunction	NaN	NaN	NaN	NaN	NaN
noun	NaN	NaN	NaN	NaN	NaN
other	NaN	NaN	NaN	NaN	NaN
particle	NaN	NaN	NaN	NaN	NaN
prefix	NaN	NaN	NaN	NaN	NaN
symbol	NaN	NaN	NaN	NaN	NaN
verb	NaN	NaN	NaN	NaN	NaN

	total_lemmas	total_lemma_proportion	average_character_frequency	total_stroke_count	average_character_stroke_count
adjective	2636	0.140759	92.403	15.9537	7.361
adverb	564	0.0301169	152.161	10.9255	3.85753
auxiliary verb	24	0.00128157	330.61	7.875	2.91389
conjunction	85	0.0045389	163.507	8.16471	2.77202
noun	10666	0.569552	90.2465	15.0653	7.25892
other	337	0.0179954	155.879	10.2493	4.13076
particle	41	0.00218935	166.059	5.73171	2.7378
prefix	85	0.0045389	78.552	7.57647	6.3402
symbol	39	0.00208255	57.7949	9.53846	6.7265
verb	4250	0.226945	168.891	15.9456	6.94886

	hiragana_and_katakana	kanji	katakana	kanji_and_katakana	kanji_and_hiragana	hiragana
n5	24	307	2375	18	215	1652
n4	0	1004	0	8	410	0
n3	0	2386	0	9	920	0
n2	0	1546	0	3	300	0
n1	0	2566	0	4	476	0
n0	0	340	3	0	112	8

	pos	proportion	dataframe
0	adjective	0.140759	lemma_df
1	adverb	0.030117	lemma_df
2	auxiliary verb	0.001282	lemma_df
3	conjunction	0.004539	lemma_df
4	noun	0.569552	lemma_df
5	other	0.017995	lemma_df
6	particle	0.002189	lemma_df
7	prefix	0.004539	lemma_df
8	symbol	0.002083	lemma_df
9	verb	0.226945	lemma_df
10	noun	0.355167	pos_df
11	particle	0.215228	pos_df
12	symbol	0.209915	pos_df
13	verb	0.102398	pos_df
14	auxiliary verb	0.071007	pos_df
15	adverb	0.015499	pos_df
16	adjective	0.010373	pos_df
17	adnominal	0.007758	pos_df
18	conjunction	0.005873	pos_df
19	prefix	0.005401	pos_df
20	interjection	0.001300	pos_df
21	filler	0.000067	pos_df
22	other	0.000015	pos_df
23	adnominal	0.000000	lemma_df
24	filler	0.000000	lemma_df
25	interjection	0.000000	lemma_df

	latin	other	full-width_roman	kanji	katakana	half-width_katakana	hiragana	punctuation
n5	0	0	0	80	84	2	83	0
n4	0	0	0	167	0	0	0	0
n3	0	0	0	370	0	0	0	0
n2	0	0	0	368	0	0	0	0
n1	0	0	0	963	0	0	0	0
n0	80	19	74	331	0	0	0	19

	latin	other	full-width_roman	kanji	katakana	half-width_katakana	hiragana	punctuation
0.0	3	2	20	0	83	2	82	5
1.0	0	0	0	80	0	0	0	0
2.0	0	0	0	160	0	0	0	0
3.0	0	0	0	200	0	0	0	0
4.0	0	0	0	200	1	0	1	0
5.0	0	0	0	185	0	0	0	0
6.0	0	0	0	180	0	0	0	0
NaN	0	0	0	0	0	0	0	0
8.0	0	0	0	991	0	0	0	0
9.0	0	0	0	176	0	0	0	0
10.0	0	0	0	10	0	0	0	0

	0.0	1.0	2.0	3.0	4.0	5.0	6.0	8.0	9.0	10.0
n5	167	49	31	0	2	0	0	0	0	0
n4	0	19	73	58	13	2	2	0	0	0
n3	0	4	33	89	85	67	42	50	0	0
n2	0	8	18	40	68	66	54	114	0	0
n1	0	0	4	12	34	50	82	710	71	0
n0	30	0	1	1	0	0	0	117	105	10

	latin	other	full-width_roman	kanji	katakana	half-width_katakana	hiragana	punctuation
n5	0	0	0	80	84	2	83	0
n4	0	0	0	167	0	0	0	0
n3	0	0	0	370	0	0	0	0
n2	0	0	0	368	0	0	0	0
n1	0	0	0	963	0	0	0	0
n0	80	19	74	331	0	0	0	19

	0.0	1.0	2.0	3.0	4.0	5.0	6.0	8.0	9.0	10.0
n5	167	49	31	0	2	0	0	0	0	0
n4	0	19	73	58	13	2	2	0	0	0
n3	0	4	33	89	85	67	42	50	0	0
n2	0	8	18	40	68	66	54	114	0	0
n1	0	0	4	12	34	50	82	710	71	0
n0	30	0	1	1	0	0	0	117	105	10