# | warning: false
# Loading necessary libraries
library(tidyverse)
library(here)
library(janitor)
# Loading datasets
<- read.csv2(here("data/kanji", "Kanji_20240227_081842.csv")) %>%
data_kanji clean_names()
<- read.csv2(here("data/kanji", "Jukugo_20240227_081908.csv")) %>%
data_jukugo clean_names()
Loading and Exploring Japanese Kanji Data Using R
Introduction
In this blog post, I’ll demonstrate how to use R to load, explore, and filter data from a dataset containing Japanese characters, known as “kanji”. The datasets were obtained from an online Kanji database. We’ll focus on using the tidyverse
family of packages to illustrate how to select and filter relevant information efficiently.
Setup and Loading
To begin, we need to load necessary libraries and import the datasets:
Here’s a breakdown of the code:
library(tidyverse)
: We load thetidyverse
package, which includesdplyr
,ggplot2
, and other useful packages.library(here)
: This package helps manage file paths conveniently.library(janitor)
: Useful for standardizing variable names and data cleaning.- We use
read.csv2()
to import CSV (comma-separated value) files with semicolons (;) as separators. here("data/kanji", "Kanji_20240227_081842.csv")
uses the functionhere()
to access the data file, which is saved inside the folders data > kanji.- The characters
%>%
are called a “pipe” in tidyverse. It can be written simply by pressing Ctrl + Shift + M (in Windows). Basically, it tells R that we want to apply some step to the previous data. In this example, I tell R that I want to use the functionclean_names()
to the data that I’ve already loaded usingread.csv2()
. clean_names()
is ajanitor
function that renames all variables in a standard format to make it easier to manipulate. Specifically,clean_names()
sets all names to lowercase, removes punctuation and symbols, and replaces spaces with underscores.
Now I have two separate datasets: one for kanji (single characters), and one for jukugo (compound words). Let’s take a look at them.
Exploring the data
Let’s examine the first few rows of each dataset:
head(data_kanji)
id kanji strokes grade
1 41 一 1 1
2 124 乙 1 7
3 2060 了 2 7
4 2074 力 2 1
5 1577 二 2 1
6 1070 人 2 1
head(data_jukugo)
id comp_word frequency grammatical_feature pronunciation
1 173 一部 46289 possible to use as an adverb itibu
2 234 一般 39274 general noun ippan
3 432 一時 25126 possible to use as an adverb itizi
4 461 一番 24155 possible to use as an adverb itiban
5 481 一緒 23453 light-verb -suru attached issyo
6 529 一致 21388 light-verb -suru attached itti
english_translation position kanji kanji_id
1 one part L 一 41
2 general L 一 41
3 one o'clock L 一 41
4 best L 一 41
5 together L 一 41
6 coincidence L 一 41
We’re using the base function head()
to show the first rows or observations of our datasets.
We can see that data_kanji
has four columns or variables:
id
shows a unique identification number.kanji
stores the actual character.strokes
represents the number of distinct lines or strokes that the character has.grade
means the official categorization of Kanji by educational year in Japan. Grade 1 includes the easiest or most common kanji, and it goes all up to grade 7.
On the other hand, data_jukugo
contains nine variables:
id
is the identification number for jukugos.comp_word
is the actual word.frequency
is a measure of how many times each jukugo appear in a selected corpus of Japanese literature (extracted from Japanese newspapers).grammatical_feature
gives us more context of how the word is used in grammatical terms.pronunciation
tells us the pronunciation in “romaji”, or the Latin alphabet.english_translation
stores the English translation.
The last three variables in data_jukugo
describes the kanji which is part of the jukugo:
position
tells us if the kanji is used in left position “L” or right position “R”.kanji
shows the kanji used in the jukugo. The first rows all show jukugos composed with the kanji “一”.kanji_id
is the identification number of the kanji part. We can use this id to linkdata_jukugo
withdata_kanji
if we want to.
Another way of looking into a dataset is to explore how each variable is encoded:
glimpse(data_kanji)
Rows: 2,136
Columns: 4
$ id <int> 41, 124, 2060, 2074, 1577, 1070, 1584, 829, 359, 1647, 1903, 1…
$ kanji <chr> "一", "乙", "了", "力", "二", "人", "入", "七", "九", "八", "…
$ strokes <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,…
$ grade <int> 1, 7, 7, 1, 1, 1, 1, 1, 1, 1, 7, 3, 1, 2, 1, 2, 6, 7, 6, 1, 2,…
glimpse(data_jukugo)
Rows: 52,791
Columns: 9
$ id <int> 173, 234, 432, 461, 481, 529, 937, 1465, 1521, 156…
$ comp_word <chr> "一部", "一般", "一時", "一番", "一緒", "一致", "…
$ frequency <int> 46289, 39274, 25126, 24155, 23453, 21388, 12477, 7…
$ grammatical_feature <chr> "possible to use as an adverb", "general noun", "p…
$ pronunciation <chr> "itibu", "ippan", "itizi", "itiban", "issyo", "itt…
$ english_translation <chr> "one part", "general", "one o'clock", "best", "tog…
$ position <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
$ kanji <chr> "一", "一", "一", "一", "一", "一", "一", "一", "…
$ kanji_id <int> 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41…
The glimpse()
function allows us to quickly glance at the data structure.
We can see that data_kanji
has 2,136 rows or observations and 4 columns or variables. We also see the first values of each of its four variables. More importantly, we can see which data type each variable stores. The kanji
variable has <chr>
type, which means “character” or “text”, while the rest of variables have <int>
type, which means “integer” number, or a round number. R automatically detects the data types when importing data using functions like read.csv2()
.
Regarding data_jukugo
, it has 52,791 rows and 9 columns, of which 3 have <int>
type, and 6 have <char>
type.
Manipulating the data
Now that I’m familiarized with this dataset, it’s useful to lay down what my analysis plan is. In other words, what do I want to learn from this data? In this case, I want to be able to find words (jukugo) that only contain kanji from a selected list of kanji that I’m learning. So, for example, if I only know kanjis 一, 人, and 十, I want to know all the possible combinations of these three kanjis.
For this exercise, I’m interested in separating jukugos in two parts: the left kanji, and the right kanji. The dataset already has half of this information, but sometimes it tells us the left kanji, and sometimes the right kanji (more on this later). I want to get sistematically both left and right kanjis in the same row, so I’ll create new variables called kanji_left
and kanji_right
.
<- data_jukugo %>%
data_jukugo mutate(kanji_left = substr(comp_word, 1, 1),
kanji_right = substr(comp_word, 2, 2))
glimpse(data_jukugo)
Rows: 52,791
Columns: 11
$ id <int> 173, 234, 432, 461, 481, 529, 937, 1465, 1521, 156…
$ comp_word <chr> "一部", "一般", "一時", "一番", "一緒", "一致", "…
$ frequency <int> 46289, 39274, 25126, 24155, 23453, 21388, 12477, 7…
$ grammatical_feature <chr> "possible to use as an adverb", "general noun", "p…
$ pronunciation <chr> "itibu", "ippan", "itizi", "itiban", "issyo", "itt…
$ english_translation <chr> "one part", "general", "one o'clock", "best", "tog…
$ position <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
$ kanji <chr> "一", "一", "一", "一", "一", "一", "一", "一", "…
$ kanji_id <int> 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41…
$ kanji_left <chr> "一", "一", "一", "一", "一", "一", "一", "一", "…
$ kanji_right <chr> "部", "般", "時", "番", "緒", "致", "定", "連", "…
Let’s explain the code:
mutate()
is thedplyr
function used to create or change variables. Here, I create two variables,kanji_left
andkanji_right
.substr()
is a base function that subtracts a string of text from a character variable.substr(comp_word, 1, 1)
means to subtract only the first character, andsubstr(comp_word, 2, 2)
gets the second character.
Alright, now I need to define the list of kanjis that I’m currently learning. This I need to do it manually, but later I’ll explain how to do it more dinamically.
<- c("一", "二", "三", "王", "玉", "十", "五") kanji_learning
Lastly, I’ll tell R to filter the jukugos that only include kanji that are on my learning list. I also want to sort the jukugos from more to less used.
<- data_jukugo %>%
jukugo_learning filter(kanji_left %in% kanji_learning, kanji_right %in% kanji_learning) %>%
arrange(desc(frequency))
head(jukugo_learning)
id comp_word frequency grammatical_feature pronunciation
1 17059 二三 32 possible to use as an adverb nisan
2 17059 二三 32 possible to use as an adverb nisan
3 20330 三一 12 general noun sanpin
4 20330 三一 12 general noun sanpin
5 23443 一一 4 possible to use as an adverb itiiti
6 23443 一一 4 possible to use as an adverb itiiti
english_translation position kanji kanji_id kanji_left kanji_right
1 two or three L 二 1577 二 三
2 two or three R 三 744 二 三
3 low-ranking samurai R 一 41 三 一
4 low-ranking samurai L 三 744 三 一
5 one-by-one L 一 41 一 一
6 one-by-one R 一 41 一 一
The filter()
function selects rows based on one or more conditions. I’ve passed two conditions: that kanji_left
is included in the kanji_learning
“list” (in R we’d call this a vector, not a list), and that kanji_right
is also included in kanji_learning
. The term “is included in” is represented in R with the operand %in%
.
The arrange()
function reorders the rows based on one or more variables. I’ve passed the argument desc(frequency)
because I want the words to be sorted in descending order of frequency (from more to less frequency).
However, something odd has happened: now we have two copies of each jukugo. There are complete duplicates in the dataset, with the only difference of which kanji appears in the variables position
, kanji
, and kanji_id
. For example, “nisan” (二三) appears twice, one with position
L, kanji
二, and kanji_id
1577, and another with position
R, kanji
三, and kanji_id
744. This is something that I didn’t see first time I explored the dataset.
I could have done things differently. Instead of splitting the jukugos manually, I could have performed a “self-join” of the duplicated rows. But one cool thing about data cleaning and analysis is that there are always different ways to reach the same goal. It’s an iterative process, and by trial and error I can learn a lot and find alternative methods of doing things.
Moving forward, since I’m only interested in keeping one record of each jukugo, I can drop these duplicates. Aditionally, I’ll keep only the variables I’m interested in.
<- jukugo_learning %>%
jukugo_learning select(id, comp_word, frequency, grammatical_feature, pronunciation, english_translation, kanji_left, kanji_right) %>%
distinct()
head(jukugo_learning)
id comp_word frequency grammatical_feature pronunciation
1 17059 二三 32 possible to use as an adverb nisan
2 20330 三一 12 general noun sanpin
3 23443 一一 4 possible to use as an adverb itiiti
4 25773 二王 2 general noun nioo
english_translation kanji_left kanji_right
1 two or three 二 三
2 low-ranking samurai 三 一
3 one-by-one 一 一
4 the two guardian Deva kings 二 王
I’ve used two new dplyr
functions: select()
keeps some columns or variables, and distinct()
keeps only non-duplicated rows.
The final result contains four distinct jukugos: 二三, 三一, 一一, and 二王. All of them are very low-frequency, with the most common of them appearing only 32 times.
Next step: making it interactive
So far, I have created a code that filters Japanese kanji words based on whatever Kanji components I want. However, the whole process would be nicer if I had a way of selecting the data interactivelly, maybe pressing some buttons. We can do just that using R Shiny applications. Find how in this post!