Generating Random City Names Based on Syllable Formation Rules

R
strings
functions
creative project
A step-by-step guide on how to generate random city names using syllable-based rules in R.
Author

Carlos Fernández

Published

July 15, 2024

Introduction

In this blog post, we’ll explore how to generate random city names by applying various syllable-based rules to the names of municipalities in the Province of Alicante, Spain. We will be using several functions to transform and manipulate strings in R.

Motivation

This post aims to create fictional names for locations, such as those found in fantasy novels, that resemble Spanish words but do not actually exist. This method can also be applied to generate names for characters or invent unique words for various creative projects. The main challenge is implementing rules for syllable formation, which are the building blocks of word generation, and then finding combinations of syllables present in real words to replicate something similar. By doing so, we can produce names that sound authentic yet are entirely new.

Load Libraries

First, we load the necessary libraries.

library(tidyverse)
library(rvest)
library(ggwordcloud)

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggwordcloud_0.6.2 rvest_1.0.4       lubridate_1.9.3   forcats_1.0.0    
 [5] stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2       readr_2.1.5      
 [9] tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.4.1    Rcpp_1.0.12      
 [5] tidyselect_1.2.1  xml2_1.3.6        png_0.1-8         scales_1.3.0     
 [9] yaml_2.3.9        fastmap_1.2.0     R6_2.5.1          generics_0.1.3   
[13] knitr_1.48        htmlwidgets_1.6.4 munsell_0.5.1     pillar_1.9.0     
[17] tzdb_0.4.0        rlang_1.1.4       utf8_1.2.4        stringi_1.8.4    
[21] xfun_0.45         timechange_0.3.0  cli_3.6.3         withr_3.0.0      
[25] magrittr_2.0.3    gridtext_0.1.5    digest_0.6.36     grid_4.4.1       
[29] rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5      
[33] evaluate_0.24.0   glue_1.7.0        fansi_1.0.6       colorspace_2.1-0 
[37] httr_1.4.7        rmarkdown_2.27    tools_4.4.1       pkgconfig_2.0.3  
[41] htmltools_0.5.8.1

Import and Clean Data

We start by importing the data of municipalities in the Province of Alicante from Wikipedia and cleaning it.

# Import data of municipalities in Alacant Province -------------------------------------------------------------
html_alacant <- read_html("https://es.wikipedia.org/wiki/Anexo:Municipios_de_la_provincia_de_Alicante")

cities_alicante <- html_alacant |>
  html_element(".wikitable") |>
  html_table() |>
  select(name = `Nombre en castellano`)

head(cities_alicante)
# A tibble: 6 × 1
  name          
  <chr>         
1 Adsubia       
2 Agost         
3 Agres         
4 Aguas de Busot
5 Albatera      
6 Alcalalí      

Functions

Pre-process Words

This function transforms the names into a format suitable for syllable extraction. It converts the names to lowercase, replaces spaces and commas, and substitutes specific letter combinations with symbols.

letter_to_symbol <- function(df, var) {
  df |>
    mutate(
      palabras = str_to_lower({{ var }}),
      palabras = str_replace(palabras, "(.*), (.*)", "\\2 \\1"),
      palabras = str_replace_all(palabras, " ", "_"),
      palabras = str_replace_all(palabras, ",", ""),
      palabras = str_replace_all(palabras, "ch", "ʧ"),
      palabras = str_replace_all(palabras, "rr", "ʀ"),
      palabras = str_replace_all(palabras, "qu", "q"),
      palabras = str_replace_all(palabras, "ll", "ʝ"),
      palabras = str_replace_all(palabras, "c([aou])", "k\\1")
    )
}

Apply Syllable-Based Rules

This function applies a regex rule to separate syllables based on predefined patterns. It detects and splits words according to the rule, handling cases where the rule does not apply.

apply_rule <- function(df, regex) {
  df |>
    mutate(
      rule = str_detect(palabras, regex),
      syllable = as_tibble(str_match(palabras, regex)[, -1], .name_repair = "minimal")
    ) |>
    unnest_wider(syllable, names_sep = "_") |>
    pivot_longer(
      cols = starts_with("syllable"),
      values_to = "silabas",
      names_to = "norma"
    ) |>
    mutate(
      silabas = ifelse(rule == FALSE & norma == "syllable_1", palabras, silabas)
    ) |>
    filter(!is.na(silabas) & silabas != "") |>
    select(
      name,
      palabras = silabas
    )
}

Apply Rules in Loop

This function repeatedly applies the given regex rule until no more changes occur in the dataset, ensuring all possible syllable separations are handled.

loop_apply_rule <- function(df, regex) {
  loop_controller <- TRUE
  while (loop_controller == TRUE) {
    previous_nrow <- nrow(df)
    df <- apply_rule(df, regex)
    new_nrow <- nrow(df)
    loop_controller <- ifelse(previous_nrow == new_nrow, FALSE, TRUE)
  }
  df
}

Revert Symbols to Original Letters

This function reverses the symbol transformations applied by letter_to_symbol, converting symbols back to their original letter combinations.

symbol_to_letter <- function(df, var) {
  df |>
    mutate(
      {{ var }} := str_replace_all({{ var }}, "_", " "),
      {{ var }} := str_replace_all({{ var }}, "ʧ", "ch"),
      {{ var }} := str_replace_all({{ var }}, "ʝ", "ll"),
      {{ var }} := str_replace_all({{ var }}, "ʀ", "rr"),
      {{ var }} := str_replace_all({{ var }}, "q", "qu"),
      {{ var }} := str_replace_all({{ var }}, "k([aou])", "c\\1")
    )
}

Spanish Syllable Rules

Here, we define the regex rules for syllable separation. These are based on Spanish rules for syllable separation (inspiration from this document).

These rules are completely dependent on the language we are trying to imitate.

norma_0a <- regex("(^.+)(_[dy].*_)(.+$)") # handle separation words, such as "de", "de los", etc.
norma_0b <- regex("(^[^_]+_)(.+)") # handle separation words, such as "de", "de los", etc.
norma_1 <- regex("(.*[aeiouáéíóú])([^_aeiouáéíóú][aeiouáéíóú].*)") # VCV -> V-CV
norma_2a <- regex("(.*)([pkbgf][rl].*)") # C1 C2 -> - C1 C2 (obstruyente + líquida)
norma_2b <- regex("(.*)([dt][r].*)") # C1 C2 -> - C1 C2 (obstruyente + líquida)
norma_2c <- regex("(.*[^_aeiouáéíóú])([^_aeiouáéíóúrl][^_].*)") # C1 C2 -> C1 - C2 
norma_2d <- regex("(.*[^_aeiouáéíóúpkbgfdt])([rl].+)") # C1 C2 -> C1 - C2 
norma_2e <- regex("(.*[^_aeiouáéíóúpkbgf])([l].+)") # C1 C2 -> C1 - C2 
norma_5 <- regex("(.*[aeo])([aeo].*)") # V1 V2 -> V1 - V2 (vocales altas)
norma_6a <- regex("(.*[íú])([aeiou].*)") # V1 V2 -> V1 - V2 (hiatos)
norma_6b <- regex("(.*[aeiou])([íú].*)") # V1 V2 -> V1 - V2 (hiatos)

Separate Words into Syllables

This function applies all the syllable separation rules to the names and converts them back from symbols to letters. It also labels each syllable’s position in the word.

names_to_syllables <- function(df) {
  df |>
    letter_to_symbol(name) |>
    loop_apply_rule(norma_0a) |>
    loop_apply_rule(norma_0b) |>
    loop_apply_rule(norma_1) |>
    loop_apply_rule(norma_2a) |>
    loop_apply_rule(norma_2b) |>
    loop_apply_rule(norma_2c) |>
    loop_apply_rule(norma_2d) |>
    loop_apply_rule(norma_2e) |>
    loop_apply_rule(norma_5) |>
    loop_apply_rule(norma_6a) |>
    loop_apply_rule(norma_6b) |>
    symbol_to_letter(palabras) |>
    group_by(name) |>
    mutate(
      posicion = case_when(
        row_number() == 1 ~ "inicio",
        row_number() == n() ~ "final",
        TRUE ~ "medio"
      )
    ) |>
    ungroup() |>
    rename(silaba = palabras)
}

Create Randomly-Generated City Name

This function generates random city names by selecting syllables based on their frequency and position within existing names.

create_random_name <- function(df_syllables_per_word, df_syllables_freq, length = 1, beginning = "") {
  new_names_vector <- ""
  for (x in 1:length) {
    n_syllables <- df_syllables_per_word |>
      slice_sample(n = 1) |>
      pull(n)
    new_name <- ""
    for (i in 1:n_syllables) {
      place <- case_when(
        i == 1 ~ "inicio",
        i == n_syllables ~ "final",
        TRUE ~ "medio"
      )
      new_name[i] <- df_syllables_freq |>
        filter(posicion == place) |>
        slice_sample(n = 1, weight_by = n) |>
        pull(silaba)
    }
    new_name <- str_c(beginning, str_flatten(new_name)) |> str_to_title()
    new_names_vector[x] <- new_name
  }
  new_names_vector
}

Putting All Together

Combine all steps into a single function to generate random city names.

create_city_names <- function(df, length = 1, beginning = "") {
  syllables <- names_to_syllables(df)
  syllables_freq <- syllables |> count(posicion, silaba)
  syllables_per_word <- syllables |> count(name)
  create_random_name(syllables_per_word, syllables_freq, length = length, beginning = beginning)
}

Applying to Alicante Cities Data

Transform the Alicante municipalities’ names into syllables.

Names to Syllables

silabas <- cities_alicante |> names_to_syllables()
head(silabas)
# A tibble: 6 × 3
  name    silaba posicion
  <chr>   <chr>  <chr>   
1 Adsubia ad     inicio  
2 Adsubia su     medio   
3 Adsubia bia    final   
4 Agost   a      inicio  
5 Agost   gost   final   
6 Agres   a      inicio  

Syllable Frequency

Calculate the frequency of each syllable and its position.

silabas_freq <- silabas |> count(posicion, silaba)
head(silabas_freq)
# A tibble: 6 × 3
  posicion silaba     n
  <chr>    <chr>  <int>
1 final    a          3
2 final    ba         2
3 final    beig       1
4 final    ber        1
5 final    bi         2
6 final    bia        1
silabas_freq |> 
  filter(n >= 2) |>
  ggplot(aes(label = silaba, size = n)) +
  scale_size_area(max_size = 20) +
  geom_text_wordcloud() +
  theme_minimal()

Syllables per Word

Count the number of syllables per name and visualize the distribution.

silabas_por_palabra <- silabas |> count(name)

ggplot(silabas_por_palabra, aes(x = n)) +
  geom_bar() +
  theme_classic() +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_continuous(breaks = 1:8) +
  labs(
    x = "Number of syllables per City Name",
    y = "Count"
  )

Generate Random Names

Generate a set of random city names based on the Alicante data.

set.seed(1234)
create_city_names(cities_alicante, length = 20)
 [1] "Semancent"            "Pi De Ja"             "Ojória"              
 [4] "Cabalisot"            "Ate"                  "Alniecogost"         
 [7] "Mondacañedes"         "Amataviedro"          "San  De Las Na"      
[10] "Parchenes"            "Banar"                "Pora"                
[13] "Befajuanjachell"      "Danipi"               "Beniarra De Nas"     
[16] "Comarnigo"            "Biarrate"             "Algra"               
[19] "Benichell"            "No De Minedanillenes"

Challenges Left

Despite the progress made, several challenges remain. Post-processing errors, such as double blank spaces, multiple accents in a single word, and overly long or difficult-to-pronounce words, need to be addressed. Future improvements could involve using two-syllable combinations instead of single-syllable building blocks, which would create more natural-sounding names at the expense of reduced variety.

Conclusion

In this post, we demonstrated how to generate random city names by applying syllable-based rules to the names of municipalities in Alicante, Spain. By following these steps, you can create your own set of random names for any dataset of city names.

References