NPC Name Generator

On NPC Names

An NPC, or a non-playing character in a computer or role-playing game is a character controlled by the game, who helps set the scene and progress the story. I’ll be addressing NPCs which are a part of sci-fi or fantasy games. In my opinion, these NPCs should have names that deviate enough from regular, run-of-the-mill, “Earthly” names, and impart an unfamiliar and exotic feel to the setting. However, they should not have names which are contrived and do not have a realistic mouthfeel, like “Tchu’quixthal”, or be too generic, like “The Wise Sage”.

Diablo 2 is a game that gets it right in my opinion. Some examples are: Gheed, Akara, Charsi, Jerhyn.

Now, the challenge here is to come up with names which are different, yet similar to “Earthly” names. My idea is to take a “seed” name, and generate variants of it to come up with NPC names. The inspiration for this derives from Isaac Asimov’s Caves of Steel, in which there is a character called Daneel, whose name is similar enough to Daniel, but yet exotic enough (for my tastes, at least!).

Generating Random Names

It’s easy to generate random names of a particular length. Let’s assume that it’s also possible to rate names based on their closeness to another name. This rating is called the score. Using these two facts, and something called a genetic algorithm, I can come up with generated NPC names, and here’s how:

Decide a “seed” name manually.
Generate a large number of random names with lengths close to the “seed” name.
Apply the genetic algorithm (explained below) to create a new list of names from the previous list.
Find out the score of all names in the list to the “seed” name.
If we have a sufficient number of names whose score is (say) above 0.9, print them, else repeat from 3.

Given a list, a genetic algorithm drives us towards another list, which has an overall score better than the current one, and it is the actual “magic” behind how the names are generated.

A genetic algorithm is based loosely on how evolution works: the survival of the fittest.

Each animal in a population has a fitness.
Two (or more) animals can mate to produce an offspring. Higher fitness means that there is a higher chance that the animal will survive and mate.
Each animal can also have a mutation while producing offspring. This mutation can be both good or bad in terms of how it affects the fitness of the offspring.

In our scenario, the correspondence is such:

population is the list of names at the current moment.
mating is the process of generating a new population of the same size based on the current one. The actual mating process is based on combination of names. For instance, “milind” and “luthra” can generate “mi-thra”, “mil-hra”, “mili-ra” and so on: the prefix of the generated name comes from the first name, and the suffix comes from the second name. If a name has a larger score, then it will contribute to more prefixes and suffixes in the new population.
mutation is the process of random changes in a name, so “milind” can mutate to “nilind”. This is necessary because we can sometimes reach a “plateau” where no amount of mating can produce a better score. The probability of mutation is very low.

If you are interested in genetic algorithms and their implementation, read the wikipedia page and then try doing this question.

Scoring Random Names

We’ve used score above as a measure of how close a name is to the “seed” name, but we’ve not discussed how to actually do this. A usual way of doing this is counting changes (number of additions, alterations and deletions needed to make the names same). I’ve done something similar, using alignments. I generate all the alignments of the two names and to score the best one.

What is an alignment?

m i l i n d
+ | | | | |
n i l i n d

The alignment above attempts to align “milind” to “nilind”. The | denotes a match, while the + denotes a mismatch. A mismatch roughly hints towards an alteration.

m i l i n d
| _ | | | |
m   l i n d

The alignment above attempts to align “milind” to “mlind”. The | denotes a match, while the _ denotes a gap. A gap roughly hints towards an addition or a deletion.

There can be many alignments for any two names. For instance, for “mil” and “lul”, all the three alignments given below are valid.

m i l
_ _ | _ _
    l u l
score: -3

m i l
+ + |
l u l
score: 2

m i l
_ + + _
  l u l
score: -1

Each of these alignments is assigned a score. For each match, an alignment is given 1 point. For each mismatch, an alignment is given 0.5 points, and for each gap, an alignment is given -1 points. We calculate the score for all possible alignments, and choose the one with the best score. In the case above, it would be the second one.

This method causes longer names to have better scores than smaller ones, because they simply have more matches. To remove the effect of name length on the score, we divide the best score by the length of the name.

Thus, the score we talked about is defined as:

$$ score = \frac{BestAlignmentScore(\text{name},\text{seed})}{\text{name}.length} $$

Actually - that is not one hundred percent true, I’ve simplified it a bit above.

Firstly, I don’t match names written down using the letters of the English alphabet. Instead, I use names written down using the International Phonetic Alphabet (IPA). In the English alphabet, the way we write something and the way we say it is quite different, and I aim to produce similar sounding names. The IPA provides an easy, comprehensive way to represent how a word should be spoken out loud. At the same time, the IPA uses a rather large set of symbols, like ‘ɳ’ and ‘θ’ and so on. I have used ASCII representations instead, as described in this very useful document.

Secondly, the mismatch score is not a constant. Consider the two mismatches - (a: as in arm, e: as in turn) and (a: as in arm, and l as in leg). The first pair consists of similar sounding vowels, while the second pair consists of one vowel and one consonant, which sound very different. Thus, the first pair gets a larger mismatch score. The entire scheme of mismatch scores is decided by this grid I came up with. The gap score is also smaller, around -0.15 by default.

Thirdly, the name and the “seed” might have different lengths. In particular, for a “seed” of length 6, the name might be 5-7 characters long. Thus, the final score definition actually looks somewhat different:

$$ score = \frac{BestAlignmentScore(\text{name},\text{seed})}{Mean(\text{name}.length, \text{seed}.length)} $$

The algorithm that finds the best alignment is mostly the same as the Needleman-Wunsch Algorithm, which is used to align protein sequences.

NPC-Name Generator in Action

This is an example run of the generator. Please feel free to try it out yourself at https://github.com/milindl/NPC-name

Seed name: d ei v i d (DAVID) Generated names:

d ei o: i d (DAOUID) (DAY - OO - EE - D)
d ei v i TH (DAVITH)
d ei v TH (DAVTH) (pretty much useless)
d ei v i e:(r) (DAVIER)
TH ei v i (THAYVI) (TH as in THIN)

Clearly, not all of them are useable, but they’re not useless either.

Future Work

Asymmetric Grid

Currently, the grid used to score mismatches is symmetric. Thus, a: turning to e: is as likely as e: turning to a:. However, by changing this, I can model the flow of time. For instance, I want to model that over time, both the ʒ (as in pleaSure) and z (as in zoo) sounds being replaced with z. In this case, I can make it more likely for ʒ sound to change into z than the other way around. This way, I can probably look at how a word changes over time.

Accurate Grid

Right now, the grid is based on me saying some words and comparing how similar they sound to me. Instead, I should base my grid off this chart, which could help me describe the closeness of two symbols based on phonetic properties.

Neighboring-Letter Based Alignment

Certain letters occur together more often than not, and my program does not take that into account while computing alignment scoring. Thus, the alignment scores should depend not only on the letter, but also its neighboring letters.