An NPC, or a non-playing character in a computer or role-playing game is a character controlled by the game, who helps set the scene and progress the story. I’ll be addressing NPCs which are a part of sci-fi or fantasy games. In my opinion, these NPCs should have names that deviate enough from regular, run-of-the-mill, “Earthly” names, and impart an unfamiliar and exotic feel to the setting. However, they should not have names which are contrived and do not have a realistic mouthfeel, like “Tchu’quixthal”, or be too generic, like “The Wise Sage”.
Diablo 2 is a game that gets it right in my opinion. Some examples are: Gheed, Akara, Charsi, Jerhyn.
Now, the challenge here is to come up with names which are different, yet similar to “Earthly” names. My idea is to take a “seed” name, and generate variants of it to come up with NPC names. The inspiration for this derives from Isaac Asimov’s Caves of Steel, in which there is a character called Daneel, whose name is similar enough to Daniel, but yet exotic enough (for my tastes, at least!).
It’s easy to generate random names of a particular length. Let’s assume that it’s also possible to rate names based on their closeness to another name. This rating is called the score. Using these two facts, and something called a genetic algorithm, I can come up with generated NPC names, and here’s how:
Given a list, a genetic algorithm drives us towards another list, which has an overall score better than the current one, and it is the actual “magic” behind how the names are generated.
A genetic algorithm is based loosely on how evolution works: the survival of the fittest.
In our scenario, the correspondence is such:
If you are interested in genetic algorithms and their implementation, read the wikipedia page and then try doing this question.
We’ve used score above as a measure of how close a name is to the “seed” name, but we’ve not discussed how to actually do this. A usual way of doing this is counting changes (number of additions, alterations and deletions needed to make the names same). I’ve done something similar, using alignments. I generate all the alignments of the two names and to score the best one.
What is an alignment?
m i l i n d
+ | | | | |
n i l i n d
The alignment above attempts to align “milind” to “nilind”. The |
denotes a match, while the +
denotes a mismatch. A mismatch roughly hints towards an alteration.
m i l i n d
| _ | | | |
m l i n d
The alignment above attempts to align “milind” to “mlind”. The |
denotes a match, while the _
denotes a gap. A gap roughly hints towards an addition or a deletion.
There can be many alignments for any two names. For instance, for “mil” and “lul”, all the three alignments given below are valid.
m i l
_ _ | _ _
l u l
score: -3
m i l
+ + |
l u l
score: 2
m i l
_ + + _
l u l
score: -1
Each of these alignments is assigned a score. For each match, an alignment is given 1 point. For each mismatch, an alignment is given 0.5 points, and for each gap, an alignment is given -1 points. We calculate the score for all possible alignments, and choose the one with the best score. In the case above, it would be the second one.
This method causes longer names to have better scores than smaller ones, because they simply have more matches. To remove the effect of name length on the score, we divide the best score by the length of the name.
Thus, the score we talked about is defined as:
$$ score = \frac{BestAlignmentScore(\text{name},\text{seed})}{\text{name}.length} $$
Actually - that is not one hundred percent true, I’ve simplified it a bit above.
Firstly, I don’t match names written down using the letters of the English alphabet. Instead, I use names written down using the International Phonetic Alphabet (IPA). In the English alphabet, the way we write something and the way we say it is quite different, and I aim to produce similar sounding names. The IPA provides an easy, comprehensive way to represent how a word should be spoken out loud. At the same time, the IPA uses a rather large set of symbols, like ‘ɳ’ and ‘θ’ and so on. I have used ASCII representations instead, as described in this very useful document.
Secondly, the mismatch score is not a constant. Consider the two mismatches - (a: as in arm, e: as in turn) and (a: as in arm, and l as in leg). The first pair consists of similar sounding vowels, while the second pair consists of one vowel and one consonant, which sound very different. Thus, the first pair gets a larger mismatch score. The entire scheme of mismatch scores is decided by this grid I came up with. The gap score is also smaller, around -0.15 by default.
Thirdly, the name and the “seed” might have different lengths. In particular, for a “seed” of length 6, the name might be 5-7 characters long. Thus, the final score definition actually looks somewhat different:
$$ score = \frac{BestAlignmentScore(\text{name},\text{seed})}{Mean(\text{name}.length, \text{seed}.length)} $$
The algorithm that finds the best alignment is mostly the same as the Needleman-Wunsch Algorithm, which is used to align protein sequences.
This is an example run of the generator. Please feel free to try it out yourself at https://github.com/milindl/NPC-name
Seed name: d ei v i d (DAVID) Generated names:
Clearly, not all of them are useable, but they’re not useless either.
Currently, the grid used to score mismatches is symmetric. Thus, a: turning to e: is as likely as e: turning to a:. However, by changing this, I can model the flow of time. For instance, I want to model that over time, both the ʒ (as in pleaSure) and z (as in zoo) sounds being replaced with z. In this case, I can make it more likely for ʒ sound to change into z than the other way around. This way, I can probably look at how a word changes over time.
Right now, the grid is based on me saying some words and comparing how similar they sound to me. Instead, I should base my grid off this chart, which could help me describe the closeness of two symbols based on phonetic properties.
Certain letters occur together more often than not, and my program does not take that into account while computing alignment scoring. Thus, the alignment scores should depend not only on the letter, but also its neighboring letters.