New: The stats project is moving to a wiki.

VN Frequency List

This is a frequency list derived from around 50 VN scripts, totalling around 20 million words. Before we continue, some notes:

Japanese is not written with spaces, so identifying the words in a sentence is not trivial. Even people disagree on exactly where the boundaries lie, especially for compound words and set phrases. To get words, we use morphological analysis.

Morphological analyzers like Mecab and Kuromoji work under the assumption that they're analysing something with a hidden markov model, and use the "Viterbi algorithm" to determine which sequences of tokens are the most likely, and pick the most likely sequence. This comes with three problems.

One, markov models merely approximate language, so there's some inherent error there.

Two, the analyzers don't output "This is this word from this dictionary", and we have to assume what word we're looking at based on what it says the lemma (corresponding parent word) is. Some different words have the same lemma.

Three, the algorithm needs to know how likely particular words or sequences are, and this has to be "trained" based on text tagged by humans. The dictionary files for Mecab and Kuromoji contain this "training" information, and it could have come from anywhere. The dictionary used in this analysis, unidic, is trained based on the Balanced Corpus of Contemporary Written Japanese, which is not biased towards fiction, the focus of this frequency list. As such, the analysis tool used to make this frequency list has some stopgaps to fix certain words, e.g. 私 almost always being read as わたくし when it comes out of the analyzer, but there are no such stopgaps for words that are not extremely common. In fact, there are more than a couple words with the wrong reading attached. Also, the analysis tool filters out ruby text markup completely, otherwise the morphological analyzer would get very confused.

In addition to the challenges set by analysis, there are challenges set by the small size of the corpus. Fate/Stay Night contains the word セイバー so many times that it puts it on the top page of the frequency list if you don't filter it out. This is just an isolated example, and jargon and names from specific stories very strongly pollute a simple "concatenate all the scripts and analyze it as a single chunk of text" approach. Because of this, each script is analyzed independently, then each word's frequency is averaged across its occurrences in each script, with outliers removed from the averaging process. This comes with a couple caveats.

One, the relationship between frequency and ranking is almost Zipfian, which means that the most linear relationship is between the logarithms of both variables. This means that the right way to average frequency data to get a good corresponding ranking is with a geometric or harmonic mean. But we're working with heavily quantized frequencies, which can even be 0 for individual scripts for uncommon words, which makes this basically impossible. Removing outliers helps, but we still have to use a simple average instead of a geometric or harmonic mean. This means that the overall distribution of frequencies is different than it should be for a 20 million word long text. The way the relationship between ranking and frequency is distributed is a little less Zipfian after averaging than it naturally would be. Words that would show up once or twice per 20 million words are completely unrepresented, which cuts off an extremely large portion of the frequency list.

Two, small scripts can't be represented, because they would dramatically alter the ranking of any words they don't use. I used an artificial cutoff point of "at least as long in bytes as Leyline 1".

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from social sciences essays. Seriously, what the hell.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

Now that we've covered all the major warnings, here's the list.