New: The stats project is moving to a wiki.

HOW I RIP SCRIPTS AND PROCESS STATS FOR THEM

Before we start

You need Python 3, 64-bit Java, and a Bash prompt. If you don't know what that means, you can't use this guide. By the way, I use windows for this, not linux.

Getting the files

Step 1: Buy the game. Otherwise borrow it from a friend. Legally.

Step 2: Identify the game engine. It might be listed on the tlwiki tools page or the tlwiki eroge script sizes page.

Sometimes games have dedicated translation tools on the tlwiki tools page. In this case, great. There's an 80% chance that the script format is usable to us. Sometimes, however, the format is crappy, and you have to use the extraction tools as documentation on how the engine stores stuff and make your own tools. Programmers only need apply.

If there are no tools for your game, first, see if the scripts can be extracted with arc_unpacker or garbro. If they are not, use google. As a final resort, open the archives with a hex editor and see whether the format is trivial or not. If it is, you can ask a programmer to make an unpacker for you.

I haven't kept track of all the script extraction tools I've made over the months, if you want one of them, ask for it on DJT.

Once you have the files

If you have something resembling a web novel with crappy markup, skip this section. If you have a binary file, you need to decompile the script. If you have program code text, you need a parser to extract just the dialogue from it. If there is no decompiler or parser for your game's scripts, you're going to have to make your own, ask a programmer, or give up.

Once you have something resembling a simple script

Once the script is plain text in a good format, you have to remove any residual markup. vnscripts excludes all names, puts dialogue in simple quotes (i.e. 「」), and puts ruby text inside 《》. If ruby text is not removed or specifically placed inside 《》, statistical analysis will give crappy results. Use regexes for this. If the script is decompiled or parsed, do it during that process if possible, it's easier to get it right.

Line wraps must be ignored, even if they're hardcoded into the original script. Line wraps are when the script goes to a new line in the middle of a sentence. If you do not ignore line wraps, statistical analysis will give crappy results. However, going to a new line after the end of a sentence (a "line break") is correct and desired, if the original VN is formatted that way.

Finally, all scripts must be converted to utf-8 without a BOM if they're not already in that format.

Running the scripts

You need python 3 and 64-bit java.

The processing scripts can be found here. This includes several VN scripts so you can Just Do It to make sure it works. Run fullredo.sh to generate two frequency lists for each script ("count" excludes certain types of words, "altcount" doesn't) and table.html, the stats table. This will take several minutes, maybe ten minutes if you have an old computer. VNFreqList.tsv was generated with normalizer.jar on the "count" directory of frequency .tsv files, excluding frequency .tsv files under 400ish KB. BCCWJ_frequencylist_suw_ver1_0.tsv is credited to NINJAL.