Wednesday, April 25, 2012

Kanjidic2 as CSV in HTML and text

To aid a neutral party in assessing approaches to digital dictionaries for Japanese, I have posted an HTML file displaying 10,000+ of the first entries in Kanjidic2 at
I have restricted the dump to the Kanji, the UCS code and a max of 12 of a possible 14 meanings.

There are less than 10,200 due to the fact that in the first 12,155 entries, many had no XML meaning content which was not assigned a language attribute.  Those few thousand may have English translations in markup previously used for foreign languages.

The file can be found as
with a three line header which you may have to alter for your purposes.

The Kanjidic2 XML file was parsed using the Curl XDM library from curl.com (Nihon-go http://www.curlap.com)

As it stands, the HTML file should be useful for building custom Anki flashcards (themselves stored as SQLite.)   I will be using variant CSV output to construct dictionary software with annotations and spaced-repetition options.  Curl has both CSV and SQLite libraries in addition to the XML libraries.

