Sunday, September 29, 2013

Pleco - Cantonese and Historical Chinese dictionaries

My god, it's been... 3 years since I've posted here?
Well, I've decided to make this my place to put some useful information for people that I find out and isn't easily searchable on the internet.
On that note, today's post is regarding Pleco, the fabulous mobile Chinese dictionary. I found myself wanting a good Cantonese dictionary, and also a good Historical Chinese dictionary, but none were available. So, I made my own (that is, found them on the internet and edited them / converted them to Pleco format).

Details and instructions for importing Yedict (a Cantonese dictionary based of CEDICT and other material) and a list of Chinese characters with Baxter's middle and old Chinese readings/definitions can be found below, for your perusal. I won't provide the files themselves though, because I am not totally sure as to the copyright statuses of the sources.

YEDICT
A large compilation of quasi-free Cantonese Chinese definitions and readings. Most of the definitions are from CEDICT, already available on Pleco for free, but it does not have Cantonese readings. Furthermore, Yedict (so it's called) used Cantonese definitions merged frrom Cantonese Stardict, so it's much more useful for those specifically-Cantonese terms.

Source: http://writecantonese8.wordpress.com/2012/02/04/cantonese-cedict-project/
SPECIFICALLY, tym's amended version (from the comments), which saves some time: http://dl.dropbox.com/u/3648660/yedict_20130108.7z

A lot of these instructions were modelled from alex_hk90's instructions, and those also inspired me to write this. If parts of this are confusing, he may have a better explanation in his post here: 
http://www.plecoforums.com/threads/user-dictionary-specification.3218/page-2
However, his instructions involved using commands on an application that only seems to run (well) on *nix systems (Linux, OSX). So I wrote these to be used on any computer (that can rune Notepad++ (all of them?)). The regular expressions are largely similar, but there are some differences.
If you don't know what Notepad++ is, google it.

Right now, the entries should look like this:
繒 缯 [jang1] [zeng1|zeng4] /silk fabrics/surname Zeng/to tie/to bind/

-Change to single delimiter
Use Notepad++ regular expression find/replace with: 
Find: (.+?) (.+?) \[(.*?)\] \[(.*?)\] /(.+)
Replace with: \1@\2@\[\3\]@\[\4\]@/\5
NOTE there is a space after the colon; do not include that space in the find/replace fields.

-Add Pleco definition formatting
1. Add bullet-point to the beginning of defs.
Find: (.+?)@/(.+)/
Replace with: \1@•\2
2. Add bullet-point to the middle parts of defs.
Find: /
Replace with: • 
NOTE the special character here. This is a special Unicode character that Pleco uses to break lines. The bullet is for cosmetic purposes.

-Convert to Pleco card format.
Find: (.*)@(.*)@\[(.*)\]@\[(.*)\]@(.*)
Replace with: \2[\1]\t\4\t[\3]\5

Now, the entries should look something like this:
缯[繒] zeng1|zeng4 [jang1]•silk fabrics• surname Zeng• to tie• to bind

-Manually break into smaller files (?)
This is due to Pleco being EXTREMELY slow in importing. I broke into 22 10,000 definition files, but that was maybe a bad idea. A better idea would be to break into fewer larger files, like maybe 30,000, and let it run overnight every night for a week. Remember to BACK UP THE DATA FILE while importing. And after import, CHECK THE ENTRY COUNT to make sure it didn't stop midway through and not inform you (which it does, a lot).

-Move files to phone and import into a new Pleco user dictionary. The file should be .txt. A few of the entries have errors, but that's something you can fix by hand as you find them, if you wish.


HISTORICAL
A little over 4000 characters and their middle and old Chinese readings and meanings. I took the liberty of making it much cleaner and easier to read for non-Chinese historical linguists, while retaining accuracy as much as possible. This uses the same method as Yedict, above.

Source: http://crlao.ehess.fr/docannexe.php?id=1221

The entries should look like this:
āi ai1 'oj '- -oj A *qˤə dust 0938b U+57C3

-Get rid of extra data (pīnyīn, MCI, MCF, GSR, UTF-16)
Use Notepad++ regular expression find/replace with:
Find: (.)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)
Replace with: \1\t\3\t\4\t\7\t\8\t\9

-Check for missing tabs after entries and manually add them (encoding issue?)
Find: ^(.)[^\t](.*\w+)

-Get rid of Baxter's tone tags on MC (he numbers them in a separate field, so these are redundant)
Find: ^(.*)\t(.*)\t(.*)[XH]\t([ABCD])
Replace with: \1\t\2\t\3\t\4

-Change tone letters to names
Find: (.*)\t(.*)\t(.*)\tA\t(.*)\t(.*)
Replace with: \1\t\2\t\3\teven\t\4\t\5
Find: (.*)\t(.*)\t(.*)\tB\t(.*)\t(.*)
Replace with: \1\t\2\t\3\trising\t\4\t\5
Find: (.*)\t(.*)\t(.*)\tC\t(.*)\t(.*)
Replace with: \1\t\2\t\3\tdeparting\t\4\t\5
Find: (.*)\t(.*)\t(.*)\tD\t(.*)\t(.*)
Replace with: \1\t\2\t\3\tentering\t\4\t\5

-Add labels
Find: (.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)
Replace with: \1\t\2\tMC: \3 \(\4 tone\)\tOC: \5\t\6

-Add Pleco definition formatting
Add bullet-point to the beginning of defs.
Find: (.*)\t(.*)\t(.*)\t(.*)\t(.*)
Replace with: \1\t\2\t\3\t\4\t• \5


-Rearrange and convert to Pleco card format
Find: (.*)\t(.*)\t(.*)\t(.*)\t(.*)
Replace with: \1\t\2\t\5 \3 \4

By now, the entries should look like this:
ai1 • dust MC: 'oj (even tone) OC: *qˤə 

-Manually delete first line (the fields key)

-Move file to phone and import into a new Pleco user dictionary. The file should be .txt. A few of the entries have errors, but that's something you can fix by hand as you find them, if you wish.