Please use this thread for all comments and bug reports. Before posting, please read the following notes/status.
1. The analyzer will never be 100% perfect. I am aiming for 95-98% accuracy.
2. There will be times where it simply cannot break apart a small piece of the text. Please report these for checking. In particular, I am most interested in verbs or adjectives that seem to be broken up. When submitting, please do the following:
A. Try submitting a single sentence of Japanese that contains the error.
B. Report with the entire sentence, as well as the location of the error (and expected results).
3. For now, words that were not found in the dictionary will appear RED. This will include errors (where words were broken up incorrectly) as well as words that are real, but not in renshuu's dictionary (such as names).
4. CURRENT ISSUES
- Verbs with って afterwards parse badly (Ex. 見つかったって)
Also noting that the 。 in the second test was flagged as not in the dictionary, whereas the 。 in the first test was. I'm assuming because it's half-width?
This'll be all I can do tonight! (Not, it does not fixed already parsed readings)
1. さ (adjective > noun) form fixed
2. half-width period fixed
3. 写メ not in base dictionary (this is not renshuu's dictionary), impossible (for now)
4. ながら fixed
5. ぼっこぼこ not in base
6 . ベテランニート fixed
7. がった fixed
Notes: The "base dictionary" is a layer below renshuu, part of a package called kuromoji. Although renshuu can work around it *sometimes*, if the word is not in this dictionary, it'll get broken up into smaller components that are very hard to stitch and put back together (on renshuu's layer). Looking into a way to inject terms into that base dictionary, but it's always a nightmare working with someone else's code library.
I'm just tossing this out here - if anyone is proficient in node/js and wants to help me reverse-engineer the dictionary file formats in the base layer that I am using, let me know! (https://github.com/takuyaa/kur...) Been trying to figure out how to add new entries to the underlying dictionary.
2. 履歴書 is not yet possible with current setup. kuromoji splits it, and both words are in the dictionary. This will be fixed by the future implementation of complex words, where a word can be marked as a subset of a larger word, and the system will try to match those together. 100% doable, just not yet.
3. So, it's marking そりゃ as a verb form of する. Any idea what that's called? I can easily code it into the system, but I'm not sure what it is or what the full rule set is.
4. 行かなきゃ fixed
5. かな fixed (interestingly, it had this as a form of the based "unit" か.
6. お客 partially fixed - but no お客さん (see #2)
7. 事なきを得る < see #2
8. 着の身着のまま家から <-- bad kuromoji marking. This may be tricky - we'll have to see how many more of these come out before we can consider a rule to overrule kuromoji.
I think the compound word issue is going to be the largest one, and one that I may need to implement sooner than later.
In order to give you all something to play with, though, I'll try to get the Actions panel set up soon so you can start exporting this stuff to lessons/schedules.
2) Original: ‥‥摑まりたくねえ‥‥ ねえ gets recognized as "right?/don't you think?" instead of ない
3) Original: こと Original: 彼奴が遣ったこと こと always get the translation of "particle indicating command, mild enthusiasm etc" instead of the more common 事
4) Original: ‥‥そうだ、彼奴だ‥‥ Parsed: ‥‥庄だ、彼奴だ‥‥
5) Original:(もう、何もかも オシマイなんだぁ!) Parsed: (もう、何もかも 御仕舞いな乃だぁ!) If there is katakana used for empashis, foreing accents, robotic voice etc..., maybe the Reader should try to find a way to link 御仕舞い under the hood but leave it display as オシマイ
6) Original: あそこで叫んでるの Parsed: 彼処で叫出る乃
Contracted Te-iru form turned into でる
7) Original:‥‥死にたがってるわよ。 Parsed:‥‥死にたがってるわよ。 たがる is failing to being linked 8) Original:矢張 (Yahari - family name) Names are getting parsed incorrectly, this was splitted into 矢 and 張 - might be the right time to import the same Names Dictionary used by jisho :D
9) Original: うわあ Parsed: うわあ This is being split into 2
10) Original:言わなくちゃ ちゃ is not recognized as the contraction of ては
3. So, it's marking そりゃ as a verb form of する. Any idea what that's called? I can easily code it into the system, but I'm not sure what it is or what the full rule set is
*そりゃ is a blend of それは, not a form of する afaik. It has its own dictionary entry and as far as I've seen gets parsed correctly.
ichi.moe categorizes すりゃ the same as the "provisional" -eba form (link). I think the rule is to just replace れば with りゃ for verbs and ければ with きゃ for adjectives; すれば→すりゃ, なければ→なきゃ .