renshuu.org requires Javascript to work correctly. Please enable Javascript and reload this page.

renshuu requires cookies to work correctly. Please enable cookies and reload this page.

掲示板 Forums - Text Analyzer v2: feedback

Top > renshuu.org > Feature Requests/Improvements > Finished/Rejected Requests

マイコー

Level: 328

I'd love any feedback on the visual update of the Text Analyzer, as well as the new data available. Specifically, here's what's changed:

1. It no longer runs each piece of text through the *fast* and then the *thorough* parsers - it just sticks to the more accurate one.

2. On tapping the name of the text, you get a new page with all the text info / reader, in addition to a new breakdown of grammar found in the text.

3. On the updated text list, you now have a Word and Grammar level. It's determined by the easiest (highest) level of the JLPT that covers 70% of the material.

Example: If you have N5: 80% of the words, it would be rated as N5

However, if you have 60% of N5 and 15% of N4 (=75% at N4 or easier), it would be rated N4.

Current concerns that I have:

1. Wondering if, in addition to overall grammar, if conjugation data should be split out as well (what percent is in polite, casual, etc.)

2. Should the 70% value be customizable?

3. While the JLPT level works fairly well on grammar, I feel like it works less so on vocab. When it comes to a lot of the text being posted in here (lyrics, novels, manga, anime subs, etc.), I feel like the JLPT (even with the expansions that renshuu is putting in with the revised Beginner/Basics lists) is narrow enough that more things are going to be marked as N1 (highest level), giving texts an impression of being overly difficult.

Anyway, I'd love any feedback or comments!

3 years ago

ポールおじちゃん

Level: 1765

My two cents:

Seems potentially informative. Why not?
I don’t see a use case for that.
True enough. I suggest (a) treating non-JLPT words as insignificant for difficulty, so if a text is 50% non-JLPT words, just rate it on the 50% that is JLPT; and (b) providing more details (like a histogram?) on request. Representing difficulty as a single number is never going to be more than a crude approximation, so don’t expect too much.

3 years ago

Anonymous123

Level: 1590

I think making the 70% value customizable would be good. I would probably want to use a much higher value (closer to 90%).

That being said, I'm not actively using the Text Analyzer, so my opinion on the matter should only hold a teeny tiny amount of weight.

3 years ago

Yatyisam

Level: 253

This has so much potential and I can see myself using it very often.

However, most of my readings are showing as N1 vocabulary, but they definitely are not N1 level texts. I think part of the problem is that all of the texts I have entered have 20-40% of the vocab under "Not in JLPT". I have no idea what those words are, but it's going to be impossible to get 70% of any level if 40% are not in the JLPT. (Is it counting conjugated verbs as not in JLPT by any chance?)

3 years ago

Yatyisam

Level: 253

I just noticed something else. The vocabulary percentages for the N Levels for my texts (including "not in JLPT") are not adding up to 100%. They all seem to give a total of around 80%.

3 years ago

マイコー

Level: 328

Thanks for the feedback!

1. The 100% issue is fixed. In the back, N5 is actually split up into "N6" (what renshuu uses in the Japanese Basics lessons - basically N5 part 1) and "N5". It wasn't properly counting the N6 ones in the N5 percentage.

2. Since you have renshuu pro, it's actually pretty easy to see which ones are in the "not in jlpt" category. When you're looking at the percentages, look below to the actions section - you can filter on "not in jlpt", then make a new, private lesson. I went ahead and did this for one of your readings so you can see it. They are reachable under Resources > Lesson Center > Me (it's in the Assorted Lessons section).

3 years ago

Yatyisam

Level: 253

The 100% issue does seem to be working ok now. And thanks for reminding me about how to view the "not in JLPT" words as well. That's very useful! But the overall word level score is still very misleading. Take for example one of my texts, where the vocab levels are:

N5 - 52%
N4 - 6%
N3 - 6%
N2 - 2%
N1 - 0%
Not - 34%

This text is rated as an N1 vocabulary, mostly because 34% are not in JLPT. But when I looked at those unlisted words I see:

十月
~ごろ
居酒屋
焼鳥
ポテト
トマト

Most of these words can be found in the most common word lists, or the most common news words list. So I assume that means that they are words that N5 or N4 learners would be able to recognize (at least the kana versions for words like izakaya and yakitori).

I understand that in an advanced text, the "not in JLPT" words would probably be advanced N1+ words. But in easy, graded reader style stuff like what I am looking at, the "not in JLPT" words are often common every day things that aren't "academic/business" enough to be on the JLPT lists, so it is skewing the results.

Potential solutions are (maybe?)

an option to not consider the "not in JLPT" words and just use the cutoff percentage on what is left
a histogram or other visual way to see the breakdown (echoing a previous post)
add information to the analysis about how many words are in the "common" word lists
allow users to set the cutoff (based on what I see in my data, 50% seems more accurate)

Oh, and by the way, most of the texts that I saved have been fairly easy, and the grammar level for each is mostly showing N4 or N5, so that seems to be working pretty well.

3 years ago

マイコー

Level: 328

I definitely agree that it is less than ideal. However, while I'm not sure which "common word lists" you're referring to (I've seen ones that start at 10,000 words), the newspaper word lists that I've seen are at least 5 or 10 thousand words long. The entirety of the JLPT is less than 10,000 words, so I am not sure "in common word lists" or "in newspaper lists" fully equates with "N5/N4 learners will know these".

On a different metric, the Genki I and Genki II textbooks have roughly 1,200 words between the two of them. It is often said (in a general sense) that finishing Genki II will take you almost to the end of N4, so at least from the perspective of JLPT, I do not think the data is *wrong* (just not as useable as it should be).

It may be that the JLPT itself is not a great metric for grading the difficulty of the texts. However, given that many people (on renshuu and elsewhere) are following material lists that generally aim in the same direction as the JLPT, so having a different set of metrics on this one page only that (as an example) says "this is a pre-intermediate text" (level 2 out of 5, with 5 being the hardest), and they say "well, I have done the pre-intermediate word groups on renshuu, but I can't read this at all", then it's a not a great solution.

We are already in the process of expanding our renshuu jlpt lists (I've practically doubled the N5 level since we began this transition, split into Japanese Basics and Beginner Japanese), but it's still tricky. For example, いざかや is arguably not a n5 or n4 word, maybe not even n3. When you go through the N5 list itself, you realize that you can get through 500-600 "must need" words without even breaking a sweat. Adding on "common foods you may see at a grocery store or restaurant" could easily add 7-10% to it with just that one category, and then suddenly, you have (I'm just guessing here) 2,000 words for "this is the most basic level, N5".

It is still far from perfect, and I really appreciate all the feedback. In the meantime, I will try looking into the "common word" list (which is already stored in renshuu). It's kind of crazy, because the data in there often seems to defy what is "common" to us.

Example: (both N5 words)

遊ぶ

秋

秋 is in the top 500 words (group 1, where each group has 500 words), while 遊ぶ is in group 26!? (roughly rank 12,500). So going by the freq lists, a word like あそぶ (which seems pretty common to *me*) would suggest that it is not something you'd learn for quite awhile.

Of course, this is cherry picking, but there doesn't seem to be a single method that is going to give us something satisfactory.

3 years ago

gdartfow

Level: 2197

Most texts I've tried (easier and harder) came out as N1 regardless, which is misleading and makes the feature feel useless overall.

I do like the idea of using a frequency list, rather than JLPT level, to estimate text difficulty. But I agree that finding one that conforms to "common sense" wouldn't be easy.

I ran a quick search in some lists and got fairly different results: Wiktionary (https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Japanese) puts 秋 at 794 and 遊ぶ at 3143, Leeds Corpus (http://corpus.leeds.ac.uk/frqc/internet-jp.num) puts 秋 at 1899 and 遊ぶ at 1593 and NINJAL's BCCWJ (https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html) puts 秋 at 1357 and 遊ぶ at 995.

3 years ago

マイコー

Level: 328

That third frequency list seems useful, although I bet that ultimately, it might be most useful to take an average multiple ones. I'll try playing with it, see what comes up.

3 years ago

マイコー

Level: 328

Bit of an update - since there is always some ambiguity for certain terms (usually ones without kanji), I can (currently) match up 25.5k out of 30k of the top words in the third frequency list.

As I continue to work on this, I'm wondering how to divide up the groupings. For example, if I set 5 difficulty levels (5 easiest, 1 hardest), what "categories" would line up with each of those?

Example:

70% in top 1,000 = Level 5

70% in top 2,500 = Level 4

70% in top 4,000 = Level 3

70% in top 8,000 = Level 4

otherwise

Level 5

3 years ago

マイコー

Level: 328

I went ahead and built an integration around this, and it's available now.

It turns out that that frequency list by itself was not fully adequate, and so I blended it with the JLPT lists for this calculation. Basically, the levels I mentioned one post above this are true, except they also consider JLPT.

So if a word is in the top 1000 (from the frequency list) OR in the jlpt n5 list, it'd be considered level 5.

3 years ago

gdartfow

Level: 2197

Something is wrong. Everything looks skewed the other way now:

I would've expected a text on Kant's categorical imperative to rank slightly higher than level 4...

3 years ago

マイコー

Level: 328

I had this long explanation about how it was working, then I looked at it, and found a "small" mistake. Try it now, please :)

3 years ago

gdartfow

Level: 2197

The word level definitely looks better now.

The grammar level percentages might need to be tweaked a little. Most texts I've tried are marked N4. Presumably because even texts geared towards N1 still use plenty of more common grammar too (e.g. から, ながら, だろう). A text comprised solely of N1 grammar examples still only ranked N2:

By the way, does the text analyzer parse grammar differently from the sentence lookup tab?

I tried checking the same sentence (#167770) in both. The lookup said 笑っているの = の、~こと (Changes A into a noun) while the text analyzer said 笑っているの = て、~で (Since A; As A) + いる (Present Casual).

I don't expect it to be perfect, but I did think it would be consistent...

3 years ago

マイコー

Level: 328

With regards to the numbers, I think the update I just put up will help. It has two things:

1. Split the cutoff percentages so that vocab (70%) and grammar (80%) are separate.

2. Allows you to easily change the percentages to better match your expectations.

As to the second, the parsing systems are different, the markup system the same (for individual sentences, they are manually confirmed (at least, new ones are)) - it looks like the parser is not linking the ている together. I just made a change that should help, although no changes to parsing will affect already parsed texts.

It may be the case that I periodically rerun all submitted texts through the updated parser, but as that takes several hours to run, it's not something I can do with every change that's made.

3 years ago

イクト

Level: 1312

On my smartphone the table is too wide to fit on the screen so you have to scroll to the right. This also seems to cause the pop-up dictionary to act word.

something else I would love to have would be some way to store additional notes about the text. I often paste stuff like nhk articles into it and would like to save the URL of the original article.

3 years ago

マイコー

Level: 328

Fixed!

As to the notes section, I'll get that added sometime in 2023.

3 years ago

イクト

Level: 1312

I found a minor but interesting bug.

I do the following:

Read a previously analyzed text.

close it again to get back to the menu.

try to add a new text.

the submit button is missing, only cancel is there.

3 years ago

マイコー

Level: 328

Fixed! Looks like it was a fragment of code from the old UI.

3 years ago

Getting the posts

Top > renshuu.org > Feature Requests/Improvements > Finished/Rejected Requests

和英辞典Vocabulary dictionary

Filters

漢字辞典 Kanji dictionary

Filters

文法辞典 Grammar dictionary

Filters

例文検索 Sentence lookup

掲示板 Forums - Text Analyzer v2: feedback