|
- Integrating BCC Corpus Data into Dictionary - Pleco Software Forums
I'm honestly a little wary of adding built-in frequency listings because I don't think they're a very good way to learn Chinese; even a really excellent corpus will probably be several years out of date for slang vocabulary, so a term that comes up as uncommon may actually be quite common now (or vice versa) - people are constantly repurposing old words - plus I don't believe they're accurate
- Word frequency list based on a 15 billion character corpus: BCC (BLCU . . .
The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters) It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available
- Word frequency list based on a 15 billion character corpus: BCC (BLCU . . .
I would read in the BCC corpus frequency list as a dictionary, then Having concatenated all the news magazine articles as plain text, I would build a dictionary of all the words in the news magazine articles up to 8 characters long, counting their number of occurrences with the help of the BCC frequency list (which tells us which combinations
- Integrating BCC Corpus Data into Dictionary
Thank you very much for your detailed explanation ! Yes, that makes sense Also, by importing the card as a user dictionary you gain additional benefits without losing anything!, So if my understanding is correct it seems there are no significant downsides:) You're welcome! Yeah, it's true, for
- Bigrams sorted by frequency with pinyin English?
The Beijing Language and Culture University created a balanced corpus of 15 billion characters It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as
- Common Idioms; A Collection by Grade [HSK old HSK 中考 高考 . . . ]
The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEX-CH (47 million characters) and the LCMC (less than 2 million characters) It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available
- audio recording corpus | Pleco Software Forums
Hey Mike, I'm a big user of vocab lists and I'm about 1 5 months away from finishing the HSK4 list Recently I've been studying some colloquial stuff and
- Media-related vocabulary gathering project - Pleco Software Forums
With a small corpus of 650 articles from People's Daily, downloaded using a Python script, I hope to start providing a more modern frequency list of media-related vocabulary The frequency list has the following features: It uses all sections of the 人民日报 People's Daily newspaper, including the sports section
|
|
|