aragonc 94984e6b12 merge 10 năm trước cách đây
..
language_profiles 94984e6b12 merge 10 năm trước cách đây
sample_texts 94984e6b12 merge 10 năm trước cách đây
index.html 94984e6b12 merge 10 năm trước cách đây
readme.txt cbf2d50681 Updating license information for the internationalization library. 14 năm trước cách đây
update_language_profiles.php 6c33498a64 Minor - Adapting code comments to phpdoc 13 năm trước cách đây

readme.txt

Libbrary of statistical profiles for language recognition
---------------------------------------------------------

The sample texts for dieffernt languages have been taken from
Perl module: Lingua::LanguageGuesser - http://gensen.dl.itc.u-tokyo.ac.jp/LanguageGuesser/LanguageGuesser_demo.html
Statistical Text Analysis - http://boxoffice.ch/pseudo/
Some random sample texts have been taken from Wikiedia - http://wikipedia.org/

All the sample texts should be UTF-8 encoded!

To understand how does language recognition work you need to read the following remarkable work:
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
http://citeseer.ist.psu.edu/cache/papers/cs/810/http:zSzzSzwww.info.unicaen.frzSz~giguetzSzclassifzSzcavnar_trenkle_ngram.pdf/n-gram-based-text.pdf

License: GNU General Public License 3 as published by the Free Software Foundation (http://www.fsf.org/).
Assembled by Ivan Tcholakov,
November, 2009