第 7 章 Appendix

7.1 Encodinng

  • ASCII: the earliest, the smallest set to record human characters.

  • Other encodings are superset of ASCII, such as Windows-1252, and Unicode.

  • Unicode has different format UTF-8, UTF-16, and GB18030.

Linux, Mac: To list all supporting locale in your computer

By default, R will use the locale setting of the operation system. Sys.setlocale() only temporarily set locale.

7.2 Windows Chinese Locale

How to change current locale to Chinese?

  • in Windows’ Control Panel -> Region and Language -> Format, choose the format you want, e.g. “Chinese (Simplified, PRC)”, and apply change. Now, restart R and check the locale information using command Sys.getlocale(). We can see the default locale for R is “Chinese (Simplified)_People’s Republic of China.936”, and the code page ‘936’ is the character encoding “GB18030”.

How to find the legal locale name?

find the list by searching Windows Language strings on internet. The currently language strings list can be found here. The “Language string” column contains the legal input for setting locale in R. For example, if I want to change current locale to Traditional Chinese, I use command: Sys.setlocale(category = “LC_ALL”, locale = “cht”).

7.3 Opening a script file (text base)

If I received a script contains Simplified Chinese characters and generated under Windows OS, the encoding must be GB18030. To show it correctly, I can set or change the locale of R to “chinese”, and import the source code using readLines, or there is a even easy way: using Office Word! If using RStudio as the editor for R, we just need to change the default encoding to “GB18030” in “Tools -> Options -> General -> Default text encoding”. Apply the setting, then open the script, it should be displayed correctly.