New Issue: Can we stop requiring particular C LOCALE / LANG settings?

19193, "mppf", "Can we stop requiring particular C LOCALE / LANG settings?", "2022-02-05T12:50:30Z"

There is a requirement on the environment that Chapel programs are run in that is buried in our documentation:

https://chapel-lang.org/docs/usingchapel/chplenv.html#character-set

Chapel works with the Unicode character set with UTF-8 encoding and the traditional C collating sequence. Users are responsible for making sure that they are running Chapel in a suitable environment. For example, for en_US locale, the following environment variables should be set:

LANG=en_US.UTF-8
LC_COLLATE=C
LC_ALL=""

Note: Other character sets may be supported in the future.

The situation is that, in practice, the I/O code currently assumes that non-binary data is stored as UTF-8. This is connected to the fact that our string type always stores UTF-8 and we don’t have any character set conversion in the standard library (see also #12726).

However, there is some old code in the I/O library that reads a "character" according to the C definition of a multibyte character. It does this if qio_glocale_utf8 is not set to UTF-8 or ASCII. But, this will lead to garbles / bad encoding errors because storing it in to a string will assume it is a UTF-8 character. But besides that, we have seen some bugs with it that only impact people with a different C LANG setting.

The code that sets up qio_glocale_utf8 is here:

In terms of design in this area, it’s my view that we need to pick one of these two options:

  1. all textual I/O is done in UTF-8 and if that’s not what you want, do binary I/O and then use a library to convert. (AFAIK this is what Rust does)
  2. all textual I/O is done with a configurable character set that can be specified when opening the file/channel/reader/writer and is then converted to Unicode / UTF-8 during I/O (because strings are always UTF-8). (AFAIK this is what Python, Java, and C# do).

Note that neither of these depend on the C LOCALE / LANG environment variables. Also, we could arguably do (1) now and later add (2) as a non-breaking change. So maybe we could completely remove qio_glocale_utf8 and switch to assuming textual data in I/O is UTF-8 in all cases in the near term. (The reason I say "maybe" is that there might be some C code in the library somewhere that uses the C multibyte infrastructure even for UTF-8. I don't think there is, but it is something to check.)