New Issue: Can we stop requiring particular C LOCALE / LANG settings?

mppf1 · February 8, 2022, 7:24pm

19193, "mppf", "Can we stop requiring particular C LOCALE / LANG settings?", "2022-02-05T12:50:30Z"

github.com/chapel-lang/chapel

Can we stop requiring particular C LOCALE / LANG settings?

opened 12:50PM - 05 Feb 22 UTC

closed 12:45PM - 10 Mar 22 UTC

mppf

area: Runtime

There is a requirement on the environment that Chapel programs are run in that i…s buried in our documentation: https://chapel-lang.org/docs/usingchapel/chplenv.html#character-set > Chapel works with the Unicode character set with UTF-8 encoding and the traditional C collating sequence. Users are responsible for making sure that they are running Chapel in a suitable environment. For example, for en_US locale, the following environment variables should be set: > > LANG=en_US.UTF-8 > LC_COLLATE=C > LC_ALL="" > > Note: Other character sets may be supported in the future. The situation is that, in practice, the I/O code currently assumes that non-binary data is stored as UTF-8. This is connected to the fact that our string type always stores UTF-8 and we don’t have any character set conversion in the standard library (see also #12726). However, there is some old code in the I/O library that reads a "character" according to the C definition of a multibyte character. It does this if `qio_glocale_utf8` is not set to UTF-8 or ASCII. But, this will lead to garbles / bad encoding errors because storing it in to a string will assume it is a UTF-8 character. But besides that, we have seen some bugs with it that only impact people with a different C LANG setting. The code that sets up qio_glocale_utf8 is here: https://github.com/chapel-lang/chapel/blob/6702b173cfc8fffe10185e7f145c2c2e2ba55b33/runtime/src/chpl-init.c#L176-L180 https://github.com/chapel-lang/chapel/blob/6702b173cfc8fffe10185e7f145c2c2e2ba55b33/runtime/src/qio/qio_formatted.c#L36-L57 In terms of design in this area, it’s my view that we need to pick one of these two options: 1. all textual I/O is done in UTF-8 and if that’s not what you want, do binary I/O and then use a library to convert. (AFAIK this is what Rust does) 2. all textual I/O is done with a configurable character set that can be specified when opening the file/channel/reader/writer and is then converted to Unicode / UTF-8 during I/O (because strings are always UTF-8). (AFAIK this is what Python, Java, and C# do). Note that neither of these depend on the C LOCALE / LANG environment variables. Also, we could arguably do (1) now and later add (2) as a non-breaking change. So maybe we could completely remove `qio_glocale_utf8` and switch to assuming textual data in I/O is UTF-8 in all cases in the near term. (The reason I say "maybe" is that there might be some C code in the library somewhere that uses the C multibyte infrastructure even for UTF-8. I don't think there is, but it is something to check.)

There is a requirement on the environment that Chapel programs are run in that is buried in our documentation:

https://chapel-lang.org/docs/usingchapel/chplenv.html#character-set

Chapel works with the Unicode character set with UTF-8 encoding and the traditional C collating sequence. Users are responsible for making sure that they are running Chapel in a suitable environment. For example, for en_US locale, the following environment variables should be set:
LANG=en_US.UTF-8
LC_COLLATE=C
LC_ALL=""
Note: Other character sets may be supported in the future.

The situation is that, in practice, the I/O code currently assumes that non-binary data is stored as UTF-8. This is connected to the fact that our string type always stores UTF-8 and we don’t have any character set conversion in the standard library (see also #12726).

However, there is some old code in the I/O library that reads a "character" according to the C definition of a multibyte character. It does this if qio_glocale_utf8 is not set to UTF-8 or ASCII. But, this will lead to garbles / bad encoding errors because storing it in to a string will assume it is a UTF-8 character. But besides that, we have seen some bugs with it that only impact people with a different C LANG setting.

The code that sets up qio_glocale_utf8 is here:

github.com

chapel-lang/chapel/blob/6702b173cfc8fffe10185e7f145c2c2e2ba55b33/runtime/src/chpl-init.c#L176-L180


      
          // Declare that we are 'locale aware' so that
          // UTF-8 functions (e.g. wcrtomb) work as
          // indicated by the locale environment variables.
          setlocale(LC_CTYPE,"");
          qio_set_glocale();

github.com

chapel-lang/chapel/blob/6702b173cfc8fffe10185e7f145c2c2e2ba55b33/runtime/src/qio/qio_formatted.c#L36-L57


      
          // 0 means not set
          // 1 means use faster, hard-coded UTF-8 decode/encoder
          // -1 means use C multibyte functions (e.g. mbtowc)
          // -2 means use C locale (ie 1 byte per character)
          int qio_glocale_utf8 = 0;
          
          void qio_set_glocale(void) {
          #ifdef HAS_WCTYPE_H
            char* codeset = nl_langinfo(CODESET);
          
            if( 0 == strcmp(codeset, "UTF-8") ) {
              qio_glocale_utf8 = QIO_GLOCALE_UTF8;
            } else if( 0 == strcmp(codeset, "ANSI_X3.4-1968") || // what Linux calls it
                       0 == strcmp(codeset, "US-ASCII") ) { // what Mac OS X calls it
              qio_glocale_utf8 = QIO_GLOCALE_ASCII;
            } else {
              qio_glocale_utf8 = QIO_GLOCALE_OTHER;
            }
          #else
            qio_glocale_utf8 = QIO_GLOCALE_ASCII;

This file has been truncated. show original

In terms of design in this area, it’s my view that we need to pick one of these two options:

all textual I/O is done in UTF-8 and if that’s not what you want, do binary I/O and then use a library to convert. (AFAIK this is what Rust does)
all textual I/O is done with a configurable character set that can be specified when opening the file/channel/reader/writer and is then converted to Unicode / UTF-8 during I/O (because strings are always UTF-8). (AFAIK this is what Python, Java, and C# do).

Note that neither of these depend on the C LOCALE / LANG environment variables. Also, we could arguably do (1) now and later add (2) as a non-breaking change. So maybe we could completely remove qio_glocale_utf8 and switch to assuming textual data in I/O is UTF-8 in all cases in the near term. (The reason I say "maybe" is that there might be some C code in the library somewhere that uses the C multibyte infrastructure even for UTF-8. I don't think there is, but it is something to check.)

Topic		Replies	Views
[design] Queries about how the program is being compiled Developers	0	176	February 4, 2022
What languages can be called from a Chapel program? Users	5	261	September 22, 2021
File I/O from C code Users	14	178	October 25, 2023
Announcing Chapel 1.24.0! Announcements	0	550	March 18, 2021

New Issue: Can we stop requiring particular C LOCALE / LANG settings?

Related Topics