Unicode Support on FreeBSD


Overview
Basic Setup for Unicode
    Fixing an error in the UTF-8/LC_CTYPE file
    Fixing a bug in hexdump(1)/od(1)
    $LANG and $LC_CTYPE
    Converting Files Using iconv
    Keyboard Input
Graphical Mode Utilities: KDE, Firefox/Opera, Evolution, Openoffice
Text-Mode Utilities
    xterm
    XTerm Fonts
    Wide Character Libraries
        libutf-8
        libncursesw
    Utilities
        bash
        claws
        csh
        cups
        emacs
        grep
        groff/grops
        joe
        ls
        mutt
        nvi
        sh
        sylpheed
        tcsh
        zsh
Servers
    Apache Web Server
Unicode on the Console
Roadmap for Adoption of Unicode as FreeBSD's Standard Character Encoding

Overview

FreeBSD still (as of April 2006) isn't shipped with Unicode support enabled by default. With some reconfiguration however, it can be turned on, and does work fine. This document describes how to go about such a reconfiguration. The reconfiguration involves rebuilding selected utilities with their support for Unicode character sets enabled. The good news is that, with few exceptions, all the needed support is already present, so it is not hard to get things going. In a few cases, however, certain utilities do not come with Unicode support included, and these need to be patched or replaced before rebuilding. Instructions for what to do are included below.

A Little Background About Character Sets

For a long time, BSD systems used the ASCII character set. Characters were 7-bit, with codes between 0..127 or hex 0x0..0x7f. This character set has room for the alphanumeric characters A-Z, a-z and 0-9, the 33 common punctuation characters encountered on American keyboards, and another set of 33 control characters. One can write English documents and messages using ASCII, but other languages are difficult or impossible due to the lack of any accent characters or support for their character sets.

During the 1980s other expanded character sets were developed to allow larger sets of characters in a single set. Work focused on 8-bit character sets with 256 characters in them. Unfortunately, many parts of the BSD system were coded with assumptions that a character was 7 bits, so a major effort was undertaken during the late 1980s and 1990s to make the system `8-bit clean'. This involved changes to the system itself and to many utilties. The result was that FreeBSD evolved to use the 8-bit ISO8859 character set, which is still the default today. In ISO8859, the 8 bits allow for codes from 0..255 or hex 0x0..0xff. The US-ASCII characters occupy the same 0x0..0x7f positions and an additional selection of common western European accent characters are on the 0x80..0xff positions. At the time, this was a great improvement as documents could now be written in many more languages. Unfortunately however, 128 extra western European characters doesn't by any means include all European characters and is therefore not sufficient even for many Europeans (it lacks eastern European characters, for example, not to mention Greek). As a result, several alternate ISO8859 character sets were developed, known as ISO8859-1, ISO8859-2, ISO8859-3 etc, each with a different set of characters in the 0x80-0xff range. ISO8859-1 contains western European characters, ISO8859-2 contains eastern European characters, and there are variants with Cyrilic, Greek, Arabic and Hebrew characters too. A document or message, however, can only use characters from one character set, so combining Hebrew with French in the same document would still pose a problem. And, ISO8859 doesn't handle Asian character sets or languages which have more than 256 characters in general use and which therefore won't fit in ISO8859's 8-bit range.

In today's increasingly global world, it is becoming more and more common to have multi-lingual documents and to want to share them with folk all over the world. In fact, a simple list of peoples' names often poses a problem if those names would normally be written in differing character sets. It's also common to exchange messages and documents with people from many parts of the world, and while multi-lingual documents might not be necessary all the time, having to continually switch character sets for work with different people becomes tedious and annoying. The expanded 8-bit ISO8859 character sets don't solve the problem. A third evolution of character sets has therefore taken place.

Unicode encodes characters as multi-byte values, supporting a much larger range of characters. Encoding of all the world's languages is supported within Unicode, as are various specialized characters such as mathematical characters, music symbols, Braille and so on. Unicode isn't a merely an extension from 8-bit to 16-bit character set, but instead allows for characters with 24-bits or 32-bits or more. Even within Unicode, there are alternative possible character encodings, but one such encoding, UTF-8, is commonly accepted as a good base for handling most of the world's languages without excessive overhead. Unicode's UTF-8 encoding is supported on platforms such as Linux and Mac OS/X and Windows, too, which is an important aspect of making it useful for message and document interchange.

To accomodate Unicode, the system and utilities must, once more, be adapted, this time to support varying length characters. This support is known as wide character support and has been implemeted extensively throughout FreeBSD and many of the base system utilties and ports. Converting a FreeBSD system to use Unicode involves enabling this wide character support in system libraries and utilities where it is needed. It also involves telling the system to interpret characters as Unicode sequences and to convert keyboard keystrokes to Unicode character values.

This document is NOT a detailled tutorial on Unicode or UTF-8. If you want such details, go here.

This document describes, in practical terms, how to make Unicode work on FreeBSD. Sufficient for us here is to say that Unicode is supported using the UTF-8 encoding, and how to use this is what we discuss below.

There are several aspects to making this possible:

Each of these is described below.

Please note that this document is, by no means, a complete list of everything. We just document the basic steps needed to get Unicode working for the tools we've looked at so far. If you have feedback on this, for example, you have input for another tool you want to see added to this document, please email fbsd@opal.com.

Do keep in mind that since enabling Unicode support does involve rebuilding many utilities with support enabled, that any time a system upgrade is performed, it will be necessary to remember to redo the reconfiguration as needed.

Basic Setup for Unicode

Fixing an error in the UTF-8/LC_CTYPE file

If you are using a version of 7-current CVS'd later than 2006/07/28 06:10 UTC, this patch is already present and you do not need it. If your system is based on 7.0 code from before that date, you need to apply this patch. If you are using a version of 6.1-stable CVS'd later than 2006/08/03 08:07 UTC, this patch is already present and you do not need it. If your system is based on 6.x code from before that date, you need to apply this patch. If your system is based on 5.x code, or earlier, you need to apply this patch.

There is an error in the FreeBSD distributed UTF-8/LC_CTYPE file. This error lists the sets of combining characters, that is characters that overstrike other characters, sometimes known as non-spacing characters, with a space size of 1. They need to be listed as size 0.

The fix is to download and save this patch and apply it using:

    # cd /usr/src/share/mklocale
    # fetch http://opal.com/jr/freebsd/unicode/lcctype.diff
    # patch <./lcctype.diff
    # make
    # make install

Fixing a bug in hexdump(1)/od(1)

If you are using a version of 7-current CVS'd later than 2006/07/31 14:17 UTC, this patch is already present and you do not need it. If your system is based on 7.0 code from before that date, you need to apply this patch. If you are using a version of 6.1-stable CVS'd later than 2006/08/03 09:06 UTC, this patch is already present and you do not need it. If your system is based on 6.x code from before that date, you need to apply this patch. If your system is based on 5.x code, or earlier, you need to apply this patch.

The hexdump(1) and od(1) utilities have code which assumes that all characters have a width of at least 1. There's an assert check for this and hexdump/od can fail if this isn't the case. Having applied the above lcctype.diff, this assertion is no longer correct.

A patch is available here. Apply using:

    # cd /usr/src/usr.bin/hexdump
    # fetch http://opal.com/jr/freebsd/unicode/hexdump.diff
    # patch <./hexdump.diff
    # make
    # make install

$LANG and $LC_CTYPE

OK, having applied the above patches, you're ready to enable UTF-8 encoding.

First, set either your LANG or LC_CTYPE environment variable to a value which includes the UTF-8 encoding. A list of locales which support UTF-8 can be found using:

    $ ls -d /usr/share/locale/*.UTF-8
For sh/bash/zsh users:
    $ export LANG=en_US.UTF-8
For csh/tcsh users:
    % setenv LANG en_US.UTF-8
Using LANG sets your locale to use the rules of the locale's character typing, but also collation sequences, and number, date and time formatting rules. For finer control, it can be useful to set each of these individually, as in:
    $ export LC_CTYPE=en_US.UTF-8
    $ export LC_COLLATE=POSIX

You should also put these settings in your ~/.profile or ~/.login startup file. This can also be done on a system-wide basis in the /etc/profile and /etc/login files.

For a list of which locales are available with UTF-8 support, look at the contents of the /usr/share/locale directory.

Converting Files Using iconv

Note: If you have files containing non-ASCII ISO8859 characters (i.e., characters in the range 128..255 hex 0x80..0xff) your system will now assume these are UTF-8 characters. They're not though, and the characters in these files will be misinterpreted which means that tools that use them will start breaking. Vi, for example, will give "conversion errors" if it reads ISO8859 data when thinking it has UTF-8. You will need to convert such files to UTF-8 for them to be useful in your Unicode environment. You can do this using iconv from the converters/libiconv port. Add the package using:
    # pkg_add -rv libiconv
Then convert files using:
    $ iconv -f iso8859-1 -t utf-8 file >file.new

Keyboard Input

Keyboard input mapping depends on what keyboard you have, and which characters you want to be able to type. More than likely, you'll end up defining several keyboard maps and a means of toggling between them. For example, my keyboard is a US English one. I use a normal us keyboard map for most input. When I want to enter European characters, I press the right-hand Ctrl key (known as the RCtrl key) to toggle to us(intl) mapping.

To do this on a per-user basis, put a command such as this one in your .xinitrc or .xsession X11 startup file.

    setxkbmap "us,us(intl)" -option "grp:rctrl_toggle,lv3:ralt_switch"
This creates two maps, us and us(intl) with the RCtrl key being used to toggle between them, and the RAlt key (the right-hand Alt key) being used as a shift key to access the additional characters. The first map given, us in this case, is the one in which the keyboard map initially starts. Each time the grp key, RCtrl in this case, is pressed, the keyboard map moves to the next one in the list, wrapping to the first again at the end of the list.

You can have more than two maps. If you regularly use several languages, you may want a separate map for each language. When writing in English, for example, I set my map to the us one. When writing in French or German though, I use RCtrl to toggle to the us(intl) map and toggle back to us again once I'm done. If I used additional languages, I'd add additional maps for those languages. Additional RCtrl presses could then take me to an eastern European map, or Greek, Hebrew, Cyrillic, or one of the Asian ones - whatever I needed.

You need to memorize the layout of the characters on the keyboard for each map. In the us mapping characters are typed as labeled on the US keyboard.

In the us(intl) mapping, the accent keys `, ', ", ^ and a few others are deadkeys. This means that each must be typed and followed by a suitable letter to obtain the accented version of the letter. To type an acute-e, é, for example, type ´ followed by e and to enter an a-circonflex, â, use ^ followed by a. To obtain the ASCII character for each of those accents themselves, the accent key is typed twice. In addition, a number of western European characters can be typed directly by using the RAlt key as follows:

    ~~ !¹ @˝ #¯ $£ %  ^Þ &  *˛ (˘ )° _  +÷ |¦ <ÿ
    `` 1¡ 2² 3³ 4¤ 5€ 6¼ 7½ 8¾ 9‘ 0’ -¥ =× \¬ <ÿ

         QÄ WÅ EÉ R® TÞ YÜ UÚ IÍ OÓ PÖ {« }» |¦
         qä wå eé r® tþ yü uú ií oó pö [« ]» \¬

           AÁ S§ DÐ FF GG HH JJ KK LØ :° ¨"
           aá sß dð ff gg hh jj kk lø ;¶ ´'

             ZÆ XØ C¢ VV BB NÑ Mµ <Ç >ˇ ? 
             zæ xø c© vv bb nñ mµ ,ç .˙ /¿
To be specific, each key has four values: Another useful one is:

The us(intl) map has the standard alphanumeric and punctuation keys in their normal positions and addditional accented characters are typed using RAlt as a shift key. However, in many maps, standard alphanumeric and punctuation keys also move to new places. You'll need to know where characters are in each mapping and a printout of each of your maps' layouts is useful.

To determine what keyboard maps are available, look in the directory /usr/local/share/X11/xkb/symbols/pc or, on older systems, /usr/X11R6/lib/X11/xkb/symbols/pc. The files there are the basic keyboard maps. To discover any variants, such as the (intl) variant on the us map, you need to study each file's contents.

Graphical Mode Utilities: KDE, Firefox/Opera, Evolution, Openoffice

Good news! Having set up your locale and keyboard map, all these graphical utilities just work.

If you're a user who lives within a graphical world only, and never uses text-mode utilities, you are done! Enjoy. You don' need the rest of this document. The following involves rebuilding/recompiling ports and tools in the base system, and possibly also patching them beforehand.

Text-Mode Utilities

If you're a user who uses xterm command-line, text-mode utilities, you'll need to review the following sections to see how to rebuild specific ports to make them work in a Unicode world.

xterm

Depending on your version of xterm, you may need to recompile it for wide character support. Prior to January 8, 2009, xterm did not need any special compilation. On that date, wide character support has been removed as a default on the grounds that it breaks 8-bit character sets. To build it for wide charater support, simply use:
    # cd /usr/ports/x11/xterm
    # make -DWITH_WIDE_CHARS
    # make install
or add WITH_WIDE_CHARS=1 to /etc/make.conf.

XTerm Fonts

Choose an xterm font with 10646 in the name, such as:
    -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
    -misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
    -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
You can get a complete list of the 10646 fonts on your system using:
    $ grep 10646 /usr/local/lib/X11/fonts/*/fonts.dir
or (on older systems):
    $ grep 10646 /usr/X11R6/lib/X11/fonts/*/fonts.dir
Note that different fonts include different combinations of encoded characters and languages. You may need to experiment with various fonts to find one which includes the characters/languages you need.

Choose a suitable font, add lines such as the following to your ~/.Xdefaults file:

    xterm*vt100.font:               -misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
    xterm*vt100.utf8Fonts.font:     -misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
Reload that file using xrdb .Xdefaults. Then start a new xterm. Download and save
this file then cat it on your xterm to verify you can see all the pretty Unicode characters.

Prerequisite Wide-Character Libraries

libutf-8 library

The libutf-8.so library is needed to support wide characters in the libncursesw.so library.

Port converters/libutf-8 can be installed without changes from the ports or as a pre-compiled package:

    # pkg_add -rv libutf-8

ncursesw library

The libncursesw.so library provides wide character support to curses-based applications, such as the vi editor, the mutt mailer, etc.

The ncurses library is part of the base system, and as of March 9th, 2007, 7-current includes wide character support as standard. Since April 7th, 2007 6-stable also includes this as standard.

However the 5.x version still uses an older version of ncurses which doesn't have wide character support. An alternate version in the ports tree, devel/ncurses does have support for wide characters. Since May 22nd, 2006, wide character support is enabled in this port by default, so it's simply necessary to build the port or install a pre-compiled package:

    # pkg_add -rv ncurses
For earlier port versions, wide character support was not enabled by default, and requires local compilation:
    # cd /usr/ports/devel/ncurses
    edit the Makefile and uncoment the --enable-widec keyword already there
    # make
    # make install
It will install in /usr/local/lib as libncursesw.so. There is no need to remove the base system libncurses.so.

Utilities

bash

Works as is.

claws

Works as is.

csh

FreeBSD's csh is actually
tcsh.

cups

Cups uses other tools such as groff to actually process the files for printing. Groff is installed from the base system and the version there is not built with wide character (multibyte) support. See the section on
groff for further information.

At this time, the simplest thing for printing many text files is to convert them to ISO8859 for printing (however, this won't handle any non-ISO8859 Unicode characters):

    $ iconv -t iso8859 file | lpr

egrep

See
grep.

emacs

If you install the devel/ncurses version of the ncurses library, as described above, and choose an iso646 font, emacs works as is both in curses mode and X11 mode.

ex

See
nvi.

fgrep

See
grep.

grep

Works as is.

groff/grops

Groff and its grops backend convert files to PostScript for printing or other use. Groff is part of the FreeBSD base system, and is not installed from the ports tree. As shipped, the source (in /usr/src/contrib/groff) does not include wide character (multibyte) support.

A simple solution is to install the japanese/groff port which does include wide character support. However note that this is a slightly older version of groff. The system groff is renamed to _groff to ensure that the version from the port is what is used.

    # cd /usr/ports/japanese/groff
    # make
    # make install
    # mv /usr/bin/groff /usr/bin/_groff

joe

Works as is.

ls

Works as is.

mutt

Has wide character support, but needs compiling with it enabled.

For the mail/mutt 1.4.2 stable version of mutt:

    # cd /usr/ports/mail/mutt
    Edit the Makefile to change the dependency on ncurses.5 to ncursesw.5
    # make -DWITH_NCURSES_PORT
    # make -DWITH_NCURSES_PORT install
or
    # make -DWITH_NCURSES_PORT WITH_SGML_DOCS=no
    # make -DWITH_NCURSES_PORT WITH_SGML_DOCS=no install
For the mail/mutt-devel 1.5.x development version of mutt:
    # make -DWITH_MUTT_NCURSES_PORT
    # make -DWITH_MUTT_NCURSES_PORT install
To use, add to ~/.muttrc:
    set send_charset="US-ASCII:UTF-8"
or just:
    set send_charset="UTF-8"
The former variant of send_charset tells mutt to tag outgoing mail as either US-ASCII if the content of the message can be represented in US-ASCII, or UTF-8 if the content cannot be represented in US-ASCII. This behavior is consistent with other mailers. However, some mailers apparently limit the encoding of replies to the charset of the received message, which makes a message containing UTF-8 characters not possible from such mailers as a reply to a message containing only US-ASCII. The second variant of send_charset, above, tags all your outgoing messages as UTF-8, which avoids this problem, but which may cause problems if you send an ASCII message to someone with a mailer which does not support UFT-8.

If your incoming mail is not displayed properly, check that you have the LANG environment variable properly set (as described above) or configure mutt:

    set charset="UTF-8"
If you want color, before running mutt, set the environment variable:
    export TERM=xterm-color

nvi

FreeBSD's vi is actually nvi. The version in the base system is version 1.79 which does not have wide character support. However, the version in port editors/nvi-devel is version 1.81.5 and does have wide character support. It's necessary to compile this port with that support enabled, install it, then rename the base system's vi and ex to ensure that the one from the port is used.
    # cd /usr/ports/editors/nvi-devel
    edit the Makefile and add the following:

	CONFIGURE_ARGS+=        --enable-widechar
    # make
    # make install
    # mv /usr/bin/vi /usr/bin/_vi
    # mv /usr/bin/view /usr/bin/_view
    # mv /usr/bin/ex /usr/bin/_ex
    # mv /usr/bin/nvi /usr/bin/_nvi
    # mv /usr/bin/nview /usr/bin/_nview
    # mv /usr/bin/nex /usr/bin/_nex
Note that there are some bugs in this version of vi. These don't cause undue problems, but should be borne in mind:

sh

The FreeBSD base system sh does not support wide characters needed for Unicode/UTF-8. Other shells, such as
csh, tcsh and zsh do.

sylpheed

Works as is.

tcsh

Works as is.

vi

See
nvi.

zsh

Update: I have just been informed (May 1st, 2006) by the maintainer of this port, that he has committed an update to 4.3.2 with multibyte enabled by default. So, if you get this version of the port, it should just work as is. If you have an earlier version of the port, the following applies.

The version of zsh in the ports is currently (April 2006) 4.2.6. Unicode support is not present in this version, however, it is done in version 4.3.2, which is the latest version of zsh available from zsh.org. It is necessary to manually download this version of the zsh source code and compile it yourself.

Use of the configure option --enable-multibyte is needed in order to enable the Unicode functionality.

Once the port is updated to 4.3.2, it will still be necessary to add the --enable-multibyte option to the CONFIGURE_ARGS variable in the Makefile unless this is enabled by default in the port.

Servers

Apache Web Server

Apache needs some way of determining the encoding of text and html files it is serving so that it can indicate the correct encoding in the HTTP MIME headers. There are various ways of doing this, including specifying file types using filename extensions (e.g., foo.utf8) or by embedding special tags (such as META HTTP-EQUIV) within the file.

In the absence of such clues, apache uses the encoding specified in its configuration file as the default. Once you have converted all files with 8-bit data in them to UTF-8, you'll probably want to change this default. For the www/apache2 version, in the configuration file, /usr/local/etc/apache2/http.conf look for and change:

    AddDefaultCharset ISO-8859-1
to
    AddDefaultCharset UTF-8
For the www/apache22 version, there is no AddDefaultCharset directive in the default configuration file. In this case, the equivalent configuration is to edit /usr/local/etc/apache22/http.conf and uncomment the line:
    #Include etc/apache22/extra/httpd-languages.conf
to
    Include etc/apache22/extra/httpd-languages.conf
then edit the file /usr/local/etc/apache22/extra/httpd-languages.conf and add:
    #
    # Specify a default charset for all pages sent out. This is
    # always a good idea and opens the door for future internationalisation
    # of your web site, should you ever want it. Specifying it as
    # a default does little harm; as the standard dictates that a page
    # is in iso-8859-1 (latin1) unless specified otherwise i.e. you
    # are merely stating the obvious. There are also some security
    # reasons in browsers, related to javascript and URL parsing
    # which encourage you to always set a default char set.
    #
    AddDefaultCharset UTF-8
after the ForceLanguagePriority directive and before the set of AddCharset directives.

Note that if all the files on your web server contain only 7-bit ASCII characters, you don't need to do this as it makes no difference. But if you have 8-bit characters in text or html files on your web server, you will want to convert them and change this setting, else it will be a pain to edit them. Convert them using iconv as shown above.

Unicode on the Console

Use of UTF-8 on the FreeBSD console doesn't work at the moment. The console uses display fonts and keyboard mappings from /usr/share/syscons/{fonts,keymaps} and managed using vidfont(1) and kbdmap(1). The source for the font and keyboard definitions is in /usr/src/share/syscons/{fonts,keymaps}.

For UTF-8 support a font definition containing a reasonable set of characters as well as a set of suitable keyboard maps are needed.

The sysutils/jfbterm port provides a framebuffered console which does have UTF-8 support.

Roadmap for Adoption of Unicode as FreeBSD's Standard Character Encoding

Before UTF-8 can be adopted as FreeBSD's standard character encoding, there needs to be:
  1. General discussion on the FreeBSD mailing lists and consensus to move ahead with this.

    This process has started (June 2006) with some initial discussion on the freebsd-current@ list (see here) and with publication of the link to this document in that discussion. Resulting from this, many hundreds of views of this page have taken place as have downloads of the patches documented here. This means that folk are now reviewing this and checking for problems.

  2. Done: July 31, 2006. Commits of the patches that are necessary for UTF-8 support.

    In due course, and in the absence of negative feedback suggesting insurmountable problems, it will be appropriate to post the patches to GNATS as FreeBSD update requests and request that a committer add them to head and merge to 6.1-STABLE. This is anticipated in mid July 2006.

  3. Work done to update libraries and utilities in the base system from their newer versions with wide-character support.

    There will then have to be more discussion of merging the newer versions of ncurses and nvi from their ports to replace the older versions currently in the base system.

    Done: March 9, 2007. Merged into 6-stable: April 7, 2007. A request for the ncurses update was made in the same thread on freebsd-current@ but the maintainer of ncurses in the base system indicated that he is busy for now and may not get to this soon.

    Discussion with the maintainer of the nvi port is in progress, but work here is also slow. The problems with the new version of nvi documented above in this document will need working out as part of this step.

  4. Work done to find console fonts and keyboard maps with ISO10646 UTF-8 character support.

    As noted above, the Linux community may already have done this and their stuff may be compatible.

  5. Work done to add support for collation (LC_COLLATE) to the UTF-8 locale.

    Support of collation in UTF-8 will need work from the larger community.

  6. Preparation of notes describing how to switch your system encoding to UTF-8 and then how to convert existing 8-bit files on your system from your current encoding to UTF-8, and how to assist users/customers to do the same (e.g., virtual web host users/customers on an apache server).

  7. Updates to ports to enable wide-character support as needed.

  8. Updates to the FreeBSD Handbook.

  9. Preparation of a Head's Up about the change.

If you are able to contribute to any of these items, please contact fbsd@opal.com and I will help coordinate things.

Last updated: Saturday, 2009 April 11 at 10:49 UTC