During the 1980s other expanded character sets were developed to allow larger sets of characters in a single set. Work focused on 8-bit character sets with 256 characters in them. Unfortunately, many parts of the BSD system were coded with assumptions that a character was 7 bits, so a major effort was undertaken during the late 1980s and 1990s to make the system `8-bit clean'. This involved changes to the system itself and to many utilties. The result was that FreeBSD evolved to use the 8-bit ISO8859 character set, which is still the default today. In ISO8859, the 8 bits allow for codes from 0..255 or hex 0x0..0xff. The US-ASCII characters occupy the same 0x0..0x7f positions and an additional selection of common western European accent characters are on the 0x80..0xff positions. At the time, this was a great improvement as documents could now be written in many more languages. Unfortunately however, 128 extra western European characters doesn't by any means include all European characters and is therefore not sufficient even for many Europeans (it lacks eastern European characters, for example, not to mention Greek). As a result, several alternate ISO8859 character sets were developed, known as ISO8859-1, ISO8859-2, ISO8859-3 etc, each with a different set of characters in the 0x80-0xff range. ISO8859-1 contains western European characters, ISO8859-2 contains eastern European characters, and there are variants with Cyrilic, Greek, Arabic and Hebrew characters too. A document or message, however, can only use characters from one character set, so combining Hebrew with French in the same document would still pose a problem. And, ISO8859 doesn't handle Asian character sets or languages which have more than 256 characters in general use and which therefore won't fit in ISO8859's 8-bit range.
In today's increasingly global world, it is becoming more and more common to have multi-lingual documents and to want to share them with folk all over the world. In fact, a simple list of peoples' names often poses a problem if those names would normally be written in differing character sets. It's also common to exchange messages and documents with people from many parts of the world, and while multi-lingual documents might not be necessary all the time, having to continually switch character sets for work with different people becomes tedious and annoying. The expanded 8-bit ISO8859 character sets don't solve the problem. A third evolution of character sets has therefore taken place.
Unicode encodes characters as multi-byte values, supporting a much larger range of characters. Encoding of all the world's languages is supported within Unicode, as are various specialized characters such as mathematical characters, music symbols, Braille and so on. Unicode isn't a merely an extension from 8-bit to 16-bit character set, but instead allows for characters with 24-bits or 32-bits or more. Even within Unicode, there are alternative possible character encodings, but one such encoding, UTF-8, is commonly accepted as a good base for handling most of the world's languages without excessive overhead. Unicode's UTF-8 encoding is supported on platforms such as Linux and Mac OS/X and Windows, too, which is an important aspect of making it useful for message and document interchange.
To accomodate Unicode, the system and utilities must, once more, be adapted, this time to support varying length characters. This support is known as wide character support and has been implemeted extensively throughout FreeBSD and many of the base system utilties and ports. Converting a FreeBSD system to use Unicode involves enabling this wide character support in system libraries and utilities where it is needed. It also involves telling the system to interpret characters as Unicode sequences and to convert keyboard keystrokes to Unicode character values.
This document is NOT a detailled tutorial on Unicode or UTF-8. If you want such details, go here.
This document describes, in practical terms, how to make Unicode work on FreeBSD. Sufficient for us here is to say that Unicode is supported using the UTF-8 encoding, and how to use this is what we discuss below.
There are several aspects to making this possible:
Please note that this document is, by no means, a complete list of everything. We just document the basic steps needed to get Unicode working for the tools we've looked at so far. If you have feedback on this, for example, you have input for another tool you want to see added to this document, please email fbsd@opal.com.
Do keep in mind that since enabling Unicode support does involve rebuilding many
utilities with support enabled, that any time a system upgrade is performed, it
will be necessary to remember to redo the reconfiguration as needed.
There is an error in the FreeBSD distributed UTF-8/LC_CTYPE file.
This error lists the sets of combining characters, that is
characters that overstrike other characters, sometimes known as non-spacing
characters, with a space size of 1. They need to be listed as size 0.
The fix is to download and save this patch
and apply it using:
The hexdump(1) and od(1) utilities have code which assumes
that all characters have a width of at least 1. There's an assert check
for this and hexdump/od can fail if this isn't the case. Having applied the above
lcctype.diff, this assertion is no longer correct.
A patch is available here. Apply using:
First, set either your LANG or LC_CTYPE environment variable to a value
which includes the UTF-8 encoding.
A list of locales which support UTF-8 can be found using:
You should also put these settings in your ~/.profile or ~/.login
startup file. This can also be done on a system-wide basis in the /etc/profile
and /etc/login files.
For a list of which locales are available with UTF-8 support, look at the
contents of the /usr/share/locale directory.
To do this on a per-user basis, put a command such as this one in your .xinitrc
or .xsession X11 startup file.
You can have more than two maps. If you regularly use several languages, you
may want a separate map for each language. When writing in English, for example,
I set my map to the us one. When writing in French or German though, I
use RCtrl to toggle to the us(intl) map and toggle back to us
again once I'm done. If I used additional languages, I'd add additional maps
for those languages.
Additional RCtrl presses could then take me to an eastern European map,
or Greek, Hebrew, Cyrillic, or one of the Asian ones - whatever I needed.
You need to memorize the layout of the characters on the keyboard for each map.
In the us mapping characters are typed as labeled on the US keyboard.
In the us(intl) mapping, the accent keys `, ', ", ^ and a few others
are deadkeys. This means that each must be typed and followed by a suitable
letter to obtain the accented version of the letter. To type an acute-e, é,
for example, type ´ followed by e and to enter an a-circonflex,
â, use ^ followed by a. To obtain the ASCII
character for each of those accents themselves, the accent key is typed twice.
In addition, a number of western European characters can be typed directly by
using the RAlt key as follows:
The us(intl) map has the standard alphanumeric and punctuation keys
in their normal positions and addditional accented characters are typed using
RAlt as a shift key. However, in many maps, standard alphanumeric and
punctuation keys also move to new places. You'll need to know where
characters are in each mapping and a printout of each of your maps' layouts
is useful.
To determine what keyboard maps are available, look in the directory
/usr/local/share/X11/xkb/symbols/pc or, on older systems,
/usr/X11R6/lib/X11/xkb/symbols/pc. The files there are the basic keyboard
maps. To discover any variants, such as the (intl) variant on
the us map, you need to study each file's contents.
If you're a user who lives within a graphical
world only, and never uses text-mode utilities, you are done! Enjoy. You don' need
the rest of this document. The following involves rebuilding/recompiling ports
and tools in the base system, and possibly also patching them beforehand.
Choose a suitable font, add lines such as the following to your ~/.Xdefaults file:
Port converters/libutf-8 can be installed without changes from the ports or as a
pre-compiled package:
The ncurses library is part of the base system, and as of
March 9th, 2007, 7-current includes wide character support as
standard. Since April 7th, 2007 6-stable also includes this as
standard.
However the 5.x version still uses an older version of ncurses
which doesn't have wide character support. An alternate
version in the ports tree, devel/ncurses does have support for
wide characters.
Since May 22nd, 2006, wide character support is enabled in this port
by default, so it's simply necessary to build the port or install a
pre-compiled package:
At this time, the simplest thing for printing many text files is to convert them
to ISO8859 for printing (however, this won't handle any non-ISO8859 Unicode characters):
A simple solution is to install the japanese/groff port which does include wide
character support. However note that this is a slightly older version of groff.
The system groff is renamed to _groff to ensure that the version from the port
is what is used.
For the mail/mutt 1.4.2 stable version of mutt:
If your incoming mail is not displayed properly, check that you have the
LANG environment variable properly set (as described above) or configure mutt:
The version of zsh in the ports is currently (April 2006) 4.2.6. Unicode
support is not present in this version, however, it is done in version 4.3.2, which
is the latest version of zsh available from zsh.org.
It is necessary to manually download this version of the zsh source code
and compile it yourself.
Use of the configure option --enable-multibyte is needed in order to
enable the Unicode functionality.
Once the port is updated to 4.3.2, it will still be necessary to add the --enable-multibyte
option to the CONFIGURE_ARGS variable in the Makefile unless this is
enabled by default in the port.
In the absence of such clues, apache uses the encoding specified in its
configuration file as the default. Once you have converted all files with 8-bit
data in them to UTF-8, you'll probably want to change this default. For the
www/apache2 version, in the
configuration file, /usr/local/etc/apache2/http.conf look for and change:
Note that if all the files on your web server contain only 7-bit ASCII characters,
you don't need to do this as it makes no difference. But if you have 8-bit characters
in text or html files on your web server, you will want to convert them and change
this setting, else it will be a pain to edit them. Convert them using iconv
as shown above.
For UTF-8 support a font definition containing a reasonable
set of characters as well as a set of suitable keyboard
maps are needed.
The sysutils/jfbterm port provides a framebuffered console which does have UTF-8 support.
This process has started (June 2006) with some initial
discussion on the freebsd-current@ list (see
here)
and with publication of the link to this document in that discussion.
Resulting from this, many hundreds of views of this page have taken
place as have downloads of the patches documented here. This means
that folk are now reviewing this and checking for problems.
In due course, and in the absence of negative feedback suggesting
insurmountable problems, it will be appropriate to post the patches
to GNATS as FreeBSD update requests and request that a committer
add them to head and merge to 6.1-STABLE. This is anticipated in mid
July 2006.
There will then have to be more discussion of merging the newer
versions of
Done: March 9, 2007. Merged into 6-stable: April 7, 2007.
Discussion with the maintainer of the nvi port is
in progress, but work here is also slow.
The problems with the new version of nvi documented above
in this document will need working out as part of this step.
As noted above, the Linux community may already have done this
and their stuff may be compatible.
Support of collation in UTF-8 will need work from the
larger community.
If you are able to contribute to any of these items, please contact
fbsd@opal.com and I will help coordinate
things.
Last updated: Saturday, 2009 April 11 at 10:49 UTC
Basic Setup for Unicode
Fixing an error in the UTF-8/LC_CTYPE file
If you are using a version of 7-current CVS'd later than 2006/07/28 06:10 UTC,
this patch is already present and you do not need it. If your system is based
on 7.0 code from before that date, you need to apply this patch.
If you are using a version of 6.1-stable CVS'd later than 2006/08/03 08:07 UTC,
this patch is already present and you do not need it. If your system is based
on 6.x code from before that date, you need to apply this patch.
If your system is based on 5.x code, or earlier, you need to apply this patch.
# cd /usr/src/share/mklocale
# fetch http://opal.com/jr/freebsd/unicode/\
lcctype.diff
# patch <./lcctype.diff
# make
# make install
Fixing a bug in hexdump(1)/od(1)
If you are using a version of 7-current CVS'd later than 2006/07/31 14:17 UTC,
this patch is already present and you do not need it. If your system is based
on 7.0 code from before that date, you need to apply this patch.
If you are using a version of 6.1-stable CVS'd later than 2006/08/03 09:06 UTC,
this patch is already present and you do not need it. If your system is based
on 6.x code from before that date, you need to apply this patch.
If your system is based on 5.x code, or earlier, you need to apply this patch.
# cd /usr/src/usr.bin/hexdump
# fetch http://opal.com/jr/freebsd/unicode/\
hexdump.diff
# patch <./hexdump.diff
# make
# make install
$LANG and $LC_CTYPE
OK, having applied the above patches, you're ready to enable UTF-8 encoding.
$ ls -d /usr/share/locale/*.UTF-8
For sh/bash/zsh users:
$ export LANG=en_US.UTF-8
For csh/tcsh users:
% setenv LANG en_US.UTF-8
Using LANG sets your locale to use the rules of the locale's character typing,
but also collation sequences, and number, date and time formatting rules.
For finer control, it can be useful to set each of these individually, as in:
$ export LC_CTYPE=en_US.UTF-8
$ export LC_COLLATE=POSIX
Converting Files Using iconv
Note: If you have files containing non-ASCII ISO8859 characters (i.e.,
characters in the range 128..255 hex 0x80..0xff) your system will now assume these are UTF-8
characters. They're not though, and the characters in these files will be misinterpreted
which means that tools that use them will start breaking. Vi, for example, will
give "conversion errors" if it reads ISO8859 data when thinking it has UTF-8. You
will need to convert such files to UTF-8 for them to be useful in your Unicode
environment. You can do this using iconv from the converters/libiconv
port. Add the package using:
# pkg_add -rv libiconv
Then convert files using:
$ iconv -f iso8859-1 -t utf-8 file >file.new
Keyboard Input
Keyboard input mapping depends on what keyboard you have, and which characters
you want to be able to type. More than likely, you'll end up defining several
keyboard maps and a means of toggling between them. For example, my keyboard
is a US English one. I use a normal us keyboard map for most input. When I
want to enter European characters, I press the right-hand Ctrl key (known as
the RCtrl key) to toggle to us(intl) mapping.
setxkbmap "us,us(intl)" -option \
"grp:rctrl_toggle,lv3:ralt_switch"
This creates two maps, us and us(intl) with the RCtrl key
being used to toggle between them, and the RAlt key (the right-hand Alt key)
being used as a shift key to access the additional characters.
The first map given, us in this case, is the one in which the
keyboard map initially starts. Each
time the grp key, RCtrl in this case, is pressed, the keyboard map
moves to the next one in the list, wrapping to the first again at the end of the
list.
~~ !¹ @˝ #¯ $£ % ^Þ & *˛ (˘ )° _ +÷ |¦ <ÿ
`` 1¡ 2² 3³ 4¤ 5€ 6¼ 7½ 8¾ 9‘ 0’ -¥ =× \¬ <ÿ
QÄ WÅ EÉ R® TÞ YÜ UÚ IÍ OÓ PÖ {« }» |¦
qä wå eé r® tþ yü uú ií oó pö [« ]» \¬
AÁ S§ DÐ FF GG HH JJ KK LØ :° ¨"
aá sß dð ff gg hh jj kk lø ;¶ ´'
ZÆ XØ C¢ VV BB NÑ Mµ <Ç >ˇ ?
zæ xø c© vv bb nñ mµ ,ç .˙ /¿
To be specific, each key has four values:
Another useful one is:
Graphical Mode Utilities: KDE, Firefox/Opera, Evolution, Openoffice
Good news! Having set up your locale and keyboard map, all these graphical
utilities just work.
Text-Mode Utilities
If you're a user who uses xterm command-line, text-mode utilities, you'll need to
review the following sections to see how to rebuild specific ports to make them
work in a Unicode world.
xterm
Depending on your version of xterm, you may need to recompile it for wide character
support.
Prior to January 8, 2009,
xterm did not need any special compilation. On that date,
wide character support has been removed as a default on the grounds that it breaks
8-bit character sets.
To build it for wide charater support, simply use:
# cd /usr/ports/x11/xterm
# make -DWITH_WIDE_CHARS
# make install
or add WITH_WIDE_CHARS=1 to /etc/make.conf.
XTerm Fonts
Choose an xterm font with 10646 in the name, such as:
-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
-misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
You can get a complete list of the 10646 fonts on your system using:
$ grep 10646 /usr/local/lib/X11/fonts/*/fonts.dir
or (on older systems):
$ grep 10646 /usr/X11R6/lib/X11/fonts/*/fonts.dir
Note that different fonts include different combinations of encoded characters
and languages. You may need to experiment with various fonts to find one which
includes the characters/languages you need.
xterm*vt100.font:
-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
xterm*vt100.utf8Fonts.font:
-misc-fixed-medium-r-normal--15-140-75-75-c-90-iso10646-1
Reload that file using xrdb .Xdefaults.
Then start a new xterm. Download and save this file then
cat it on your xterm to verify you can see all the pretty Unicode characters.
Prerequisite Wide-Character Libraries
libutf-8 library
The libutf-8.so library is needed to support wide characters in
the libncursesw.so library.
# pkg_add -rv libutf-8
ncursesw library
The libncursesw.so library provides wide character support to
curses-based applications, such as the vi editor, the mutt
mailer, etc.
# pkg_add -rv ncurses
For earlier port versions, wide character support was not enabled by default,
and requires local compilation:
# cd /usr/ports/devel/ncurses
edit the Makefile and uncoment the
--enable-widec keyword already there
# make
# make install
It will install in /usr/local/lib as libncursesw.so.
There is no need to remove the base system libncurses.so.
Utilities
bash
Works as is.
claws
Works as is.
csh
FreeBSD's csh is actually tcsh.
cups
Cups uses other tools such as groff to actually
process the files for printing. Groff is installed from the base system and
the version there is not built with wide character (multibyte) support. See the
section on groff for further information.
$ iconv -t iso8859 file | lpr
egrep
See grep.
emacs
If you install the devel/ncurses version of the ncurses library, as described
above, and choose an iso646 font, emacs works as is both in curses mode and X11 mode.
ex
See nvi.
fgrep
See grep.
grep
Works as is.
groff/grops
Groff and its grops backend convert files to PostScript for printing or
other use. Groff is part of the FreeBSD base system, and is not installed from
the ports tree. As shipped, the source (in /usr/src/contrib/groff) does not include
wide character (multibyte) support.
# cd /usr/ports/japanese/groff
# make
# make install
# mv /usr/bin/groff /usr/bin/_groff
joe
Works as is.
ls
Works as is.
mutt
Has wide character support, but needs compiling with it enabled.
# cd /usr/ports/mail/mutt
Edit the Makefile to change the dependency on
ncurses.5 to ncursesw.5
# make -DWITH_NCURSES_PORT
# make -DWITH_NCURSES_PORT install
or
# make -DWITH_NCURSES_PORT WITH_SGML_DOCS=no
# make -DWITH_NCURSES_PORT WITH_SGML_DOCS=no \
install
For the mail/mutt-devel 1.5.x development version of mutt:
# make -DWITH_MUTT_NCURSES_PORT
# make -DWITH_MUTT_NCURSES_PORT install
To use, add to ~/.muttrc:
set send_charset="US-ASCII:UTF-8"
or just:
set send_charset="UTF-8"
The former variant of send_charset tells mutt to tag outgoing
mail as either US-ASCII if the content of the message can be represented
in US-ASCII, or UTF-8 if the content cannot be represented in US-ASCII.
This behavior is consistent with other mailers. However, some mailers apparently
limit the encoding of replies to the charset of the received message, which
makes a message containing UTF-8 characters not possible from such mailers
as a reply to a message containing only US-ASCII. The second variant of
send_charset, above, tags all your outgoing messages as UTF-8, which
avoids this problem, but which may cause problems if you send an ASCII
message to someone with a mailer which does not support UFT-8.
set charset="UTF-8"
If you want color, before running mutt, set the environment variable:
export TERM=xterm-color
nvi
FreeBSD's vi is actually nvi. The version in the base system is
version 1.79 which does not have wide character support. However, the version
in port editors/nvi-devel is version 1.81.5 and does have wide character
support. It's necessary to compile this port with that support enabled, install
it, then rename the base system's vi and ex to ensure that the
one from the port is used.
# cd /usr/ports/editors/nvi-devel
edit the Makefile and add the following:
CONFIGURE_ARGS+= --enable-widechar
# make
# make install
# mv /usr/bin/vi /usr/bin/_vi
# mv /usr/bin/view /usr/bin/_view
# mv /usr/bin/ex /usr/bin/_ex
# mv /usr/bin/nvi /usr/bin/_nvi
# mv /usr/bin/nview /usr/bin/_nview
# mv /usr/bin/nex /usr/bin/_nex
Note that there are some bugs in this version of vi. These don't cause
undue problems, but should be borne in mind:
sh
The FreeBSD base system sh does not support wide characters needed
for Unicode/UTF-8. Other shells, such as
csh,
tcsh
and
zsh
do.
sylpheed
Works as is.
tcsh
Works as is.
vi
See nvi.
zsh
Update: I have just been informed (May 1st, 2006) by the maintainer of
this port, that he has committed an update to 4.3.2 with multibyte enabled by
default. So, if you get this version of the port, it should just work as is.
If you have an earlier version of the port, the following applies.
Servers
Apache Web Server
Apache needs some way of determining the encoding of text and html files it is
serving so that it can indicate the correct encoding in the HTTP MIME headers. There are
various ways of doing this, including specifying file types using filename extensions
(e.g., foo.utf8) or by embedding special tags (such as META HTTP-EQUIV)
within the file.
AddDefaultCharset ISO-8859-1
to
AddDefaultCharset UTF-8
For the www/apache22 version, there is no AddDefaultCharset directive in
the default configuration file. In this case, the equivalent configuration is
to edit /usr/local/etc/apache22/http.conf and uncomment the line:
#Include etc/apache22/extra/httpd-languages.conf
to
Include etc/apache22/extra/httpd-languages.conf
then edit the file /usr/local/etc/apache22/extra/httpd-languages.conf
and add:
#
# Specify a default charset for all pages sent
# out. This is always a good idea and opens the
# door for future internationalisation of your
# web site, should you ever want it. Specifying
# it as a default does little harm; as the
# standard dictates that a page is in iso-8859-1
# (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also
# some security reasons in browsers, related to
# javascript and URL parsing which encourage you
# to always set a default char set.
#
AddDefaultCharset UTF-8
after the ForceLanguagePriority directive and before the
set of AddCharset directives.
Unicode on the Console
Use of UTF-8 on the FreeBSD console doesn't work at the moment.
The console uses display fonts and keyboard mappings from
/usr/share/syscons/{fonts,keymaps} and managed using
vidfont(1) and kbdmap(1). The source for the
font and keyboard definitions is in
/usr/src/share/syscons/{fonts,keymaps}.
Roadmap for Adoption of Unicode as FreeBSD's Standard Character Encoding
Before UTF-8 can be adopted as FreeBSD's standard character encoding,
there needs to be:
Commits of the patches that are necessary for UTF-8 support.
ncurses and nvi from their ports to replace
the older versions currently in the base system.
A request for the ncurses update was made in
the same thread on freebsd-current@ but the maintainer of
ncurses in the base system indicated that he is busy for
now and may not get to this soon.