Man and Groff - Author Feedbacks - UTF-8 Comments

Alexander E. Patrakov patrakov at ums.usu.ru
Thu Jan 19 22:44:58 PST 2006


Jim Gifford wrote:

(many thanks to Jim for contacting the authors)

> Here is a response from the Man maintainer on UTF-8
>
> Federico Lucifredi wrote:
>
>> Hello Jim,
>>  Man will not be UTF-8 problematic soon - the timeframe is march. So,
>> hold on to your horses, we are working on it and it targeted to be in
>> release 1.6d.
>>
>>  Hope that will solve your problem -- entirely =)
>
Yes, this is in the TODO list in Man-1.6b. Please make sure, however, 
that the maintainer fully understands the problem. And, unfortunately, I 
can't verify the progress because I don't have read access to the 
development sources of Man.

Currently, Man has the following problems:

1) It doesn't output error messages (such as translations of "no such 
manual page") properly in UTF-8 locales. Currently, it just copies the 
sequence of bytes supplied by the translator to the terminal. This is 
correct only if the translator and the user use the same locale. Such 
assumption was reasonable until year 2003, but it fails now, and leads 
to just a sequence of "invalid character" squares if, e.g., the 
translator uses ru_RU.KOI8-R and the user uses ru_RU.UTF-8. A sequence 
of empty squares is not a helpful error message.

Man should either convert messages between translator's and user's 
character set on the fly (this is best done by switching from the 
obsolete catgets family of translation functions to gettext), or not 
attempt to translate error messages from English at all. Oh, we can 
always implement the second solution by passing "+lang none" to Man's 
./configure line.

2) The language special-casing logic (strncmp(lang, "ja", 2) == 0) 
doesn't match today's reality and should be either implemented properly 
(as in man-db) or omitted completely and left to distro patches. With 
today's groff and Debian's policy on storing manual pages on the 
filesystem, the reality is as follows:

A) CVS version of Groff: see the bottom of this mail.
B) Released versions of Groff-1.19.x: In 8-bit locales, use -Tlatin1. In 
UTF-8 locales which have a corresponding non-UTF-8 locale, use the 
-Tlatin1 device and convert the output to the encoding in which the 
manual page is stored on disk to UTF-8. In essentially-multibyte 
languages (i.e., Chinese, Japanese, and Korean) there is no way to 
format manual pages correctly with this version of groff.
C) Debian-patched Groff-1.18.1.1: same as Groff-1.19.x, but has special 
rules for CJK: the manual page should be converted from the encoding in 
which it is stored on disk to the locale charset, and then fed to groff 
with the -Tnippon or -Tutf8 parameter depending on the locale character set.

3) Feature request: it would be very nice if LFS obtains a way to tell 
Man to ignore /usr/share/man/ja/* even in Japanese locales, because the 
system's Groff-1.19.{2,3cvs} can't format those manuals. This also 
applies to other languages, and maybe it is better to implement as a 
whitelist, not blacklist. This whitelist should be different for 
printing and display purposes.

> Here is a response from the Groff maintainer on UTF-8
>
> Werner LEMBERG wrote:
>
>>> Is the current CVS of groff, utf-8 friendly.
>>>     
>>
>>
>> Yes.  It doesn't have the final form (the preconv preprocessor will
>> get folded into soelim) which means that files included with .so
>> aren't handled yet automatically, but the interface won't change, this
>> is, options `-k' and `-K <enc>' will stay to convert the input file
>> encoding to something groff can understand.
>>
>> Note that you still need fonts which actually have those Unicode
>> characters.
>
I have checked out today's Groff from Savannah CVS. The test results are 
below.

1) It depends upon netpbm programs. This dependency can be circumvented 
for LFS purposes by issuing "make -k" for compilation.

2) The relocation stuff segfaults, so I had to disable it by editing 
src/libs/libgroff/Makefile.sub.

3) After that, groff can correctly format the Russian manual page for 
the /etc/passwd file in both ru_RU.KOI8-R and ru_RU.UTF-8 with the 
following command:

groff -K KOI8-R -Tutf8 -mandoc /usr/share/man/ru/man5/passwd.5 | iconv 
-f UTF-8 -t //TRANSLIT

This, of course, scales to more (but not all) languages by changing 
KOI8-R to the character set in which the manual pages for that language. 
Also, if one stores manual pages on disk in RedHat fashion, this works 
if KOI8-R is changed to UTF-8. So, the new architecture is good and 
general. Thanks to the authors.

The old pre-1.19.3cvs method (used by man-db) still works:

groff -Tlatin1 -mandoc /usr/share/man/ru/man5/passwd.5 | iconv -f KOI8-R

4) The "-k" and encoding autoguessing is a bad idea because not every 
manual page is tagged properly (e.g., the passwd(5) manual page is not 
tagged). Everyone will end up using -K with the explicit encoding 
specified (and, in fact, that's Man's, not user's responsibility).

5) New Groff is still not able to format Japanese manuals. Is there any 
timeline for this?

So, since the set of languages supported by new groff is a strict subset 
of those supported by Debian-patched groff-1.18.1.1 and Man-DB, there is 
no direct merit in upgrading now or in March. This does not, however, 
make testing of new versions of Man and Groff irrelevant.

> Both also expressed and interest for us to assist in testing.

Thanks to both of them. With their help, Man-DB will be certainly not 
needed in the future.

Jim: If you really want to drop Man-DB right now in favour of Man, I 
will (on your request) make a proof-of-concept patch for the current LFS 
book that installs Man and a safe subset of manual pages without 
confusing instructions. But that would be a huge functionality drop (but 
no "unreadable manual page" bugs similar to those found in RedHat 8), so 
I really don't want this to be applied.

-- 
Alexander E. Patrakov



More information about the cross-lfs mailing list