Please review for Man-DB changes

Alexander E. Patrakov patrakov at gmail.com
Sat Oct 25 08:47:39 PDT 2008


DJ Lucas wrote:

> Many other distributions ignore the on disk encodings completely, 
> leaving the end user with a mix of improperly encoded manual pages.

Well, the end user doesn't care how the manual pages are encoded on 
disk. The only thing that matters is if they are displayed correctly. 
And I can't translate the sentence into Russian, because I don't know 
how an encoding can be ignored by the distribution. Issues can be 
ignored, and encodings can be mishandled.

And you lost the important bit from your previous mail, that in such 
distributions some pages (that match the de-facto Man setup) are 
readable, while others display as completely "illegible" lines of 
craracters.

And BTW, Lingvo (the leading online English<->Russian dictionary) 
doesn't even list your intended meaning among the list of available 
translations for "illegible". They think that this word can apply only 
to handwriting or typesetting, and is a synonym for "blurry", or "too 
small to read". I.e., it means something which can be characterized with 
a certain degree of "illegibility", while we are talking about perfectly 
displayed, but wrong characters (and one cannot talk about "more 
correct" or "less correct" characters). So, please choose another word 
below.

> When man encounters an unexpected encoding, it will display the contents 
> as configured, resulting in completely illegible text.

Man (original) doesn't _know_ the encoding. It just passes the manual 
page through a pipeline designed (deliberately or by copying others' 
setup blindly) to process text in a certain encoding. Garbage in, 
garbage out. Yes, that's essentially what you said, but not all Man 
implementations have enough brains to "expect" some encoding - the 
original Man just pipes text through the static user-configured pipeline.

Sorry, it is too late here for me to try suggesting a better wording. I 
will do this tomorrow if you don't do it yourself while I sleep.

>>> Man-DB uses a
>>> built-in table (see below) to find the correct serach directory for
>>> manual pages based on the user's locale settings.
>>>     
>> No, it doesn't look into the table in this case. See add_nls_manpath() 
>> in http://www.chiark.greenend.org.uk/~cjwatson/bzr/man-db/trunk/src/manp.c
>>
>> It iterates over all subdirectories and tests whether the subdirectory 
>> is for the user's language, completely disregarding the encoding.
>>   
> ...ships with manual pages in legacy encodings.  Man-DB uses a built-in 
> table (see below) to determine the on disk encoding of the manual pages 
> found for a user's locale. If the directories found do not contain the 
> ".UTF-8" extension, Man-DB checks the table, and performs the necessary 
> conversion.  E.g., because of "UTF-8" in the directory name...

It doesn't work this way. Suppose that the user's locale is 
ll_CC.CODESET. Man looks for subdirectories of /usr/share/man that, 
after removing a possible suffix, reduce to either ll_CC or ll. For each 
of the directories found with a suffix, it uses the suffix as the 
encoding. If the directory has no suffix, Man-DB checks the table. 
"UTF-8" has no special meaning, but your text creates a false impression 
that it does. E.g., if /usr/share/man/ru.CP1251 existed, Man-DB would 
expect to find CP1251-encoded manual pages there. Again, please read the 
source. Oh, you did.

> Some interesting reading in the source.  Looks like at least 
> unpack_locale_bits() does not care what the codeset is, but it's checked 
> in encodings.c. So:
> 
> ...If the directories found do not contain an extension, Man-DB checks 
> the table, and performs the necessary conversion. E.g., because of 
> "UTF-8" extension in the directory name...

It always performs the necessary conversion (e.g., in ru_RU.KOI8-R 
locale, it can use manual pages from /usr/share/man/ru.UTF-8), so let's 
drop or move "and performs the necessary conversion". Also, in UTF-8 
locales, it does _double_ conversion: first to the encoding from the 
table, then (after processing with Groff) back, because Groff doesn't 
understand UTF-8. Other than that, good.

-- 
Alexander E. Patrakov



More information about the lfs-dev mailing list