New Last Call: 'Tags for Identifying Languages' to BCP

The IESG has been considering

- 'Tags for Identifying Languages '
   <draft-phillips-langtags-08.txt> as a BCP

There have been considerable changes to the document since the
initial last call, and the IESG would like the community to consider
the changes.  In addition, the authors have prepared text describing
why this mechanism is needed as a replacement for the existing
procedure; it is included below.

The IESG plans to make a decision in the next few weeks, and solicits
final comments on this action.  Please send any comments to the
iesg@ietf.org or ietf@ietf.org mailing lists by 2005-01-05.

The file can be obtained via
http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt

Author's discussion of drivers for this work:

Reasons for Enhancing RFC 3066

RFC 3066 and its predecessor, RFC 1766, define language tags for use on the
Internet. Language tags are necessary for many applications, ranging from
cataloging content to computer processing of text. The RFC 3066 standard for
language tags has been widely adopted in various protocols and text formats,
including HTML, XML, and CLDR, as the best means of identifying languages and
language preferences.

This specification proposes enhancements to RFC 3066. Because revisions to RFC
3066 therefore have such broad implications, it is important to understand the
reasons for modifying the structure of language tags and the design implications
of the proposed replacement.

Problems

This specification, the proposed successor to RFC 3066, addresses a number of
issues that implementers of language tags have faced in recent years:

    * Stability of the underlying ISO standards
    * Accessibility of the underlying ISO standards for implementers
    * Ambiguity of the tags defined by these ISO standards
    * Difficulty with registrations and their acceptance
    * Identification of script where necessary
    * Extensibility

The stability, accessibility, and ambiguity issues are crucial. Currently,
because of changes in underlying ISO standards, a valid RFC 3066 language tag
may become invalid (or have its meaning change) at a later date. With much of
the world's computing infrastructure dependent on language tags, this is simply
unacceptable: it invalidates content that may have an extensive shelf-life. In
this specification, once a language tag is valid, it remains valid forever.
RFC 3066 Language Tags: A brief survey

Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO
639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally followed
by an ISO 3166 country code. Tags formed in this manner are not individually
registered and anyone can use such a combination of codes to identify their
language preferences or the language of some piece of content. Because this
system allows a broad range of tags to be formed by reference to the underlying
standards, these tags are referred to as generative in nature. The generative
system is very powerful and allows content authors and others to form and use
very expressive tags without the need to engage in a long and arduous
registration process. Examples of such tags are:

    * en-US (English as used in the United States)
    * fr-CA (French as used in Canada)
    * de-CH (German as used in Switzerland)
    * ja (Japanese)
    * ale-CA (Aleut as used in Canada)
    * ale-BE (Aleut as used in Belgium)

While it is possible to generate tags that do not identify any likely
real-world content, such as Aleut as used in Belgium, tags of this nature do not
represent a serious problem. Consider the case of a database that can identify
people by national origin and by hair color. It is not a problem that one could
compose a query for blond Mongolians, even though no results would ever be
returned.

There are problems with the the RFC 3066 definition of generative tags,
however. The ISO 639 and ISO 3166 standards are not freely available and evolve
over time. For example, ISO 3166 has withdrawn tags in the past and, worse, then
reassigned them to a different country altogether. As a result, it is difficult
for implementers to obtain a correct list of codes and then ensure
interoperability with other implementations of language tags.

The other way to form an RFC 3066 tag is via registration with IANA. Tags
registered with IANA identify a specific language, dialect or variation. Unlike
the generative tags, the registered values cannot be combined with other
standard subtags to form additional tags that are more descriptive. Examples of
such tags are:

    * no-nyn (Nynorsk variation of Norwegian, 
              deprecated: use 'nn' instead)
    * cel-gaulish
    * i-klingon (deprecated: use 'tlh' instead)
    * etc.

Registration, besides being a long and arduous process, also presents a variety
of problems for implementers. Although the tags are freely available, most
implementations do not support these tags because they do not fit neatly into
the generative system. Special logic is required to handle them, especially when
performing language negotiation or fallback. In addition, many of the tags are
deprecated because the registration process is less opaque and time-consuming
than registering a language with ISO 639 MA/RA has historically been. Eventually
ISO 639 does catch up and assign the language a code, resulting in overlapping
tag choices. Implementations must also deal with the implications of multiple
valid tags identifying what is essentially the same content.

But most problematic is the lack of a relationship to the generative mechanism.
Since each variation of a tag must be separately registered, language variations
with a broad range of valid uses require an enormous number of registrations.
For example, there are 8 registrations to deal with minor spelling reforms in
the German language and these registrations cover just three countries where
German is commonly spoken--and no countries where it is not the major language.
Variations in languages with a broader diffusion (such as Chinese) may require
20 or more registrations to gain full coverage, sometimes of important
distinctions.

Solving the Problems

This specification addresses each of these issues with a simple, elegant design
that is compatible with existing language tags and implementations.

This compatibility exists on several levels. All language tags, both generative
and registered, that were valid under RFC 3066 are still valid under this
specification. In addition, and very importantly, language tags that are newly
defined by this specification are compatible with the ABNF syntax, matching,
parsing, and other mechanisms defined by RFC 3066.

Thus for an implementation of RFC 3066, all of the new tags defined by this
specification are still in the form of valid registered tags, and will simply be
dealt with in whatever fashion the implementation used to handle future
registrations, those that were added to the registry after the implementation
was created. In other words, tags formed under this specification that are
unfamiliar to RFC 3066 implementations will be treated by those implementations
as if they were registered tags from a future version of the 3066 registry.

Subtags and the Registry

The largest change in the specification is that it modifies the structure of
the language tag registry. Instead of having to obtain lists of codes from five
separate external standards (not all of which are easily available), the IANA
registry will maintain a comprehensive list of valid subtags that can be used in
the generative mechanism in a machine-parseable text format. This registry will
continue to track the existing core standards and will start with the current
list of valid codes. As future codes are assigned, the IANA registry will be
updated to reflect the changes.

Having a separate registry allows IANA language tags to resolve ambiguity and
stability problems with the underlying standards. Language tags formed today
will be guaranteed to maintain their validity and meaning essentially forever,
something that is not true today.

In addition, switching to a subtag registry changes the nature of registrations
themselves. Instead of registering complete tags and therefore potentially
having to register a very large number of them (complicating life for
implementers and discouraging support for the registry), a single subtag can be
generatively combined to form many useful tags.

For example, one registered tag today is 'zh-Hans', which represents "Chinese
written in the Simplified Chinese script". Only this tag is valid under RFC
3066. Useful tags such as 'zh-Hans-SG' (SG=Signapore) or 'zh-Hans-CN' are not
valid. By switching to a registry in which 'Hans' is a registered subtag, any of
these valid and useful tags can be formed generatively.

In addition, the subtag registry will encourage implementers to support
registered items, since the subtags will fit the generative mechanism and
exception handling code will no longer be necessary.

To prevent the IANA language registry filling up with deprecated entries, rules
have also been introduced to curb harmful registrations that should be handled
by the various ISO maintenance and registration authorities (such as ISO 639).

The new structure and registry allows implementations to determine much more
about tags, even in the absence of registry information. This is important
because at any given point in time there will be a mixture of implementations
that have different snapshots of the registry. The new structure allows these
implementations to to interoperate effectively. In particular, the category of
all subtags (as language, region, script, etc.) can be determined without
reference to the particular version of the registry snapshot by the
implementation. This allows for much more robust implementations, and greater
compatibility over time.

In addition, this specification also makes it possible, for the first time, to
effectively test whether an implementation conforms to the specification. The
problem with RFC 3066 is that to determine the status of an implementation
produced at a given point, one has to reconstruct the historical contents of
each of the ISO standards and the historical contents of the registry. This is a
time-consuming and error-prone process. The new registry provides a complete,
easily parseable file which provides the precise the contents of valid tags for
any point in time.

Additional Subtag Sources

This specification introduces two additional international standards as sources
for language tags.

ISO 15924 represents script codes. (The example above of 'Hans' is from ISO
15924.) Writing system variations are often crucial to communicate, especially
when selecting content using language negotiation. Addition of this standard
will allow these distinctions to be formed generatively, rather than via
individual registration.

UN M.49 represents region and country codes. The UN M.49 standard is used by
ISO 3166 to determine what a country is. The UN M.49 codes are used by this
specification in two ways. First, if ISO 3166 reassigns a country code formerly
associated with one country to another country (as it did in 2001 with the 'CS'
code, formerly Czechoslovakia and now assigned to Serbia and Montenegro), then
the UN M.49 code can be placed in the registry to preserve stability. Secondly,
the UN M.49 standard defines regional codes for areas such as "Central and South
America" which can be useful in forming language tags for larger regions.

Future-Proofing: Private Use and Extensions

Because of the widespread use of language tags, it is potentially disruptive to
have periodic revisions of the core specification, despite demonstrated need.
This specification addresses this problem by fully specifying the valid syntax
of language tags, while providing for future, unforeseen, requirements. One of
these mechanisms is the extlang subtags, which allows for future extensions of
ISO 639, in particular, ISO 639-3.

Private use subtags is another one of these mechanisms. In RFC 3066, any tag
that was not registered or wholly made up of generative subtags must be
completely tagged as private use. Recipients of such a tag are not allowed to
infer any information from such a tag, except by private agreement. Thus if any
private-use information needed to be included in the tag, the entire tag had to
be private use; making the entire tag uninterpretable to other implementations.

This specification allows for private use subtags in a particular, prescribed
manner. Consider the IANA registered tag 'sl-nedis', which represents the
Natisone dialect of Slovenian. The subtag 'sl' is a valid ISO 639-1 code for
Slovenian. Prior to its registration with IANA, if users wished to tag content
as being in the Natisone dialect, they had two choices for language tags: 'sl'
and 'x-sl-nedis' (or similar). The first tag does not meet the need of
distinguishing the text from other varieties of Slovenian, while the second one
does not convey the relationship to Slovenian to outside processors (a human
might look at the tag and infer Slovenian, but the 'sl' subtag doesn't
necessarily represent that language).

Under this specification, if a new dialect of Slovenian were needed (let's call
it the 'xyzzy' dialect), a tag such as 'sl-x-xyzzy' can be used. In fact, a
quite comprehensive amount of information can be communicated:
'sl-Latn-IT-x-xyzzy' would represent Slovenian written using the Latin script as
used in Italy with some additional private distinguishing information (which
implementations of this specification can match algorithmically).

Note that RFC 3066 private use tags are still permitted and have the same
information content and treatment as they did previously.

The extension mechanism also provides a way for independent RFCs to define
extensions to language tags. These extensions have a very constrained,
well-defined structure to prevent extensions from interfering with
implementations of this specification (or RFC 3066).

Matching and Language Negotiation

Content tagging is only one of the applications for language tags. The other
major applications are querying for for matches and in content negotiation. RFC
3066 defines "language ranges" for use in content negotiation and querying and
describes a very simple matching algorithm. This specification maintains
compatibility with this language negotiation scheme, while providing additional
information on the implementation of language matching.

Well-Formed vs. Validating

Existing language tag processors already fall into two categories. There are
language tag processors that check if language tags have the proper,
well-formed, syntax, but which do not validate their content, and there are
language tag processors that in addition validate and reject unrecognized tags.
Each of these categories is appropriate to different implementations. For
example, to process incoming tags that may have been formed under a future
registry, an implementation may restrict itself to only checking
well-formedness. Another implementation that allows users to generate tags may
fully validate.

This specification clearly distinguishes these two possible classes of
conformance, and provides an explicit, testable definition of each one.
Impact of the New Design on Existing Implementations

One concern that is crucial to acceptance of the new language tag design is how
it works with existing implementations of RFC 3066 and how existing
implementations will interact with implementations of the newer language tags.

It is important to recognize that all language tags that were valid under the
existing RFC 3066 will remain valid, with their meanings intact, under this
specification. In fact, this specification stabilizes these meanings so that
existing implementations can be continued forward for as long as it necessary.
Content, regardless of its format, will remain valid, essentially forever.

As content and systems begin to make use of the new language tags by adopting
the additional fields defined by this specification, there will be an impact on
software and systems that expect only the older tags. The design of this
specification was carefully created so that all of the new values that can be
assigned fit the pattern for registered language tags under RFC 3066. Thus while
existing implementations will not recognize the meaning in the tags, they will
be able to process them as if they were unrecognized-but-well-formed registered
tags.

In addition, although this specification acknowledges the possibility of
alternate or advanced matching and negotiation strategies, it maintains the
existing matching algorithm (by removing subtags from the right side of a
language tag until a match is obtained), simply providing more detail on usage.

Summary

The authors of this specification have worked for the past year with a wide
range of experts in the language tagging community to build consensus on a
design for language tags that meets the needs and requirements of the user
community. Language tags form a basic building block for natural language
support in computer systems and content. The revision proposed in this
specification addresses the needs of this community of users with a minimal
impact on existing content and implementations, while providing a stable basis
for future development, expansion, and improvement.

_______________________________________________
IETF-Announce mailing list
IETF-Announce@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf-announce