Network Working Group P. Hoffman Internet-Draft VPN Consortium Updates: 2223 (if approved) T. Bray Intended status: Informational Sun Microsystems Expires: April 5, 2009 October 2, 2008 Using non-ASCII Characters in RFCs draft-hoffman-utf8-rfcs-03.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 5, 2009. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This document specifies a change to the IETF process in which Internet Drafts and RFCs are allowed to contain non-ASCII characters. The proposed change is to change the encoding of Internet Drafts and RFCs to UTF-8. Hoffman & Bray Expires April 5, 2009 [Page 1] Internet-Draft Non-ASCII in RFCs October 2008 1. Introduction The purpose of this document is to specify a way for the IETF to use non-ASCII characters in Internet Drafts and RFCs. Various guideline documents in the IETF, notably [RFC2223], specify that RFCs must use only the US-ASCII character set. This restriction has historically caused problems, notably: o Names and addresses of authors of IETF documents are misspelled o Names and document titles in references are misspelled o Protocol examples that include non-ASCII characters cannot be included straightforwardly The first two issues cause real problems for people searching for RFCs for particular authors or references that contain non-ASCII characters. For many languages that use Latin characters outside the ASCII range, there are not absolute mappings between those non-ASCII characters and ASCII equivalents. A common example is that "a-with- umlaut" (U+00E4) may be mapped to "a" or to "ae"; many other mapping difficulties exist. The third issue reduces the effectiveness of IETF specifications; Implementors of protocols which carry textual payloads often experience difficulty in achieving interoperability related to the use of character sets from around the world. Specifications which can provide concrete examples of such protocol scenarios will be of significant benefit to these implementors. Now that UTF-8 [RFC3629] is nearly universally available in text- editing and display systems, the IETF can eliminate these problems by allowing RFCs to use UTF-8. This document uses example characters as specified in [RFC5137]. Had the recommendations from this document already been implemented, this alternate representation would, of course, not be necessary. It is important to note that this document does not use RFC 2119 language (MUST, SHOULD, and so on). Instead, it lists practices that the IETF should consider. If the ideas in this document are adopted, the final list of rules for using UTF-8 in Internet Drafts and RFCs would be published by the IAOC. The authors are open to changing this and using 2119-style language if the community prefers it. Hoffman & Bray Expires April 5, 2009 [Page 2] Internet-Draft Non-ASCII in RFCs October 2008 2. Use of UTF-8 in Internet Drafts and RFCs Upon publication of this document as an RFC, all existing RFCs and Internet Drafts will be considered to be encoded in UTF-8. The RFC Editor needs to change their processes to publish documents that are valid UTF-8. Similarly, upon acceptance of this document by the IETF, the IAOC should direct the IETF Secretariat to have all Internet Drafts encoded in UTF-8. The Secretariat needs to change their processes to publish documents that are valid UTF-8. 2.1. Limits On the Locations In Which Non-ASCII Text May Be Used It is suggested that the IETF Secretariat and RFC Editor limit non- ASCII characters to the following: o Names and addresses of authors, used at the top of RFCs and in Author Contact sections o Names and document titles used in References sections o Quotations from non-English languages o Protocol examples that show non-ASCII characters, for example in Internationalized Domain Names (IDNs), Internationalized Resource Identifiers (IRIs), and internationalized email addresses. 2.2. Allowable Character Repertoire UTF-8 is an encoding of the Unicode Character Set and can be used to any of its numeric codepoints, from 0 to 0x10FFFF inclusive. Specifications encoded in UTF-8 should not contain the encodings of certain Unicode codepoints. The codepoint ranges given in this section are inclusive: o The "ASCII control characters" in the ranges U+0000 to U+0008, and U+000B to U+001F. These lack either visual representations, interoperable semantics, or both. o The Surrogate-block range U+D800 to U+DFFF. These codepoints do not identify characters, but exist to support the UTF-16 encoding. o The ZERO WIDTH NO-BREAK SPACE U+FEFF and its mirror image U+FFFE. o The Private-Use-Area ranges, U+E000 to U+F8FF, U+F0000 to U+FFFFD, and U+100000 to U+10FFFD. Hoffman & Bray Expires April 5, 2009 [Page 3] Internet-Draft Non-ASCII in RFCs October 2008 Specifications encoded in UTF-8 should not contain the encodings of Unicode codepoints which are "Compatibility Characters", that is, those whose properties include a compatibility decomposition. Note that such characters occur rarely and detecting them requires run- time access to the Unicode character database, which may not be practical in some situations. 2.3. Normalization Due to the way that Unicode uses combining characters, there are sometimes multiple codepoint sequences that denote what, to a human, is the same character. For example, the character "lowercase-a-with- accent" can be spelled in two ways: as a single character (U+00E1) or as two characters (U+0061 followed by U+0301). This can present problems in searching and rendering. The process of standardizing on one of these possibilities is referred to as "normalization" and several "normalization forms" are defined by the Unicode Consortium. All UTF-8 text appearing in RFCs (but not necessarily Internet Drafts) ought to be normalized using Normalization Form C. 2.4. Author and Employer Names Authors can choose how to spell their names and the names of their employers in the various parts of Internet Drafts they are writing. The spelling at the top of the first page of the document needs to match the spelling in the "Authors' Addresses" section near the end of the document, but the latter can have alternate spellings to help those searching documents by name. Postal information listed in the "Authors' Addresses" section can also use non-ASCII. For example, assume that an author whose name is Fltstrm has a preferred all-ASCII spelling of Xiaodong Faltstrom. Two expected allowed methods for spelling his name would be: Network Working Group X. Faltstrom Internet-Draft ExampleCo . . . Author's Address Xiaodong Faltstrom ( Fltstrm) ExampleCo Email: xiaodong.faltstrom@example.com Hoffman & Bray Expires April 5, 2009 [Page 4] Internet-Draft Non-ASCII in RFCs October 2008 Network Working Group X. Fltstrm Internet-Draft ExampleCo . . . Author's Address Fltstrm (Xiaodong Faltstrom) ExampleCo Email: xiaodong.faltstrom@example.com 3. Security Considerations A display program that expects only US-ASCII input may fail when it encounters octets outside the US-ASCII range of values. Such a failure may become a security issue. For example, the program may display incorrect results for the input. More seriously, the program may have an internal error that causes it to fail in a security- compromising fashion. Note that such a program is vulnerable to many attacks other than just showing IETF documents. Someone could insert a UTF-8 host name in an RFC that has visually confusing characters. Another person could copy that host name out of the RFC and have it resolve to an unintended DNS name. This scenario seems quite far-fetched, given that tracking the RFC back to the author is trivial. 4. IAOC considerations If this document is adopted by the IETF, it will be up to the IAOC to have the IETF Secretariat and the RFC Editor implement it. The IAOC needs to consider all of the suggested rules in this document, both the positive ones (such as allowing additional characters in some parts of Internet Drafts and RFCs) and the negative ones (such as disallowing particular characters from being used). The IAOC might want to publish proposed instructions to he IETF Secretariat and the RFC Editor and ask for community input on the specific instructions. 5. Informative References [RFC2223] Postel, J. and J. Reynolds, "Instructions to RFC Authors", RFC 2223, October 1997. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. Hoffman & Bray Expires April 5, 2009 [Page 5] Internet-Draft Non-ASCII in RFCs October 2008 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 137, RFC 5137, February 2008. Appendix A. Arguments Against Changing to UTF-8 Over more than a decade, the question of changing the encoding of RFCs to UTF-8 has come up repeatedly. Although many people wanted the change, various people had different reasons why they felt it was a bad idea. This appendix is a summary of those arguments and an explanation of why they are no longer as critical as they were long ago. A.1. Difficulty in Displaying Some text display systems only know how to display US-ASCII. Displaying an RFC that uses non-ASCII characters encoded in UTF-8 will cause those characters to be unreadable. There are, of course, still such display systems, and there always will be. However, the number is dwindling as more software is improved to display non-ASCII characters and, in particular, to read UTF-8 as an encoding. Of the systems that can only render US-ASCII, only a small subset drop non-ASCII characters: the others show an incorrect character in its place. Thus, the person using such a system can often see that there is a problem, and can possibly choose to get better display software. A.2. Difficulty in Printing Some printers can only print a limited set of characters due to the fact that they are character-oriented, not graphical. Such printers inherently cannot print characters they do not understand. Almost all such printers print the ASCII characters just fine. There are, of course, still such printers, and there always will be. However, the number is dwindling as older printers are replaced with ones that can print graphics so that now-common text features like boldface and italics can be printed. A.3. Insufficient Fonts Almost no display system that can display text that is encoded with UTF-8 can display every character in the Unicode repertoire. Thus, some non-ASCII characters that are included in RFCs will not display properly. Virtually every system that can display Unicode knows how to Hoffman & Bray Expires April 5, 2009 [Page 6] Internet-Draft Non-ASCII in RFCs October 2008 substitute a replacement character for ones that cannot be displayed. In fact, most such systems have glyphs for rendering unknown characters and different glyphs for rendering known characters for which the system has no font. A.4. Inability to Search for Non-ASCII Characers If authors start using non-ASCII characters in their names and/or addresses, people who know the characters but are unfamiliar with the user interface on their computers may not be able to enter those characters in the search criteria. For example, some people do not know how to enter "u-with-umlaut" in their operating system, even though the operating system allows such input. This is a valid concern, but one that is orthogonal to whether or not RFCs should use these characters. The alternative (never go to UTF-8) simply shifts the problem to forcing the user to guess which ASCII-only spelling to use when searching. Appendix B. Changes from -02 to -03 Changed the example name from Frank Hrst to Fltstrm. In 2.1, changed "It is suggested that the RFC Editor limit..." to "It is suggested that the IETF Secretariat and RFC Editor limit..." Made 2.4 match 2.1 by saying that postal addresses can be in UTF-8 as well. Authors' Addresses Paul Hoffman VPN Consortium Email: paul.hoffman@vpnc.org Tim Bray Sun Microsystems Email: tbray@textuality.com Hoffman & Bray Expires April 5, 2009 [Page 7] Internet-Draft Non-ASCII in RFCs October 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Hoffman & Bray Expires April 5, 2009 [Page 8]