Home

Specifications

Schema

Commentary

Mark Wahl


Web Design by
Kristen Lanum

Commentary by Mark Wahl, CISA

Organizing principles for systems:
Issues with internationalizing domain names (20070729)

In the OpenID authentication 1.1 protocol, an end user provides their identifier URL (in either the http or https scheme) to a relying party web site they are visiting, by typing their identifier into a field the relying party site's web form. The OpenID authentication 2.0 protocol is similar, but currently also allows the end user's identifier to be an XRI.

A HTTP or HTTPS URL is typically expressed with the components

the scheme namehttp or https
a host idexample.com
an optional port number:8080
an optional path/x.cgi
an optional query?foo=bar
an optional fragment#section2

Currently the two most common representation choices of OpenID URLs, for a user with a userid at a identity provider organization, are

There are some significant differences between representing a userid as a domain name component and in a path, including

domain name componentHTTP URI path
case-insensitivecase-sensitive
length limited to 255 characters (by RFC 1123)length not limited by HTTP
either an ASCII alphanumeric string [a-z0-9-] (RFC 1034 section 3.5), or
an international domain name component that is UTF-8 encoded and with its octets percent-encoded.
must begin with a /; the strings /./ and /../ have special significance; some characters must be percent-encoded.

The proposed standard definition of international domain names in IDNA (RFC 3490) defines an internationalized domain name, and the components can be internationalized labels, which contain encoded Unicode characters from outside of the ASCII range. Out of the Unicode 3.2 charset only a few characters cannot be used, e.g. Unicode "dots" U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop), space characters, control characters, private use characters, non-character code points, surrogate characters, characters inappropriate for plain text or canonical represntation, display property characters and language tag characters. IDNA performs a conversion on non-ASCII characters using the "nameprep" (RFC 3491) profile of "stringprep" (RFC 3454), to map upper case characters to lower case.

The internet draft "Proposed Issues and Changes for IDNA - An overview" by John Klensin of July 2007 discusses some of the issues that have been found with the model.

One observation in that draft is that

"Historically, many, perhaps most, of the 'names' in the DNS have just been mnemonics to identify some particular concept, object, or organization. They are typically derived from, or rooted in, some language because most people think in language-based ways. But, because they are mnemonics, they need not obey the orthographic conventions of any language: it is not a requirement that it be possible for them to be 'words'."

Another consideration is display order of the components of a domain name (left-to-right vs right-to-left), which may be different from the order in which the components are transmitted.

"Questions remain about protocol constraints implying that the overall direction of these strings will always be left-to-right (or right-to- left) for an IRI or email address, or if they even should conform to such rules. These questions also have several possible answers. Should a domain name abc.def, in which both labels are represented in scripts that are written right-to-left, be displayed as fed.cba or cba.fed? An IRI for clear text web access would, in network order, begin with 'http://' and the characters will appear as 'http://abc.def' -- but what does this suggest about the display order? When entering a URI to many browsers, it may be possible to provide only the domain name and leave the 'http://' to be filled in by default, assuming no tail (an approach that does not work for other protocols). The natural display order for the typed domain name on a right-to-left system is fed.cba. Does this change if a protocol identifier, tail, and the corresponding delimiters are specified?"

An important issue with OpenID in web browser interactions as it relates to international domain names is that the user does not type in their OpenID identifier URL in the 'address bar' of the web browser, where URLs are typically typed in, but instead they enter their URL in a form field of a web page.

Addressing these limitations would require changes such as

   <form>
      <label for="openid_identifier">OpenID</label>
      <input type="uri" name="openid_identifier" title="OpenID">
      <label for="openid_locale">Your language</label>
      <select name="openid_locale" title="Your Language" size="1">
         <option value="en" label="English" />
         <option value="fr" label="French" />
         
...
      </select>
   </form>