Home

Specifications

Schema

Commentary

Mark Wahl


Web Design by
Kristen Lanum

Commentary by Mark Wahl, CISA

Organizing principles for systems:
Observations 1-5 for identity data sharing (20070719)

Marc Canter and others have been proposing a data sharing summit to discuss the protocols and other standards needed for data sharing between social networking services.

Besides Kim Cameron of Microsoft's well-known "Seven Laws of Identity"

  1. Digital identity systems must only reveal information identifying a user with the user's consent.
  2. The solution which discloses the least identifying information and best limits its use is the most stable, long-term solution.
  3. Digital identity systems must limit disclosure of identifying information to parties having a necessary and justifiable place in a given identity relationship.
  4. A universal identity metasystem must support both 'omnidirectional' identifiers for use by public entities and 'unidirectional' identifiers for private entities, thus facilitating discovery while preventing unnecessary release of correlation handles.
  5. A universal identity metasystem must channel and enable the interworking of multiple identity technologies run by multiple identity providers.
  6. A unifying identity metasystem must define the human user as a component integrated through protected and unambiguous human-machine communications.
  7. A unifying identity metasystem must provide a simple consistent experience while enabling separation of contexts through multiple operators and technologies.

and Mike Neuenschwander of the Burton Group's "Seven Tragic Flaws of Identity" (which I described here, and Dave Kearns here):

  1. Failure of the weakest links mustn't lead to catastrophe
  2. Don't put the role before the start
  3. Not every identity nail requires the technology hammer
  4. Use of a system invites abuse of it
  5. Identifying things doesn't make them more secure
  6. Identity isn't about the individual
  7. There are a lot more than 7 flaws

I noted some observations I've made on this blog on the specific topic of the technology problems when sharing identity data across organizational boundaries.

1. Attempting to model a real-world object as complex as a person as merely a 'list of type:value pairs' or 'a hierarchical XML document' is insufficient and unworkable.

Don't make spherical cow assumptions: it's been tried many times, and invariably it fails each time. Today, these are the wrong starting points as they lead to problems when attempting to interwork with systems that have a much richer model, as richer data can't be flattened into a simple list of attributes. Connections are lost, metadata/annotations are lost.

This is derived in part from Mike Neuenschwander's Flaw #6: it's not about the individual: it's about the relationship. It was also seen in the failures in LDAP and X.500 to model attributes that are related to each other (e.g., a telephone number and a mailing address are tied to a site): X.500 tried with families of entries concept, and LDAP deployments ended up with application-proprietary XML blobs as values.

For another example, Mike Jones of Microsoft wrote in "Interoperable Verified Identity Claims" on an form of value metadata of interest to CardSpace: indicating whether an attribute value has been verified, and if so, by whom, how, and the validity period of this verification, and suggested a syntax of a rich attribute for a "verified name". I discussed in a "scorecard" for how hard this would be to implement in various existing identity data models.

The best thing to do might be to not even attempt to send the description of a person from one system to another.

 

2. It's better to provide answers to questions, than the data behind those answers.

Bob Blakley of the Burton Group wrote in "The Meta-identity System" about an "Identity Oracle"

"In order to build an asset, the Identity Provider has to stop giving its crown jewels - identity data - to its customers. It can do this simply by changing what it puts into the claims it hands out to Relying Parties. Instead of answering a Relying Party's query "How old is Bob?" with the claim "Bob is 45", it can answer "How old is Bob?" with the claim "Bob is over 18". Instead of answering the query "Is Bob a good credit risk?" with the claim "Bob's credit history is (fifty-page report goes here)", it can answer "Is Bob a good credit risk?" with the claim "97% of people with credit histories similar to Bob's repaid loans of under $200,000 on time." ... The second advantage ... is that it allows the Identity Provider to provide a service to Relying Parties while minimizing the disclosure of specific personal information to those parties - thereby reducing privacy risks to subjects."

This would also help to address the problems I mentioned in an earlier post: services receiving data from other services, and then re-using that information in ways the user didn't expect. Also it makes it somewhat easier for the provider of the data to specify whether it is authoritative for that data.

Of course, this makes caching by the recipient more difficult, and issues for how to handle notification of changes would need to be addressed.

However, before sending an answer, it is necessary for the sender to have confidence the service that asked the question will be able to understand it.

 

3. A service shouldn't share an item of identity data with another service unless the sender knows the recipient can understand it.

Early directory browsing applications were based on the assumption that the client application would search for some user's entry, and upon finding it, retrieve "all user attributes". The directory server would happily oblige and return all the attributes of the user in that entry (subject to access control restrictions), and leave it to the client to sort out what to do with them.

As I mentioned in the principle of contractual disclosure, this is a bad idea today. Instead,

An identity system must only reveal identifying information to a recipient if the identity system and that recipient have agreed on how the recipient can handle and use that information.

This is to ensure the identity system does not inadvertently violate its data management policy by revealing information to a recipient that is not going to follow a compatible policy.

Without such an agreement in place, the operators of an identity system has no information about what the recipient will do with that information, an in particular has no recourse if it finds a recipient has misused a user's information provided to it, even if that recipient in general has a place in identity relationships, and even if the information is part of the 'least' identifying information for the user. If the recipient has no identified policy for managing a particular kind of identifying information, then the identity system wouldn't be able to fully answer a user's or auditor's question of how it ensures that the control of that data is maintained.

This principle would IMHO be used primarily with respect to private information, but also is relevant for public information: most identity providers will want to make certain statements about the information they provide (e.g. the format, interpretation of fields, ownership/copyright, appropriate/acceptable use etc), as well as defend against libel, provide the right to later revise or revoke it, etc.

Some of these topics are also being considered by the Liberty Alliance as it has taken on the evolution of the "Carmel Apple" (IGF CARML AAPML) specifications proposed earlier this year by Phil Hunt and others of Oracle.

It is also necessary for the recipient of a piece of information to understand not just the information itself, but any metadata which the sender has attached to it.

 

4. Metadata needs to stay attached to identity data.

Metadata for an element of identity information includes administrative annotations, rights and restrictions. Besides the example given earlier, other forms of metadata include,

In the electronic news gathering and media environments, the preservation of the metadata on press photos is critically important. If the name of the photographer or the releases becomes detached from a photo as the photo travels from one organization to another, the photographer may not get credit, or the recipient may not be able to use it. For this reason, tools such as Photoshop that perform 'drastic' manipulation of images (cropping, resizing, retouching) still recognize the metadata and preserve the metadata values on the image files which these tools generate.

Yet this data is at risk in cross-organization identity flows, as existing identity protocols and systems often make "lossy" translations, and don't have fields in their protocols or databases for this incoming metadata.

In theory, while it might be considered OK to remove metadata when the data has been munged into a privacy protecting database where individuals can no longer be directly discerned, in practice these databases may not be viable, as I mentioned in "extracting data from links in social networks" and "attacks on anonymized social networks and fudging oracles", and quoted

"... any privacy mechanism, interactive or non-interactive, providing reasonably accurate answers to a 0.761 fraction of randomly generated weighted subset sum queries, and arbitrary answers on the remaining 0.239 fraction, is blatantly non-private."

 

5. There is no universal schema.

While person schemas have been around for 107 years, during the early 1980s the CCITT/ISO for OSI standardized, based on the work of the IFIP International Computer Messaging WG, an electronic mail addressing scheme that would in theory route messages based on a combination of

 country-name               
 administration-domain-name  
 network-address            
 terminal-identifier         
 private-domain-name         
 organization-name           
 numeric-user-identifier     
 personal-name               
 organizational-unit-names   

This led to the X.500 organizationalPerson, and numerous attempts were subsequently made to develop a common, minimal subset representation of an individual person in an LDAP directory: account, Internet White Pages Schema, wpSchema, Lightweight Internet Person Schema, inetOrgPerson, Microsoft User, etc.

As I mentioned in "Client implications of Kim's fifth law", schemas such as these made assumptions which were not global. For example, in the case of naming attributes,

Schema AssumptionInvalidated by
individuals only have 3-4 naming attributes "A fully evolved nomenclature consists of (in this order) laqab, kunya, ism, patronymic (with or without further nasab), nisba(s)..." from Arabic Nomenclature: a summary guide for beginners
everyone has a given name and surnamefor cultures which do not make use of a surname or family name, e.g., Icelandic or Malay
everyone has a single full name or nickname that is what everyone else refers to them asEven in the US, Dr. Robert Smith would expect to be called "Dr. Smith" by a patient and "Bob" by a family member.
everyone in a culture has the same 'display order' for their name "I've found that name order in Chinese persons is a marginally reliable indicator of attitudes towards the West.", quoted from http://www.crookedtimber.org/archives/001435.html

In the late 1990s I and others in the SP-DNA working group attempted to develop a CIM-aligned "conceptual model" for directory contents, to separate applications from directory schemas.

The Conceptual Model was to be specified as a series of views, each corresponding to a role in the deployment (Service Provider or Hosted Organization). Within a view, were a series of layers. Each layer defined a UML graph illustrating the relationships between the modeled objects, implemented as instances of classes. (The layers interconnected and were mainly a way of constraining the size of each graph to be understandable and fit on a single page.)

More recently, the identity schemas have been tied to the web ontology language OWL, through reverse engineering of schema. However, as I discussed in "ontologies for schema, continued", there is probably no single ontology possible for describing all that has been done with directory schema, and in "Schema ontologies: some considerations", that there is not yet a generally agreed-upon ontology framework amongst ontology researchers (everything from OpenCyc with its topic map to the ISO SUMO to dozens of industry bodies have their own upper-level ontologies and languages, no de-facto standard, as debated on the Wikipedia pge for upper ontology).

Furthermore, the 'minimal subset' of schema across existing identity systems is so minimal that it is not really worth standardizing, since too many systems have assumptions that limit their interoperability in this regard (e.g. basic issues such as restrictions on the form of a name or the uniqueness constraints (across all time, across multiple systems) of a "unique identifier" attribute, let alone questions such as whether they have a concept such as of "individual person" or not).

The bigger issue than trying to find the minimal schema is how service-specific, application-specific, community-specific and user-specific schema extensions can better be handled. There is no way that a universal schema will occomodate every schema need, since any set of users might identify a need for an element of schema that is not met by any existing schema.

For a practical example, suppose a service wishes to interwork with another service in which there are Icelandic or Russian users who have a naming attribute "Patryonymic". It is insufficient for a service to merely store this as 'yet another attribute', as it has key social significance, as described in the Wikipedia entry on Patronymic

"A Russian will almost never formally address a person named Mikhail as just 'Mikhail', but rather as 'Mikhail' plus his patronymic (for instance, 'Mikhail Nikolayevich' or 'Mikhail Sergeyevich' etc). However, on informal occasions when a person is using the diminutive of a name, such as Misha for Mikhail, the patronymic is hardly ever used. "

This issue is discussed further in my post on "decentralized l10n".

One can envisage locales which correspond to very small enclaves of use, perhaps one individual or a particular set of individuals. These locales are defined by their participant users rather than by a geography. They do not need to imply a different and private written language, the locale could use English, Klingon, or other languages or combinations of languages. Instead, the locale defines a particular set of operating conventions for software in this locale, that are driven by the requirements of the participants.

For Identity Management, this implies that the schema of the deployment, as well as aspects of the system which depend upon that schema, such as the provisioning user interface, should be determined by the choices made for that locale, which might be arbitrary and change over time. If a community decides to have for example a favoriteDrink attribute, and defines the data management expectations for this attribute (it is a user supplied string that can be displayed to any user of the system), there is no reason to suggest that this attribute will be in conflict with other requirements of an extensible or general purpose identity system, so it should be possible to implement.

As the set of potential locales in this expanded concept is quite large, there is no possibility that a single vendor could attempt to hard code or upon demand implement every possible locale. Nor would there be an external volunteer community that would be willing to take up the challenge and implement the locale. Furthermore, attempting to even coordinate this system at a single point would be taxing to that service provider, this would require a system that could scale to the size of (for example) the Yahoo! Groups environment.

For this concept to be reached, therefore, it seems to imply that the localization process is decentralized: the users of a system can modify their system to meet their needs. The approach sounds simple in practice, although going by implementation experience, it is rarely fully realized.

Some of the barriers to decentralized localization of existing Identity Management deployments have been:

Some of the evaluation criteria I proposed, in the context of naming attributes, included