Folding and collation (comparing strings)

Describes descriptor folding and descriptor collation.

There are two techniques that may be used to modify the characters in a descriptor prior to performing operations such as comparisons on text strings:

folding
collation

Folding

Folding is a relatively simple way of normalising text for comparison by removing case distinctions, converting accented characters to characters without accents etc. Folding is used for tolerant comparisons, i.e. comparisons that are biased towards a match.

For example, the file system uses folding to decide whether two file names are identical or not. Folding is locale-independent behaviour, and means that the file system, for example, can be locale-independent.

It is important to note that there can be no guarantee that folding is in any way culturally appropriate, and should not be used for comparing strings in natural language; collation is the correct functionality for this.

Variants of member functions that fold are provided where appropriate. For example, TDesC16::CompareF() for folded comparison.

Collation

Collation is a much better and more powerful way to compare strings and produces a dictionary-like ('lexicographic') ordering. Folding cannot remove piece accents or deal with correspondences that are not one-to-one like the mapping from German upper case SS to lower case ß. In addition, folding cannot optionally ignore punctuation.

For languages using the Latin script, for example, collation is about deciding whether to ignore punctuation, whether to fold upper and lower case, how to treat accents, and so on. In a given locale there is usually a standard set of collation rules that can be used.

Collation should always be used for comparing strings in natural language.

Variants of member functions that use collation are provided where appropriate. For example, TDesC16::CompareC() for collated comparison.

Comparing and sorting strings

The TDesC16::CompareC() variant prototyped as:

TInt CompareC(const TDesC16& aDes, TInt aMaxLevel, const TCollationMethod* aCollationMethod) const;

returns 0, if two strings match.

There are many ways in which two strings can match, even when they do not have the same length:

if one string includes combining characters, but the collation level is set to 0 (which means that accents are ignored)
if one string contains "pre-composed" versions of accented characters and the other contains "decomposed" versions of the same character
if one string contains a ligature that, in a collation table, matches multiple characters in the other string and the collation level is set to less than 3 (for example "æ" might match "ae")
if one string contains a "surrogate pair" (a 32-bit encoded character) which happens to match a normal character at the level specified
if the collation method does not have its "ignore none" flag set and the collation level is set to less than 3, then spaces and punctuation are ignored; this means that one string could be much longer than the other just by adding a large number of spaces
if one string were to contain the Hangul representation of Korean and the other were to contain the Jamo representation of the same Korean and the collation level is set to less than 3.

The collation level is an integer that can take one of the values: 0, 1, 2 or 3, and determines how tightly the matching of two strings should be. This value is passed as the second parameter to CompareC(). The values have the following meanings:

0 - only test the character identity; accents and case are ignored
1 - test the character identity and accents; case is ignored
2 - test the character identity, accents and case
3 - test the Unicode value as well as the character identity, accents and case.

At levels 0-2:

ligatures (e.g. "æ") are the same as their decomposed equivalents (e.g. "ae")
script variants are the same (for example "R" matches the mathematical real number symbol (Unicode 211D)
the "micro" symbol (Unicode 00B5) matches Greek "mu" (Unicode 03BC)).

At level 3 these are treated differently.

If the aim is to sort strings, then level 3 must be used. For any strings a and b, if a < b for some level of collation, then a < b for all higher levels of collation as well. It is impossible, therefore, to affect the order that is generated by using lower collation levels than 3. This just causes similar strings to sort in a random order. In standard English, sorting at level 3 gives the following order:

bat < bee < BEE < bus

The case of the B only affects the comparison after all the letter identities have been found to be the same - this is usually what people are trying to achieve by using lower collation levels than 3 for sorting. It is never necessary.

The sort order can be affected by setting flags in the TCollationMethod object.

Note that when strings match at level 3, they do not necessarily have the same binary representation, or even the same length. Unicode contains many strings that are regarded as equivalent, even though they have different binary representations.