|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
|
"https://www.w3.org/TR/html4/loose.dtd">
|
|
|
<html>
|
|
|
<head>
|
|
|
<meta name="generator" content=
|
|
|
"HTML Tidy for HTML5 for Apple macOS version 5.6.0">
|
|
|
<meta http-equiv="Content-Type" content=
|
|
|
"text/html; charset=utf-8">
|
|
|
<meta http-equiv="Content-Language" content="en-us">
|
|
|
<link rel="stylesheet" href=
|
|
|
"../reports.css" type="text/css">
|
|
|
<title>UTS #35: Unicode LDML: Collation</title>
|
|
|
<style type="text/css">
|
|
|
<!--
|
|
|
.dtd {
|
|
|
font-family: monospace;
|
|
|
font-size: 90%;
|
|
|
background-color: #CCCCFF;
|
|
|
border-style: dotted;
|
|
|
border-width: 1px;
|
|
|
}
|
|
|
|
|
|
.xmlExample {
|
|
|
font-family: monospace;
|
|
|
font-size: 80%
|
|
|
}
|
|
|
|
|
|
.blockedInherited {
|
|
|
font-style: italic;
|
|
|
font-weight: bold;
|
|
|
border-style: dashed;
|
|
|
border-width: 1px;
|
|
|
background-color: #FF0000
|
|
|
}
|
|
|
|
|
|
.inherited {
|
|
|
font-weight: bold;
|
|
|
border-style: dashed;
|
|
|
border-width: 1px;
|
|
|
background-color: #00FF00
|
|
|
}
|
|
|
|
|
|
.element {
|
|
|
font-weight: bold;
|
|
|
color: red;
|
|
|
}
|
|
|
|
|
|
.attribute {
|
|
|
font-weight: bold;
|
|
|
color: maroon;
|
|
|
}
|
|
|
|
|
|
.attributeValue {
|
|
|
font-weight: bold;
|
|
|
color: blue;
|
|
|
}
|
|
|
|
|
|
li, p {
|
|
|
margin-top: 0.5em;
|
|
|
margin-bottom: 0.5em
|
|
|
}
|
|
|
|
|
|
h2, h3, h4, table {
|
|
|
margin-top: 1.5em;
|
|
|
margin-bottom: 0.5em;
|
|
|
}
|
|
|
-->
|
|
|
</style>
|
|
|
</head>
|
|
|
<body>
|
|
|
<table class="header" width="100%">
|
|
|
<tr>
|
|
|
<td class="icon"><a href="https://unicode.org"><img alt=
|
|
|
"[Unicode]" src="../logo60s2.gif"
|
|
|
width="34" height="33" style=
|
|
|
"vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a>
|
|
|
<a class="bar" href=
|
|
|
"https://www.unicode.org/reports/">Technical Reports</a></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="gray"> </td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<div class="body">
|
|
|
<h2 style="text-align: center">Unicode Technical Standard #35</h2>
|
|
|
<h1>Unicode Locale Data Markup Language (LDML)<br>
|
|
|
Part 5: Collation</h1>
|
|
|
<!-- At least the first row of this header table should be identical across the parts of this UTS. -->
|
|
|
<table border="1" cellpadding="2" cellspacing="0" class="wide">
|
|
|
<tr>
|
|
|
<td>Version</td>
|
|
|
<td>38</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>Editors</td>
|
|
|
<td>Markus Scherer (<a href="mailto:markus.icu@gmail.com">markus.icu@gmail.com</a>) and
|
|
|
<a href="tr35.html#Acknowledgments">other CLDR committee
|
|
|
members</a></td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>For the full header, summary, and status, see <a href=
|
|
|
"tr35.html">Part 1: Core</a></p>
|
|
|
<h3><i>Summary</i></h3>
|
|
|
<p>This document describes parts of an XML format
|
|
|
(<i>vocabulary</i>) for the exchange of structured locale data.
|
|
|
This format is used in the <a href=
|
|
|
"https://unicode.org/cldr/">Unicode Common Locale Data
|
|
|
Repository</a>.</p>
|
|
|
<p>This is a partial document, describing only those parts of
|
|
|
the LDML that are relevant for collation (sorting, searching
|
|
|
& grouping). For the other parts of the LDML see the
|
|
|
<a href="tr35.html">main LDML document</a> and the links
|
|
|
above.</p>
|
|
|
<h3><i>Status</i></h3>
|
|
|
|
|
|
<!-- NOT YET APPROVED
|
|
|
<p>
|
|
|
<i class="changed">This is a<b><font color="#ff3333">
|
|
|
draft </font></b>document which may be updated, replaced, or superseded by
|
|
|
other documents at any time. Publication does not imply endorsement
|
|
|
by the Unicode Consortium. This is not a stable document; it is
|
|
|
inappropriate to cite this document as other than a work in
|
|
|
progress.
|
|
|
</i>
|
|
|
</p>
|
|
|
END NOT YET APPROVED -->
|
|
|
<!-- APPROVED -->
|
|
|
<p><i>This document has been reviewed by Unicode members and
|
|
|
other interested parties, and has been approved for publication
|
|
|
by the Unicode Consortium. This is a stable document and may be
|
|
|
used as reference material or cited as a normative reference by
|
|
|
other specifications.</i></p>
|
|
|
<!-- END APPROVED -->
|
|
|
|
|
|
<blockquote>
|
|
|
<p><i><b>A Unicode Technical Standard (UTS)</b> is an
|
|
|
independent specification. Conformance to the Unicode
|
|
|
Standard does not imply conformance to any UTS.</i></p>
|
|
|
</blockquote>
|
|
|
<p><i>Please submit corrigenda and other comments with the CLDR
|
|
|
bug reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related
|
|
|
information that is useful in understanding this document is
|
|
|
found in the <a href="tr35.html#References">References</a>. For
|
|
|
the latest version of the Unicode Standard see [<a href=
|
|
|
"tr35.html#Unicode">Unicode</a>]. For a list of current Unicode
|
|
|
Technical Reports see [<a href=
|
|
|
"tr35.html#Reports">Reports</a>]. For more information about
|
|
|
versions of the Unicode Standard, see [<a href=
|
|
|
"tr35.html#Versions">Versions</a>].</i></p>
|
|
|
<h2><a name="Parts" href="#Parts" id="Parts">Parts</a></h2>
|
|
|
<!-- This section of Parts should be identical in all of the parts of this UTS. -->
|
|
|
<p>The LDML specification is divided into the following
|
|
|
parts:</p>
|
|
|
<ul class="toc">
|
|
|
<li>Part 1: <a href="tr35.html#Contents">Core</a> (languages,
|
|
|
locales, basic structure)</li>
|
|
|
<li>Part 2: <a href="tr35-general.html#Contents">General</a>
|
|
|
(display names & transforms, etc.)</li>
|
|
|
<li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a>
|
|
|
(number & currency formatting)</li>
|
|
|
<li>Part 4: <a href="tr35-dates.html#Contents">Dates</a>
|
|
|
(date, time, time zone formatting)</li>
|
|
|
<li>Part 5: <a href=
|
|
|
"tr35-collation.html#Contents">Collation</a> (sorting,
|
|
|
searching, grouping)</li>
|
|
|
<li>Part 6: <a href=
|
|
|
"tr35-info.html#Contents">Supplemental</a> (supplemental
|
|
|
data)</li>
|
|
|
<li>Part 7: <a href=
|
|
|
"tr35-keyboards.html#Contents">Keyboards</a> (keyboard
|
|
|
mappings)</li>
|
|
|
</ul>
|
|
|
<h2><a name="Contents" href="#Contents" id="Contents">Contents
|
|
|
of Part 5, Collation</a></h2>
|
|
|
<!-- START Generated TOC: CheckHtmlFiles -->
|
|
|
<ul class="toc">
|
|
|
<li>1 <a href="#CLDR_Collation">CLDR Collation</a>
|
|
|
<ul class="toc">
|
|
|
<li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR
|
|
|
Collation Algorithm</a>
|
|
|
<ul class="toc">
|
|
|
<li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li>
|
|
|
<li>1.1.2 <a href=
|
|
|
"#Context_Sensitive_Mappings">Context-Sensitive
|
|
|
Mappings</a></li>
|
|
|
<li>1.1.3 <a href="#Algorithm_Case">Case
|
|
|
Handling</a></li>
|
|
|
<li>1.1.4 <a href=
|
|
|
"#Algorithm_Reordering_Groups">Reordering
|
|
|
Groups</a></li>
|
|
|
<li>1.1.5 <a href="#Combining_Rules">Combining
|
|
|
Rules</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>2 <a href="#Root_Collation">Root Collation</a>
|
|
|
<ul class="toc">
|
|
|
<li>2.1 <a href=
|
|
|
"#grouping_classes_of_characters">Grouping classes of
|
|
|
characters</a></li>
|
|
|
<li>2.2 <a href="#non_variable_symbols">Non-variable
|
|
|
symbols</a></li>
|
|
|
<li>2.3 <a href="#tibetan_contractions">Additional
|
|
|
contractions for Tibetan</a></li>
|
|
|
<li>2.4 <a href="#tailored_noncharacter_weights">Tailored
|
|
|
noncharacter weights</a></li>
|
|
|
<li>2.5 <a href="#Root_Data_Files">Root Collation Data
|
|
|
Files</a></li>
|
|
|
<li>2.6 <a href="#Root_Data_File_Formats">Root Collation
|
|
|
Data File Formats</a>
|
|
|
<ul class="toc">
|
|
|
<li>2.6.1 <a href=
|
|
|
"#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li>
|
|
|
<li>2.6.2 <a href=
|
|
|
"#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li>
|
|
|
<li>2.6.3 <a href=
|
|
|
"#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3 <a href="#Collation_Tailorings">Collation
|
|
|
Tailorings</a>
|
|
|
<ul class="toc">
|
|
|
<li>3.1 <a href="#Collation_Types">Collation Types</a>
|
|
|
<ul class="toc">
|
|
|
<li>3.1.1 <a href=
|
|
|
"#Collation_Type_Fallback">Collation Type
|
|
|
Fallback</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Sample_requested_and_actual_collation_locales_and_types">
|
|
|
Sample requested and actual collation locales and
|
|
|
types</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.2 <a href="#Collation_Version">Version</a></li>
|
|
|
<li>3.3 <a href="#Collation_Element">Collation
|
|
|
Element</a></li>
|
|
|
<li>3.4 <a href="#Setting_Options">Setting Options</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href="#Collation_Settings">Collation
|
|
|
Settings</a></li>
|
|
|
<li>3.4.1 <a href="#Common_Settings">Common settings
|
|
|
combinations</a></li>
|
|
|
<li>3.4.2 <a href="#Normalization_Setting">Notes on
|
|
|
the normalization setting</a></li>
|
|
|
<li>3.4.3 <a href="#Variable_Top_Settings">Notes on
|
|
|
variable top settings</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.5 <a href="#Rules">Collation Rule Syntax</a></li>
|
|
|
<li>3.6 <a href="#Orderings">Orderings</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Specifying_Collation_Ordering">Specifying Collation
|
|
|
Ordering</a></li>
|
|
|
<li>Table: <a href=
|
|
|
"#Abbreviating_Ordering_Specifications">Abbreviating
|
|
|
Ordering Specifications</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.7 <a href="#Contractions">Contractions</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Specifying_Contractions">Specifying
|
|
|
Contractions</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.8 <a href="#Expansions">Expansions</a></li>
|
|
|
<li>3.9 <a href="#Context_Before">Context Before</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Specifying_Previous_Context">Specifying Previous
|
|
|
Context</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.10 <a href=
|
|
|
"#Placing_Characters_Before_Others">Placing Characters
|
|
|
Before Others</a></li>
|
|
|
<li>3.11 <a href="#Logical_Reset_Positions">Logical Reset
|
|
|
Positions</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Specifying_Logical_Positions">Specifying Logical
|
|
|
Positions</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.12 <a href=
|
|
|
"#Special_Purpose_Commands">Special-Purpose Commands</a>
|
|
|
<ul class="toc">
|
|
|
<li>Table: <a href=
|
|
|
"#Special_Purpose_Elements">Special-Purpose
|
|
|
Elements</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.13 <a href="#Script_Reordering">Collation
|
|
|
Reordering</a>
|
|
|
<ul class="toc">
|
|
|
<li>3.13.1 <a href=
|
|
|
"#Interpretation_reordering">Interpretation of a
|
|
|
reordering list</a></li>
|
|
|
<li>3.13.2 <a href=
|
|
|
"#Reordering_Groups_allkeys">Reordering Groups for
|
|
|
allkeys.txt</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.14 <a href="#Case_Parameters">Case Parameters</a>
|
|
|
<ul class="toc">
|
|
|
<li>3.14.1 <a href="#Case_Untailored">Untailored
|
|
|
Characters</a></li>
|
|
|
<li>3.14.2 <a href="#Case_Weights">Compute Modified
|
|
|
Collation Elements</a></li>
|
|
|
<li>3.14.3 <a href="#Case_Tailored">Tailored
|
|
|
Strings</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>3.15 <a href="#Visibility">Visibility</a></li>
|
|
|
<li>3.16 <a href="#Collation_Indexes">Collation
|
|
|
Indexes</a>
|
|
|
<ul class="toc">
|
|
|
<li>3.16.1 <a href="#Index_Characters">Index
|
|
|
Characters</a></li>
|
|
|
<li>3.16.2 <a href="#CJK_Index_Markers">CJK Index
|
|
|
Markers</a></li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul><!-- END Generated TOC: CheckHtmlFiles -->
|
|
|
<h2>1 <a name="CLDR_Collation" href="#CLDR_Collation" id=
|
|
|
"CLDR_Collation">CLDR Collation</a></h2>
|
|
|
<p>Collation is the general term for the process and function
|
|
|
of determining the sorting order of strings of characters, for
|
|
|
example for lists of strings presented to users, or in
|
|
|
databases for sorting and selecting records.</p>
|
|
|
<p>Collation varies by language, by application (some languages
|
|
|
use special phonebook sorting), and other criteria (for
|
|
|
example, phonetic vs. visual).</p>
|
|
|
<p>CLDR provides collation data for many languages and styles.
|
|
|
The data supports not only sorting but also language-sensitive
|
|
|
searching and grouping under index headers. All CLDR collations
|
|
|
are based on the [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default
|
|
|
order, with common modifications applied in the CLDR root
|
|
|
collation, and further tailored for language and style as
|
|
|
needed.</p>
|
|
|
<h3>1.1 <a name="CLDR_Collation_Algorithm" href=
|
|
|
"#CLDR_Collation_Algorithm" id="CLDR_Collation_Algorithm">CLDR
|
|
|
Collation Algorithm</a></h3>
|
|
|
<p>The CLDR collation algorithm is an extension of the <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode
|
|
|
Collation Algorithm</a>.</p>
|
|
|
<h4>1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE" id=
|
|
|
"Algorithm_FFFE">U+FFFE</a></h4>
|
|
|
<p>U+FFFE maps to a CE with a minimal, unique primary weight.
|
|
|
Its primary weight is not "variable": U+FFFE must not become
|
|
|
ignorable in alternate handling. On the identical level, a
|
|
|
minimal, unique “weight” must be emitted for U+FFFE as well.
|
|
|
This allows for <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging
|
|
|
Sort Keys</a> within code point space.</p>
|
|
|
<p>For example, when sorting names in a database, a sortable
|
|
|
string can be formed with <em>last_name</em> + '\uFFFE' +
|
|
|
<em>first_name</em>. These strings would sort properly, without
|
|
|
ever comparing the last part of a last name with the first part
|
|
|
of another first name.</p>
|
|
|
<p>For backwards secondary level sorting, text <i>segments</i>
|
|
|
separated by U+FFFE are processed in forward segment order, and
|
|
|
<i>within</i> each segment the secondary weights are compared
|
|
|
backwards. This is so that such combined strings are processed
|
|
|
consistently with merging their sort keys (for example, by
|
|
|
concatenating them level by level with a low separator).</p>
|
|
|
<p class="note">Note: With unique, low weights on <i>all</i>
|
|
|
levels it is possible to achieve <code>sortkey(str1 + "\uFFFE"
|
|
|
+ str2) == mergeSortkeys(sortkey(str1), sortkey(str2))</code> .
|
|
|
When that is not necessary, then code can be a little simpler
|
|
|
(no special handling for U+FFFE except for
|
|
|
backwards-secondary), sort keys can be a little shorter (when
|
|
|
using compressible common non-primary weights for U+FFFE), and
|
|
|
another low weight can be used in tailorings.</p>
|
|
|
<h4>1.1.2 <a name="Context_Sensitive_Mappings" href=
|
|
|
"#Context_Sensitive_Mappings" id=
|
|
|
"Context_Sensitive_Mappings">Context-Sensitive
|
|
|
Mappings</a></h4>
|
|
|
<p>Contraction matching, as in the UCA, starts from the first
|
|
|
character of the contraction string. It slows down processing
|
|
|
of that first character even when none of its contractions
|
|
|
matches. In some cases, it is preferrable to change such
|
|
|
contractions to mappings with a prefix (context before a
|
|
|
character), so that complex processing is done only when the
|
|
|
less-frequently occurring trailing character is
|
|
|
encountered.</p>
|
|
|
<p>For example, the DUCET contains contractions for several
|
|
|
variants of L· (L followed by middle dot). Collating ASCII text
|
|
|
is slowed down by contraction matching starting with L/l. In
|
|
|
the CLDR root collation, these contractions are replaced by
|
|
|
prefix mappings (L|·) which are triggered only when the middle
|
|
|
dot is encountered. CLDR also uses prefix rules in the Japanese
|
|
|
tailoring, for processing of Hiragana/Katakana length and
|
|
|
iteration marks.</p>
|
|
|
<p>The mapping is conditional on the prefix match but does not
|
|
|
change the mappings for the preceding text. As a result, a
|
|
|
contraction mapping for "px" can be replaced by a prefix rule
|
|
|
"p|x" only if px maps to the collation elements for p followed
|
|
|
by the collation elements for "x if after p". In the DUCET, L·
|
|
|
maps to CE(L) followed by a special secondary CE (which differs
|
|
|
from CE(·) when · is not preceded by L). In the CLDR root
|
|
|
collation, L has no context-sensitive mappings, but · maps to
|
|
|
that special secondary CE if preceded by L.</p>
|
|
|
<p>A prefix mapping for p|x behaves mostly like the contraction
|
|
|
px, except when there is a contraction that overlaps with the
|
|
|
prefix, for example one for "op". A contraction matches only
|
|
|
new text (and consumes it), while a prefix matches only
|
|
|
already-consumed text.</p>
|
|
|
<ul>
|
|
|
<li>With mappings for "op" and "px", only the first
|
|
|
contraction matches in text "opx". (It consumes the "op"
|
|
|
characters, and there is no context-sensitive mapping for
|
|
|
x.)</li>
|
|
|
<li>With mappings for "op" and "p|x", both the contraction
|
|
|
and the prefix rule match in text "opx". (The prefix always
|
|
|
matches already-consumed characters, regardless of whether
|
|
|
they mapped as part of contractions.)</li>
|
|
|
</ul>
|
|
|
<p class="note">Note: Matching of discontiguous contractions
|
|
|
should be implemented without rewriting the text (unlike in the
|
|
|
[<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
|
|
|
algorithm specification), so that prefix matching is
|
|
|
predictable. (It should also help with contraction matching
|
|
|
performance.) An implementation that does rewrite the text, as
|
|
|
in the UCA, will get different results for some (unusual)
|
|
|
combinations of contractions, prefix rules, and input text.</p>
|
|
|
<p>Prefix matching uses a simple longest-match algorithm (op|c
|
|
|
wins over p|c). It is recommended that prefix rules be limited
|
|
|
to mappings where both the prefix string and the mapped string
|
|
|
begin with an NFC boundary (that is, with a normalization
|
|
|
starter that does not combine backwards). (In op|ch both o and
|
|
|
c should be starters (ccc=0) and NFC_QC=Yes.) Otherwise, prefix
|
|
|
matching would be affected by canonical reordering and
|
|
|
discontiguous matching, like contractions. Prefix matching is
|
|
|
thus always contiguous.</p>
|
|
|
<p>A character can have mappings with both prefixes (context
|
|
|
before) and contraction suffixes. Prefixes are matched first.
|
|
|
This is to keep them reasonably implementable: When there is a
|
|
|
mapping with both a prefix and a contraction suffix (like in
|
|
|
Japanese: ぐ|ゞ), then the matching needs to go in both
|
|
|
directions. The contraction might involve discontiguous
|
|
|
matching, which needs complex text iteration and handling of
|
|
|
skipped combining marks, and will consume the matching suffix.
|
|
|
Prefix matching should be first because, regardless of whether
|
|
|
there is a match, the implementation will always return to the
|
|
|
original text index (right after the prefix) from where it will
|
|
|
start to look at all of the contractions for that prefix.</p>
|
|
|
<p>If there is a match for a prefix but no match for any of the
|
|
|
suffixes for that prefix, then fall back to mappings with the
|
|
|
next-longest matching prefix, and so on, ultimately to mappings
|
|
|
with no prefix. (Otherwise mappings with longer prefixes would
|
|
|
“hide” mappings with shorter prefixes.)</p>
|
|
|
<p>Consider the following mappings.</p>
|
|
|
<ol>
|
|
|
<li>p → CE(p)</li>
|
|
|
<li>h → CE(h)</li>
|
|
|
<li>c → CE(c)</li>
|
|
|
<li>ch → CE(d)</li>
|
|
|
<li>p|c → CE(u)</li>
|
|
|
<li>p|ci → CE(v)</li>
|
|
|
<li>p|ĉ → CE(w)</li>
|
|
|
<li>op|ck → CE(x)</li>
|
|
|
</ol>
|
|
|
<p>With these, text collates like this:</p>
|
|
|
<ul>
|
|
|
<li>pc → CE(p)CE(u)</li>
|
|
|
<li>pci → CE(p)CE(v)</li>
|
|
|
<li>pch → CE(p)CE(u)CE(h)</li>
|
|
|
<li>pĉ → CE(p)CE(w)</li>
|
|
|
<li>pĉ̣ → CE(p)CE(w)CE(U+0323) // discontiguous</li>
|
|
|
<li>opck → CE(o)CE(p)CE(x)</li>
|
|
|
<li>opch → CE(o)CE(p)CE(u)CE(h)</li>
|
|
|
</ul>
|
|
|
<p>However, if the mapping p|c → CE(u) is missing, then text
|
|
|
"pch" maps to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and
|
|
|
"pĉ̣" maps to CE(p)CE(c)CE(U+0323)CE(U+0302) (because
|
|
|
discontiguous contraction matching extends <i>an existing
|
|
|
match</i> by one non-starter at a time).</p>
|
|
|
<h4>1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case" id=
|
|
|
"Algorithm_Case">Case Handling</a></h4>
|
|
|
<p>CLDR specifies how to sort lowercase or uppercase first, as
|
|
|
a stronger distinction than other tertiary variants
|
|
|
(<strong>caseFirst</strong>) or while completely ignoring all
|
|
|
other tertiary distinctions (<strong>caseLevel</strong>). See
|
|
|
<i>Section 3.3 <a href="#Setting_Options">Setting
|
|
|
Options</a></i> and <i>Section 3.13 <a href=
|
|
|
"#Case_Parameters">Case Parameters</a></i>.</p>
|
|
|
<h4>1.1.4 <a name="Algorithm_Reordering_Groups" href=
|
|
|
"#Algorithm_Reordering_Groups" id=
|
|
|
"Algorithm_Reordering_Groups">Reordering Groups</a></h4>
|
|
|
<p>CLDR specifies how to do parametric reordering of groups of
|
|
|
scripts (e.g., “native script first”) as well as special groups
|
|
|
(e.g., “digits after letters”), and provides data for the
|
|
|
effective implementation of such reordering.</p>
|
|
|
<h4>1.1.5 <a name="Combining_Rules" href="#Combining_Rules" id=
|
|
|
"Combining_Rules">Combining Rules</a></h4>
|
|
|
<p>Rules from different sources can be combined, with the later
|
|
|
rules overriding the earlier ones. The following is an example
|
|
|
of how this can be useful.</p>
|
|
|
<p>There is a root collation for "emoji" in CLDR. So use of
|
|
|
"-u-co-emoji" in a Unicode locale identifier will access that
|
|
|
ordering.</p>
|
|
|
<p>Example, using ICU:</p>
|
|
|
<blockquote>
|
|
|
<p>collator =
|
|
|
Collator.getInstance(ULocale.forLanguageTag("en-u-co-emoji"));</p>
|
|
|
</blockquote>
|
|
|
<p>However, use of the emoji will supplant the language's
|
|
|
customizations. So the above is the equivalent of:</p>
|
|
|
<blockquote>
|
|
|
<p>collator =
|
|
|
Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji"));</p>
|
|
|
</blockquote>
|
|
|
<p>The same structure will not work for a language that does
|
|
|
require customization, like Danish. That is, the following will
|
|
|
fail.</p>
|
|
|
<blockquote>
|
|
|
<p>collator =
|
|
|
Collator.getInstance(ULocale.forLanguageTag("da-u-co-emoji"));</p>
|
|
|
</blockquote>
|
|
|
<p>For that, a slightly more cumbersome method needs to be
|
|
|
employed, which is to take the rules for Danish, and explicitly
|
|
|
add the rules for emoji.</p>
|
|
|
<blockquote>
|
|
|
<p>RuleBasedCollator collator = new RuleBasedCollator(<br>
|
|
|
((RuleBasedCollator)
|
|
|
Collator.getInstance(ULocale.forLanguageTag("da"))).getRules()
|
|
|
+<br>
|
|
|
((RuleBasedCollator)
|
|
|
Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")))<br>
|
|
|
|
|
|
.getRules());</p>
|
|
|
</blockquote>
|
|
|
<p>The following table shows the differences. When emoji
|
|
|
ordering is supported, the two faces will be adjacent. When
|
|
|
Danish ordering is supported, the ü is after the y.</p>
|
|
|
<table class='simple'>
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<td>code point order</td>
|
|
|
<td>,</td>
|
|
|
<td></td>
|
|
|
<td></td>
|
|
|
<td>Z</td>
|
|
|
<td>a</td>
|
|
|
<td>y</td>
|
|
|
<td>ü</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>글</td>
|
|
|
<td>😀</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>en</td>
|
|
|
<td>,</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>😀</td>
|
|
|
<td>a</td>
|
|
|
<td>ü</td>
|
|
|
<td>y</td>
|
|
|
<td>Z</td>
|
|
|
<td>글</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>en-u-co-emoji</td>
|
|
|
<td>,</td>
|
|
|
<td>😀</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>a</td>
|
|
|
<td>ü</td>
|
|
|
<td>y</td>
|
|
|
<td>Z</td>
|
|
|
<td>글</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>da</td>
|
|
|
<td>,</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>😀</td>
|
|
|
<td>a</td>
|
|
|
<td>y</td>
|
|
|
<td><strong><u>ü</u></strong></td>
|
|
|
<td>Z</td>
|
|
|
<td>글</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>da-u-co-emoji</td>
|
|
|
<td>,</td>
|
|
|
<td>😀</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>a</td>
|
|
|
<td><strong><u>ü</u></strong></td>
|
|
|
<td>y</td>
|
|
|
<td>Z</td>
|
|
|
<td>글</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>combined rules</td>
|
|
|
<td>,</td>
|
|
|
<td>😀</td>
|
|
|
<td>☹️</td>
|
|
|
<td>✈️️</td>
|
|
|
<td>a</td>
|
|
|
<td>y</td>
|
|
|
<td><strong><u>ü</u></strong></td>
|
|
|
<td>Z</td>
|
|
|
<td>글</td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table><br>
|
|
|
<p> </p>
|
|
|
<h2>2 <a name="Root_Collation" href="#Root_Collation" id=
|
|
|
"Root_Collation">Root Collation</a></h2>
|
|
|
<p>The CLDR root collation order is based on the <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">
|
|
|
Default Unicode Collation Element Table (DUCET)</a> defined in
|
|
|
<em>UTS #10: Unicode Collation Algorithm</em> [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is
|
|
|
used by all other locales by default, or as the base for their
|
|
|
tailorings. (For a chart view of the UCA, see Collation Chart
|
|
|
[<a href="tr35.html#UCAChart">UCAChart</a>].)</p>
|
|
|
<p>Starting with CLDR 1.9, CLDR uses modified tables for the
|
|
|
root collation order. The root locale ordering is tailored in
|
|
|
the following ways:</p>
|
|
|
<h3>2.1 <a name="grouping_classes_of_characters" href=
|
|
|
"#grouping_classes_of_characters" id=
|
|
|
"grouping_classes_of_characters">Grouping classes of
|
|
|
characters</a></h3>
|
|
|
<p>As of Version 6.1.0, the DUCET puts characters into the
|
|
|
following ordering:</p>
|
|
|
<ul>
|
|
|
<li>First "common characters": whitespace, punctuation,
|
|
|
general symbols, some numbers, currency symbols, and other
|
|
|
numbers.</li>
|
|
|
<li>Then "script characters": Latin, Greek, and the rest of
|
|
|
the scripts.</li>
|
|
|
</ul>
|
|
|
<p>(There are a few exceptions to this general ordering.)</p>
|
|
|
<p>The CLDR root locale modifies the DUCET tailoring by
|
|
|
ordering the common characters more strictly by category:</p>
|
|
|
<ul>
|
|
|
<li>whitespace, punctuation, general symbols, currency
|
|
|
symbols, and numbers.</li>
|
|
|
</ul>
|
|
|
<p>What the regrouping allows is for users to parametrically
|
|
|
reorder the groups. For example, users can reorder numbers
|
|
|
after all scripts, or reorder Greek before Latin.</p>
|
|
|
<p>The relative order within each of these groups still matches
|
|
|
the DUCET. Symbols, punctuation, and numbers that are grouped
|
|
|
with a particular script stay with that script. The differences
|
|
|
between CLDR and the DUCET order are:</p>
|
|
|
<ol>
|
|
|
<li>CLDR groups the numbers together after currency symbols,
|
|
|
instead of splitting them with some before and some after.
|
|
|
Thus the following are put <em>after</em> currencies and just
|
|
|
before all the other numbers.
|
|
|
<blockquote>
|
|
|
<p>U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE<br>
|
|
|
...<br>
|
|
|
U+1D371 ( 𝍱 ) [No] COUNTING ROD TENS DIGIT NINE</p>
|
|
|
</blockquote>
|
|
|
</li>
|
|
|
<li>CLDR handles a few other characters differently
|
|
|
<ol>
|
|
|
<li>U+10A7F ( 𐩿 ) [Po] OLD SOUTH ARABIAN NUMERIC
|
|
|
INDICATOR is put with punctuation, not symbols</li>
|
|
|
<li>U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc]
|
|
|
RIAL SIGN are put with currency signs, not with R and
|
|
|
REH.</li>
|
|
|
</ol>
|
|
|
</li>
|
|
|
</ol>
|
|
|
<h3>2.2 <a name="non_variable_symbols" href=
|
|
|
"#non_variable_symbols" id="non_variable_symbols">Non-variable
|
|
|
symbols</a></h3>
|
|
|
<p>There are multiple <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a>
|
|
|
options in the UCA for symbols and punctuation, including
|
|
|
<em>non-ignorable</em> and <em>shifted</em>. With the
|
|
|
<em>shifted</em> option, almost all symbols and punctuation are
|
|
|
ignored—except at a fourth level. The CLDR root locale ordering
|
|
|
is modified so that symbols are not affected by the
|
|
|
<em>shifted</em> option. That is, by default, symbols are not
|
|
|
“variable” in CLDR. So <em>shifted</em> only causes whitespace
|
|
|
and punctuation to be ignored, but not symbols (like ♥). The
|
|
|
DUCET behavior can be specified with a locale ID using the "kv"
|
|
|
keyword, to set the Variable section to include all of the
|
|
|
symbols below it, or be set parametrically where
|
|
|
implementations allow access.</p>
|
|
|
<p>See also:</p>
|
|
|
<ul>
|
|
|
<li><i>Section 3.3, <a href="#Setting_Options">Setting
|
|
|
Options</a></i></li>
|
|
|
<li><a href=
|
|
|
"https://www.unicode.org/charts/collation/">https://www.unicode.org/charts/collation/</a></li>
|
|
|
</ul>
|
|
|
<h3>2.3 <a name="tibetan_contractions" href=
|
|
|
"#tibetan_contractions" id="tibetan_contractions">Additional
|
|
|
contractions for Tibetan</a></h3>
|
|
|
<p>Ten contractions are added for Tibetan: Two to fulfill
|
|
|
<a href=
|
|
|
"https://www.unicode.org/reports/tr10/#WF5">well-formedness
|
|
|
condition 5</a>, and eight more to preserve the default order
|
|
|
for Tibetan. For details see <i>UTS #10, Section 3.8.2,
|
|
|
<a href="https://www.unicode.org/reports/tr10/#Well_Formed_DUCET">
|
|
|
Well-Formedness of the DUCET</a></i>.</p>
|
|
|
<h3>2.4 <a name="tailored_noncharacter_weights" href=
|
|
|
"#tailored_noncharacter_weights" id=
|
|
|
"tailored_noncharacter_weights">Tailored noncharacter
|
|
|
weights</a></h3>
|
|
|
<p>U+FFFE and U+FFFF have special tailorings:</p>
|
|
|
<blockquote>
|
|
|
<p><strong>U+FFFF:</strong> This code point is tailored to
|
|
|
have a primary weight higher than all other characters. This
|
|
|
allows the reliable specification of a range, such as “Sch” ≤
|
|
|
X ≤ “Sch\uFFFF”, to include all strings starting with "sch"
|
|
|
or equivalent.</p>
|
|
|
<p><strong>U+FFFE:</strong> This code point produces a CE
|
|
|
with minimal, unique weights on primary and identical levels.
|
|
|
For details see the <i><a href="#Algorithm_FFFE">CLDR
|
|
|
Collation Algorithm</a></i> above.</p>
|
|
|
</blockquote>
|
|
|
<p>UCA (beginning with version 6.3) also maps
|
|
|
<strong>U+FFFD</strong> to a special collation element with a
|
|
|
very high primary weight, so that it is reliably non-<a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>,
|
|
|
for use with <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed
|
|
|
code unit sequences</a>.</p>
|
|
|
<p>In CLDR, so as to maintain the special collation elements,
|
|
|
<strong>U+FFFD..U+FFFF</strong> are not further tailorable, and
|
|
|
nothing can tailor to them. That is, neither can occur in a
|
|
|
collation rule. For example, the following rules are
|
|
|
illegal:</p>
|
|
|
<p><code>&\uFFFF < x</code></p>
|
|
|
<p><code>&x <\uFFFF</code><br></p>
|
|
|
<p class="note"><b>Note:</b></p>
|
|
|
<ul>
|
|
|
<li class="note">Java uses an early version of this collation
|
|
|
syntax, but has not been updated recently. It does not
|
|
|
support any of the syntax marked with [...], and its default
|
|
|
table is not the DUCET nor the CLDR root collation.</li>
|
|
|
</ul>
|
|
|
<h3>2.5 <a name="Root_Data_Files" href="#Root_Data_Files" id=
|
|
|
"Root_Data_Files">Root Collation Data Files</a></h3>
|
|
|
<p>The CLDR root collation data files are in the CLDR
|
|
|
repository and release, under the path <a href=
|
|
|
"https://github.com/unicode-org/cldr/tree/latest/common/uca/">common/uca/</a>.</p>
|
|
|
<p>For most data files there are <strong>_SHORT</strong>
|
|
|
versions available. They contain the same data but only minimal
|
|
|
comments, to reduce the file sizes.</p>
|
|
|
<p>Comments with DUCET-style weights in files other than
|
|
|
allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined
|
|
|
in allkeys_CLDR.txt.</p>
|
|
|
<ul>
|
|
|
<li><strong>allkeys_CLDR</strong> - A file that provides a
|
|
|
remapping of UCA DUCET weights for use with CLDR.</li>
|
|
|
<li><strong>allkeys_DUCET</strong> - The same as DUCET
|
|
|
allkeys.txt, but in alternate=non-ignorable sort order, for
|
|
|
easier comparison with allkeys_CLDR.txt.</li>
|
|
|
<li>
|
|
|
<strong>FractionalUCA</strong> - A file that provides a
|
|
|
remapping of UCA DUCET weights for use with CLDR. The
|
|
|
weight values are modified:
|
|
|
<ul>
|
|
|
<li>The weights have variable length, with 1..4 bytes
|
|
|
each. Each secondary or tertiary weight currently uses at
|
|
|
most 2 bytes.</li>
|
|
|
<li>There are tailoring gaps between adjacent weights, so
|
|
|
that a number of characters can be tailored to sort
|
|
|
between any two root collation elements.</li>
|
|
|
<li>There are collation elements with primary weights at
|
|
|
the boundaries between reordering groups and Unicode
|
|
|
scripts, so that tailoring around the first or last
|
|
|
primary of a group/script results in new collation
|
|
|
elements that sort and reorder together with that group
|
|
|
or script. These boundary weights also define the primary
|
|
|
weight ranges for parametric group and script
|
|
|
reordering.</li>
|
|
|
</ul>An implementation may modify the weights further to
|
|
|
fit the needs of its data structures.
|
|
|
</li>
|
|
|
<li><strong>UCA_Rules</strong> - A file that specifies the
|
|
|
root collation order in the form of <a href=
|
|
|
"#Collation_Tailorings">tailoring rules</a>. This is only an
|
|
|
approximation of the FractionalUCA data, since the rule
|
|
|
syntax cannot express every detail of the collation elements.
|
|
|
For example, in the DUCET and in FractionalUCA, tertiary
|
|
|
differences are usually expressed with special tertiary
|
|
|
weights on all collation elements of an expansion, while a
|
|
|
typical from-rules builder will modify the tertiary weight of
|
|
|
only one of the collation elements.</li>
|
|
|
<li>
|
|
|
<strong>CollationTest_CLDR</strong> - The CLDR versions of
|
|
|
the CollationTest files, which use the tailorings for CLDR.
|
|
|
For information on the format, see <a href=
|
|
|
"https://www.unicode.org/Public/UCA/latest/CollationTest.html">
|
|
|
CollationTest.html</a> in the <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Data10">UCA data
|
|
|
directory</a>.
|
|
|
<ul>
|
|
|
<li>CollationTest_CLDR_NON_IGNORABLE.txt</li>
|
|
|
<li>CollationTest_CLDR_SHIFTED.txt</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
<h3>2.6 <a name="Root_Data_File_Formats" href=
|
|
|
"#Root_Data_File_Formats" id="Root_Data_File_Formats">Root
|
|
|
Collation Data File Formats</a></h3>
|
|
|
<p>The file formats may change between versions of CLDR. The
|
|
|
formats for CLDR 23 and beyond are as follows. As usual, text
|
|
|
after a # is a comment.</p>
|
|
|
<h4>2.6.1 <a name="File_Format_allkeys_CLDR_txt" href=
|
|
|
"#File_Format_allkeys_CLDR_txt" id=
|
|
|
"File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></h4>
|
|
|
<p>This file defines CLDR’s tailoring of the DUCET, as
|
|
|
described in <i>Section 2, <a href="#Root_Collation">Root
|
|
|
Collation</a></i> .</p>
|
|
|
<p>The format is similar to that of <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>,
|
|
|
although there may be some differences in whitespace.</p>
|
|
|
<h4>2.6.2 <a name="File_Format_FractionalUCA_txt" href=
|
|
|
"#File_Format_FractionalUCA_txt" id=
|
|
|
"File_Format_FractionalUCA_txt">FractionalUCA.txt</a></h4>
|
|
|
<p>The format is illustrated by the following sample lines,
|
|
|
with commentary afterwards.</p>
|
|
|
<pre>[UCA version = 6.0.0]</pre>
|
|
|
<blockquote>
|
|
|
<p>Provides the version number of the UCA table.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
[Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre>
|
|
|
<blockquote>
|
|
|
<p>Lists the ranges of Unified_Ideograph characters in
|
|
|
collation order. (New in CLDR 24.) They map to collation
|
|
|
elements with <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Implicit_Weights">implicit
|
|
|
(constructed) primary weights</a>.</p>
|
|
|
</blockquote>
|
|
|
<pre>[radical 6=⼅亅:亅𠄌了𠄍-𠄐亇𠄑予㐧𠄒-𠄔争𠀩𠄕亊𠄖-𠄘𪜜事㐨𠄙-𠄛𪜝𠄜𠄝]
|
|
|
[radical 210=⿑齊:齊𪗄𪗅齋䶒䶓𪗆齌𠆜𪗇𪗈齍𪗉-𪗌齎𪗎𪗍齏𪗏-𪗓]
|
|
|
[radical 210'=⻬齐:齐齑]
|
|
|
[radical end]</pre>
|
|
|
<blockquote>
|
|
|
<p>Data for Unihan radical-stroke order. (New in CLDR 26.)
|
|
|
Following the [Unified_Ideograph] line, a section of
|
|
|
<code>[radical ...]</code> lines defines a radical-stroke
|
|
|
order of the Unified_Ideograph characters.</p>
|
|
|
<p>For Han characters, an implementation may choose either to
|
|
|
implement the order defined in the UCA and the
|
|
|
[Unified_Ideograph] data, or to implement the order defined
|
|
|
by the <code>[radical ...]</code> lines. Beginning with CLDR
|
|
|
26, the CJK type="unihan" tailorings assume that the root
|
|
|
collation order sorts Han characters in Unihan radical-stroke
|
|
|
order according to the <code>[radical ...]</code> data. The
|
|
|
CollationTest_CLDR files only contain Han characters that are
|
|
|
in the same relative order using implicit weights or the
|
|
|
radical-stroke order.</p>
|
|
|
<p>The root collation radical-stroke order is derived from
|
|
|
the first (normative) values of the <a href=
|
|
|
"https://www.unicode.org/reports/tr38/#kRSUnicode">Unihan
|
|
|
kRSUnicode</a> field for each Han character. Han characters
|
|
|
are ordered by radical, with traditional forms sorting before
|
|
|
simplified ones. Characters with the same radical are ordered
|
|
|
by residual stroke count. Characters with the same
|
|
|
radical-stroke values are ordered by block and code point, as
|
|
|
for <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Implicit_Weights">UCA
|
|
|
implicit weights</a>.</p>
|
|
|
<p>There is one <code>[radical ...]</code> line per radical,
|
|
|
in the order of radical numbers. Each line shows the radical
|
|
|
number and the representative characters from the <a href=
|
|
|
"https://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD
|
|
|
file CJKRadicals.txt</a>, followed by a colon (“:”) and the
|
|
|
Han characters with that radical in the order as described
|
|
|
above. A range like <code>万-丌</code> indicates that the code
|
|
|
points in that range sort in code point order.</p>
|
|
|
<p>The radical number and characters are informational. The
|
|
|
sort order is established only by the order of the
|
|
|
<code>[radical ...]</code> lines, and within each line by the
|
|
|
characters and ranges between the colon (“:”) and the bracket
|
|
|
(“]”).</p>
|
|
|
<p>Each Unified_Ideograph occurs exactly once. Only
|
|
|
Unified_Ideograph characters are listed on <code>[radical
|
|
|
...]</code> lines.</p>
|
|
|
<p>This section is terminated with one <code>[radical
|
|
|
end]</code> line.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL></pre>
|
|
|
<blockquote>
|
|
|
<p>Provides a weight line. The first element (before the ";")
|
|
|
is a hex codepoint sequence. The second field is a sequence
|
|
|
of collation elements. Each collation element has 3 parts
|
|
|
separated by commas: the primary weight, secondary weight,
|
|
|
and tertiary weight. The tertiary weight actually consists of
|
|
|
two components: the top two bits (0xC0) are used for the
|
|
|
<em>case level</em>, and should be masked off where a case
|
|
|
level is not used.</p>
|
|
|
<p>A weight is either empty (meaning a zero or ignorable
|
|
|
weight) or is a sequence of one or more bytes. The bytes are
|
|
|
interpreted as a "fraction", meaning that the ordering is 04
|
|
|
< 05 05 < 06. The weights are constructed so that no
|
|
|
weight is an initial subsequence of another: that is, having
|
|
|
both the weights 05 and 05 05 is illegal. The above line
|
|
|
consists of all ignorable weights.</p>
|
|
|
<p>The vertical bar (“|”) character is used to indicate
|
|
|
context, as in:</p>
|
|
|
</blockquote>
|
|
|
<pre>006C | 00B7; [, DB A9, 05]</pre>
|
|
|
<blockquote>
|
|
|
This example indicates that if U+00B7 appears immediately
|
|
|
after U+006C, it is given the corresponding collation element
|
|
|
instead. This syntax is roughly equivalent to the following
|
|
|
contraction, but is more efficient. For details see the
|
|
|
specification of <i><a href=
|
|
|
"#Context_Sensitive_Mappings">Context-Sensitive
|
|
|
Mappings</a></i> above.
|
|
|
</blockquote>
|
|
|
<pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre>
|
|
|
<blockquote>
|
|
|
<p>Single-byte primary weights are given to particularly
|
|
|
frequent characters, such as space, digits, and a-z. More
|
|
|
frequent characters are given two-byte weights, while
|
|
|
relatively infrequent characters are given three-byte
|
|
|
weights. For example:</p>
|
|
|
</blockquote>
|
|
|
<pre>...
|
|
|
0009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION>
|
|
|
...
|
|
|
1B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG
|
|
|
...
|
|
|
0031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE</pre>
|
|
|
<blockquote>
|
|
|
<p>The assignment of 2 vs 3 bytes does not reflect
|
|
|
importance, or exact frequency.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
3041; [76 06, 05, 03] # Hira Lo [3888.0020.000D] * HIRAGANA LETTER SMALL A
|
|
|
3042; [76 06, 05, 85] # Hira Lo [3888.0020.000E] * HIRAGANA LETTER A
|
|
|
30A1; [76 06, 05, 10] # Kana Lo [3888.0020.000F] * KATAKANA LETTER SMALL A
|
|
|
30A2; [76 06, 05, 9E] # Kana Lo [3888.0020.0011] * KATAKANA LETTER A</pre>
|
|
|
<blockquote>
|
|
|
<p>Beginning with CLDR 27, some primary or secondary
|
|
|
collation elements may have below-common tertiary weights
|
|
|
(e.g., <code>03</code> ), in particular to allow normal
|
|
|
Hiragana letters to have common tertiary weights.</p>
|
|
|
</blockquote>
|
|
|
<pre># SPECIAL MAX/MIN COLLATION ELEMENTS
|
|
|
FFFE; [02, 05, 05] # Special LOWEST primary, for merge/interleaving
|
|
|
FFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges</pre>
|
|
|
<blockquote>
|
|
|
<p>The two tailored noncharacters have their own primary
|
|
|
weights.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
F967; [U+4E0D] # Hani Lo [FB40.0020.0002][CE0D.0000.0000] * CJK COMPATIBILITY IDEOGRAPH-F967
|
|
|
2F02; [U+4E36, 10] # Hani So [FB40.0020.0004][CE36.0000.0000] * KANGXI RADICAL DOT
|
|
|
2E80; [U+4E36, 70, 20] # Hani So [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004] * CJK RADICAL REPEAT</pre>
|
|
|
<blockquote>
|
|
|
<p>Some collation elements are specified by reference to
|
|
|
other mappings. This is particularly useful for Han
|
|
|
characters which are given implicit/constructed primary
|
|
|
weights; the reference to a Unified_Ideograph makes these
|
|
|
mappings independent of implementation details. This
|
|
|
technique may also be used in other mappings to show the
|
|
|
relationship of character variants.</p>
|
|
|
<p>The referenced character must have a mapping listed
|
|
|
earlier in the file, or the mapping must have been defined
|
|
|
via the [Unified_Ideograph] data line. The referenced
|
|
|
character must map to exactly one collation element.</p>
|
|
|
<p><code>[U+4E0D]</code> copies U+4E0D’s entire collation
|
|
|
element. <code>[U+4E36, 10]</code> copies U+4E36’s primary
|
|
|
and secondary weights and specifies a different tertiary
|
|
|
weight. <code>[U+4E36, 70, 20]</code> only copies U+4E36’s
|
|
|
primary weight and specifies other secondary and tertiary
|
|
|
weights.</p>
|
|
|
<p>FractionalUCA.txt does not have any explicit mappings for
|
|
|
implicit weights. Therefore, an implementation is free to
|
|
|
choose an algorithm for computing implicit weights according
|
|
|
to the principles specified in the UCA.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
FDD1 20AC; [0D 20 02, 05, 05] # CURRENCY first primary
|
|
|
FDD1 0034; [0E 02 02, 05, 05] # DIGIT first primary starts new lead byte
|
|
|
FDD0 FF21; [26 02 02, 05, 05] # REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte
|
|
|
FDD1 004C; [28 02 02, 05, 05] # LATIN first primary starts new lead byte
|
|
|
FDD0 FF3A; [5D 02 02, 05, 05] # REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte
|
|
|
FDD1 03A9; [5F 04 02, 05, 05] # GREEK first primary starts new lead byte (compressible)
|
|
|
FDD1 03E2; [5F 60 02, 05, 05] # COPTIC first primary (compressible)</pre>
|
|
|
<blockquote>
|
|
|
<p>These are special mappings with primaries at the
|
|
|
boundaries of scripts and reordering groups. They serve as
|
|
|
tailoring boundaries, so that tailoring near the first or
|
|
|
last character of a script or group places the tailored item
|
|
|
into the same group. Beginning with CLDR 24, each of these is
|
|
|
a contraction of U+FDD1 with a character of the corresponding
|
|
|
script (or of the General_Category [Z, P, S, Sc, Nd]
|
|
|
corresponding to a special reordering group), mapping to the
|
|
|
first possible primary weight per script or group. They can
|
|
|
be enumerated for implementations of <a href=
|
|
|
"#Collation_Indexes">Collation Indexes</a>. (Earlier versions
|
|
|
mapped contractions with U+FDD0 to the last primary weights
|
|
|
of each group but not each script.)</p>
|
|
|
<p>Beginning with CLDR 27, these mappings alone define the
|
|
|
boundaries for reordering single scripts. (There are no
|
|
|
mappings for Hrkt, Hans, or Hant because they are not fully
|
|
|
distinct scripts; they share primary weights with other
|
|
|
scripts: Hrkt=Hira=Kana & Hans=Hant=Hani.) There are some
|
|
|
reserved ranges, beginning at boundaries marked with U+FDD0
|
|
|
plus following characters as shown above. The reserved ranges
|
|
|
are not used for collation elements and are not available for
|
|
|
tailoring.</p>
|
|
|
<p>Some primary lead bytes must be reserved so that
|
|
|
reordering of scripts along partial-lead-byte boundaries can
|
|
|
“split” the primary lead byte and use up a reserved byte.
|
|
|
This is for implementations that write sort keys, which must
|
|
|
reorder primary weights by offsetting them by whole lead
|
|
|
bytes. There are reorder-reserved ranges before and after
|
|
|
Latin, so that reordering scripts with few primary lead bytes
|
|
|
relative to Latin can move those scripts into the reserved
|
|
|
ranges without changing the primary weights of any other
|
|
|
script. Each of these boundaries begins with a new two-byte
|
|
|
primary; that is, no two groups/scripts/ranges share the top
|
|
|
16 bits of their primary weights.</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
FDD0 0034; [11, 05, 05] # lead byte for numeric sorting</pre>
|
|
|
<blockquote>
|
|
|
<p>This mapping specifies the lead byte for numeric sorting.
|
|
|
It must be different from the lead byte of any other primary
|
|
|
weight, otherwise numeric sorting would generate ill-formed
|
|
|
collation elements. Therefore, this mapping itself must be
|
|
|
excluded from the set of regular mappings. This value can be
|
|
|
ignored by implementations that do not support numeric
|
|
|
sorting. (Other contractions with U+FDD0 can normally be
|
|
|
ignored altogether.)</p>
|
|
|
</blockquote>
|
|
|
<pre>
|
|
|
# HOMELESS COLLATION ELEMENTS
|
|
|
FDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON
|
|
|
FDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
|
|
|
FDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre>
|
|
|
<blockquote>
|
|
|
<p>The DUCET has some weights that don't correspond directly
|
|
|
to a character. To allow for implementations to have a
|
|
|
mapping for each collation element (necessary for certain
|
|
|
implementations of tailoring), this requires the construction
|
|
|
of special sequences for those weights. These collation
|
|
|
elements can normally be ignored.</p>
|
|
|
</blockquote>
|
|
|
<p>Next, a number of tables are defined. The function of each
|
|
|
of the tables is summarized afterwards.</p>
|
|
|
<pre># VALUES BASED ON UCA
|
|
|
...
|
|
|
[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
|
|
|
[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
|
|
|
[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
|
|
|
[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
|
|
|
[first trailing [E5, 05, 05]] # CONSTRUCTED
|
|
|
[last trailing [E5, 05, 05]] # CONSTRUCTED
|
|
|
...</pre>
|
|
|
<blockquote>
|
|
|
<p>This table summarizes ranges of important groups of
|
|
|
characters for implementations.</p>
|
|
|
</blockquote>
|
|
|
<pre># Top Byte => Reordering Tokens
|
|
|
[top_byte 00 TERMINATOR ] # [0] TERMINATOR=1
|
|
|
[top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1
|
|
|
[top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1
|
|
|
[top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
|
|
|
...</pre>
|
|
|
<blockquote>
|
|
|
<p>This table defines the reordering groups, for script
|
|
|
reordering. The table maps from the first bytes of the
|
|
|
fractional weights to a reordering token. The format is
|
|
|
"[top_byte " byte-value reordering-token "COMPRESS"? "]". The
|
|
|
"COMPRESS" value is present when there is only one byte in
|
|
|
the reordering token, and primary-weight compression can be
|
|
|
applied. Most reordering tokens are script values; others are
|
|
|
special-purpose values, such as PUNCTUATION. Beginning with
|
|
|
CLDR 24, this table precedes the regular mappings, so that
|
|
|
parsers can use this information while processing and
|
|
|
optimizing mappings. Beginning with CLDR 27, most of this
|
|
|
data is irrelevant because single scripts can be reordered.
|
|
|
Only the "COMPRESS" data is still useful.</p>
|
|
|
</blockquote>
|
|
|
<pre># Reordering Tokens => Top Bytes
|
|
|
[reorderingTokens Arab 61=910 62=910 ]
|
|
|
[reorderingTokens Armi 7A=22 ]
|
|
|
[reorderingTokens Armn 5F=82 ]
|
|
|
[reorderingTokens Avst 7A=54 ]
|
|
|
...</pre>
|
|
|
<blockquote>
|
|
|
<p>This table is an inverse mapping from reordering token to
|
|
|
top byte(s). In terms like "61=910", the first value is the
|
|
|
top byte, while the second is informational, indicating the
|
|
|
number of primaries assigned with that top byte.</p>
|
|
|
</blockquote>
|
|
|
<pre># General Categories => Top Byte
|
|
|
[categories Cc 03{SPACE}=6 ]
|
|
|
[categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
|
|
|
[categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre>
|
|
|
<blockquote>
|
|
|
<p>This table is informational, providing the top bytes,
|
|
|
scripts, and primaries associated with each general category
|
|
|
value.</p>
|
|
|
</blockquote>
|
|
|
<pre># FIXED VALUES
|
|
|
[fixed first implicit byte E0]
|
|
|
[fixed last implicit byte E4]
|
|
|
[fixed first trail byte E5]
|
|
|
[fixed last trail byte EF]
|
|
|
[fixed first special byte F0]
|
|
|
[fixed last special byte FF]
|
|
|
|
|
|
[fixed secondary common byte 05]
|
|
|
[fixed last secondary common byte 45]
|
|
|
[fixed first ignorable secondary byte 80]
|
|
|
|
|
|
[fixed tertiary common byte 05]
|
|
|
[fixed first ignorable tertiary byte 3C]
|
|
|
</pre>
|
|
|
<blockquote>
|
|
|
<p>The final table gives certain hard-coded byte values. The
|
|
|
"trail" area is provided for implementation of the "trailing
|
|
|
weights" as described in the UCA.</p>
|
|
|
</blockquote>
|
|
|
<p class="note">Note: The particular primary lead bytes for
|
|
|
Hani vs. IMPLICIT vs. TRAILING are only an example. An
|
|
|
implementation is free to move them if it also moves the
|
|
|
explicit TRAILING weights. This affects only a small number of
|
|
|
explicit mappings in FractionalUCA.txt, such as for U+FFFD,
|
|
|
U+FFFF, and the “unassigned first primary”. It is possible to
|
|
|
use no SPECIAL bytes at all, and to use only the one primary
|
|
|
lead byte FF for TRAILING weights.</p>
|
|
|
<h4>2.6.3 <a name="File_Format_UCA_Rules_txt" href=
|
|
|
"#File_Format_UCA_Rules_txt" id=
|
|
|
"File_Format_UCA_Rules_txt">UCA_Rules.txt</a></h4>
|
|
|
<p>The format for this file uses the CLDR collation syntax, see
|
|
|
<i>Section 3, <a href="#Collation_Tailorings">Collation
|
|
|
Tailorings</a></i> .</p>
|
|
|
<h2>3 <a name="Collation_Tailorings" href=
|
|
|
"#Collation_Tailorings" id="Collation_Tailorings">Collation
|
|
|
Tailorings</a></h2>
|
|
|
<p class="dtd"><!ELEMENT collations (alias |
|
|
|
(defaultCollation?, collation*, special*)) ></p>
|
|
|
<p class="dtd"><!ELEMENT defaultCollation ( #PCDATA )
|
|
|
></p>
|
|
|
<p>This element of the LDML format contains one or more
|
|
|
<span class="element">collation</span> elements, distinguished
|
|
|
by type. Each <span class="element">collation</span> contains
|
|
|
elements with parametric settings, or rules that specify a
|
|
|
certain sort order, as a tailoring of the root order, or
|
|
|
both.</p>
|
|
|
<p class="note">Note: CLDR collation tailoring data should
|
|
|
follow the <a href=
|
|
|
"http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR
|
|
|
Collation Guidelines</a>.</p>
|
|
|
<h3>3.1 <a name="Collation_Types" href="#Collation_Types" id=
|
|
|
"Collation_Types">Collation Types</a></h3>
|
|
|
<p>Each locale may have multiple sort orders (types). The
|
|
|
<span class="element">defaultCollation</span> element defines
|
|
|
the default tailoring for a locale and its sublocales. For
|
|
|
example:</p>
|
|
|
<ul>
|
|
|
<li>root.xml:
|
|
|
<code><defaultCollation>standard</defaultCollation></code></li>
|
|
|
<li>zh.xml:
|
|
|
<code><defaultCollation>pinyin</defaultCollation></code></li>
|
|
|
<li>zh_Hant.xml:
|
|
|
<code><defaultCollation>stroke</defaultCollation></code></li>
|
|
|
</ul>
|
|
|
<p>To allow implementations in reduced memory environments to
|
|
|
use CJK sorting, there are also short forms of each of these
|
|
|
collation sequences. These provide for the most common
|
|
|
characters in common use, and are marked with <span class=
|
|
|
"attribute">alt</span>="<span class=
|
|
|
"attributeValue">short</span>".</p>
|
|
|
<p>A collation type name that starts with "private-", for
|
|
|
example, "private-kana", indicates an incomplete tailoring that
|
|
|
is only intended for import into one or more other tailorings
|
|
|
(usually for sharing common rules). It does not establish a
|
|
|
complete sort order. An implementation should not build data
|
|
|
tables for a private collation type, and should not include a
|
|
|
private collation type in a list of available types.</p>
|
|
|
<p class="note"><b>Note:</b></p>
|
|
|
<ul>
|
|
|
<li>There is an on-line demonstration of collation at
|
|
|
[<a href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that
|
|
|
uses the same rule syntax. (Pick the locale and scroll to
|
|
|
"Collation Rules", near the end.)</li>
|
|
|
<li class="note">In CLDR 23 and before, LDML collation files
|
|
|
used an XML format. Starting with CLDR 24, the XML collation
|
|
|
syntax is deprecated and no longer used. See the <i><a href=
|
|
|
"https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings">
|
|
|
CLDR 23 version of this document</a></i> for details about
|
|
|
the XML collation syntax.</li>
|
|
|
</ul>
|
|
|
<h4>3.1.1 <a name="Collation_Type_Fallback" href=
|
|
|
"#Collation_Type_Fallback" id=
|
|
|
"Collation_Type_Fallback">Collation Type Fallback</a></h4>
|
|
|
<p>When loading a requested tailoring from its data file and
|
|
|
the parent file chain, use the following type fallback to find
|
|
|
the tailoring.</p>
|
|
|
<ol>
|
|
|
<li>Determine the default type from the
|
|
|
<defaultCollation> element; map the default type to its
|
|
|
alias if one is defined. If there is no
|
|
|
<defaultCollation> element, then use "standard" as the
|
|
|
default type.</li>
|
|
|
<li>If the request language tag specifies the collation type
|
|
|
(keyword "co"), then map it to its alias if one is defined
|
|
|
(e.g., "-co-phonebk" → "phonebook"). If the language tag does
|
|
|
not specify the type, then use the default type.</li>
|
|
|
<li>Use the <collation> element with this type.</li>
|
|
|
<li>If it does not exist, and the type starts with "search"
|
|
|
but is longer, then set the type to "search" and use that
|
|
|
<collation> element. (For example, "searchjl" →
|
|
|
"search".)</li>
|
|
|
<li>If it does not exist, and the type is not the default
|
|
|
type, then set the type to the default type and use that
|
|
|
<collation> element.</li>
|
|
|
<li>If it does not exist, and the type is not "standard",
|
|
|
then set the type to "standard" and use that
|
|
|
<collation> element.</li>
|
|
|
<li>If it does not exist, then use the CLDR root
|
|
|
collation.</li>
|
|
|
</ol>
|
|
|
<p class="note">Note that the CLDR collation/root.xml contains
|
|
|
<defaultCollation>standard</defaultCollation>,
|
|
|
<collation type="standard"> (with an empty tailoring, so
|
|
|
this is the same as the CLDR root collation), and <collation
|
|
|
type="search">.</p>
|
|
|
<p>For example, assume that we have collation data for the
|
|
|
following tailorings. ("da/search" is shorthand for
|
|
|
"da-u-co-search".)</p>
|
|
|
<ul>
|
|
|
<li>root/defaultCollation=standard</li>
|
|
|
<li>root/standard (this is the same as “the CLDR root
|
|
|
collator”)</li>
|
|
|
<li>root/search</li>
|
|
|
<li>da/standard</li>
|
|
|
<li>da/search</li>
|
|
|
<li>el/standard</li>
|
|
|
<li>ko/standard</li>
|
|
|
<li>ko/search</li>
|
|
|
<li>ko/searchjl</li>
|
|
|
<li>zh/defaultCollation=pinyin</li>
|
|
|
<li>zh/pinyin</li>
|
|
|
<li>zh/stroke</li>
|
|
|
<li>zh-Hant/defaultCollation=stroke</li>
|
|
|
</ul>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name=
|
|
|
"Sample_requested_and_actual_collation_locales_and_types"
|
|
|
href=
|
|
|
"#Sample_requested_and_actual_collation_locales_and_types"
|
|
|
id=
|
|
|
"Sample_requested_and_actual_collation_locales_and_types">Sample
|
|
|
requested and actual collation locales and types</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>requested</th>
|
|
|
<th>actual</th>
|
|
|
<th>comment</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>da/phonebook</td>
|
|
|
<td>da/standard</td>
|
|
|
<td>default type for Danish</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>zh</td>
|
|
|
<td>zh/pinyin</td>
|
|
|
<td>default type for zh</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>zh/standard</td>
|
|
|
<td>root/standard</td>
|
|
|
<td>no "standard" tailoring for zh, falls back to root</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>zh/phonebook</td>
|
|
|
<td>zh/pinyin</td>
|
|
|
<td>default type for zh</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>zh-Hant/phonebook</td>
|
|
|
<td>zh/stroke</td>
|
|
|
<td>default type for zh-Hant is "stroke"</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>da/searchjl</td>
|
|
|
<td>da/search</td>
|
|
|
<td>"search.+" falls back to "search"</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>el/search</td>
|
|
|
<td>root/search</td>
|
|
|
<td>no "search" tailoring for Greek</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>el/searchjl</td>
|
|
|
<td>root/search</td>
|
|
|
<td>"search.+" falls back to "search", found in root</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>ko/searchjl</td>
|
|
|
<td>ko/searchjl</td>
|
|
|
<td>requested data is actually available</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<h3>3.2 <a name="Collation_Version" href="#Collation_Version"
|
|
|
id="Collation_Version">Version</a></h3>
|
|
|
<p>The version attribute is used in case a specific version of
|
|
|
the UCA is to be specified. It is optional, and is specified if
|
|
|
the results are to be identical on different systems. If it is
|
|
|
not supplied, then the version is assumed to be the same as the
|
|
|
Unicode version for the system as a whole.</p>
|
|
|
<blockquote>
|
|
|
<p class="note"><b>Note:</b> For version 3.1.1 of the UCA,
|
|
|
the version of Unicode must also be specified with any
|
|
|
versioning information; an example would be "3.1.1/3.2" for
|
|
|
version 3.1.1 of the UCA, for version 3.2 of Unicode. This
|
|
|
was changed by decision of the UTC, so that dual versions
|
|
|
were no longer necessary. So for UCA 4.0 and beyond, the
|
|
|
version just has a single number.</p>
|
|
|
</blockquote>
|
|
|
<h3>3.3 <a name="Collation_Element" href="#Collation_Element"
|
|
|
id="Collation_Element">Collation Element</a></h3>
|
|
|
<p class="dtd"><!ELEMENT collation (alias | (cr*, special*))
|
|
|
></p>
|
|
|
<p>The tailoring syntax is designed to be independent of the
|
|
|
actual weights used in any particular UCA table. That way the
|
|
|
same rules can be applied to UCA versions over time, even if
|
|
|
the underlying weights change. The following illustrates the
|
|
|
overall structure of a <span class=
|
|
|
"element">collation</span>:</p>
|
|
|
<pre><collation type="phonebook">
|
|
|
<cr><![CDATA[
|
|
|
[caseLevel on]
|
|
|
&c < k
|
|
|
]]></cr>
|
|
|
</collation></pre>
|
|
|
<h3>3.4 <a name="Setting_Options" href="#Setting_Options" id=
|
|
|
"Setting_Options">Setting Options</a></h3>
|
|
|
<p>Parametric settings can be specified in language tags or in
|
|
|
rule syntax (in the form <code>[keyword value]</code> ). For
|
|
|
example, <code>-ks-level2</code> or <code>[strength 2]</code>
|
|
|
will only compare strings based on their primary and secondary
|
|
|
weights.</p>
|
|
|
<p>If a setting is not present, the CLDR default (or the
|
|
|
default for the locale, if there is one) is used. That default
|
|
|
is listed in bold italics. Where there is a UCA default that is
|
|
|
different, it is listed in bold with (<strong>UCA
|
|
|
default</strong>). Note that the default value for a locale may
|
|
|
be different than the normal default value for the setting.</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Collation_Settings" href="#Collation_Settings" id=
|
|
|
"Collation_Settings">Collation Settings</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>BCP47 Key</th>
|
|
|
<th>BCP47 Value</th>
|
|
|
<th>Rule Syntax</th>
|
|
|
<th>Description</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="5">ks</td>
|
|
|
<td>level1</td>
|
|
|
<td><code>[strength 1]</code><br>
|
|
|
(primary)</td>
|
|
|
<td rowspan="5">Sets the default strength for comparison,
|
|
|
as described in the [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
|
|
|
<em>Note that a strength setting of greater than 4 may have
|
|
|
the same effect as <strong>identical</strong>, depending on
|
|
|
the locale and implementation.</em></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>level2</td>
|
|
|
<td><code>[strength 2]</code><br>
|
|
|
(secondary)</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>level3</td>
|
|
|
<td><em><strong><code>[strength 3]</code><br>
|
|
|
(tertiary)</strong></em></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>level4</td>
|
|
|
<td><code>[strength 4]</code><br>
|
|
|
(quaternary)</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>identic</td>
|
|
|
<td><code>[strength I]</code><br>
|
|
|
(identical)</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="3">ka</td>
|
|
|
<td>noignore</td>
|
|
|
<td><i><strong><code>[alternate
|
|
|
non-ignorable]</code></strong></i><br></td>
|
|
|
<td rowspan="3">Sets alternate handling for variable
|
|
|
weights, as described in [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
|
|
|
where "shifted" causes certain characters to be ignored in
|
|
|
comparison. <em>The default for LDML is different than it
|
|
|
is in the UCA. In LDML, the default for alternate handling
|
|
|
is <strong>non-ignorable</strong>, while in UCA it is
|
|
|
<strong>shifted</strong>. In addition, in LDML only
|
|
|
whitespace and punctuation are variable by
|
|
|
default.</em></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>shifted</td>
|
|
|
<td><strong><code>[alternate shifted]</code><br>
|
|
|
(UCA default)</strong></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td><i>n/a</i><br>
|
|
|
(blanked)</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="2">kb</td>
|
|
|
<td>true</td>
|
|
|
<td><code>[backwards 2]</code></td>
|
|
|
<td rowspan="2">Sets the comparison for the second level to
|
|
|
be <strong>backwards</strong>, as described in [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong>n/a</strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="2">kk</td>
|
|
|
<td>true</td>
|
|
|
<td><strong><code>[normalization on]</code><br>
|
|
|
(UCA default)</strong></td>
|
|
|
<td rowspan="2">If <strong>on</strong>, then the normal
|
|
|
[<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
|
|
|
algorithm is used. If <strong>off</strong>, then most
|
|
|
strings should still sort correctly despite not normalizing
|
|
|
to NFD first.<br>
|
|
|
<em>Note that the default for CLDR locales may be different
|
|
|
than in the UCA. The rules for particular locales have it
|
|
|
set to <strong>on</strong>: those locales whose exemplar
|
|
|
characters (in forms commonly interchanged) would be
|
|
|
affected by normalization.</em></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong><code>[normalization
|
|
|
off]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="2">kc</td>
|
|
|
<td>true</td>
|
|
|
<td><code>[caseLevel on]</code></td>
|
|
|
<td rowspan="2">If set to <strong>on</strong><i>,</i> a
|
|
|
level consisting only of case characteristics will be
|
|
|
inserted in front of tertiary level, as a "Level 2.5". To
|
|
|
ignore accents but take case into account, set strength to
|
|
|
<strong>primary</strong> and case level to
|
|
|
<strong>on</strong>. For details, see <em>Section 3.14,
|
|
|
<a href="#Case_Parameters">Case Parameters</a></em> .</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong><code>[caseLevel
|
|
|
off]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="3">kf</td>
|
|
|
<td>upper</td>
|
|
|
<td><code>[caseFirst upper]</code></td>
|
|
|
<td rowspan="3">If set to <strong>upper</strong>, causes
|
|
|
upper case to sort before lower case. If set to
|
|
|
<strong>lower</strong>, causes lower case to sort before
|
|
|
upper case. Useful for locales that have already supported
|
|
|
ordering but require different order of cases. Affects case
|
|
|
and tertiary levels. For details, see <em>Section 3.14,
|
|
|
<a href="#Case_Parameters">Case Parameters</a></em> .</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>lower</td>
|
|
|
<td><code>[caseFirst lower]</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong><code>[caseFirst
|
|
|
off]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="2">kh</td>
|
|
|
<td>true<br>
|
|
|
<i><strong>Deprecated:</strong></i> Use rules with
|
|
|
quaternary relations instead.</td>
|
|
|
<td><code>[hiraganaQ on]</code></td>
|
|
|
<td rowspan="2">Controls special treatment of Hiragana code
|
|
|
points on quaternary level. If turned <strong>on</strong>,
|
|
|
Hiragana codepoints will get lower values than all the
|
|
|
other non-variable code points in <strong>shifted</strong>.
|
|
|
That is, the normal Level 4 value for a regular collation
|
|
|
element is FFFF, as described in [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
|
|
|
<em>Section 3.6, <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable
|
|
|
Weighting</a></em> . This is changed to FFFE for
|
|
|
[:script=Hiragana:] characters. The strength must be
|
|
|
greater or equal than quaternary if this attribute is to
|
|
|
have any effect.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong><code>[hiraganaQ
|
|
|
off]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="2">kn</td>
|
|
|
<td>true</td>
|
|
|
<td><code>[numericOrdering on]</code></td>
|
|
|
<td rowspan="2">If set to <strong>on</strong>, any sequence
|
|
|
of Decimal Digits (General_Category = Nd in the [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is
|
|
|
sorted at a primary level with its numeric value. For
|
|
|
example, "A-21" < "A-123". The computed primary weights
|
|
|
are all at the start of the <strong>digit</strong>
|
|
|
reordering group. Thus with an untailored UCA table, "a$"
|
|
|
< "a0" < "a2" < "a12" < "a⓪" < "aa".</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>false</td>
|
|
|
<td><i><strong><code>[numericOrdering
|
|
|
off]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>kr</td>
|
|
|
<td>a sequence of one or more reorder codes: <strong>space,
|
|
|
punct, symbol, currency, digit</strong>, or any BCP47
|
|
|
script ID</td>
|
|
|
<td><code>[reorder Grek digit]</code></td>
|
|
|
<td>Specifies a reordering of scripts or other significant
|
|
|
blocks of characters such as symbols, punctuation, and
|
|
|
digits. For the precise meaning and usage of the reorder
|
|
|
codes, see <em>Section 3.13, <a href=
|
|
|
"#Script_Reordering">Collation Reordering</a>.</em></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="4">kv</td>
|
|
|
<td>space</td>
|
|
|
<td><code>[maxVariable space]</code></td>
|
|
|
<td rowspan="4">Sets the variable top to the top of the
|
|
|
specified reordering group. All code points with primary
|
|
|
weights less than or equal to the variable top will be
|
|
|
considered variable, and thus affected by the alternate
|
|
|
handling. Variables are ignorable by default in [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but
|
|
|
not in CLDR.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>punct</td>
|
|
|
<td><i><strong><code>[maxVariable
|
|
|
punct]</code></strong></i></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>symbol</td>
|
|
|
<td><strong><code>[maxVariable symbol]</code><br>
|
|
|
(UCA default)</strong></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>currency</td>
|
|
|
<td><code>[maxVariable currency]</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>vt</td>
|
|
|
<td>See <i>Part 1 Section 3.6.4, <a href=
|
|
|
"tr35.html#Unicode_Locale_Extension_Data_Files">U Extension
|
|
|
Data Files</a></i>.<br>
|
|
|
<i><strong>Deprecated:</strong></i> Use maxVariable
|
|
|
instead.</td>
|
|
|
<td><code>&\u00XX\uYYYY < [variable top]</code><br>
|
|
|
<br>
|
|
|
(the default is set to the highest punctuation, thus
|
|
|
including spaces and punctuation, but not symbols)</td>
|
|
|
<td>
|
|
|
<p>The BCP47 value is described in <i>Appendix Q:
|
|
|
<a href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale
|
|
|
Extension Keys and Types</a>.</i></p>
|
|
|
<p>Sets the string value for the variable top. All the
|
|
|
code points with primary weights less than or equal to
|
|
|
the variable top will be considered variable, and thus
|
|
|
affected by the alternate handling.<br>
|
|
|
An implementation that supports the variableTop setting
|
|
|
should also support the maxVariable setting, and it
|
|
|
should "pin" ("round up") the variableTop to the top of
|
|
|
the containing reordering group.<br>
|
|
|
Variables are ignorable by default in [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
|
|
|
but not in CLDR. See below for more information.</p>
|
|
|
</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td>match-boundaries: <em><strong>none</strong></em> |
|
|
|
whole-character | whole-word<br>
|
|
|
Defined by <em>Section 8, <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Searching">Searching
|
|
|
and Matching</a></em> of [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td><em>n/a</em></td>
|
|
|
<td>match-style: <em><strong>minimal</strong></em> | medial
|
|
|
| maximal<br>
|
|
|
Defined by <em>Section 8, <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Searching">Searching
|
|
|
and Matching</a></em> of [<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<h4>3.4.1 <a name="Common_Settings" href="#Common_Settings" id=
|
|
|
"Common_Settings">Common settings combinations</a></h4>
|
|
|
<p>Some commonly used parametric collation settings are
|
|
|
available via combinations of LDML settings attributes:</p>
|
|
|
<ul>
|
|
|
<li>“Ignore accents”: <strong>strength=primary</strong></li>
|
|
|
<li>“Ignore accents” but take case into account:
|
|
|
<strong>strength=primary caseLevel=on</strong></li>
|
|
|
<li>“Ignore case”: <strong>strength=secondary</strong></li>
|
|
|
<li>“Ignore punctuation” (completely):
|
|
|
<strong>strength=tertiary alternate=shifted</strong></li>
|
|
|
<li>“Ignore punctuation” but distinguish among punctuation
|
|
|
marks: <strong>strength=quaternary
|
|
|
alternate=shifted</strong></li>
|
|
|
</ul>
|
|
|
<h4>3.4.2 <a name="Normalization_Setting" href=
|
|
|
"#Normalization_Setting" id="Normalization_Setting">Notes on
|
|
|
the normalization setting</a></h4>
|
|
|
<p>The UCA always normalizes input strings into NFD form before
|
|
|
the rest of the algorithm. However, this results in poor
|
|
|
performance.</p>
|
|
|
<p>With <strong>normalization=off</strong>, strings that are in
|
|
|
[<a href="tr35.html#FCD">FCD</a>] and do not contain Tibetan
|
|
|
precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
|
|
|
correctly. With <strong>normalization=on</strong>, an
|
|
|
implementation that does not normalize to NFD must at least
|
|
|
perform an incremental FCD check and normalize substrings as
|
|
|
necessary. It should also always decompose the Tibetan
|
|
|
precomposed vowels. (Otherwise discontiguous contractions
|
|
|
across their leading components cannot be handled
|
|
|
correctly.)</p>
|
|
|
<p>Another complication for an implementation that does not
|
|
|
always use NFD arises when contraction mappings overlap with
|
|
|
canonical Decomposition_Mapping strings. For example, the
|
|
|
Danish contraction “aa” overlaps with the decompositions of
|
|
|
‘ä’, ‘å’, and other characters. In the root collation (and in
|
|
|
the DUCET), Cyrillic ‘ӛ’ maps to a single collation element,
|
|
|
which means that its decomposition “ә+◌̈” forms a contraction,
|
|
|
and its second character (U+0308) is the same as the first
|
|
|
character in the Decomposition_Mapping of U+0344
|
|
|
‘◌̈́’=“◌̈+◌́”.</p>
|
|
|
<p>In order to handle strings with these characters (e.g., “aä”
|
|
|
and “ӛ́” [which are in FCD]) exactly as with prior NFD
|
|
|
normalization, an implementation needs to either add overlap
|
|
|
contractions to its data (e.g., “a+ä” and “ә+◌̈́”), or it needs
|
|
|
to decompose the relevant composites (e.g., ‘ä’ and ‘◌̈́’) as
|
|
|
soon as they are encountered.</p>
|
|
|
<h4>3.4.3 <a name="Variable_Top_Settings" href=
|
|
|
"#Variable_Top_Settings" id="Variable_Top_Settings">Notes on
|
|
|
variable top settings</a></h4>
|
|
|
<p>Users may want to include more or fewer characters as
|
|
|
Variable. For example, someone could want to restrict the
|
|
|
Variable characters to just include space marks. In that case,
|
|
|
maxVariable would be set to "space". (In CLDR 24 and earlier,
|
|
|
the now-deprecated variableTop would be set to U+1680, see the
|
|
|
“Whitespace” <a href="https://unicode.org/charts/collation/">UCA
|
|
|
collation chart</a>). Alternatively, someone could want more of
|
|
|
the Common characters in them, and include characters up to
|
|
|
(but not including) '0', by setting maxVariable to "currency".
|
|
|
(In CLDR 24 and earlier, the now-deprecated variableTop would
|
|
|
be set to U+20BA, see the “Currency-Symbol” collation
|
|
|
chart).</p>
|
|
|
<p>The effect of these settings is to customize to ignore
|
|
|
different sets of characters when comparing strings. For
|
|
|
example, the locale identifier "de-u-ka-shifted-kv-currency" is
|
|
|
requesting settings appropriate for German, including German
|
|
|
sorting conventions, and that currency symbols and characters
|
|
|
sorting below them are ignored in sorting.</p>
|
|
|
<h3>3.5 <a name="Rules" href="#Rules" id="Rules">Collation Rule
|
|
|
Syntax</a></h3>
|
|
|
<p class="dtd"><!ELEMENT cr #PCDATA ></p>
|
|
|
<p>The goal for the collation rule syntax is to have clearly
|
|
|
expressed rules with a concise format. The CLDR rule syntax is
|
|
|
a subset of the [<a href=
|
|
|
"tr35.html#ICUCollation">ICUCollation</a>] syntax.</p>
|
|
|
<p>For the CLDR root collation, the FractionalUCA.txt file
|
|
|
defines all mappings for all of Unicode directly, and it also
|
|
|
provides information about script boundaries, reordering
|
|
|
groups, and other details. For tailorings, this is neither
|
|
|
necessary nor practical. In particular, while the root
|
|
|
collation sort order rarely changes for existing characters,
|
|
|
their numeric collation weights change with every version. If
|
|
|
tailorings also specified numeric weights directly, then they
|
|
|
would have to change with every version, parallel with the root
|
|
|
collation. Instead, for tailorings, mappings are added and
|
|
|
modified relative to the root collation. (There is no syntax to
|
|
|
<i>remove</i> mappings, except via <a href=
|
|
|
"#Special_Purpose_Commands">special [suppressContractions
|
|
|
[...]]</a> .)</p>
|
|
|
<p>The ASCII [:P:] and [:S:] characters are reserved for
|
|
|
collation syntax: <code>[\u0021-\u002F \u003A-\u0040
|
|
|
\u005B-\u0060 \u007B-\u007E]</code></p>
|
|
|
<p>Unicode Pattern_White_Space characters between tokens are
|
|
|
ignored. Unquoted white space terminates reset and relation
|
|
|
strings.</p>
|
|
|
<p>A pair of ASCII apostrophes encloses quoted literal text.
|
|
|
They are normally used to enclose a syntax character or white
|
|
|
space, or a whole reset/relation string containing one or more
|
|
|
such characters, so that those are parsed as part of the
|
|
|
reset/relation strings rather than treated as syntax. A pair of
|
|
|
immediately adjacent apostrophes is used to encode one
|
|
|
apostrophe.</p>
|
|
|
<p>Code points can be escaped with <code>\uhhhh</code> and
|
|
|
<code>\U00hhhhhh</code> escapes, as well as common escapes like
|
|
|
<code>\t</code> and <code>\n</code> . (For details see the
|
|
|
documentation of ICU UnicodeString::unescape().) This is
|
|
|
particularly useful for default-ignorable code points,
|
|
|
combining marks, visually indistinct variants, hard-to-type
|
|
|
characters, etc. These sequences are unescaped before the rules
|
|
|
are parsed; this means that even escaped syntax and white space
|
|
|
characters need to be enclosed in apostrophes. For example:
|
|
|
<code>&'\u0020'='\u3000'</code>. Note: The unescaping is
|
|
|
done by ICU tools (genrb) and demos before passing rule strings
|
|
|
into the ICU library code. The ICU collation API does not
|
|
|
unescape rule strings.</p>
|
|
|
<p>The ASCII double quote must be both escaped (so that the
|
|
|
collation syntax can be enclosed in pairs of double quotes in
|
|
|
programming environments such as ICU resource bundle .txt
|
|
|
files) and quoted. For example:
|
|
|
<code>&'\u0022'<<<x</code></p>
|
|
|
<p>Comments are allowed at the beginning, and after any
|
|
|
complete reset, relation, setting, or command. A comment begins
|
|
|
with a <code>#</code> and extends to the end of the line
|
|
|
(according to the Unicode Newline Guidelines).</p>
|
|
|
<p>The collation syntax is case-sensitive.</p>
|
|
|
<h3>3.6 <a name="Orderings" href="#Orderings" id=
|
|
|
"Orderings">Orderings</a></h3>
|
|
|
<p>The root collation mappings form the initial state. Mappings
|
|
|
are added and removed via a sequence of rule chains. Each
|
|
|
tailoring rule builds on the current state after all of the
|
|
|
preceding rules (and is not affected by any following rules).
|
|
|
Rule chains may alternate with comments, settings, and special
|
|
|
commands.</p>
|
|
|
<p>A rule chain consists of a reset followed by one or more
|
|
|
relations. The reset position is a string which maps to one or
|
|
|
more collation elements according to the current state. A
|
|
|
relation consists of an operator and a string; it maps the
|
|
|
string to the current collation elements, modified according to
|
|
|
the operator.</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Specifying_Collation_Ordering" href=
|
|
|
"#Specifying_Collation_Ordering" id=
|
|
|
"Specifying_Collation_Ordering">Specifying Collation
|
|
|
Ordering</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Relation Operator</th>
|
|
|
<th> Example</th>
|
|
|
<th>Description</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>&</code></td>
|
|
|
<td><code>& Z</code></td>
|
|
|
<td>Map Z to collation elements according to the current
|
|
|
state. These will be modified according to the following
|
|
|
relation operators and then assigned to the corresponding
|
|
|
relation strings.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><</code></td>
|
|
|
<td><code>& a<br>
|
|
|
< b</code></td>
|
|
|
<td>Make 'b' sort after 'a', as a <i>primary</i>
|
|
|
(base-character) difference</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<</code></td>
|
|
|
<td><code>& a<br>
|
|
|
<< ä</code></td>
|
|
|
<td>Make 'ä' sort after 'a' as a <i>secondary</i> (accent)
|
|
|
difference</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<<</code></td>
|
|
|
<td><code>& a<br>
|
|
|
<<< A</code></td>
|
|
|
<td>Make 'A' sort after 'a' as a <i>tertiary</i>
|
|
|
(case/variant) difference</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<<<</code></td>
|
|
|
<td><code>& か<br>
|
|
|
<<<< カ</code></td>
|
|
|
<td>Make 'カ' (Katakana Ka) sort after 'か' (Hiragana Ka) as
|
|
|
a <i>quaternary</i> difference</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>= </code></td>
|
|
|
<td><code>& v<br>
|
|
|
= w </code></td>
|
|
|
<td>Make 'w' sort <i>identically</i> to 'v'</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>The following shows the result of serially applying three
|
|
|
rules.</p>
|
|
|
<table>
|
|
|
<tr>
|
|
|
<th> </th>
|
|
|
<th>Rules</th>
|
|
|
<th>Result</th>
|
|
|
<th>Comment</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>1</td>
|
|
|
<td>& a < g</td>
|
|
|
<td>... a <font color="red"><<sub>1</sub> g</font>
|
|
|
...</td>
|
|
|
<td>Put g after a.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>2</td>
|
|
|
<td>& a < h < k</td>
|
|
|
<td>... a <font color="red"><<sub>1</sub> h
|
|
|
<<sub>1</sub> k</font> <<sub>1</sub> g ...</td>
|
|
|
<td>Now put h and k after a (inserting before the g).</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>3</td>
|
|
|
<td>& h << g</td>
|
|
|
<td>... a <<sub>1</sub> h <font color=
|
|
|
"red"><<sub>1</sub> g</font> <<sub>1</sub> k ...</td>
|
|
|
<td>Now put g after h (inserting before k).</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>Notice that relation strings can occur multiple times, and
|
|
|
thus override previous rules.</p>
|
|
|
<p>Each relation uses and modifies the collation elements of
|
|
|
the immediately preceding reset position or relation. A rule
|
|
|
chain with two or more relations is equivalent to a sequence of
|
|
|
“atomic rules” where each rule chain has exactly one relation,
|
|
|
and each relation is followed by a reset to this same relation
|
|
|
string.</p>
|
|
|
<p><i>Example:</i></p>
|
|
|
<table>
|
|
|
<tr>
|
|
|
<th>Rules</th>
|
|
|
<th>Equivalent Atomic Rules</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>& b < q <<< Q<br>
|
|
|
& a < x <<< X << q <<< Q
|
|
|
< z</td>
|
|
|
<td>& b < q<br>
|
|
|
& q <<< Q<br>
|
|
|
& a < x<br>
|
|
|
& x <<< X<br>
|
|
|
& X << q<br>
|
|
|
& q <<< Q<br>
|
|
|
& Q < z</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>This is not always possible because prefix and extension
|
|
|
strings can occur in a relation but not in a reset (see
|
|
|
below).</p>
|
|
|
<p>The relation operator <code>=</code> maps its relation
|
|
|
string to the current collation elements. Any other relation
|
|
|
operator modifies the current collation elements as
|
|
|
follows.</p>
|
|
|
<ul>
|
|
|
<li>Find the <i>last</i> collation element whose strength is
|
|
|
at least as great as the strength of the operator. For
|
|
|
example, for <code><<</code> find the last primary or
|
|
|
secondary CE. This CE will be modified; all following CEs
|
|
|
should be removed. If there is no such CE, then reset the
|
|
|
collation elements to a single completely-ignorable CE.</li>
|
|
|
<li>Increment the collation element weight corresponding to
|
|
|
the strength of the operator. For example, for
|
|
|
<code><<</code> increment the secondary weight.</li>
|
|
|
<li>The new weight must be less than the next weight for the
|
|
|
same combination of higher-level weights of any collation
|
|
|
element according to the current state.</li>
|
|
|
<li>Weights must be allocated in accordance with the <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#Well-Formed">UCA
|
|
|
well-formedness conditions</a>.</li>
|
|
|
<li>When incrementing any weight, lower-level weights should
|
|
|
be reset to the “common” values, to help with sort key
|
|
|
compression.</li>
|
|
|
</ul>
|
|
|
<p>In all cases, even for <code>=</code> , the case bits are
|
|
|
recomputed according to <i>Section 3.13, <a href=
|
|
|
"#Case_Parameters">Case Parameters</a></i>. (This can be
|
|
|
skipped if an implementation does not support the caseLevel or
|
|
|
caseFirst settings.)</p>
|
|
|
<p>For example, <code>&ae<x</code> maps ‘x’ to two
|
|
|
collation elements. The first one is the same as for ‘a’, and
|
|
|
the second one has a primary weight between those for ‘e’ and
|
|
|
‘f’. As a result, ‘x’ sorts between “ae” and “af”. (If the
|
|
|
primary of the first collation element was incremented instead,
|
|
|
then ‘x’ would sort after “az”. While also sorting
|
|
|
primary-after “ae” this would be surprising and
|
|
|
sub-optimal.)</p>
|
|
|
<p>Some additional operators are provided to save space with
|
|
|
large tailorings. The addition of a * to the relation operator
|
|
|
indicates that each of the following single characters are to
|
|
|
be handled as if they were separate relations with the
|
|
|
corresponding strength. Each of the following single characters
|
|
|
must be NFD-inert, that is, it does not have a canonical
|
|
|
decomposition and it does not reorder (ccc=0). This keeps
|
|
|
abbreviated rules unambiguous.</p>
|
|
|
<p>A starred relation operator is followed by a sequence of
|
|
|
characters with the same quoting/escaping rules as normal
|
|
|
relation strings. Such a sequence can also be followed by one
|
|
|
or more pairs of ‘-’ and another sequence of characters. The
|
|
|
single characters adjacent to the ‘-’ establish a code point
|
|
|
order range. The same character cannot be both the end of a
|
|
|
range and the start of another range. (For example,
|
|
|
<code><a-d-g</code> is not allowed.)</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Abbreviating_Ordering_Specifications" href=
|
|
|
"#Abbreviating_Ordering_Specifications" id=
|
|
|
"Abbreviating_Ordering_Specifications">Abbreviating
|
|
|
Ordering Specifications</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Relation Operator</th>
|
|
|
<th>Example</th>
|
|
|
<th>Equivalent</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><*</code></td>
|
|
|
<td><code>& <span style="color: blue">a</span><br>
|
|
|
<* <span style=
|
|
|
"color: blue">bcd-gp-s</span> </code></td>
|
|
|
<td><code>& <span style="color: blue">a</span><br>
|
|
|
< <span style="color: blue">b</span> < <span style=
|
|
|
"color: blue">c</span> < <span style=
|
|
|
"color: blue">d</span> < <span style=
|
|
|
"color: blue">e</span> < <span style=
|
|
|
"color: blue">f</span> < <span style=
|
|
|
"color: blue">g</span> < <span style=
|
|
|
"color: blue">p</span> < <span style=
|
|
|
"color: blue">q</span> < <span style=
|
|
|
"color: blue">r</span> < <span style=
|
|
|
"color: blue">s</span></code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<*</code></td>
|
|
|
<td><code>& <span style="color: blue">a</span><br>
|
|
|
<<* <span style="color: blue">æᶏɐ</span></code></td>
|
|
|
<td><code>& <span style="color: blue">a</span><br>
|
|
|
<< <span style="color: blue">æ</span> <<
|
|
|
<span style="color: blue">ᶏ</span> << <span style=
|
|
|
"color: blue">ɐ</span></code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<<*</code></td>
|
|
|
<td><code>& <span style="color: blue">p</span><br>
|
|
|
<<<* <span style=
|
|
|
"color: blue">PpP</span></code></td>
|
|
|
<td><code>& <span style="color: blue">p</span><br>
|
|
|
<<< <span style="color: blue">P</span>
|
|
|
<<< <span style="color: blue">p</span>
|
|
|
<<< <span style="color: blue">P</span></code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code><<<<*</code></td>
|
|
|
<td><code>& <span style="color: blue">k</span><br>
|
|
|
<<<<* <span style=
|
|
|
"color: blue">qQ</span></code></td>
|
|
|
<td><code>& <span style="color: blue">k</span><br>
|
|
|
<<<< <span style="color: blue">q</span>
|
|
|
<<<< <span style=
|
|
|
"color: blue">Q</span></code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>=*</code></td>
|
|
|
<td><code>& <span style="color: blue">v</span><br>
|
|
|
=* <span style="color: blue">VwW</span></code></td>
|
|
|
<td><code>& <span style="color: blue">v</span><br>
|
|
|
= <span style="color: blue">V</span> = <span style=
|
|
|
"color: blue">w</span> = <span style=
|
|
|
"color: blue">W</span></code></td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<h3>3.7 <a name="Contractions" href="#Contractions" id=
|
|
|
"Contractions">Contractions</a></h3>
|
|
|
<p>A multi-character relation string defines a contraction.</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Specifying_Contractions" href=
|
|
|
"#Specifying_Contractions" id=
|
|
|
"Specifying_Contractions">Specifying Contractions</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Example</th>
|
|
|
<th>Description</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>& k<br>
|
|
|
< ch</code></td>
|
|
|
<td>Make the sequence 'ch' sort after 'k', as a primary
|
|
|
(base-character) difference</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<h3>3.8 <a name="Expansions" href="#Expansions" id=
|
|
|
"Expansions">Expansions</a></h3>
|
|
|
<p>A mapping to multiple collation elements defines an
|
|
|
expansion. This is normally the result of a reset position
|
|
|
(and/or preceding relation) that yields multiple collation
|
|
|
elements, for example <code>&ae<x</code> or
|
|
|
<code>&æ<y</code> .</p>
|
|
|
<p>A relation string can also be followed by <code>/</code> and
|
|
|
an <i>extension string</i>. The extension string is mapped to
|
|
|
collation elements according to the current state, and the
|
|
|
relation string is mapped to the concatenation of the regular
|
|
|
CEs and the extension CEs. The extension CEs are not modified,
|
|
|
not even their case bits. The extension CEs are <i>not</i>
|
|
|
retained for following relations.</p>
|
|
|
<p>For example, <code>&a<z/e</code> maps ‘z’ to an
|
|
|
expansion similar to <code>&ae<x</code> . However, the
|
|
|
first CE of ‘z’ is primary-after that of ‘a’, and the second CE
|
|
|
is exactly that of ‘e’, which yields the order ae < x <
|
|
|
af < ag < ... < az < z < b.</p>
|
|
|
<p>The choice of reset-to-expansion vs. use of an extension
|
|
|
string can be exploited to affect contextual mappings. For
|
|
|
example, <code>&L·=x</code> yields a second CE for ‘x’
|
|
|
equal to the context-sensitive middle-dot-after-L (which is a
|
|
|
secondary CE in the root collation). On the other hand,
|
|
|
<code>&L=x/·</code> yields a second CE of the middle dot by
|
|
|
itself (which is a primary CE).</p>
|
|
|
<p>The two ways of specifying expansions also differ in how
|
|
|
case bits are computed. When some of the CEs are copied
|
|
|
verbatim from an extension string, then the relation string’s
|
|
|
case bits are distributed over a smaller number of normal CEs.
|
|
|
For example, <code>&aE=Ch</code> yields an uppercase CE and
|
|
|
a lowercase CE, but <code>&a=Ch/E</code> yields a
|
|
|
mixed-case CE (for ‘C’ and ‘h’ together) followed by an
|
|
|
uppercase CE (copied from ‘E’).</p>
|
|
|
<p>In summary, there are two ways of specifying expansions
|
|
|
which produce subtly different mappings. The use of extension
|
|
|
strings is unusual but sometimes necessary.</p>
|
|
|
<h3>3.9 <a name="Context_Before" href="#Context_Before" id=
|
|
|
"Context_Before">Context Before</a></h3>
|
|
|
<p>A relation string can have a prefix (context before) which
|
|
|
makes the mapping from the relation string to its tailored
|
|
|
position conditional on the string occurring after that prefix.
|
|
|
For details see the specification of <i><a href=
|
|
|
"#Context_Sensitive_Mappings">Context-Sensitive
|
|
|
Mappings</a></i>.</p>
|
|
|
<p>For example, suppose that "-" is sorted like the previous
|
|
|
vowel. Then one could have rules that take "a-", "e-", and so
|
|
|
on. However, that means that every time a very common character
|
|
|
(a, e, ...) is encountered, a system will slow down as it looks
|
|
|
for possible contractions. An alternative is to indicate that
|
|
|
when "-" is encountered, and it comes after an 'a', it sorts
|
|
|
like an 'a', and so on.</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Specifying_Previous_Context" href=
|
|
|
"#Specifying_Previous_Context" id=
|
|
|
"Specifying_Previous_Context">Specifying Previous
|
|
|
Context</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Rules</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>& a <<< a | '-'<br>
|
|
|
& e <<< e | '-'<br>
|
|
|
...</code></td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>Both the prefix and extension strings can occur in a
|
|
|
relation. For example, the following are allowed:</p>
|
|
|
<ul>
|
|
|
<li><code>< abc | def / ghi</code></li>
|
|
|
<li><code>< def / ghi</code></li>
|
|
|
<li><code>< abc | def</code></li>
|
|
|
</ul>
|
|
|
<h3>3.10 <a name="Placing_Characters_Before_Others" href=
|
|
|
"#Placing_Characters_Before_Others" id=
|
|
|
"Placing_Characters_Before_Others">Placing Characters Before
|
|
|
Others</a></h3>
|
|
|
<p>There are certain circumstances where characters need to be
|
|
|
placed before a given character, rather than after. This is the
|
|
|
case with Pinyin, for example, where certain accented letters
|
|
|
are positioned before the base letter. That is accomplished
|
|
|
with the following syntax.</p>
|
|
|
<pre>&[before 2] a << à</pre>
|
|
|
<p>The before-strength can be 1 (primary), 2 (secondary), or 3
|
|
|
(tertiary).</p>
|
|
|
<p>It is an error if the strength of the reset-before differs
|
|
|
from the strength of the immediately following relation. Thus
|
|
|
the following are errors.</p>
|
|
|
<ul>
|
|
|
<li><code>&[before 2] a < à # error</code></li>
|
|
|
<li><code>&[before 2] a <<< à #
|
|
|
error</code></li>
|
|
|
</ul>
|
|
|
<h3>3.11 <a name="Logical_Reset_Positions" href=
|
|
|
"#Logical_Reset_Positions" id="Logical_Reset_Positions">Logical
|
|
|
Reset Positions</a></h3>
|
|
|
<p>The CLDR table (based on UCA) has the following overall
|
|
|
structure for weights, going from low to high.</p>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Specifying_Logical_Positions" href=
|
|
|
"#Specifying_Logical_Positions" id=
|
|
|
"Specifying_Logical_Positions">Specifying Logical
|
|
|
Positions</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Name</th>
|
|
|
<th>Description</th>
|
|
|
<th>UCA Examples</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first tertiary ignorable<br>
|
|
|
...<br>
|
|
|
last tertiary ignorable</td>
|
|
|
<td>p, s, t = ignore</td>
|
|
|
<td>Control Codes<br>
|
|
|
Format Characters<br>
|
|
|
Hebrew Points<br>
|
|
|
Tibetan Signs<br>
|
|
|
...</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first secondary ignorable<br>
|
|
|
...<br>
|
|
|
last secondary ignorable</td>
|
|
|
<td>p, s = ignore</td>
|
|
|
<td>None in UCA</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first primary ignorable<br>
|
|
|
...<br>
|
|
|
last primary ignorable</td>
|
|
|
<td>p = ignore</td>
|
|
|
<td>Most combining marks</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first variable<br>
|
|
|
...<br>
|
|
|
last variable</td>
|
|
|
<td><i><b>if</b> alternate = non-ignorable<br></i> p !=
|
|
|
ignore,<br>
|
|
|
<i><b>if</b> alternate = shifted</i><br>
|
|
|
p, s, t = ignore</td>
|
|
|
<td>Whitespace,<br>
|
|
|
Punctuation</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first regular<br>
|
|
|
...<br>
|
|
|
last regular</td>
|
|
|
<td>p != ignore</td>
|
|
|
<td>General Symbols<br>
|
|
|
Currency Symbols<br>
|
|
|
Numbers<br>
|
|
|
Latin<br>
|
|
|
Greek<br>
|
|
|
...</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first implicit<br>
|
|
|
...<br>
|
|
|
last implicit</td>
|
|
|
<td>p != ignore, assigned automatically</td>
|
|
|
<td>CJK, CJK compatibility (those that are not
|
|
|
decomposed)<br>
|
|
|
CJK Extension A, B, C, ...<br>
|
|
|
Unassigned</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>first trailing<br>
|
|
|
...<br>
|
|
|
last trailing</td>
|
|
|
<td>p != ignore,<br>
|
|
|
used for trailing syllable components</td>
|
|
|
<td>Jamo Trailing<br>
|
|
|
Jamo Leading<br>
|
|
|
U+FFFD<br>
|
|
|
U+FFFF</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>Each of the above Names can be used with a reset to position
|
|
|
characters relative to that logical position. That allows
|
|
|
characters to be ordered before or after a <i>logical</i>
|
|
|
position rather than a specific character.</p>
|
|
|
<blockquote>
|
|
|
<p class="note"><b>Note:</b> The reason for this is so that
|
|
|
tailorings can be more stable. A future version of the UCA
|
|
|
might add characters at any point in the above list. Suppose
|
|
|
that you set character X to be after Y. It could be that you
|
|
|
want X to come after Y, no matter what future characters are
|
|
|
added; or it could be that you just want Y to come after a
|
|
|
given logical position, for example, after the last primary
|
|
|
ignorable.</p>
|
|
|
</blockquote>
|
|
|
<p>Each of these special reset positions always maps to a
|
|
|
single collation element.</p>
|
|
|
<p>Here is an example of the syntax:</p>
|
|
|
<pre>& [first tertiary ignorable] << à </pre>
|
|
|
<p>For example, to make a character be a secondary ignorable,
|
|
|
one can make it be immediately after (at a secondary level) a
|
|
|
specific character (like a combining diaeresis), or one can
|
|
|
make it be immediately after the last secondary ignorable.</p>
|
|
|
<p>Each special reset position adjusts to the effects of
|
|
|
preceding rules, just like normal reset position strings. For
|
|
|
example, if a tailoring rule creates a new collation element
|
|
|
after <code>&[last variable]</code> (via explicit tailoring
|
|
|
after that, or via tailoring after the relevant character),
|
|
|
then this new CE becomes the new <i>last variable</i> CE, and
|
|
|
is used in following resets to <code>[last variable]</code>
|
|
|
.</p>
|
|
|
<p>[first variable] and [first regular] and [first trailing]
|
|
|
should be the first real such CEs (e.g., CE(U+0060 `)), as
|
|
|
adjusted according to the tailoring, not the boundary CEs (see
|
|
|
the FractionalUCA.txt “first primary” mappings starting with
|
|
|
U+FDD1).</p>
|
|
|
<p><code>[last regular]</code> is not actually the last normal
|
|
|
CE with a primary weight before implicit primaries. It is used
|
|
|
to tailor large numbers of characters, usually CJK, into the
|
|
|
script=Hani range between the last regular script and the first
|
|
|
implicit CE. (The first group of implicit CEs is for Han
|
|
|
characters.) Therefore, <code>[last regular]</code> is set to
|
|
|
the first Hani CE, the artificial script boundary CE at the
|
|
|
beginning of this range. For example: <code>&[last
|
|
|
regular]<*亜唖娃阿...</code></p>
|
|
|
<p>The [last trailing] is the CE of U+FFFF. Tailoring to that
|
|
|
is not allowed.</p>
|
|
|
<p>The <code>[last variable]</code> indicates the "highest"
|
|
|
character that is treated as punctuation with alternate
|
|
|
handling.</p>
|
|
|
<p>The value can be changed by using the maxVariable setting.
|
|
|
This takes effect, however, after the rules have been built,
|
|
|
and does not affect any characters that are reset relative to
|
|
|
the <code>[last variable]</code> value when the rules are being
|
|
|
built. The maxVariable setting might also be changed via a
|
|
|
runtime parameter. That also does not affect the rules.<br>
|
|
|
(In CLDR 24 and earlier, the variable top could also be set by
|
|
|
using a tailoring rule with <code>[variable top]</code> in the
|
|
|
place of a relation string.)</p>
|
|
|
<h3>3.12 <a name="Special_Purpose_Commands" href=
|
|
|
"#Special_Purpose_Commands" id=
|
|
|
"Special_Purpose_Commands">Special-Purpose Commands</a></h3>
|
|
|
<p>The import command imports rules from another collation.
|
|
|
This allows for better maintenance and smaller rule sizes. The
|
|
|
source is a BCP 47 language tag with an optional collation type
|
|
|
but without other extensions. The collation type is the BCP 47
|
|
|
form of the collation type in the source; it defaults to
|
|
|
"standard".</p>
|
|
|
<p><em>Examples:</em></p>
|
|
|
<ul>
|
|
|
<li><code>[import de-u-co-phonebk]</code> (not
|
|
|
"...-co-phonebook")</li>
|
|
|
<li><code>[import und-u-co-search]</code> (not
|
|
|
"root-...")</li>
|
|
|
<li><code>[import ja-u-co-private-kana]</code>
|
|
|
(language "ja" required even when this import itself is in
|
|
|
another "ja" tailoring.)</li>
|
|
|
</ul>
|
|
|
<table>
|
|
|
<caption>
|
|
|
<a name="Special_Purpose_Elements" href=
|
|
|
"#Special_Purpose_Elements" id=
|
|
|
"Special_Purpose_Elements">Special-Purpose Elements</a>
|
|
|
</caption>
|
|
|
<tr>
|
|
|
<th>Rule Syntax</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>[suppressContractions [Љ-ґ]]</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td>[optimize [Ά-ώ]]</td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>The <i>suppress contractions</i> tailoring command turns off
|
|
|
any existing contractions that begin with those characters, as
|
|
|
well as any prefixes for those characters. It is typically used
|
|
|
to turn off the Cyrillic contractions in the UCA, since they
|
|
|
are not used in many languages and have a considerable
|
|
|
performance penalty. The argument is a <a href=
|
|
|
"tr35.html#Unicode_Sets">Unicode Set</a>.</p>
|
|
|
<p>The <i>suppress contractions</i> command has immediate
|
|
|
effect on the current set of mappings, including mappings added
|
|
|
by preceding rules. Following rules are processed after
|
|
|
removing any context-sensitive mappings originating from any of
|
|
|
the characters in the set.</p>
|
|
|
<p>The <i>optimize</i> tailoring command is purely for
|
|
|
performance. It indicates that those characters are
|
|
|
sufficiently common in the target language for the tailoring
|
|
|
that their performance should be enhanced.</p>
|
|
|
<p>The reason that these are not settings is so that their
|
|
|
contents can be arbitrary characters.</p>
|
|
|
<hr width="50%">
|
|
|
<p><i>Example:</i></p>
|
|
|
<p>The following is a simple example that combines portions of
|
|
|
different tailorings for illustration. For more complete
|
|
|
examples, see the actual locale data: <a href=
|
|
|
"https://github.com/unicode-org/cldr/tree/latest/common/collation/ja.xml">
|
|
|
Japanese</a>, <a href=
|
|
|
"https://github.com/unicode-org/cldr/tree/latest/common/collation/zh.xml">
|
|
|
Chinese</a>, <a href=
|
|
|
"https://github.com/unicode-org/cldr/tree/latest/common/collation/sv.xml">
|
|
|
Swedish</a>, and <a href=
|
|
|
"https://github.com/unicode-org/cldr/tree/latest/common/collation/de.xml">
|
|
|
German</a> (type="phonebook") are particularly
|
|
|
illustrative.</p>
|
|
|
<pre><collation>
|
|
|
<cr><![CDATA[
|
|
|
[caseLevel on]
|
|
|
&Z
|
|
|
< æ <<< Æ
|
|
|
< å <<< Å <<< aa <<< aA <<< Aa <<< AA
|
|
|
< ä <<< Ä
|
|
|
< ö <<< Ö << ű <<< Ű
|
|
|
< ő <<< Ő << ø <<< Ø
|
|
|
&V <<<* wW
|
|
|
&Y <<<* üÜ
|
|
|
&[last non-ignorable]
|
|
|
<span style=
|
|
|
"color: green"># The following is equivalent to <亜<唖<娃...</span>
|
|
|
<* 亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦
|
|
|
<* 鯵梓圧斡扱
|
|
|
]]></cr>
|
|
|
</collation></pre>
|
|
|
<h3>3.13 <a name="Script_Reordering" href="#Script_Reordering"
|
|
|
id="Script_Reordering">Collation Reordering</a></h3>
|
|
|
<p>Collation reordering allows scripts and certain other
|
|
|
defined blocks of characters to be moved relative to each other
|
|
|
parametrically, without changing the detailed rules for all the
|
|
|
characters involved. This reordering is done on top of any
|
|
|
specific ordering rules within the script or block currently in
|
|
|
effect. Reordering can specify groups to be placed at the start
|
|
|
and/or the end of the collation order. For example, to reorder
|
|
|
Greek characters before Latin characters, and digits afterwards
|
|
|
(but before other scripts), the following can be used:</p>
|
|
|
<table>
|
|
|
<tr>
|
|
|
<th>Rule Syntax</th>
|
|
|
<th>Locale Identifier</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>[reorder Grek Latn digit]</code></td>
|
|
|
<td><code>en-u-kr-grek-latn-digit</code></td>
|
|
|
</tr>
|
|
|
</table>
|
|
|
<p>In each case, a sequence of
|
|
|
<em><strong>reorder_codes</strong></em> is used, separated by
|
|
|
spaces in the settings attribute and in rule syntax, and by
|
|
|
hyphens in locale identifiers.</p>
|
|
|
<p>A <strong><em>reorder_code</em></strong> is any of the
|
|
|
following special codes:</p>
|
|
|
<ol>
|
|
|
<li><strong>space, punct, symbol, currency, digit</strong> -
|
|
|
core groups of characters below 'a'</li>
|
|
|
<li>
|
|
|
<strong>any script code</strong> except
|
|
|
<strong>Common</strong> and <strong>Inherited</strong>.
|
|
|
<ul>
|
|
|
<li>Some pairs of scripts sort primary-equal and always
|
|
|
reorder together. For example, Katakana characters are
|
|
|
are always reordered with Hiragana.</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li><strong>others</strong> - where all codes not explicitly
|
|
|
mentioned should be ordered. The script code
|
|
|
<strong>Zzzz</strong> (Unknown Script) is a synonym for
|
|
|
<strong>others</strong>.</li>
|
|
|
</ol>
|
|
|
<p>It is an error if a code occurs multiple times.</p>
|
|
|
<p>It is an error if the sequence of reorder codes is empty in
|
|
|
the XML attribute or in the locale identifier. Some
|
|
|
implementations may interpret an empty sequence in the
|
|
|
<code>[reorder]</code> rule syntax as a reset to the DUCET
|
|
|
ordering, synonymous with <code>[reorder others]</code> ; other
|
|
|
implementations may forbid an empty sequence in the rule syntax
|
|
|
as well.</p>
|
|
|
<p>Interaction with <strong>alternate=shifted</strong>: Whether
|
|
|
a primary weight is “variable” is determined according to the
|
|
|
“variable top”, before applying script reordering. Once that is
|
|
|
determined, script reordering is applied to the primary weight
|
|
|
regardless of whether it is “regular” (used in the primary
|
|
|
level) or “shifted” (used in the quaternary level).</p>
|
|
|
<h4>3.13.1 <a name="Interpretation_reordering" href=
|
|
|
"#Interpretation_reordering" id=
|
|
|
"Interpretation_reordering">Interpretation of a reordering
|
|
|
list</a></h4>
|
|
|
<p>The reordering list is interpreted as if it were processed
|
|
|
in the following way.</p>
|
|
|
<ol>
|
|
|
<li>If any core code is not present, then it is inserted at
|
|
|
the front of the list in the order given above.</li>
|
|
|
<li>If the <strong>others</strong> code is not present, then
|
|
|
it is inserted at the end of the list.</li>
|
|
|
<li>The <strong>others</strong> code is replaced by the list
|
|
|
of all script codes not explicitly mentioned, in DUCET
|
|
|
order.</li>
|
|
|
<li>The reordering list is now complete, and used to reorder
|
|
|
characters in collation accordingly.</li>
|
|
|
</ol>
|
|
|
<p>The locale data may have a particular ordering. For example,
|
|
|
the Czech locale data could put digits after all letters, with
|
|
|
<code>[reorder others digit]</code> . Any reordering codes
|
|
|
specified on top of that (such as with a bcp47 locale
|
|
|
identifier) completely replace what was there. To specify a
|
|
|
version of collation that completely resets any existing
|
|
|
reordering to the DUCET ordering, the single code
|
|
|
<strong>Zzzz</strong> or <strong>others</strong> can be used,
|
|
|
as below.</p>
|
|
|
<p><em>Examples:</em></p>
|
|
|
<table cellpadding="0" cellspacing="0">
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<th>Locale Identifier</th>
|
|
|
<th>Effect</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>en-u-kr-latn-digit</code></td>
|
|
|
<td>Reorder digits after Latin characters (but before
|
|
|
other scripts like Cyrillic).</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>en-u-kr-others-digit</code></td>
|
|
|
<td>Reorder digits after all other characters.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>en-u-kr-arab-cyrl-others-symbol</code></td>
|
|
|
<td>Reorder Arabic characters first, then Cyrillic, and
|
|
|
put symbols at the end—after all other characters.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>en-u-kr-others</code></td>
|
|
|
<td>Remove any locale-specific reordering, and use DUCET
|
|
|
order for reordering blocks.</td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<p>The default reordering groups are defined by the
|
|
|
FractionalUCA.txt file, based on the primary weights of
|
|
|
associated collation elements. The file contains special
|
|
|
mappings for the start of each group, script, and
|
|
|
reorder-reserved range, see <i>Section 2.6.2, <a href=
|
|
|
"#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>.</p>
|
|
|
<p>There are some special cases:</p>
|
|
|
<ul>
|
|
|
<li>The <strong>Hani</strong> group includes implicit weights
|
|
|
for <em>Han characters</em> according to the UCA as well as
|
|
|
any characters tailored relative to a Han character, or after
|
|
|
<code>&[first Hani]</code>.</li>
|
|
|
<li>Implicit weights for <em>unassigned code points</em>
|
|
|
according to the UCA reorder as the last weights in the
|
|
|
<strong>others</strong> (<strong>Zzzz</strong>) group.<br>
|
|
|
There is no script code to explicitly reorder the
|
|
|
unassigned-implicit weights into a particular position.
|
|
|
(Unassigned-implicit weights are used for non-Hani code
|
|
|
points without any mappings. For a given Unicode version they
|
|
|
are the code points with General_Category values Cn, Co,
|
|
|
Cs.)</li>
|
|
|
<li>The TRAILING group, the FIELD-SEPARATOR (associated with
|
|
|
U+FFFE), and collation elements with only zero primary
|
|
|
weights are not reordered.</li>
|
|
|
<li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are
|
|
|
never associated with characters.</li>
|
|
|
</ul>
|
|
|
<p>For example, <code>reorder="Hani Zzzz Grek"</code> sorts
|
|
|
Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned,
|
|
|
Greek, TRAILING.</p>
|
|
|
<p>Notes for implementations that write sort keys:</p>
|
|
|
<ul>
|
|
|
<li>Primaries must always be offset by one or more whole
|
|
|
primary lead bytes. (Otherwise the number of bytes in a
|
|
|
fractional weight may change, compressible scripts may span
|
|
|
multiple lead bytes, or trailing primary bytes may collide
|
|
|
with separators and primary-compression terminators.)</li>
|
|
|
<li>When a script is reordered that does not start and end on
|
|
|
whole-primary-lead-byte boundaries, then the lead byte needs
|
|
|
to be “split”, and a reserved byte is used up. The data
|
|
|
supports this via reorder-reserved ranges of primary weights
|
|
|
that are not used for collation elements.</li>
|
|
|
<li>Primary weights from different original lead bytes can be
|
|
|
reordered to a shared lead byte, as long as they do not
|
|
|
overlap. Primary compression ends when the target lead byte
|
|
|
differs or when the original lead byte of the next primary is
|
|
|
not compressible.</li>
|
|
|
<li>Non-compressible groups and scripts begin or end on
|
|
|
whole-primary-lead-byte boundaries (or both), so that
|
|
|
reordering cannot surround a non-compressible script by two
|
|
|
compressible ones within the same target lead byte. This is
|
|
|
so that primary compression can be terminated reliably
|
|
|
(choosing the low or high terminator byte) simply by
|
|
|
comparing the previous and current primary weights. Otherwise
|
|
|
it would have to also check for another condition (e.g.,
|
|
|
equal scripts).</li>
|
|
|
</ul>
|
|
|
<h4>3.13.2 <a name="Reordering_Groups_allkeys" href=
|
|
|
"#Reordering_Groups_allkeys" id=
|
|
|
"Reordering_Groups_allkeys">Reordering Groups for
|
|
|
allkeys.txt</a></h4>
|
|
|
<p>For allkeys_CLDR.txt, the start of each reordering group can
|
|
|
be determined from FractionalUCA.txt, by finding the first real
|
|
|
mapping (after “xyz first primary”) of that group (e.g.,
|
|
|
<code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE
|
|
|
ACCENT</code> ), and looking for that mapping's character
|
|
|
sequence ( <code>0060</code> ) in allkeys_CLDR.txt. The comment
|
|
|
in FractionalUCA.txt ( <code>[0312.0020.0002]</code> ) also
|
|
|
shows the allkeys_CLDR.txt collation elements.</p>
|
|
|
<p>The DUCET ordering of some characters is slightly different
|
|
|
from the CLDR root collation order. The reordering groups for
|
|
|
the DUCET are not specified. The following describes how
|
|
|
reordering groups for the DUCET can be derived.</p>
|
|
|
<p>For allkeys_DUCET.txt, the start of each reordering group is
|
|
|
normally the primary weight corresponding to the same character
|
|
|
sequence as for allkeys_CLDR.txt. In a few cases this requires
|
|
|
adjustment, especially for the special reordering groups, due
|
|
|
to CLDR’s ordering the common characters more strictly by
|
|
|
category than the DUCET (as described in <i>Section 2, <a href=
|
|
|
"#Root_Collation">Root Collation</a></i>). The necessary
|
|
|
adjustment would set the start of each allkeys_DUCET.txt
|
|
|
reordering group to the primary weight of the first mapping for
|
|
|
the relevant General_Category for a special reordering group
|
|
|
(for characters that sort before ‘a’), or the primary weight of
|
|
|
the first mapping for the first script (e.g., sc=Grek) of an
|
|
|
“alphabetic” group (for characters that sort at or after
|
|
|
‘a’).</p>
|
|
|
<p>Note that the following only applies to primary weights
|
|
|
greater than the one for U+FFFE and less than "trailing"
|
|
|
weights.</p>
|
|
|
<p>The special reordering groups correspond to General_Category
|
|
|
values as follows:</p>
|
|
|
<ul>
|
|
|
<li>punct: P</li>
|
|
|
<li>symbol: Sk, Sm, So</li>
|
|
|
<li>space: Z, Cc</li>
|
|
|
<li>currency: Sc</li>
|
|
|
<li>digit: Nd</li>
|
|
|
</ul>
|
|
|
<p>In the DUCET, some characters that sort below ‘a’ and have
|
|
|
other General_Category values not mentioned above (e.g., gc=Lm)
|
|
|
are also grouped with symbols. Variants of numbers (gc=No or
|
|
|
Nl) can be found among punctuation, symbols, and digits.</p>
|
|
|
<p>Each collation element of an expansion may be in a different
|
|
|
reordering group, for example for parenthesized characters.</p>
|
|
|
<h3>3.14 <a name="Case_Parameters" href="#Case_Parameters" id=
|
|
|
"Case_Parameters">Case Parameters</a></h3>
|
|
|
<p>The <strong>case level</strong> is an <em>optional</em>
|
|
|
intermediate level ("2.5") between Level 2 and Level 3 (or
|
|
|
after Level 1, if there is no Level 2 due to strength
|
|
|
settings). The case level is used to support two parametric
|
|
|
features: ignoring non-case variants (Level 3 differences)
|
|
|
except for case, and giving case differences a higher-level
|
|
|
priority than other tertiary differences. Distinctions between
|
|
|
small and large Kana characters are also included as case
|
|
|
differences, to support Japanese collation.</p>
|
|
|
<p>The <strong>case first</strong> parameter controls whether
|
|
|
to swap the order of upper and lowercase. It can be used with
|
|
|
or without the case level.</p>
|
|
|
<p>Importantly, the case parameters have no effect in many
|
|
|
instances. For example, they have no effect on the comparison
|
|
|
of two non-ignorable characters with different primary weights,
|
|
|
or with different secondary weights if the strength =
|
|
|
<strong>secondary (or higher).</strong></p>
|
|
|
<p>When either the <strong>case level</strong> or <strong>case
|
|
|
first</strong> parameters are set, the following describes the
|
|
|
derivation of the modified collation elements. It assumes the
|
|
|
original levels for the code point are [p.s.t] (primary,
|
|
|
secondary, tertiary). This derivation may change in future
|
|
|
versions of LDML, to track the case characteristics more
|
|
|
closely.</p>
|
|
|
<h4>3.14.1 <a name="Case_Untailored" href="#Case_Untailored"
|
|
|
id="Case_Untailored">Untailored Characters</a></h4>
|
|
|
<p>For untailored characters and strings, that is, for mappings
|
|
|
in the root collation, the case value for each collation
|
|
|
element is computed from the tertiary weight listed in
|
|
|
allkeys_CLDR.txt. This is used to modify the collation
|
|
|
element.</p>
|
|
|
<p>Look up a case value for the tertiary weight x of each
|
|
|
collation element:</p>
|
|
|
<ol>
|
|
|
<li>UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}</li>
|
|
|
<li>UNCASED otherwise</li>
|
|
|
<li>FractionalUCA.txt encodes the case information in bits 6
|
|
|
and 7 of the first byte in each tertiary weight. The case
|
|
|
bits are set to 00 for UNCASED and LOWERCASE, and 10 for
|
|
|
UPPER. There is no MIXED case value (01) in the root
|
|
|
collation.</li>
|
|
|
</ol>
|
|
|
<h4>3.14.2 <a name="Case_Weights" href="#Case_Weights" id=
|
|
|
"Case_Weights">Compute Modified Collation Elements</a></h4>
|
|
|
<p>From a computed case value, set a weight <strong>c</strong>
|
|
|
according to the following.</p>
|
|
|
<ol>
|
|
|
<li>If <strong>CaseFirst=UpperFirst</strong>, set
|
|
|
<strong>c</strong> = UPPER ? <strong>1</strong> : MIXED ? 2 :
|
|
|
<strong>3</strong></li>
|
|
|
<li>Otherwise set <strong>c</strong> = UPPER ?
|
|
|
<strong>3</strong> : MIXED ? 2 : <strong>1</strong></li>
|
|
|
</ol>
|
|
|
<p>Compute a new collation element according to the following
|
|
|
table. The notation <em>xt</em> means that the values are
|
|
|
numerically combined into a single level, such that xt < yu
|
|
|
whenever x < y. The fourth level (if it exists) is
|
|
|
unaffected. Note that a secondary CE must have a secondary
|
|
|
weight S which is greater than the secondary weight s of any
|
|
|
primary CE; and a tertiary CE must have a tertiary weight T
|
|
|
which is greater than the tertiary weight t of any primary or
|
|
|
secondary CE ([<a href=
|
|
|
"https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a href=
|
|
|
"https://www.unicode.org/reports/tr10/#WF2">WF2</a>).</p>
|
|
|
<div align="center">
|
|
|
<table>
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<th>Case Level</th>
|
|
|
<th>Strength</th>
|
|
|
<th>Original CE</th>
|
|
|
<th>Modified CE</th>
|
|
|
<th>Comment</th>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="5"><strong>on</strong></td>
|
|
|
<td rowspan="2"><strong>primary</strong></td>
|
|
|
<td><code>0.S.t</code></td>
|
|
|
<td><code>0.0</code></td>
|
|
|
<td rowspan="2">ignore case level weights of
|
|
|
primary-ignorable CEs</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>p.s.t</code></td>
|
|
|
<td><code>p.c</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="3"><strong>secondary<br></strong> or
|
|
|
higher</td>
|
|
|
<td><code>0.0.T</code></td>
|
|
|
<td><code>0.0.0.T</code></td>
|
|
|
<td rowspan="3">ignore case level weights of
|
|
|
secondary-ignorable CEs</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>0.S.t</code></td>
|
|
|
<td><code>0.S.c.t</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>p.s.t</code></td>
|
|
|
<td><code>p.s.c.t</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td rowspan="4"><strong>off</strong></td>
|
|
|
<td rowspan="4">any</td>
|
|
|
<td><code>0.0.0</code></td>
|
|
|
<td><code>0.0.00</code></td>
|
|
|
<td rowspan="4">ignore case level weights of
|
|
|
tertiary-ignorable CEs</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>0.0.T</code></td>
|
|
|
<td><code>0.0.3T</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>0.S.t</code></td>
|
|
|
<td><code>0.S.ct</code></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><code>p.s.t</code></td>
|
|
|
<td><code>p.s.ct</code></td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
</div>
|
|
|
<p>For primary+case, which is used for “ignore accents but not
|
|
|
case” collation, primary ignorables are ignored so that a = ä.
|
|
|
For secondary+case, which would by analogy mean “ignore
|
|
|
variants but not case”, secondary ignorables are ignored for
|
|
|
equivalent behavior.</p>
|
|
|
<p>When using <strong>caseFirst</strong> but not
|
|
|
<strong>caseLevel</strong>, the combined case+tertiary weight
|
|
|
of a tertiary CE must be greater than the combined
|
|
|
case+tertiary weight of any primary or secondary CE so that
|
|
|
[<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
|
|
|
<a href=
|
|
|
"https://www.unicode.org/reports/tr10/#WF2">well-formedness
|
|
|
condition 2</a> is fulfilled. Since the tertiary CE’s tertiary
|
|
|
weight T is already greater than any t of primary or secondary
|
|
|
CEs, it is sufficient to set its case weight to UPPER=3. It
|
|
|
must not be affected by <strong>caseFirst=upper</strong>. (The
|
|
|
table uses the constant 3 in this case rather than the computed
|
|
|
c.)</p>
|
|
|
<p>The case weight of a tertiary-ignorable CE must be 0 so that
|
|
|
[<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
|
|
|
<a href=
|
|
|
"https://www.unicode.org/reports/tr10/#WF1">well-formedness
|
|
|
condition 1</a> is fulfilled.</p>
|
|
|
<h4>3.14.3 <a name="Case_Tailored" href="#Case_Tailored" id=
|
|
|
"Case_Tailored">Tailored Strings</a></h4>
|
|
|
<p>Characters and strings that are tailored have case values
|
|
|
computed from their root collation case bits.</p>
|
|
|
<ol>
|
|
|
<li>Look up the tailored string’s root CEs. (Ignore any
|
|
|
prefix or extension strings.) N=number of primary root
|
|
|
CEs.</li>
|
|
|
<li>Determine the number and type (primary vs. weaker) of CEs
|
|
|
a tailored string maps to. M=number of primary tailored
|
|
|
CEs.</li>
|
|
|
<li>If N<=M (no more root than tailoring primary CEs):
|
|
|
Copy the root case bits for primary CEs 0..N-1.
|
|
|
<ul>
|
|
|
<li>If N<M (fewer root primary CEs): Clear the case
|
|
|
bits of the remaining tailored primary CEs.
|
|
|
(uncased/lowercase/small Kana)</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>If N>M (more root primary CEs): Copy the root case
|
|
|
bits for primary CEs 0..M-2. Set the case bits for tailored
|
|
|
primary CE M-1 according to the remaining root primary CEs
|
|
|
M-1..N-1:
|
|
|
<ul>
|
|
|
<li>Set to uncased/lower if all remaining root primary
|
|
|
CEs have uncased/lower.</li>
|
|
|
<li>Set to uppercase if all remaining root primary CEs
|
|
|
have uppercase.</li>
|
|
|
<li>Otherwise, set to mixed.</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li>Clear the case bits for secondary CEs 0.s.t.</li>
|
|
|
<li>Tertiary CEs 0.0.t must get uppercase bits.</li>
|
|
|
<li>Tertiary-ignorable CEs 0.0.0 must get
|
|
|
ignorable-case=lowercase bits.</li>
|
|
|
</ol>
|
|
|
<p class="note">Note: Almost all Cased characters have primary
|
|
|
(non-ignorable) root collation CEs, except for U+0345 Combining
|
|
|
Ypogegrammeni which is Lowercase. All Uppercase characters have
|
|
|
primary root collation CEs.</p>
|
|
|
<h3>3.15 <a name="Visibility" href="#Visibility" id=
|
|
|
"Visibility">Visibility</a></h3>
|
|
|
<p>Collations have external visibility by default, meaning that
|
|
|
they can be displayed in a list of collation options for users
|
|
|
to choose from. A collation whose type name starts with
|
|
|
"private-" is internal and should not be shown in such a list.
|
|
|
Collations are typically internal when they are partial
|
|
|
sequences included in other collations. See <i>Section 3.1,
|
|
|
<a href="#Collation_Types">Collation Types</a></i> .</p>
|
|
|
<h3>3.16 <a name="Collation_Indexes" href="#Collation_Indexes"
|
|
|
id="Collation_Indexes">Collation Indexes</a></h3>
|
|
|
<h4>3.16.1 <a name="Index_Characters" href="#Index_Characters"
|
|
|
id="Index_Characters">Index Characters</a></h4>
|
|
|
<p>The main data includes <exemplarCharacters> for
|
|
|
collation indexes. See <i>Part 2 General, Section 3, <a href=
|
|
|
"tr35-general.html#Character_Elements">Character
|
|
|
Elements</a></i>, for general information about exemplar
|
|
|
characters.</p>
|
|
|
<p>The index characters are a set of characters for use as a UI
|
|
|
"index", that is, a list of clickable characters (or character
|
|
|
sequences) that allow the user to see a segment of a larger
|
|
|
"target" list. Each character corresponds to a bucket in the
|
|
|
target list. One may have different kinds of index lists; one
|
|
|
that produces an index list that is relatively static, and the
|
|
|
other is a list that produces roughly equally-sized buckets.
|
|
|
While CLDR is mostly focused on the first, there is provision
|
|
|
for supporting the second as well.</p>
|
|
|
<p>The index characters need to be used in conjunction with a
|
|
|
collation for the locale, which will determine the order of the
|
|
|
characters. It will also determine which index characters show
|
|
|
up.</p>
|
|
|
<p>The static list would be presented as something like the
|
|
|
following (either vertically or horizontally):</p>
|
|
|
<p align="center">… A B C D E F G H CH I J K L M N O P Q R
|
|
|
S T U V W X Y Z …</p>
|
|
|
<p>In the "A" bucket, you would find all items that are primary
|
|
|
greater than or equal to "A" in collation order, and primary
|
|
|
less than "B". The use of the list requires that the target
|
|
|
list be sorted according to the locale that is used to create
|
|
|
that list. Although we say "character" above, the index
|
|
|
character could be a sequence, like "CH" above. The index
|
|
|
exemplar characters must always be used with a collation
|
|
|
appropriate for the locale. Any characters that do not have
|
|
|
primary differences from others in the set should be
|
|
|
removed.</p>
|
|
|
<p>Details:</p>
|
|
|
<ol>
|
|
|
<li>The primary weight (according to the collation) is used
|
|
|
to determine which bucket a string is in. There are special
|
|
|
buckets for before the first character, between buckets of
|
|
|
different scripts, and after the last bucket (and of a
|
|
|
different script).</li>
|
|
|
<li>Characters in the <em>index characters</em> do not need
|
|
|
to have distinct primary weights. That is, the <em>index
|
|
|
characters</em> are adapted to the underlying collation:
|
|
|
normally Ё is in the Е bucket for Russian, but if someone
|
|
|
used a variant of Russian collation that distinguished them
|
|
|
on a primary level, then Ё would show up as its own
|
|
|
bucket.</li>
|
|
|
<li>If an <em>index character</em> string ends with a single
|
|
|
"*" (U+002A), for example "Sch*" and "St*" in German, then
|
|
|
there will be a separate bucket for the string minus the "*",
|
|
|
for example "Sch" and "St", even if that string does not sort
|
|
|
distinctly.</li>
|
|
|
<li>An <em>index character</em> can have multiple primary
|
|
|
weights, for example "Æ" and "Sch". Names that have the same
|
|
|
initial primary weights sort into this <em>index
|
|
|
character</em>’s bucket. This can be achieved by using an
|
|
|
upper-boundary string that is the concatenation of the
|
|
|
<em>index character</em> and U+FFFF, for example "Æ\uFFFF"
|
|
|
and "Sch\uFFFF". Names that sort greater than this upper
|
|
|
boundary but less than the next index character are
|
|
|
redirected to the last preceding single-primary index
|
|
|
character (A and S for the examples here).</li>
|
|
|
</ol>
|
|
|
<p>For example, for index characters <code>[A Æ B R S {Sch*}
|
|
|
{St*} T]</code> the following sample names are sorted into an
|
|
|
index as shown.</p>
|
|
|
<ul>
|
|
|
<li>A — Adelbert, Afrika</li>
|
|
|
<li>Æ — Æsculap, Aesthet</li>
|
|
|
<li>B — Berlin</li>
|
|
|
<li>R — Rilke</li>
|
|
|
<li>S — Sacher, Seiler, Sultan</li>
|
|
|
<li>Sch — Schiller</li>
|
|
|
<li>St — Steiff</li>
|
|
|
<li>T — Thomas</li>
|
|
|
</ul>
|
|
|
<p>The … items are special: each is a bucket for
|
|
|
everything else, either less or greater. They are inserted at
|
|
|
the start and end of the index list, <em>and</em> on script
|
|
|
boundaries. Each script has its own range, except where scripts
|
|
|
sort primary-equal (e.g., Hira & Kana). All characters that
|
|
|
sort in one of the low reordering groups (whitespace,
|
|
|
punctuation, symbols, currency symbols, digits) are treated as
|
|
|
a single script for this purpose.</p>
|
|
|
<p>If you tailor a Greek character into the Cyrillic script,
|
|
|
that Greek character will be bucketed (and sorted) among the
|
|
|
Cyrillic ones.</p>
|
|
|
<p>Even in an implementation that reorders groups of scripts
|
|
|
rather than single scripts, for example Hebrew together with
|
|
|
Phoenician and Samaritan, the index boundaries are really
|
|
|
script boundaries, <em>not</em> multi-script-group boundaries.
|
|
|
So if you had a collation that reordered Hebrew after Ethiopic,
|
|
|
you would still get index boundaries between the following (and
|
|
|
in that order):</p>
|
|
|
<ol>
|
|
|
<li>Ethiopic</li>
|
|
|
<li>Hebrew</li>
|
|
|
<li>Phoenician<em> // included in the Hebrew reordering
|
|
|
group</em></li>
|
|
|
<li>Samaritan<em> // included in the Hebrew reordering
|
|
|
group</em></li>
|
|
|
<li>Devanagari</li>
|
|
|
</ol>
|
|
|
<p>(Beginning with CLDR 27, single scripts can be
|
|
|
reordered.)</p>
|
|
|
<p>In the UI, an index character could also be omitted or
|
|
|
grayed out if its bucket is empty. For example, if there is
|
|
|
nothing in the bucket for Q, then Q could be omitted. That
|
|
|
would be up to the implementation. Additional buckets could be
|
|
|
added if other characters are present. For example, we might
|
|
|
see something like the following:</p>
|
|
|
<table border="1" cellspacing="0">
|
|
|
<tbody>
|
|
|
<tr align="center">
|
|
|
<td>
|
|
|
<div align="center">
|
|
|
<strong>Sample Greek Index<br></strong>
|
|
|
</div>
|
|
|
</td>
|
|
|
<td><strong>Contents<br></strong></td>
|
|
|
</tr>
|
|
|
<tr align="center">
|
|
|
<td>
|
|
|
<div align="center">
|
|
|
Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω
|
|
|
</div>
|
|
|
</td>
|
|
|
<td>With only content beginning with Greek
|
|
|
letters <br></td>
|
|
|
</tr>
|
|
|
<tr align="center">
|
|
|
<td>
|
|
|
<div align="center">
|
|
|
… Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ
|
|
|
Ω …
|
|
|
</div>
|
|
|
</td>
|
|
|
<td>With some content before or after</td>
|
|
|
</tr>
|
|
|
<tr align="center">
|
|
|
<td>
|
|
|
<div align="center">
|
|
|
… 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ
|
|
|
Ψ Ω …
|
|
|
</div>
|
|
|
</td>
|
|
|
<td>With numbers, and nothing between 9 and Alpha</td>
|
|
|
</tr>
|
|
|
<tr align="center">
|
|
|
<td>
|
|
|
<div align="center">
|
|
|
… 9 <em>A-Z</em> Α Β Γ Δ Ε Ζ Η Θ Ι Κ
|
|
|
Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …
|
|
|
</div>
|
|
|
</td>
|
|
|
<td>With numbers, some Latin</td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<p>Here is a sample of the XML structure:</p>
|
|
|
<pre>
|
|
|
<exemplarCharacters type="index">[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]</exemplarCharacters></pre>
|
|
|
<p>The display of the index characters can be modified with the
|
|
|
Index labels elements, discussed in the <i>Part 2 General,
|
|
|
Section 3.3, <a href="tr35-general.html#IndexLabels">Index
|
|
|
Labels</a></i> .</p>
|
|
|
<h4>3.16.2 <a name="CJK_Index_Markers" href=
|
|
|
"#CJK_Index_Markers" id="CJK_Index_Markers">CJK Index
|
|
|
Markers</a></h4>
|
|
|
<p>Special index markers have been added to the CJK collations
|
|
|
for stroke, pinyin, zhuyin, and unihan. These markers allow for
|
|
|
effective and robust use of indexes for these collations.</p>
|
|
|
<p>The per-language index exemplar characters are not useful
|
|
|
for collation indexes for CJK because for each such language
|
|
|
there are multiple sort orders in use (for example, Chinese
|
|
|
pinyin vs. stroke vs. unihan vs. zhuyin), and these sort orders
|
|
|
use very different index characters. In addition, sometimes the
|
|
|
boundary strings are different from the bucket label strings.
|
|
|
For collations that contain index markers, the boundary strings
|
|
|
and bucket labels should be derived from those index markers,
|
|
|
ignoring the index exemplar characters.</p>
|
|
|
<p>For example, near the start of the pinyin tailoring there is
|
|
|
the following:</p>
|
|
|
<p><p> A</p><!-- INDEX A --><br>
|
|
|
<pc>阿呵𥥩锕𠼞𨉚</pc><!-- ā --></p>
|
|
|
<p>…</p>
|
|
|
<p><pc>翶</pc><!-- ao --><br>
|
|
|
<p> B</p><!-- INDEX B --></p>
|
|
|
<p>These indicate the boundaries of "buckets" that can be used
|
|
|
for indexing. They are always two characters starting with the
|
|
|
noncharacter U+FDD0, and thus will not occur in normal text.
|
|
|
For pinyin the second character is A-Z; for unihan it is one of
|
|
|
the radicals; and for stroke it is a character after U+2800
|
|
|
indicating the number of strokes, such as ⠁. For zhuyin the
|
|
|
second character is one of the standard Bopomofo characters in
|
|
|
the range U+3105 through U+3129.</p>
|
|
|
<p>The corresponding bucket label strings are the boundary
|
|
|
strings with the leading U+FDD0 removed. For example, the
|
|
|
Pinyin boundary string "\uFDD0A" yields the label string
|
|
|
"A".</p>
|
|
|
<p>However, for stroke order, the label string is the stroke
|
|
|
count (second character minus U+2800) as a decimal-digit number
|
|
|
followed by 劃 (U+5283). For example, the stroke order boundary
|
|
|
string "\uFDD0\u2805" yields the label string "5劃".</p>
|
|
|
<hr>
|
|
|
<p class="copyright">Copyright © 2001–2020 Unicode, Inc. All
|
|
|
Rights Reserved. The Unicode Consortium makes no expressed or
|
|
|
implied warranty of any kind, and assumes no liability for
|
|
|
errors or omissions. No liability is assumed for incidental and
|
|
|
consequential damages in connection with or arising out of the
|
|
|
use of the information or programs contained or accompanying
|
|
|
this technical report. The Unicode <a href=
|
|
|
"https://unicode.org/copyright.html">Terms of Use</a> apply.</p>
|
|
|
<p class="copyright">Unicode and the Unicode logo are
|
|
|
trademarks of Unicode, Inc., and are registered in some
|
|
|
jurisdictions.</p>
|
|
|
</div>
|
|
|
</body>
|
|
|
</html>
|