This report identifies PUA characters and superscript script characters in manuscript description files. It is sorted by filename and, within the individual files, it lists only the items that include PUA or superscript characters, and only the textual snippets that contain those characters.
PUA characters are in red and their Unicode codepoint values are reported in parentheses after the sample text, along with a count (in square brackets) of how many times each PUA character occurs in that sample.
As of Unicode 9.0, superscript Cyrillic letters are in Cyrillic Extended-A U+2DE0–U+2DFF and Cyrillic
Extended-B U+A674–U+A67B. The superscript
characters are in blue and their Unicode codepoint values are reported in parentheses after the sample text,
along with a count (in square brackets) of how many times each superscript character occurs in that sample.
Because Unicode does not provide combining superscript versions of all Cyrillic letters, even were we to use
the ones that are available, we would have to fall back on an alternative for others, which would introduce
inconsistencies into the representation of superscription in the corpus. For that reason, our policy is to
represent all instances of superscription by wrapping markup around regular Cyrillic letters, so that
for example, е<seg rend="sup">г<seg>
would be rendered as ег
.
If titlo, porkytie, or an accentual or breathing diacritic appears over the superscript letter, it should be
included with the letter inside the same <seg>
element.