Character, Code Point and Grapheme



The character à can be a single character or a double character. In single character format the corresponding unicode is \u00E0. In two character format the unicode is \u0061 followed by \u0300. The unicode \u0061 is the unicode of the letter a. The unicode \u0300 is the unicode for the sign at the top of the character a. The unicode \u0300 is usually used only after a letter. The combination \u0061 followed by \u0300 is therefore another way of representing the character à.

Many languages make use of the combination of unicodes to represent the variations or accents of the character. This is called unicode grapheme. From software matching point of view, the regular expression RegExp("^.$") is supposed to match a single character, however, in reality, it matches a single code point. If the character à is encoded as a combination of two code points the regular expression RegExp("^.$") will fail to match it. In that case the correct regular expression to match it will be RegExp("^..$")

Here is an example to understand this fact

<html>
<body>
<script type="text/javascript">
<!--
/*
********************************************************
Javascript Regular Expression Example 28
Unicode grapheme
********************************************************
*/
var pattern1=RegExp("..");
var string1 = new Array(2);
string1[0] = "à"; // one unicode point
string1[1] = "à";// Two unicode points
var i;
for(i=0; i<=1; i++)
{
if (pattern1.test(string1[i]))
{
document.write(string1[i], " -->matches regular expression","<br\>");
}
else
{
document.write(string1[i], " --> does not match regular expression", "<br\>");
}
}
//-->
</script>
</body>
</html>

Try this Example online


You can try this example online at - here .

You should notice that even though it looks like string1[0] and string1[1] have been assigned to the same character, they are not. One of them is a single point unicode and other is a two point unicode. If you run this script, you will see the following result.

à --> does not match regular expression
à -->matches regular expression


You will be wondering how I wrote à in two different ways. Well, I have used some online unicode converters. These converters will convert a unicode or a combination of unicode to the given character.

If the field of unicode looks interesting to you, you can visit http://unicode.org to get more information.