Unicode text styles

5/27/2023

Additionally, because UTF-16 relies upon a 16-bit character, many existing programs and applications had to add special, separate support (essentially duplicating all their text handling code) for UTF-16 because they were designed to support 8-bit characters. If you’re doing the math, you’ve already realized that the space calculations still aren’t great, and there is still potential for a lot of wasted space with UTF-16 encoded data especially if you’re only ever using characters that use just 8 bits (or 1 byte). UTF-16 was developed as an alternative, using 16 bits (or 2 bytes) per character. To meet this requirement the developers of Unicode implemented a two-byte character system, but even that didn’t provide enough possible combinations for all the world’s characters! But solving this problem wasn’t as simple as just increasing it up to three or four bytes per character because of memory and space considerations – if each character in a plain text file requires 4 bytes of disk space (or memory space, if it’s loaded into memory), you’re essentially quadrupling the amount of space that data needs to be stored! That’s not efficient at all.

Obviously you’d need exponentially more combinations than 256 to support all characters in the world. This means that you can only support up to 256 characters with an encoding that uses a single byte for each of those characters.

So this means a single byte can be one of 256 possible combinations of bits. A bit is the most basic and smallest piece of electronic data and can either be a 0 or a 1 (or “off” / “on”). Why? Well, as you might already know, in the world of computers, one byte is composed of 8 bits. The difference between UTF-8, UTF-16, etc.īecause Unicode encompasses hundreds of thousands of characters, multiple bytes are required for each character. So, with this knowledge in mind, an updated diagram for how Unicode encoding works is shown below: (Note: As of this update to this power tip, on Nov 2, 2018, there are exactly 137,374 characters in Unicode.) Unicode includes a table of useful character properties such as “this is lower case” or “this is a number” or “this is a punctuation mark”.

Each character gets a name and a code point, for example LATIN CAPITAL LETTER A is 0041 and TIBETAN SYLLABLE OM is 0F00. It defines a large (and steadily growing) number of characters – just over 100,000 last time I checked. The basics of Unicode are actually pretty simple.

Tim Bray, in his article “On the Goodness of Unicode”, explains Unicode in simple terms: Prior to Unicode, you would probably have needed to select a different code page to see each script, if the script even had a code page and a font that supported it, and you wouldn’t be able to view multiple languages / scripts within the same file at all. The practical benefit of this aim is that any user in any location can view Chinese scripts, English alphanumeric characters, or Russian and Arabic text – all within the same file and without having to manually futz with the encoding (code page) for each specific text. Unicode is an encoding developed many years ago by some intelligent developers with the goal of mapping most of the world’s written characters to a single encoding set.

0 Comments

Author

Archives

Categories

Unicode text styles

Leave a Reply.