SandCastleIcon.png This article has links to websites or programs outside of Scratch and Wikipedia. Remember to stay safe while using the internet, as we cannot guarantee the safety of other websites.
UTF-8 has become the most popular character encoding standard in recent years.

A character, commonly abbreviated as "char", is a computer symbol, letter, or number.[1] A keyboard is an input device that inputs a character when a key is pressed. In Scratch, characters are used in strings, arguments, and any situation in the Scratch editor or the playable project where text is required.

Computers use encoding sets to represent characters. Since computers only understand binary code, characters are identified by certain binary sequences. There are many variations and standards across the world that have changed throughout history.[2]

Types of Characters

Letters

Letters are characters from an alphabet. In English, they consist of lowercase and uppercase characters ranging from the letters "A" to "Z". Combining letters can create words, and combining words can create sentences. "Character" is simply a more universal world that encompasses letters as well as other symbols.

Symbols

Computers have a wide range of recognizable symbols. Some are present on standard keyboards while others may need to be inputted via software rather than a hardware device. An example of a symbol is the common pound sign: "#". The pound sign is also known as a "hash tag" on social media websites and is arguably the most common symbol used. The "&" symbol is also common and represents the word "and" with one character.

Emojis

Not to be confused with Scratch Emojis.
Emojis in a Scratch project.

Emojis are small images and "smileys" that are a part of the Unicode Technical Standard[3] that is implemented by most computers, phones, and similar devices. They are recognized computer characters and can even be used in project names and projects themselves (as of Scratch 3.0). Emojis have surged in popularity in the last decade due to their fun and visual nature and easy accessibility on cell phones. To input an emoji into a project, there are various methods:

  • Perform the input on a cell phone from the project page
  • Copy-and-paste an emoji from another Internet source
  • Use an on-screen keyboard with emoji support on a computer
The emoji section of an Android keyboard.

In Windows 10, the default on-screen keyboard does not support emojis. However, there is a second on-screen keyboard called the "Touch Keyboard" that has emoji support. The Touch Keyboard can be used even without a touch screen; it has support for a traditional computer mouse. To enable it, right-click the task bar and select the "Show touch keyboard button". From there, the Touch Keyboard icon will appear on the right side of the taskbar. On the virtual keyboard, the "smiley" button displays the emoji options. Another method is to press the Windows logo key and the period key to bring up an assortment of emojis and kaomojis.

Numbers

Main article: Numbers

Numbers are also symbols, often used to represent quantities in the context of mathematical operations or within a sentence. Single number characters can be combined to form larger or more precise numbers. The basic numbers range from "0" to "9". Decimal numbers often use the "." character to represent a decimal point. While the "." character alone is a symbol and not a number, it can be used with a number.

Non-Printable Characters

The "enter" key is a real character used twice in this text file when jumping down a line.

Some characters are "invisible", as in computers do not display them on a screen.[4] An example of this is the "escape" key. Other examples include the character for the "enter" key, the "tab" key, and even the value "null". Null is not something that has any visual representation but is important in computer programming. In the language C, the "null" character is used to denote the end of a string.

Restrictions

Some computer programs may only allow certain characters to be used in specific circumstances. For instance, Scratch does not allow letter or symbol characters to be typed into a numeric insert. It is up to the programmer to decide what characters are allowed or not. Many websites only allow letters, numbers, and a few symbols to be used for usernames. Passwords often allow a larger range of characters for enhanced security.

Strings

Main article: String

A string is a chain of characters. A phrase, word, or even random jumble of characters can be a string. The communication of ideas typically cannot be done with single characters, so multiple are used in unison. A string can consist of a single character, however. In Scratch, strings are commonly used in lists, blocks such as Say (), encoding and decoding cloud data, and more.

Retrieving a Character from a String

Main article: Letter () of () (block)

In Scratch, the letter () of [] block is used to retrieve a single character from a string. For instance, if the first letter of "Hello World" is to be obtained, arguments can be entered into the block to form letter (1) of [Hello World].

Encoding

A computer does not recognize characters like a human. A human sees a frowning emoji and interprets it as sadness. A human sees numbers and associates mathematics with them. A computer is merely a machine that represents characters with standardized formats known as encoding.[5] Basically, characters are all assigned code values. Usually the code values are organized for the programmer's ease. For instance, the letters' codes will be in order alphabetically. Numbers, likewise, will be in a simple order.

ASCII

The American Standard Code for Information Interchange is an old but still-available encoding standard. Each letter is associated with an ASCII code and also represented by a single byte (8 bits).[6] Originally only 7 bits were used to represent ASCII characters, allowing 128 characters. They were still represented by a single byte, though, since computers work with bytes better than an odd number of bits. Eventually an extended family of ASCII characters came out with the characters 128-255, becoming known as ANSI.

ASCII is a limited character set because of its history. In the past, computers could only handle up to 8 bits, so the ASCII character set was restricted to 127 characters. This predominantly included characters most associated with the English language.

ANSI

ANSI is an extension of the ASCII encoding, doubling the amount of characters. It contains characters ranging from 0-255. It differs from ASCII notably by using 8 bits instead of 7 bits to represent a single character.[7] In the present day, though, this difference is insignificant since ASCII characters are essentially stored as 8-bit values with the first bit always set to "0". Likewise, values that ANSI adds onto the ASCII character set has the first bit set to "1".

Inputting Characters off the Keyboard

Keyboards only have a limited amount of characters. If, for instance, one wants to enter the "°" symbol, the "alt" key can be held down while "0176" is punched in on the right-hand number pad of the keyboard.[8] "176" is the respective code for the degree sign in ANSI. This is a functionality built into most keyboards.

UTF-8

UTF-8 is a more modern standard that encompassing over a million characters without necessarily requiring multiple bytes per character. The standard can be used globally, allowing Chinese characters to be used in the same text as Spanish characters.[9] UTF-8 encodes into certain bits information on the length of the sequence of bits to represent the character. For example, if a certain character had a very long code to represent itself, there would be a few bits acting as as a "flag" to alert the computer that it's a longer character. The computer would then take the next byte into account for one single character.

Some characters are represented by less bytes than others in the encoding. Thus, this allows files to be smaller in size than an encoding that treats every character with the same amount of bytes. The USC4 character encoding represents all characters with 4 bytes. While some characters in UTF-8 may be represented by 4 bytes, many are only represented by 1 or 2. The following chart shows the amount of bytes required for a range of characters:[10]

Bytes Per Character
Min Character Code Max Character Code Bytes
0 127 1
128 2,047 2
2,048 65,535 3
65,536 1,114,111 4

Because of this setup, all the original ASCII characters (0-127) are still only 1 byte in UTF-8. The least common characters take on a larger amount of bytes.

Other Variants

UTF-16 and UTF-32 also exist but are less common than UTF-8. UTF-16 uses a minimum of 16 bits or 2 bytes for every character.[11] One would assume this would make files larger than UTF-8, but some characters in UTF-8 character that are represented by 3 bytes may be represented by 2 bytes in UTF-16. A character represented by 16 bits in UTF-8 actually takes up 3 bytes because some of the bits are used to signal that multiple bytes are necessary. In UTF-16, a 16 bit character can be represented by 2 bytes.

Unicode

The Unicode Consortium is a non-profit organization that develops the Unicode standard of computer characters.

Unicode is a standardized character set that can store over a million characters. UTF-8 encodes the Unicode character set. Unicode itself does not specify how to encode its data into binary, it merely is a large database of code values of many characters.[12] Unicode is constantly being updated with new values, as not all have been filled yet.

Usage in Scratch

Main article: Encoding and Decoding Cloud Data

Foreknowledge of computer character encoding standards can be beneficial when developing Scratch projects. Particularly, using cloud variables to store more data than a counter or high score value requires a custom-made encoding system. Cloud variables are only capable of storing numbers, so if text is to be stored, it needs to be translated into numeric codes. This is inline with how computers work, as they translate text into sequences of "1"s and "0"s.

Similar to ASCII encoding, a system can be designed where each character is assigned a code, and the cloud variable contains a sequence of codes. When the data is to be read, it must be decoded by looking up the characters associated with their respective code values. Since cloud variables allow the numbers 0-9, less digits can represent the same range of characters as ASCII, which requires 7 digits (bits) in binary per character.

UTF-8 can also be replicated with cloud variables by using some digits to represent how many other digits are part of the same character before moving onto the next one. Consider that the first number in the cloud variable signifies how many following digits make up the code for the next char. If the cloud variable's encoded data is 3564298, then the first character code is "564" followed by "98". The "3" signifies that there are 3 digits in the first character, and the "2" signifies that there are 2 digits in the second character.

A list can then be used where the index corresponds to the character code. This type of system would be beneficial if a large amount of characters is to be recognized by the project. Restrictions can be set by only allowing certain characters, using the more simple ASCII-based encoding system with a fixed number of digits per character. This, however, may cause issues if someone's username with an "illegal" character is attempted to be encoded into the cloud variable. More complex logic could possibly account for such circumstances.

See Also

References

Cookies help us deliver our services. By using our services, you agree to our use of cookies.