Dealing with “cursed” Twee files in Extwee

When writing my last post, my plan was working towards Extwee 2.2.5. This was originally planned as work on creating new binary files using Node’s experimental support. After several attempts of this, I have pushed this goal towards a later version. Instead, I have shifted my focus to an ongoing problem with Extwee I’ve been calling “cursed” Twee files. These are files where special combinations of symbols can create problems when attempting to parse their content.

What is Twee?

The Twee format allows for defining passages in Twine in a plain-text format. It uses special combinations of symbols to help define the names, tags, and body of each passage.

Here is a simple example of the format with two passages:

:: PassageName [tag1, tags2]
This is content

:: PassageName2 [tag2]
This is content

In the above example, there are two passages. The first, PassageName, is defined beginning with the special sigil, combination of symbols, of two colons, “::”. This marks the beginning of the name of a passage. The next line, and up until the line before the next sigil, defines the body of the passage.

When parsing Twee files, Extwee, and other tools, reads the Twee data. Next, it converts this data into another format like HTML.

Problem: Searching for (ASCII) Symbols

In the current version of Extwee, the process of parsing Twee data begins with searching for symbols. If certain combinations of symbols are found, the code assumes it should be parsing certain parts. For example, if it finds the sigil, it assumes it should be parsing a name and potentially tags. Within another section, if an opening square bracket is found, it assumes it will be parsing tags.

This process works really well for most Twee files. It does not work for some “cursed” examples where special symbols are used in unusual ways or combinations.

Earlier this year, Chris Klimas identified a simple example Extwee does not understand:

:: test
\:: nefarious

When attempting to parse the earlier code example, Extwee would fail. It does not understand the use of a backslash and colon combination for a very nefarious reason. The American Standard Code for Information Interchange (ASCII) defines the use of backslashes as part of meta-characters. These contain extra, often visual information like when a line or file ends.

Re-writing the earlier example produces the next example:

:: test\n\:: nefarious

The backslash and lowercase “n” combination is known as the newline character. There are several defined meta-characters, but the use of the backslash allows for creating more. Based on how ASCII works, the combination of the backslash and colon creates the assumption of another meta-character.

It is possible to overcome this issue by adding more backslashes during some parts. You can remove them later. Though, there is an easier way. Instead of strictly using ASCII, it is possible to convert into another encoding format.

Solution: Using UTF-8 Code Points

The Unicode Transformation Format (UTF) encoding allows for more symbols than ASCII. It uses a concept known as “code points.” For every symbol, there exists a number. When a computer is trying to understand a UTF file, it converts the numbers into symbols.

The advantage of using numerical values instead of symbols is found in meta-characters. ASCII defines multiple combinations of symbols. When using numbers, these no longer exist. Every single symbol, regardless of if it is part of a combination, is treated as a single number.

Converting the earlier example into UTF code points creates a new example as a series of numbers:

58, 58, 32, 116, 101, 115, 116, 92, 110, 92, 58, 58, 32, 110, 101, 102, 97, 114, 105, 111, 117, 115

When using code points, each symbol becomes something to check against. The same sequence, “\::”, affecting Extwee can be better understood as three numbers “92, 58, 58”.

This approach does need more code to parse everything. But it also better handles the more “cursed” examples. These include unusual names and combinations of symbols in unusual places.

More to Come

I have not made the new code public yet. I hope to finish it over the next week. Then, I will ask for more “cursed” examples to test against. Hopefully, this new code can become part of the next major version of Extwee in the coming weeks.