Bush hid the facts

Bush hid the facts is a common name for a bug present in some Microsoft Windows applications, which causes a file of text encoded in ASCII or its superset (such as in a Windows code page) to be interpreted as if it were UTF-16LE, resulting in mojibake. When "Bush hid the facts" (without newline or quotes) is put in a new (pre-Vista) Notepad document and saved, closed, and reopened, the nonsensical Chinese characters "" appear instead.

While "Bush hid the facts" is the sentence most commonly presented on the Internet to induce the error, the bug can be triggered by many sentences with characters and spaces in a particular order so that the bytes match the UTF-16LE encoding of valid (if nonsensical) Chinese Unicode characters. Other popular strings are "this app can break", "acre vai pra globo" (Portuguese for "Acre goes to Rede Globo"), and "aaaa aaa aaa aaaaa".[1] The bug is triggered even by the text "a ".

The bug occurs when the string is passed to the Win32 charset detection function IsTextUnicode with no other characters. IsTextUnicode sees what it thinks is valid UTF-16LE Chinese and returns true, and the application then incorrectly interprets the text as UTF-16LE.[2]

Many text editors and tools exhibit this behavior because they use IsTextUnicode as well.

Discovery

The bug appeared since IsTextUnicode was introduced with NT 3.5 in 1994, but was not discovered until early 2004.[3]

Workarounds

Vista SP1 and later Notepad includes a workaround for the IsTextUncode bug .

Editing the text to not be a pattern that triggers this bug will avoid it. For instance, adding a new line in the first 20 characters will work.

If the file is saved as "UTF-8" rather than "ANSI" the text loads correctly, because Notepad prepends a UTF-8 byte order mark, which is a pattern that does not trigger the bug. UTF-8 without the byte order mark would still trigger the bug, as this sequence is represented identically in UTF-8 as in ASCII.

The bug is also avoided by saving as "Unicode", which in Microsoft Windows means UTF-16LE. When loading this text IsTextUnicode should (and does) return true and the text is correct.

To retrieve the original text using Notepad, bring up the "Open a file" dialog box, select the file, select "ANSI" or "UTF-8" in the "Encoding" list box, and click Open. Under Windows 2000, Notepad lacks the "Encoding" list box. Notepad2 also lacks this. WordPad appears to load the text correctly without choosing the encoding, since it uses its own encoding detection.

References

  1. Bush Hid The Facts - Notepad Conspiracy Claim – Hoax-Slayer
  2. Chen, Raymond (2007-03-24). "Some files come up strange in Notepad - The Old New Thing". blogs.msdn.com.
  3. Cumps, David (February 27, 2004). "Notepad bug? Encoding issue?". #region .Net Blog. Retrieved February 15, 2009.

External links

This article is issued from Wikipedia - version of the Thursday, May 05, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.