<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>var/log &#187; bugs</title>
	<atom:link href="http://www.varslashlog.com/tag/bugs/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.varslashlog.com</link>
	<description>Yet another weblog</description>
	<lastBuildDate>Sat, 12 Sep 2009 13:34:27 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Why PHP and Unicode/UTF-8 is a bad combination</title>
		<link>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/</link>
		<comments>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 00:09:46 +0000</pubDate>
		<dc:creator>AHSauge</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[mbstring]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[UTF-8]]></category>

		<guid isPermaLink="false">http://www.varslashlog.com/?p=6</guid>
		<description><![CDATA[One thing that really annoy me these days is the unrestricted enthusiasm for PHP and Unicode, primarily in the form of UTF-8. These people seem to think that Unicode is some fantastic thing they just have to use even though a single byte charset like ISO-8859-1 is more than sufficient for their need. Don&#8217;t get [...]]]></description>
			<content:encoded><![CDATA[<p>One thing that really annoy me these days is the unrestricted enthusiasm for PHP and Unicode, primarily in the form of UTF-8. These people seem to think that Unicode is some fantastic thing they just have to use even though a single byte charset like ISO-8859-1 is more than sufficient for their need. Don&#8217;t get me wrong, Unicode is good and everything, but not when you&#8217;re using PHP. Why? Because PHP by itself does <em>not</em> support Unicode. Now I know alot of people will object and say that there is mbstring. If you&#8217;re one of them, please read this carefully: mbstring is a Swiss cheese. Now I know alot of people is not using mbstring and basicly is using Unicode (most likely UTF-8) blissfully unaware of it&#8217;s dangers. So before I get into why mbstring is a horrible piece of coding, I&#8217;d like to explain why you shouldn&#8217;t use Unicode without some measures.<span id="more-6"></span></p>
<p>First of all, you shouldn&#8217;t be using a foreign charset or encoding without knowing it&#8217;s risk. Unicode requires a mulibyte encoding, which has it&#8217;s own set of problems compared to singlebyte charsets like ISO-8859-1 and simliar. As I&#8217;m only really familiar with UTF-8, I&#8217;m going to use that as an example. I could proably write a book about the problems, but here&#8217;s a small list:</p>
<ul>
<li>BOM (U+FEFF) and &#8220;reverse&#8221;-BOM (U+FFFE). This is not really a problem in UTF-8 as it doesn&#8217;t have the issues with little and big endian (<a title="Wikipedia article explaining endianness" href="http://en.wikipedia.org/wiki/Endian" target="_blank">wikipedia about endianness</a>), but if you convert to for instance UTF-16, having a &#8220;reverse&#8221;-BOM at the start of your document would be a disaster.</li>
<li>Str-functions aren&#8217;t multibyte aware and might &#8220;break&#8221; your strings.</li>
<li>Non-shortest form. There&#8217;s a bit of math behind this one, which I&#8217;m not going to explain today, so you&#8217;ll just have to trust me on this one. Non-shortest form is an illegal representation of a Unicode character. It basicly is represented using more bytes than necessary and in UTF-8 this results in &#8216; , which in hex terms is 0&#215;27, could be represented by 0xC0A7. If those who developed for example the RDBMS you&#8217;re using is just as unknowing as alot of people out there, they will decode or recode this to UTF-16, UCS-2 or UCS-4 and get a single quote ( &#8216; ) which might just result in a SQL-injection &#8230;</li>
<li>Surrogates. In UTF-8 surrogates (U+D800-U+DFFE) are not allowed, and is actually regared as a potential security risk. Allowing these will result in a � or a similar &#8220;this is an illegal byte&#8221;-character being displayed.</li>
<li>Illegal combination of bytes resulting in a � or a similar &#8220;this is an illegal byte&#8221;-character being displayed.</li>
</ul>
<p>In addition to those above, there is also the possibility of security problems when you&#8217;re actually escaping input with functions not aware of multibyte charsets (<a href="http://shiflett.org/blog/2005/dec/google-xss-example">here&#8217;s an example</a>). With all this in mind I can tell you why mbstring is such a bad piece of coding.</p>
<p>First of all it doesn&#8217;t have a complete replacement for every str-function. We&#8217;re missing *sort, *trim, strcasecmp, str_ireplace, ucfirst, ucwords and wordwrap just to mention some. The second thing is that mbstring is writen by a Japanese or something, which is quite obvious considering strcasecmp is missing. &#8230; and no, using mb_strtolower and strcasecmp is not a valid way as the process requires either simple or full case folding which has some differences from the &#8220;to lower&#8221;-process. Anyway, mbstring is writen more or less to enable Japanese, Chinese and similar languages to work. The thing is, these languages don&#8217;t have case as most other languages which explains the lack of the above.</p>
<p>Third, there is some bugs in mbstring which result in invalid validaton. Here are som examples:</p>
<ul>
<li>Anything behind a null-byte isn&#8217;t validated</li>
<li>Surrogates are allowed</li>
<li>&#8220;Reverse&#8221;-BOM is allowed</li>
<li>Title case isn&#8217;t working</li>
</ul>
<p>Fourth and final, the overloading abilites make the str-functions behave different from the original str-functions. These are undocumented differences, possibly breaking applications expecting str-functionality. For instance, if 2. paramter in strrchr is an int, is converted to a character. In mb_strrchr however, this will give an error.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
