<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>var/log &#187; UTF-8</title>
	<atom:link href="http://www.varslashlog.com/tag/utf-8/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.varslashlog.com</link>
	<description>Yet another weblog</description>
	<lastBuildDate>Sat, 12 Sep 2009 13:34:27 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>How to use Unicode/UTF-8 in PHP properly (part 1)</title>
		<link>http://www.varslashlog.com/2009/02/09/how-to-use-unicodeutf-8-in-php-properly-part-1/</link>
		<comments>http://www.varslashlog.com/2009/02/09/how-to-use-unicodeutf-8-in-php-properly-part-1/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 01:45:13 +0000</pubDate>
		<dc:creator>AHSauge</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[mbstring]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[UTF-8]]></category>

		<guid isPermaLink="false">http://www.varslashlog.com/?p=119</guid>
		<description><![CDATA[I&#8217;ve previous been writing about why PHP and Unicode/UTF-8 is a bad combination. Even though UTF-8 in PHP should (for now) be avoided, it is sometimes a necessity to use it. As UTF-8 can be quite problematic for some people to use, I thought I this time should write about how to actually use it [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve <a title="Why PHP and Unicode/UTF-8 is a bad combination" href="http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/">previous</a> been writing about why PHP and Unicode/UTF-8 is a bad combination. Even though UTF-8 in PHP should (for now) be avoided, it <em>is</em> sometimes a necessity to use it. As UTF-8 can be quite problematic for some people to use, I thought I this time should write about how to actually use it properly. In this first part I&#8217;ll deal with the basic handling of UTF-8 in PHP using the PHP extensions mbstring and/or iconv.</p>
<p><span id="more-119"></span></p>
<h4>The basic facts</h4>
<p>The key element when using a multibyte character set in PHP is to know exactly what you&#8217;re doing. If you don&#8217;t, you can easily end up with partially corrupted text and wrong results. The one big reason for this is the fact that PHP by default does not support anything other than byte-sized character set. In fact, strictly speaking, PHP doesn&#8217;t really know what a character set is. All it sees are bytes, not characters. This means that every string-function in PHP works on the assumption that a byte is a character. When dealing with for instance UTF-8 this is no longer true. The result is that strlen reports the number of bytes in the string, not the number of characters. Similarly, strpos will give you the position in bytes, not characters, and many of the other string-functions have similar problems. So what to do?</p>
<p>First off, a very handy fact about UTF-8 is it&#8217;s ASCII-compatible (7bit ASCII that is), meaning these characters are binary represented as 0xxx xxxx (where x are ASCII bits). Another handy fact about valid UTF-8 strings is that any encoded Unicode character has a unique byte sequence, meaning it can&#8217;t be confused with a part of another character. This means that if you encounter a byte 00100000 (20 hex or 32 dec) it can not be anything other than a space character, or else it&#8217;s not a valid UTF-8 string.  For those interested in how this is archived, here&#8217;s the binary representation of UTF-8 characters (skip if you&#8217;re not into that type of stuff ;o)</p>
<blockquote>
<pre>1 byte:  0xxx xxxx
2 bytes: 110x xxxx 10xx xxxx
3 bytes: 1110 xxxx 10xx xxxx 10xx xxxx
4 bytes: 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx</pre>
</blockquote>
<p>Well, enough with talk about UTF-8 in general. Here&#8217;s a step-by-step guild how to use UTF-8 in PHP.</p>
<h4>1.  Find you what you have available</h4>
<p>If you&#8217;re going to use UTF-8 in PHP you&#8217;ll first need to find you what&#8217;s available to you. Ideally you should have the PHP extension <a title="link to the PHP manual for mbstring" href="http://www.php.net/mbstring">mbstring</a> install and it should <span style="text-decoration: underline;">not</span> be set to overload str-functions. With mbstring you&#8217;ll have a set of functions that are multibyte aware. If you don&#8217;t have mbstring available, check for <a title="Link to PHP manual for iconv" href="http://www.php.net/iconv">iconv</a> (also an PHP extension). In PHP 5 and later this extension will give you some very simple functions to work with (strlen, strpos, strrpos, substr and validation). If you have neither mbstring nor iconv available at your host (I&#8217;m assuming that you&#8217;re going to run something in a hosted server), you should strongly consider change host and/or your need for Unicode/UTF-8, as you&#8217;re going to have to make some native functions that works properly on UTF-8 encoded strings. For the sake of simplicity, I&#8217;m going to assume you have mbstring or iconv installed.</p>
<h4>2. Store your files as UTF-8</h4>
<p>This should be somewhat obvious. If you&#8217;re going to use UTF-8, you should store your files as it too. The simple reason is that any strings you have stored in your scripts will be UTF-8 too, and should be outputed properly given you&#8217;ve done everything else correct.</p>
<h4>3. Define input and output as UTF-8</h4>
<p>This is quite depending on what you&#8217;re doing, but chances are that you&#8217;re dealing with the HTTP-protocol and HTML. If so you have to add the following function call before any output (if you&#8217;re not using output buffering).</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #990000;">header</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'Content-Type: text/html; charset=utf-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>and add the following to the head-part of your HTML-document</p>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">meta</span> <span style="color: #000066;">http-equiv</span><span style="color: #66cc66;">=</span>Content-<span style="color: #000066;">Type</span> <span style="color: #000066;">content</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;text/html; charset=utf-8&quot;</span>&gt;</span></pre></div></div>

<p>This will state that any output is html and UTF-8. If you&#8217;re not dealing with html, just replace text/html with whatever you&#8217;re using (and proably drop the meta tag too). This should also ensure that input from most browsers is UTF-8, but to be really sure it might be a good idea to add the attribute accept-charset to any forms you might have, like this:</p>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">form</span> <span style="color: #000066;">accept-charset</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;UTF-8&quot;</span> <span style="color: #000066;">method</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;post&quot;</span>&gt;</span>
<span style="color: #808080; font-style: italic;">&lt;!-- Input stuff here --&gt;</span>
<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">form</span>&gt;</span></pre></div></div>

<p>This attribute can also be used with a comma-separated list of character set your script accept. To keep it simple you should just use UTF-8 as the only accepted &#8220;character set&#8221; (strictly speaking, UTF-8 is an encoding of the Unicode character set, and not a charset by itself).</p>
<p>If you don&#8217;t have control over the input, for instance RSS feed from 3. party server, transform it to Unicode and encode to UTF-8. This can be done like this:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">//mbstring</span>
<span style="color: #000088;">$UTF8string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'other charset'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// or call</span>
<span style="color: #990000;">mb_internal_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// once to use UTF-8 as default and and drop last parameter like this:</span>
<span style="color: #000088;">$UTF8string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'other charset'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">//iconv</span>
<span style="color: #000088;">$UTF8string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">iconv</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'other charset'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Consult the PHP manual for supported character sets.</p>
<h4>4. Validate your input</h4>
<p>I can not stress this point enough. Check that your input actually is UTF-8, or the multibyte aware functions might not work as expected. Also, if not validated, those who view or store your data might get security problems like SQL-injection (though it would be their fault it&#8217;s happening &#8230;). This can be done as follows for mbstring:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">//last parameter can still be dropped as showed in last example</span>
<span style="color: #000088;">$validUTF8</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mb_check_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">//will give true if valid, false if not</span></pre></div></div>

<p>for iconv you&#8217;ll have to go for a bit more dirty solution</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> validateUTF8_iconv<span style="color: #009900;">&#40;</span><span style="color: #000088;">$before</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
    <span style="color: #000088;">$after</span> <span style="color: #339933;">=</span> <span style="color: #990000;">iconv</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'UTF-8'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$before</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$before</span> <span style="color: #339933;">===</span> <span style="color: #000088;">$after</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The reason this works is because any non-valid characters are removed or changed, and thus the strings will not be equal anymore.</p>
<h4>5. Store your data correct</h4>
<p>Now this is really a pit a lot of people fall into. Storing UTF-8 on disc is no problem, and is done just as before. Storing data in a database however, is in the case of MySQL definitely <em>not</em> as before. First of you&#8217;ll need MySQL 4.1 or later as the versions before don&#8217;t support what we&#8217;re about to do. The big problem with PHP and MySQL is that the connection is by default set to latin1, also known as ISO-8859-1, and a lot of people then actually store there data as a UTF-8-transformed ISO-8859-1. That is, MySQL thinks your input is ISO-8859-1 and then  convert it to Unicode and encode it as UTF-8. This will lead to problems when you view your data in for instance phpMyAdmin which connects to MySQL the proper way when dealing with UTF-8. The correct way to connect to MySQL is now</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$link</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mysql_connect</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'localhost'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'mysql_user'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'mysql_password'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #990000;">mysql_query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;SET NAMES 'utf8' COLLATE 'utf8_general_ci'&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>The difference is the second line which states what charset is being used for further talk on the connection. The collate part can be dropped if you&#8217;re using the default one (utf8_general_ci). In addition to this, the table and fields should be defined with collation utf8_[something] and to save space don&#8217;t use char-fields as they will use 3x number of characters you define. Instead you should use varchar. Please also note that MySQL only supports Unicode 3.0, that is 2 byte Unicode up to code-point FFFE (<a title="Byte Order Mark" href="http://en.wikipedia.org/wiki/Byte_Order_Mark">BOM</a>) or 3 byte UTF-8. If you need any code-points above that, you&#8217;re unfortunately in for some trouble &#8230;</p>
<p>This is however only the case of MySQL. For other database systems you should check what is the default charset and how to change it. The documentation/manual of the system is a good place to start to find this type of information.</p>
<h4>6. Functions operating on UTF-8</h4>
<p>The points above should make sure that you&#8217;re using UTF-8. The last thing is that you have to remember that any str-functions might not work correctly. Try to use str-functions defined in mbstring or iconv (see documentation). Some functions do however work as expected. Strcmp does only a binary compare and still works, strcasecmp however don&#8217;t. Str_replace will also work on valid UTF-8 strings as any given character has a unique byte sequence, but the case-less version str_ireplace don&#8217;t. In general any str-function that is not case-less and don&#8217;t need to operate on the number of characters in the string, should work just fine as long as any input to the function is valid UTF-8.</p>
<p>There are also some functions that are not part of mbstring or iconv that do support UTF-8. Htmlspecialchars, htmlentities and preg_* (with u modifier) are examples of this. There are also functions that operate on a purley binary level without any regards to charset. Examples of this is the md5 and sha1 functions.</p>
<p>That concludes this part of the howto. You should now be able to use UTF-8 the right way. If you have any questions, please leave a comment and I&#8217;ll happily answer. The next part will hopefully deal with some common pitfalls and how to debug and solve them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.varslashlog.com/2009/02/09/how-to-use-unicodeutf-8-in-php-properly-part-1/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Why PHP and Unicode/UTF-8 is a bad combination</title>
		<link>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/</link>
		<comments>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 00:09:46 +0000</pubDate>
		<dc:creator>AHSauge</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[mbstring]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[UTF-8]]></category>

		<guid isPermaLink="false">http://www.varslashlog.com/?p=6</guid>
		<description><![CDATA[One thing that really annoy me these days is the unrestricted enthusiasm for PHP and Unicode, primarily in the form of UTF-8. These people seem to think that Unicode is some fantastic thing they just have to use even though a single byte charset like ISO-8859-1 is more than sufficient for their need. Don&#8217;t get [...]]]></description>
			<content:encoded><![CDATA[<p>One thing that really annoy me these days is the unrestricted enthusiasm for PHP and Unicode, primarily in the form of UTF-8. These people seem to think that Unicode is some fantastic thing they just have to use even though a single byte charset like ISO-8859-1 is more than sufficient for their need. Don&#8217;t get me wrong, Unicode is good and everything, but not when you&#8217;re using PHP. Why? Because PHP by itself does <em>not</em> support Unicode. Now I know alot of people will object and say that there is mbstring. If you&#8217;re one of them, please read this carefully: mbstring is a Swiss cheese. Now I know alot of people is not using mbstring and basicly is using Unicode (most likely UTF-8) blissfully unaware of it&#8217;s dangers. So before I get into why mbstring is a horrible piece of coding, I&#8217;d like to explain why you shouldn&#8217;t use Unicode without some measures.<span id="more-6"></span></p>
<p>First of all, you shouldn&#8217;t be using a foreign charset or encoding without knowing it&#8217;s risk. Unicode requires a mulibyte encoding, which has it&#8217;s own set of problems compared to singlebyte charsets like ISO-8859-1 and simliar. As I&#8217;m only really familiar with UTF-8, I&#8217;m going to use that as an example. I could proably write a book about the problems, but here&#8217;s a small list:</p>
<ul>
<li>BOM (U+FEFF) and &#8220;reverse&#8221;-BOM (U+FFFE). This is not really a problem in UTF-8 as it doesn&#8217;t have the issues with little and big endian (<a title="Wikipedia article explaining endianness" href="http://en.wikipedia.org/wiki/Endian" target="_blank">wikipedia about endianness</a>), but if you convert to for instance UTF-16, having a &#8220;reverse&#8221;-BOM at the start of your document would be a disaster.</li>
<li>Str-functions aren&#8217;t multibyte aware and might &#8220;break&#8221; your strings.</li>
<li>Non-shortest form. There&#8217;s a bit of math behind this one, which I&#8217;m not going to explain today, so you&#8217;ll just have to trust me on this one. Non-shortest form is an illegal representation of a Unicode character. It basicly is represented using more bytes than necessary and in UTF-8 this results in &#8216; , which in hex terms is 0&#215;27, could be represented by 0xC0A7. If those who developed for example the RDBMS you&#8217;re using is just as unknowing as alot of people out there, they will decode or recode this to UTF-16, UCS-2 or UCS-4 and get a single quote ( &#8216; ) which might just result in a SQL-injection &#8230;</li>
<li>Surrogates. In UTF-8 surrogates (U+D800-U+DFFE) are not allowed, and is actually regared as a potential security risk. Allowing these will result in a � or a similar &#8220;this is an illegal byte&#8221;-character being displayed.</li>
<li>Illegal combination of bytes resulting in a � or a similar &#8220;this is an illegal byte&#8221;-character being displayed.</li>
</ul>
<p>In addition to those above, there is also the possibility of security problems when you&#8217;re actually escaping input with functions not aware of multibyte charsets (<a href="http://shiflett.org/blog/2005/dec/google-xss-example">here&#8217;s an example</a>). With all this in mind I can tell you why mbstring is such a bad piece of coding.</p>
<p>First of all it doesn&#8217;t have a complete replacement for every str-function. We&#8217;re missing *sort, *trim, strcasecmp, str_ireplace, ucfirst, ucwords and wordwrap just to mention some. The second thing is that mbstring is writen by a Japanese or something, which is quite obvious considering strcasecmp is missing. &#8230; and no, using mb_strtolower and strcasecmp is not a valid way as the process requires either simple or full case folding which has some differences from the &#8220;to lower&#8221;-process. Anyway, mbstring is writen more or less to enable Japanese, Chinese and similar languages to work. The thing is, these languages don&#8217;t have case as most other languages which explains the lack of the above.</p>
<p>Third, there is some bugs in mbstring which result in invalid validaton. Here are som examples:</p>
<ul>
<li>Anything behind a null-byte isn&#8217;t validated</li>
<li>Surrogates are allowed</li>
<li>&#8220;Reverse&#8221;-BOM is allowed</li>
<li>Title case isn&#8217;t working</li>
</ul>
<p>Fourth and final, the overloading abilites make the str-functions behave different from the original str-functions. These are undocumented differences, possibly breaking applications expecting str-functionality. For instance, if 2. paramter in strrchr is an int, is converted to a character. In mb_strrchr however, this will give an error.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.varslashlog.com/2008/10/27/why-php-and-unicode-is-a-bad-combination/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
