Monday, February 09th, 2009 | Author: AHSauge

I’ve previous been writing about why PHP and Unicode/UTF-8 is a bad combination. Even though UTF-8 in PHP should (for now) be avoided, it is sometimes a necessity to use it. As UTF-8 can be quite problematic for some people to use, I thought I this time should write about how to actually use it properly. In this first part I’ll deal with the basic handling of UTF-8 in PHP using the PHP extensions mbstring and/or iconv.

The basic facts

The key element when using a multibyte character set in PHP is to know exactly what you’re doing. If you don’t, you can easily end up with partially corrupted text and wrong results. The one big reason for this is the fact that PHP by default does not support anything other than byte-sized character set. In fact, strictly speaking, PHP doesn’t really know what a character set is. All it sees are bytes, not characters. This means that every string-function in PHP works on the assumption that a byte is a character. When dealing with for instance UTF-8 this is no longer true. The result is that strlen reports the number of bytes in the string, not the number of characters. Similarly, strpos will give you the position in bytes, not characters, and many of the other string-functions have similar problems. So what to do?

First off, a very handy fact about UTF-8 is it’s ASCII-compatible (7bit ASCII that is), meaning these characters are binary represented as 0xxx xxxx (where x are ASCII bits). Another handy fact about valid UTF-8 strings is that any encoded Unicode character has a unique byte sequence, meaning it can’t be confused with a part of another character. This means that if you encounter a byte 00100000 (20 hex or 32 dec) it can not be anything other than a space character, or else it’s not a valid UTF-8 string. For those interested in how this is archived, here’s the binary representation of UTF-8 characters (skip if you’re not into that type of stuff ;o)

1 byte:  0xxx xxxx
2 bytes: 110x xxxx 10xx xxxx
3 bytes: 1110 xxxx 10xx xxxx 10xx xxxx
4 bytes: 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

Well, enough with talk about UTF-8 in general. Here’s a step-by-step guild how to use UTF-8 in PHP.

1.  Find you what you have available

If you’re going to use UTF-8 in PHP you’ll first need to find you what’s available to you. Ideally you should have the PHP extension mbstring install and it should not be set to overload str-functions. With mbstring you’ll have a set of functions that are multibyte aware. If you don’t have mbstring available, check for iconv (also an PHP extension). In PHP 5 and later this extension will give you some very simple functions to work with (strlen, strpos, strrpos, substr and validation). If you have neither mbstring nor iconv available at your host (I’m assuming that you’re going to run something in a hosted server), you should strongly consider change host and/or your need for Unicode/UTF-8, as you’re going to have to make some native functions that works properly on UTF-8 encoded strings. For the sake of simplicity, I’m going to assume you have mbstring or iconv installed.

2. Store your files as UTF-8

This should be somewhat obvious. If you’re going to use UTF-8, you should store your files as it too. The simple reason is that any strings you have stored in your scripts will be UTF-8 too, and should be outputed properly given you’ve done everything else correct.

3. Define input and output as UTF-8

This is quite depending on what you’re doing, but chances are that you’re dealing with the HTTP-protocol and HTML. If so you have to add the following function call before any output (if you’re not using output buffering).

header('Content-Type: text/html; charset=utf-8');

and add the following to the head-part of your HTML-document

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

This will state that any output is html and UTF-8. If you’re not dealing with html, just replace text/html with whatever you’re using (and proably drop the meta tag too). This should also ensure that input from most browsers is UTF-8, but to be really sure it might be a good idea to add the attribute accept-charset to any forms you might have, like this:

<form accept-charset="UTF-8" method="post">
<!-- Input stuff here -->
</form>

This attribute can also be used with a comma-separated list of character set your script accept. To keep it simple you should just use UTF-8 as the only accepted “character set” (strictly speaking, UTF-8 is an encoding of the Unicode character set, and not a charset by itself).

If you don’t have control over the input, for instance RSS feed from 3. party server, transform it to Unicode and encode to UTF-8. This can be done like this:

//mbstring
$UTF8string = mb_convert_encoding($string, 'other charset', 'UTF-8');
// or call
mb_internal_encoding('UTF-8');
// once to use UTF-8 as default and and drop last parameter like this:
$UTF8string = mb_convert_encoding($string, 'other charset');
 
//iconv
$UTF8string = iconv('other charset', 'UTF-8', $string);

Consult the PHP manual for supported character sets.

4. Validate your input

I can not stress this point enough. Check that your input actually is UTF-8, or the multibyte aware functions might not work as expected. Also, if not validated, those who view or store your data might get security problems like SQL-injection (though it would be their fault it’s happening …). This can be done as follows for mbstring:

//last parameter can still be dropped as showed in last example
$validUTF8 = mb_check_encoding($string, 'UTF-8'); //will give true if valid, false if not

for iconv you’ll have to go for a bit more dirty solution

function validateUTF8_iconv($before)
{
    $after = iconv('UTF-8', 'UTF-8', $before);
    return ($before === $after);
}

The reason this works is because any non-valid characters are removed or changed, and thus the strings will not be equal anymore.

5. Store your data correct

Now this is really a pit a lot of people fall into. Storing UTF-8 on disc is no problem, and is done just as before. Storing data in a database however, is in the case of MySQL definitely not as before. First of you’ll need MySQL 4.1 or later as the versions before don’t support what we’re about to do. The big problem with PHP and MySQL is that the connection is by default set to latin1, also known as ISO-8859-1, and a lot of people then actually store there data as a UTF-8-transformed ISO-8859-1. That is, MySQL thinks your input is ISO-8859-1 and then  convert it to Unicode and encode it as UTF-8. This will lead to problems when you view your data in for instance phpMyAdmin which connects to MySQL the proper way when dealing with UTF-8. The correct way to connect to MySQL is now

$link = mysql_connect('localhost', 'mysql_user', 'mysql_password');
mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");

The difference is the second line which states what charset is being used for further talk on the connection. The collate part can be dropped if you’re using the default one (utf8_general_ci). In addition to this, the table and fields should be defined with collation utf8_[something] and to save space don’t use char-fields as they will use 3x number of characters you define. Instead you should use varchar. Please also note that MySQL only supports Unicode 3.0, that is 2 byte Unicode up to code-point FFFE (BOM) or 3 byte UTF-8. If you need any code-points above that, you’re unfortunately in for some trouble …

This is however only the case of MySQL. For other database systems you should check what is the default charset and how to change it. The documentation/manual of the system is a good place to start to find this type of information.

6. Functions operating on UTF-8

The points above should make sure that you’re using UTF-8. The last thing is that you have to remember that any str-functions might not work correctly. Try to use str-functions defined in mbstring or iconv (see documentation). Some functions do however work as expected. Strcmp does only a binary compare and still works, strcasecmp however don’t. Str_replace will also work on valid UTF-8 strings as any given character has a unique byte sequence, but the case-less version str_ireplace don’t. In general any str-function that is not case-less and don’t need to operate on the number of characters in the string, should work just fine as long as any input to the function is valid UTF-8.

There are also some functions that are not part of mbstring or iconv that do support UTF-8. Htmlspecialchars, htmlentities and preg_* (with u modifier) are examples of this. There are also functions that operate on a purley binary level without any regards to charset. Examples of this is the md5 and sha1 functions.

That concludes this part of the howto. You should now be able to use UTF-8 the right way. If you have any questions, please leave a comment and I’ll happily answer. The next part will hopefully deal with some common pitfalls and how to debug and solve them.

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
Category: PHP
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses

  1. Thank’s a lot for the post. I’ve been coding (amateur-level) PHP since like 7-8 years, and I always had this encoding decoding problem. And I always had to invent something new :) That was such a shame :)

  2. Great article, hopefully PHP 6 will solve most of these common unicode problems by switching to multibyte string functions. However, I can’t imagine the time it would take to get most hosting companies to update to version 6.

    Here’s something I would like to know, when using unicode should we htmlspecialchars() the data before entering the database? Or leave data as is, and only htmlspecialchars() the output from the database?

    I’m looking forward to your next article on this subject!

  3. As far as I can see, PHP6 will definitely solve more or less all current problems with PHP and Unicode/UTF-8. It will provide top-to-bottom built-in support for Unicode (both UTF-8 and UTF-16) using Unicode-enabled versions of existing functions etc. and a new datatype for Unicode-strings. Though I too fear that it might take a while before it’s available at most hosting companies. The switching to PHP5 took years, and I fear we’ll be looking at similar times for PHP6 too, which basically means that we’re looking at years before you can assume your users have PHP6 available at their host. However, the Unicode semantics can be set to off as default in php.ini (and turned on at runtime), so the Unicode shouldn’t be that much of a problem for the hosts, but there are other changes in PHP 6 too … Hopefully some hosts might be sensible and default to PHP5 while also offering possibility to convert to a server with PHP6.

    As for your question about htmlspecialchars: If and/or when you use it is up to you. From a performance point of view, it’s better to do it once before you store it instead of doing it multiple times afterwards (at each request), and I can’t see any benefits of doing it at each request vs. before storing. Any editing would require html entities for at least < and > anyway to prevent injection of malicious (X)HTML, and any entities will show up as normal characters in input and textareas even after using htmlspecialchars, so there’s no real drawback doing it before. As for the Unicode/UTF-8 side of it, there’s no difference from single byte charsets except that you have to add ‘UTF-8′ as the 3. parameter. Actually, stricly speaking, you don’t even need to add the paramter at all, as it’s only going to convert <> & ‘ and ” to entities which are all ASCII-characters anyway (which are stored the exact same way in UTF-8, hence ASCII-compatible).

    PS: It’s worth noting that this does not apply for htmlentities which converts alot more than just those five characters, and in my opinion shouldn’t be used with Unicode/UTF-8 as it sort of defeats the point of using Unicode/UTF-8.

  4. 4
    Ziad Hilal 
    Friday, 15. May 2009

    Thanks for the great reply!

  5. 5
    Navigator 
    Wednesday, 3. June 2009

    Great article! Thanks! :)

  6. Very helpful, thanks too much…

Leave a Reply