Monday, October 27th, 2008 | Author: AHSauge

One thing that really annoy me these days is the unrestricted enthusiasm for PHP and Unicode, primarily in the form of UTF-8. These people seem to think that Unicode is some fantastic thing they just have to use even though a single byte charset like ISO-8859-1 is more than sufficient for their need. Don’t get me wrong, Unicode is good and everything, but not when you’re using PHP. Why? Because PHP by itself does not support Unicode. Now I know alot of people will object and say that there is mbstring. If you’re one of them, please read this carefully: mbstring is a Swiss cheese. Now I know alot of people is not using mbstring and basicly is using Unicode (most likely UTF-8) blissfully unaware of it’s dangers. So before I get into why mbstring is a horrible piece of coding, I’d like to explain why you shouldn’t use Unicode without some measures.

First of all, you shouldn’t be using a foreign charset or encoding without knowing it’s risk. Unicode requires a mulibyte encoding, which has it’s own set of problems compared to singlebyte charsets like ISO-8859-1 and simliar. As I’m only really familiar with UTF-8, I’m going to use that as an example. I could proably write a book about the problems, but here’s a small list:

  • BOM (U+FEFF) and “reverse”-BOM (U+FFFE). This is not really a problem in UTF-8 as it doesn’t have the issues with little and big endian (wikipedia about endianness), but if you convert to for instance UTF-16, having a “reverse”-BOM at the start of your document would be a disaster.
  • Str-functions aren’t multibyte aware and might “break” your strings.
  • Non-shortest form. There’s a bit of math behind this one, which I’m not going to explain today, so you’ll just have to trust me on this one. Non-shortest form is an illegal representation of a Unicode character. It basicly is represented using more bytes than necessary and in UTF-8 this results in ‘ , which in hex terms is 0×27, could be represented by 0xC0A7. If those who developed for example the RDBMS you’re using is just as unknowing as alot of people out there, they will decode or recode this to UTF-16, UCS-2 or UCS-4 and get a single quote ( ‘ ) which might just result in a SQL-injection …
  • Surrogates. In UTF-8 surrogates (U+D800-U+DFFE) are not allowed, and is actually regared as a potential security risk. Allowing these will result in a � or a similar “this is an illegal byte”-character being displayed.
  • Illegal combination of bytes resulting in a � or a similar “this is an illegal byte”-character being displayed.

In addition to those above, there is also the possibility of security problems when you’re actually escaping input with functions not aware of multibyte charsets (here’s an example). With all this in mind I can tell you why mbstring is such a bad piece of coding.

First of all it doesn’t have a complete replacement for every str-function. We’re missing *sort, *trim, strcasecmp, str_ireplace, ucfirst, ucwords and wordwrap just to mention some. The second thing is that mbstring is writen by a Japanese or something, which is quite obvious considering strcasecmp is missing. … and no, using mb_strtolower and strcasecmp is not a valid way as the process requires either simple or full case folding which has some differences from the “to lower”-process. Anyway, mbstring is writen more or less to enable Japanese, Chinese and similar languages to work. The thing is, these languages don’t have case as most other languages which explains the lack of the above.

Third, there is some bugs in mbstring which result in invalid validaton. Here are som examples:

  • Anything behind a null-byte isn’t validated
  • Surrogates are allowed
  • “Reverse”-BOM is allowed
  • Title case isn’t working

Fourth and final, the overloading abilites make the str-functions behave different from the original str-functions. These are undocumented differences, possibly breaking applications expecting str-functionality. For instance, if 2. paramter in strrchr is an int, is converted to a character. In mb_strrchr however, this will give an error.

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
Category: PHP
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response

  1. [...] in PHP properly (part 1) Monday, February 09th, 2009 | Author: AHSauge I’ve previous been writing about why PHP and Unicode/UTF-8 is a bad combination. Even though UTF-8 in PHP should [...]

Leave a Reply