C008: Unicode, UTF-8 and PHP

PHP strings comprise single bytes, hence they cannot contain characters larger than 8-bit numbers. In order to cope with large character sets, these are encoded in UTF-8. Standard ASCII 7-bit codes are stored as a single byte. If it's bigger than 7 bits, the larger the number, the more bytes it takes up, thus:

bytes bits code
   1    7  0bbbbbbb 
   2   11  110bbbbb 10bbbbbb 
   3   16  1110bbbb 10bbbbbb 10bbbbbb 
   4   21  11110bbb 10bbbbbb 10bbbbbb 10bbbbbb 
   5   26  111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 
   6   31  1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 

Some coding uses fixed 2-byte Unicode. This sometimes makes things easier: it's pretty easy to know how many characters you have in your file (it's file length - 2 / 2). To identify Unicode files, a special marker is put at the start of the file. This is a two-byte code: the first is 0xFF and the second 0xFE. Windows frequently uses Unicode when saving web pages.

PHP has a couple of useful functions as standard, utf8_encode and utf8_decode. However, these only convert into single byte character strings, so if you're dealing with large number character sets, you've got problems.

There are some add-on solutions, but since you often don't have control of the PHP build, here's a routine to convert UTF-8 into single numbers. Your input string should be in $in:

$out = array();
for ($i=0;$i<strlen($in);$i++) {
  if (($c&0x80)==0x00) $out[]=$c;
  else if (($c&0xE0)==0xC0) $out[]=(($c&0x1F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xF0)==0xE0) $out[]=(($c&0x0F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xF8)==0xF0) $out[]=(($c&0x07)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xFC)==0xF8) $out[]=(($c&0x03)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xFE)==0xFC) $out[]=(($c&0x01)*0x40000000)|((ord($in[++$i])&0x3F)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);

If you have an array of large number characters and want to turn them into UTF-8, here's how to do it. . Your input array is $in and your output string will be $out.

for($i=0;$i<sizeof($in);$i++) {
  if ($n<0x80) $o=$n;
   else if ($n&lt;0x800) $o=chr((($n&0x7C0)/0x40)|0xC0).chr($n&0x3F|0x80);
   else if ($n&lt;0x10000) $o=chr((($n&0xF000)/0x1000)|0xE0).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else if ($n&lt;0x200000) $o=chr((($n&0x1C0000)/0x40000)|0xF0).chr(($n&0x3F000)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else if ($n&lt;0x4000000) $o=chr((($n&0x3000000)/0x1000000)|0xF8).chr(($n&0xFC0000)/0x40000|0x80).chr(($n&0x3F000|0x80)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else $o=chr((($n&0x4000000)/0x1000000)|0xFC).chr(($n&0x3F000000)/0x1000000|0x80).chr(($n&0xFC0000)/0x40000|0x80).chr(($n&0x3F000)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
  $out .= $o;

Note that if you have a strongly typed language which knows that variables are integers, it would be more sensible to do the divides before the ands, hence:

  if ($n<0x80) $o=$n;
  else if ($n<0x800) $o=chr((($n/0x40)&0x1F)|0xC0).chr($n&0x3F|0x80);
  else if ($n<0x10000) $o=chr((($n/0x1000)&0x0F)|0xE0).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else if ($n<0x200000) $o=chr((($n/0x40000)&0x07)|0xF0).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else if ($n<0x4000000) $o=chr((($n/0x1000000)&0x03)|0xF8).chr(($n/0x40000)&0x3F|0x80).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else $o=chr((($n/0x1000000)&0x01)|0xFC).chr(($n/0x1000000)&0x3F|0x80).chr(($n/0x40000)&0x3F|0x80).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);

but I fear that PHP would throw a wobbley and convert it into a float, which would not be good!


A stream containing UTF-8 encoding should be prepended by character U+FEFF. This, in encoded form, appears as three bytes, namely EF BB BF.

For further details, see UTF-8 and Unicode FAQ.

Stop press!

But there is an easier way...

If the problem is that you're trying to output UTF-8 characters to an html page, which seems likely which is why you're using php, you don't need to do any of this. Simply include the header
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
and the browser should then understand the UTF-8 in its native form.


The code above is theoretical and hasn't actually been tested...

An easier way for web stuff

If you're struggling to get the right characters to appear on a webpage, the easiest way to force the correct characters is to use the &#nnn; format for specifying characters. It means large amounts of data, but you get the right results.

The first point to note is that, whatever format a file is on disk (probably 16-bit unicode if it's got extra characters in it) PHP will always present you with UTF-8 strings, so that it can handle them in bytes. So here's a routine for converting this UTF-8 format into HTML ready characters. And it has been tested !!

function utf8_html ($in) // Convert UTF-8 strings to HTML-happy format
  $out = '';
  for ($i=0;$i<strlen($in);$i++) {
    if (($c&0x80)==0x00) $o=$c;
    else if (($c&0xE0)==0xC0) $o=((($c&0x1F)*0x40)|(ord($in[++$i])&0x3F));
    else if (($c&0xF0)==0xE0) $o=(($c&0x0F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    else if (($c&0xF8)==0xF0) $o=(($c&0x07)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in{++$i})&0x3F);
    else if (($c&0xFC)==0xF8) $o=(($c&0x03)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in{++$i})&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    else if (($c&0xFE)==0xFC) $o=(($c&0x01)*0x40000000)|((ord($in[++$i])&0x3F)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    // if it's 0xFE then it gets ignored, because that's the UTF-8 signature.
    $out .= ($o < 0xFF) ? chr($o) : '&#'.$o.';';
  return $out;

Oh, by the way, don't use this if the file isn't in UTF-8 format, otherwise it will mistreat any characters in the range 0x80 to 0xFF. So, what you could do is find out when you open the file:

  $fh = fopen($filename,'r') or die ("Failed to open $filename");
  $temp = fread ($fh,3);
  $utf8=($temp == chr(0xEF).chr(0xBB).chr(0xBF)); // The UTF-8 signature
  if (!$utf8) fseek($fh,0); // If it's not UTF-8 then rewind to the start
  while ($line=trim(fgets($fh))) {
    $html = $utf8 ? utf8_html($line) : $line;

Whew! Hope that helps...