Solution C008

C008: Unicode, UTF-8 and PHP

PHP strings comprise single bytes, hence they cannot contain characters larger than 8-bit numbers. In order to cope with large character sets, these are encoded in UTF-8. Standard ASCII 7-bit codes are stored as a single byte. If it's bigger than 7 bits, the larger the number, the more bytes it takes up, thus:

bytes bits code
   1    7  0bbbbbbb 
   2   11  110bbbbb 10bbbbbb 
   3   16  1110bbbb 10bbbbbb 10bbbbbb 
   4   21  11110bbb 10bbbbbb 10bbbbbb 10bbbbbb 
   5   26  111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 
   6   31  1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 

Some coding uses fixed 2-byte Unicode. This sometimes makes things easier: it's pretty easy to know how many characters you have in your file (it's file length - 2 / 2). To identify Unicode files, a special marker is put at the start of the file. This is a two-byte code: the first is 0xFF and the second 0xFE. Windows frequently uses Unicode when saving web pages.

PHP has a couple of useful functions as standard, utf8_encode and utf8_decode. However, these only convert into single byte character strings, so if you're dealing with large number character sets, you've got problems.

There are some add-on solutions, but since you often don't have control of the PHP build, here's a routine to convert UTF-8 into single numbers. Your input string should be in $in:

$out = array();
for ($i=0;$i<strlen($in);$i++) {
  $c=ord($in[$i]);
  if (($c&0x80)==0x00) $out[]=$c;
  else if (($c&0xE0)==0xC0) $out[]=(($c&0x1F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xF0)==0xE0) $out[]=(($c&0x0F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xF8)==0xF0) $out[]=(($c&0x07)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xFC)==0xF8) $out[]=(($c&0x03)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
  else if (($c&0xFE)==0xFC) $out[]=(($c&0x01)*0x40000000)|((ord($in[++$i])&0x3F)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
}

If you have an array of large number characters and want to turn them into UTF-8, here's how to do it. . Your input array is $in and your output string will be $out.

$out='';
for($i=0;$i<sizeof($in);$i++) {
  $n=$in[$i];
  if ($n<0x80) $o=$n;
   else if ($n&lt;0x800) $o=chr((($n&0x7C0)/0x40)|0xC0).chr($n&0x3F|0x80);
   else if ($n&lt;0x10000) $o=chr((($n&0xF000)/0x1000)|0xE0).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else if ($n&lt;0x200000) $o=chr((($n&0x1C0000)/0x40000)|0xF0).chr(($n&0x3F000)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else if ($n&lt;0x4000000) $o=chr((($n&0x3000000)/0x1000000)|0xF8).chr(($n&0xFC0000)/0x40000|0x80).chr(($n&0x3F000|0x80)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
   else $o=chr((($n&0x4000000)/0x1000000)|0xFC).chr(($n&0x3F000000)/0x1000000|0x80).chr(($n&0xFC0000)/0x40000|0x80).chr(($n&0x3F000)/0x1000|0x80).chr(($n&0xFC0)/0x40|0x80).chr($n&0x3F|0x80);
  $out .= $o;
}

Note that if you have a strongly typed language which knows that variables are integers, it would be more sensible to do the divides before the ands, hence:

  if ($n<0x80) $o=$n;
  else if ($n<0x800) $o=chr((($n/0x40)&0x1F)|0xC0).chr($n&0x3F|0x80);
  else if ($n<0x10000) $o=chr((($n/0x1000)&0x0F)|0xE0).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else if ($n<0x200000) $o=chr((($n/0x40000)&0x07)|0xF0).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else if ($n<0x4000000) $o=chr((($n/0x1000000)&0x03)|0xF8).chr(($n/0x40000)&0x3F|0x80).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);
  else $o=chr((($n/0x1000000)&0x01)|0xFC).chr(($n/0x1000000)&0x3F|0x80).chr(($n/0x40000)&0x3F|0x80).chr(($n/0x1000)&0x3F|0x80).chr(($n/0x40)&0x3F|0x80).chr($n&0x3F|0x80);

but I fear that PHP would throw a wobbley and convert it into a float, which would not be good!

Header

A stream containing UTF-8 encoding should be prepended by character U+FEFF. This, in encoded form, appears as three bytes, namely EF BB BF.

For further details, see UTF-8 and Unicode FAQ.

Stop press!

But there is an easier way...

If the problem is that you're trying to output UTF-8 characters to an html page, which seems likely which is why you're using php, you don't need to do any of this. Simply include the header
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
and the browser should then understand the UTF-8 in its native form.

PS

The code above is theoretical and hasn't actually been tested...

An easier way for web stuff

If you're struggling to get the right characters to appear on a webpage, the easiest way to force the correct characters is to use the &#nnn; format for specifying characters. It means large amounts of data, but you get the right results.

The first point to note is that, whatever format a file is on disk (probably 16-bit unicode if it's got extra characters in it) PHP will always present you with UTF-8 strings, so that it can handle them in bytes. So here's a routine for converting this UTF-8 format into HTML ready characters. And it has been tested !!

function utf8_html ($in) // Convert UTF-8 strings to HTML-happy format
{
  $out = '';
  for ($i=0;$i<strlen($in);$i++) {
    $c=ord($in[$i]);
    if (($c&0x80)==0x00) $o=$c;
    else if (($c&0xE0)==0xC0) $o=((($c&0x1F)*0x40)|(ord($in[++$i])&0x3F));
    else if (($c&0xF0)==0xE0) $o=(($c&0x0F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    else if (($c&0xF8)==0xF0) $o=(($c&0x07)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in{++$i})&0x3F);
    else if (($c&0xFC)==0xF8) $o=(($c&0x03)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in{++$i})&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    else if (($c&0xFE)==0xFC) $o=(($c&0x01)*0x40000000)|((ord($in[++$i])&0x3F)*0x1000000)|((ord($in[++$i])&0x3F)*0x40000)|((ord($in[++$i])&0x3F)*0x1000)|((ord($in[++$i])&0x3F)*0x40)|(ord($in[++$i])&0x3F);
    // if it's 0xFE then it gets ignored, because that's the UTF-8 signature.
    $out .= ($o < 0xFF) ? chr($o) : '&#'.$o.';';
  }
  return $out;
}

Oh, by the way, don't use this if the file isn't in UTF-8 format, otherwise it will mistreat any characters in the range 0x80 to 0xFF. So, what you could do is find out when you open the file:

  $fh = fopen($filename,'r') or die ("Failed to open $filename");
  $temp = fread ($fh,3);
  $utf8=($temp == chr(0xEF).chr(0xBB).chr(0xBF)); // The UTF-8 signature
  if (!$utf8) fseek($fh,0); // If it's not UTF-8 then rewind to the start
  while ($line=trim(fgets($fh))) {
    $html = $utf8 ? utf8_html($line) : $line;
...

Whew! Hope that helps...