Monday, October 31, 2011

PHP UTF-8 character to dezimal converter

A while ago I was developing on a connector to a SMS sending API. As SMS are partially transmitted in a 7-bit coded limited character set (http://en.wikipedia.org/wiki/GSM_03.38) not all characters are supported. To validate each and every character in the message, it is useful - at least for debugging purposes - to convert the characters to decimal and filter it by looking up an array.

Correct me if I am wrong but in my opinion, PHP is not really supporting development with UTF-8 this is why my function is checking the bytes low-level.

Simply change the variable $string to the text you want to convert and the output will be value and character row by row.

<html xmlns="http://www.w3.org/1999/xhtml" 
   xml:lang="en-us" lang="en-us" dir="ltr" >
<head>
   <meta http-equiv="content-type" content="text/html; 
      charset=UTF-8" />
</head>
<body>
<?php
$string = "|^€{}[~]\\";
$count = 0;

for ($i=0; $i < strlen($string); $i++)
{
    echo ordUTF8($string, $i, $count)." ".$string[$i]."<br />";
    $i += $count - 1;
}

function ordUTF8($string, $index = 0, &$bytes = null)
{
    $len = strlen($string);
    $bytes = 0;
    
    if ($index >= $len)
    {
        return false;
    }
    
    $h = ord($string{$index});
    
    if ($h <= 0x7F)
    {
        $bytes = 1;
        return $h;
    }
    else if ($h < 0xC2)
    {
        return false;
    }
    else if ($h <= 0xDF && $index < $len - 1)
    {
        $bytes = 2;
        return ($h & 0x1F) <<  6 
            | (ord($string{$index + 1}) & 0x3F);
    }
    else if ($h <= 0xEF && $index < $len - 2)
    {
        $bytes = 3;
        return ($h & 0x0F) << 12 
            | (ord($string{$index + 1}) & 0x3F) << 6
            | (ord($string{$index + 2}) & 0x3F);
    }          
    else if ($h <= 0xF4 && $index < $len - 3)
    {
        $bytes = 4;
        return ($h & 0x0F) << 18 
            | (ord($string{$index + 1}) & 0x3F) << 12
            | (ord($string{$index + 2}) & 0x3F) << 6
            | (ord($string{$index + 3}) & 0x3F);
    }
    else
    {
        return false;
    }
}
?>
</body>
</html>

No comments:

Post a Comment