Detecting UTF BOM – byte order mark

When integrating systems with many different data sources and systems across Europe you are bound to eventually run in to issues with UTF-8 and national character sets as for example the Swedish ISO-8859-1. Even when parsing simple UTF-8 files with comma separated values things might things might popup to bite you.

One such thing is the occurrence of the UTF byte order mark, or BOM. The UTF-8 character for the byte order mark is U+FEFF, or rather three bytes – 0xef, 0xbb and 0xbf – that sits in the beginning of the text file. For UTF-16 it is used to indicate the byte order. For UTF-8 it is not really necessary.

But for UTF-8, especially on Windows, it has become more and more common to use it to indicate that the file is indeed UTF. Most text editors handle this well and you won’t ever see these bytes. As it should be.

The problems start when you are using PHP binary safe string functions such as strcmp() and substr(). Then these three bytes that won’t be visible even when using var_dump() can become bothersome. (You would however see that the string length output by var_dump() is correct and also counts the invisible bytes.)

So you need to detect the three bytes and remove the BOM. Below is a simplified example on how to detect and remove the three bytes.

$str = file_get_contents('file.utf8.csv');
$bom = pack("CCC", 0xef, 0xbb, 0xbf);
if (0 == strncmp($str, $bom, 3)) {
	echo "BOM detected - file is UTF-8\n";
	$str = substr($str, 3);
}

It’s as simple as that.

Tagged with: ,
Posted in PHP
22 comments on “Detecting UTF BOM – byte order mark
  1. Paul Boisson says:

    Quick word to say that I’d rather use “\xef\xbb\xbf” than pack(). It feels more readable.

  2. Brutos says:

    Loading the whole file into memory if you need just the first three bytes seems like a waste of memory.

  3. Danne says:

    @Paul True. Both ways work.
    @Brutos. It is an example. Nothing else.

  4. spike says:

    Dude. Thanks for this!

    I spent the last hour (googling like crazy) trying to figure out the best way to detect and handle the BOM that was coming from a webservice. I was considering contacting the vendor and asking that he not use the BOM and instead use an http header, but this worked best.

    btw, I iterate over several BOMs to figure out which is being used (0xfffe, 0xfeff and the above) and then handle appropriately.

  5. Have been there its really annoying. If you want to detect more types of UTF BOMs you might have look on my articles

    http://artur.ejsmont.org/blog/content/annoying-utf-byte-order-marks

    utf-8 boms can be safely removed using scripts others may need proper file conversion:

    http://artur.ejsmont.org/blog/Convertig-file-encoding-to-UTF-8-from-other-Unicode-and-removing-Byte-Order-Marks

  6. Thanks for this simple but great detection.
    I just implemented that into my MySQLDumper-App that now can handle dumps saved with BOM.

  7. Atanas Vasilev says:

    Here’s a short snippet I use to digg out all files containing the BOM sequence recursively in a folder:

    http://pastebin.com/dHbqjUNy

  8. Nathan Kelly says:

    Atanas Vasilev – your code is genius, thanks for sharing.

  9. VoxAppeal says:

    I spent a couple of hours searching for a way of removing the BOM from a text file, & after various scripts & codes in php, perl & others, I found what I was looking for – a simple text editor with the option of encoding in UTF-8 without BOM, Notepad++:
    http://sourceforge.net/projects/notepad-plus/

    A simple properties check already shows 3 bytes less.

    Easy-peasy!

  10. Nacho says:

    Thanks! I included this in a loop so it can recursively check all files in a project:

    foreach( dir_list( dirname( __FILE__ ) . DIRECTORY_SEPARATOR ) as $filename) {

    if( check_extension( $filename ) ) detect_and_remove_DOM( $filename );
    }

    function check_extension( $filename ) {

    $check_estensions = array( ‘.php’ );
    return ( in_array( substr( $filename, (strlen($filename)-4), 4 ) , $check_estensions) );
    }

    function detect_and_remove_DOM( $filename ) {

    echo “$filename\n”;
    $str = file_get_contents($filename);
    $bom = pack(“CCC”, 0xef, 0xbb, 0xbf);
    if (0 == strncmp($str, $bom, 3)) {
    echo “BOM detected – file is UTF-8\n”;
    $str = substr($str, 3);
    }
    return file_put_contents( $filename, $str );
    }

    function dir_list($dir) {

    $retval = array();

    $path = $dir;
    if(substr($dir, -1) != “/”) $dir .= “/”;

    $d = @dir($dir);

    if(!$d) {

    if (DEBUG) echo “getFileList: Failed opening directory $dir for reading”;
    return array();
    }

    while(false !== ($entry = $d->read())) {

    if($entry[0] == “.”) continue; // skip hidden files

    if(is_dir(“$dir$entry”)) {

    $retval = array_merge( $retval, dir_list( $dir . $entry . DIRECTORY_SEPARATOR ) );
    }
    elseif(is_readable(“$dir$entry”)) {

    $retval[] = “$path$entry”;
    }
    }

    $d->close();

    return $retval;
    }

  11. Devtrix.net says:

    Though i used “\xef\xbb\xbf” and strpos() for detecting utf-8 BOM, but you helped me with a quick solution.
    Thanks!

  12. Alex says:

    THANK YOU!

    A few years ago I have spent month searching for this info.

    I finnaly just removed the 3 chars.. and it become problematic.

    Your sollution saved me today!

  13. magikMaker says:

    Thanks for posting this.
    Finally a way to get rid of these annoying BOM characters п»ї

  14. Richard says:

    Thanks so much this sorted my problem!

  15. Ben K. says:

    This solution (technically, a modification thereof) was just what I was looking for. (I was having a devil of a time getting my RSS feed to connect to a 3rd party app; turns out the BOM was (somehow) getting in the way.) Thanks!

  16. foakleys says:

    This is not an indictment on other preachers who have taught on this from a personal wealth position. The focus on this is not personal, but out of concern because I don’t want to be fooled into believing that is a difference between the First Fruits offerings and the tithes because there isn’t. foakleys http://pinterest.com/replicaoakleys/

  17. Il me tarde de lire un autre article

  18. Lars says:

    Hi, here some small linux-comments …

    # find php-files with UTF8-BOM in this directory + subdirectories

    find . -type f -iname “*.php” -exec grep -l $’\xEF\xBB\xBF’ {} \;

    # WARNING: you can also remove it this way, but PLEASE make sure you are in the correct directory + user

    find . -type f -iname “*.php” -exec sed -i ’1s/^\xEF\xBB\xBF//’ {} \;

  19. Une fois de plus un article assurément passionnant

  20. Hello to all, how is all, I think every one is getting more from this
    web site, and your views are nice in support of new viewers.

  21. Vous nous concoctez continuellement des posts attractifs

  22. Johne191 says:

    That alone wwas an egregious oversight on thheir own part, since gkfadeaekecd

4 Pings/Trackbacks for "Detecting UTF BOM – byte order mark"
  1. [...] Gründe warum man seinen eigenen Code hassen soll. Ganz Gute dabei und nette erklärungen dazu Detecting UTF BOM – byte order mark Wie man mit PHP prüfen kann ob eine Datei ein UTF-BOM (Byte Order Mark) enthält. Geht ganz easy. [...]

  2. [...] Originally Posted by Salathe There is a "Byte order mark" preceding the places in that string, which accounts for the extra length. Thanks, that was the problem. The string I was comparing had originally come from a text file saved with Windows notepad, which includes a BOM when you save as UTF-8. In my case I am fixing my problem by altering my data (using a text editor that allows saving as UTF-8 without a BOM), but I also found this PHP based solution if anyone else has a similar problem in the future: Detecting UTF BOM – byte order mark [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>