When integrating systems with many different data sources and systems across Europe you are bound to eventually run in to issues with UTF-8 and national character sets as for example the Swedish ISO-8859-1. Even when parsing simple UTF-8 files with comma separated values things might things might popup to bite you.
One such thing is the occurrence of the UTF byte order mark, or BOM. The UTF-8 character for the byte order mark is U+FEFF, or rather three bytes – 0xef, 0xbb and 0xbf – that sits in the beginning of the text file. For UTF-16 it is used to indicate the byte order. For UTF-8 it is not really necessary.
But for UTF-8, especially on Windows, it has become more and more common to use it to indicate that the file is indeed UTF. Most text editors handle this well and you won’t ever see these bytes. As it should be.
The problems start when you are using PHP binary safe string functions such as strcmp() and substr(). Then these three bytes that won’t be visible even when using var_dump() can become bothersome. (You would however see that the string length output by var_dump() is correct and also counts the invisible bytes.)
So you need to detect the three bytes and remove the BOM. Below is a simplified example on how to detect and remove the three bytes.
$str = file_get_contents('file.utf8.csv');
$bom = pack("CCC", 0xef, 0xbb, 0xbf);
if (0 == strncmp($str, $bom, 3)) {
echo "BOM detected - file is UTF-8\n";
$str = substr($str, 3);
}
It’s as simple as that.

Quick word to say that I’d rather use “\xef\xbb\xbf” than pack(). It feels more readable.
Loading the whole file into memory if you need just the first three bytes seems like a waste of memory.
@Paul True. Both ways work.
@Brutos. It is an example. Nothing else.
Dude. Thanks for this!
I spent the last hour (googling like crazy) trying to figure out the best way to detect and handle the BOM that was coming from a webservice. I was considering contacting the vendor and asking that he not use the BOM and instead use an http header, but this worked best.
btw, I iterate over several BOMs to figure out which is being used (0xfffe, 0xfeff and the above) and then handle appropriately.
Have been there its really annoying. If you want to detect more types of UTF BOMs you might have look on my articles
http://artur.ejsmont.org/blog/content/annoying-utf-byte-order-marks
utf-8 boms can be safely removed using scripts others may need proper file conversion:
http://artur.ejsmont.org/blog/Convertig-file-encoding-to-UTF-8-from-other-Unicode-and-removing-Byte-Order-Marks
Thanks for this simple but great detection.
I just implemented that into my MySQLDumper-App that now can handle dumps saved with BOM.
Here’s a short snippet I use to digg out all files containing the BOM sequence recursively in a folder:
http://pastebin.com/dHbqjUNy
Atanas Vasilev – your code is genius, thanks for sharing.
I spent a couple of hours searching for a way of removing the BOM from a text file, & after various scripts & codes in php, perl & others, I found what I was looking for – a simple text editor with the option of encoding in UTF-8 without BOM, Notepad++:
http://sourceforge.net/projects/notepad-plus/
A simple properties check already shows 3 bytes less.
Easy-peasy!
Thanks! I included this in a loop so it can recursively check all files in a project:
foreach( dir_list( dirname( __FILE__ ) . DIRECTORY_SEPARATOR ) as $filename) {
if( check_extension( $filename ) ) detect_and_remove_DOM( $filename );
}
function check_extension( $filename ) {
$check_estensions = array( ‘.php’ );
return ( in_array( substr( $filename, (strlen($filename)-4), 4 ) , $check_estensions) );
}
function detect_and_remove_DOM( $filename ) {
echo “$filename\n”;
$str = file_get_contents($filename);
$bom = pack(“CCC”, 0xef, 0xbb, 0xbf);
if (0 == strncmp($str, $bom, 3)) {
echo “BOM detected – file is UTF-8\n”;
$str = substr($str, 3);
}
return file_put_contents( $filename, $str );
}
function dir_list($dir) {
$retval = array();
$path = $dir;
if(substr($dir, -1) != “/”) $dir .= “/”;
$d = @dir($dir);
if(!$d) {
if (DEBUG) echo “getFileList: Failed opening directory $dir for reading”;
return array();
}
while(false !== ($entry = $d->read())) {
if($entry[0] == “.”) continue; // skip hidden files
if(is_dir(“$dir$entry”)) {
$retval = array_merge( $retval, dir_list( $dir . $entry . DIRECTORY_SEPARATOR ) );
}
elseif(is_readable(“$dir$entry”)) {
$retval[] = “$path$entry”;
}
}
$d->close();
return $retval;
}
Though i used “\xef\xbb\xbf” and strpos() for detecting utf-8 BOM, but you helped me with a quick solution.
Thanks!
THANK YOU!
A few years ago I have spent month searching for this info.
I finnaly just removed the 3 chars.. and it become problematic.
Your sollution saved me today!
Thanks for posting this.
Finally a way to get rid of these annoying BOM characters п»ї
Thanks so much this sorted my problem!
This solution (technically, a modification thereof) was just what I was looking for. (I was having a devil of a time getting my RSS feed to connect to a 3rd party app; turns out the BOM was (somehow) getting in the way.) Thanks!
This is not an indictment on other preachers who have taught on this from a personal wealth position. The focus on this is not personal, but out of concern because I don’t want to be fooled into believing that is a difference between the First Fruits offerings and the tithes because there isn’t. foakleys http://pinterest.com/replicaoakleys/