Detecting UTF BOM – byte order mark

When integrating systems with many different data sources and systems across Europe you are bound to eventually run in to issues with UTF-8 and national character sets as for example the Swedish ISO-8859-1. Even when parsing simple UTF-8 files with comma separated values things might things might popup to bite you.

One such thing is the occurrence of the UTF byte order mark, or BOM. The UTF-8 character for the byte order mark is U+FEFF, or rather three bytes – 0xef, 0xbb and 0xbf – that sits in the beginning of the text file. For UTF-16 it is used to indicate the byte order. For UTF-8 it is not really necessary.

But for UTF-8, especially on Windows, it has become more and more common to use it to indicate that the file is indeed UTF. Most text editors handle this well and you won’t ever see these bytes. As it should be.

The problems start when you are using PHP binary safe string functions such as strcmp() and substr(). Then these three bytes that won’t be visible even when using var_dump() can become bothersome. (You would however see that the string length output by var_dump() is correct and also counts the invisible bytes.)

So you need to detect the three bytes and remove the BOM. Below is a simplified example on how to detect and remove the three bytes.

$str = file_get_contents('file.utf8.csv');
$bom = pack("CCC", 0xef, 0xbb, 0xbf);
if (0 == strncmp($str, $bom, 3)) {
	echo "BOM detected - file is UTF-8\n";
	$str = substr($str, 3);
}

It’s as simple as that.

PHP

If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.

Comments

17 Responses to “Detecting UTF BOM – byte order mark”

Trackbacks

Check out what others are saying about this post...
  1. [...] Gründe warum man seinen eigenen Code hassen soll. Ganz Gute dabei und nette erklärungen dazu Detecting UTF BOM – byte order mark Wie man mit PHP prüfen kann ob eine Datei ein UTF-BOM (Byte Order Mark) enthält. Geht ganz easy. [...]

  2. [...] Originally Posted by Salathe There is a "Byte order mark" preceding the places in that string, which accounts for the extra length. Thanks, that was the problem. The string I was comparing had originally come from a text file saved with Windows notepad, which includes a BOM when you save as UTF-8. In my case I am fixing my problem by altering my data (using a text editor that allows saving as UTF-8 without a BOM), but I also found this PHP based solution if anyone else has a similar problem in the future: Detecting UTF BOM – byte order mark [...]



Leave Comment

(required)

(required)