The encoding used by Windows for Unicode is UTF-16, to be specific, UTF-16LE (Little Endian). Linux uses UTF-8 to encode Unicode. A file encoded with Unicode can optionally contain a Byte Order Mark(BOM) which is a special magic number at the start of file. Byte Order Mark(BOM) is optional for UTF-8, but mandatory for UTF-16 as per Unicode standard. So, Linux does not use BOM for Unicode files as it uses UTF-8. But Windows applications look for BOM in Unicode encoded file as they use UTF-16.
So in summary, Windows uses UTF-16LE with BOM, and Linux uses UTF-8 without BOM.
To verify type of encoding used for a file, we can use file command on Linux.
$ file Unicode_Windows.txt Unicode_Windows.txt: Little-endian UTF-16 Unicode text, with CR line terminators
We can see more details using hexdump also,
$ hexdump -C Unicode_Windows.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A...A...| 00000010 0a 00 |..| 00000012ff fe is BOM for UTF-16LE, and we can see end of line character as 0d 00 (Carriage Return CR) and 0a 00 (Line Feed LF)
2. Converting from UTF-16 to UTF-8
Above file is created in Windows. To convert it to Linux encoding, there are multiple ways.2a. Using iconv
$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt > Unicode_Linux1.txtLet us check this file.
$ file Unicode_Linux1.txt Unicode_Linux1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators $ hexdump -C Unicode_Linux1.txt 00000000 ef bb bf e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 |................| 00000010 b0 97 e0 b1 81 0d 0a |.......| 00000017This converts to UTF-8, but keeps BOM at the begining of file (ef bb bf). Also we have CR (0d) and LF (0a) characters for end of line. So, convert it to UTF-8 without BOM and CR, here is the command.
$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt | sed 1s/^.//g | sed s/"\r$"//g > Unicode_Linux1.txtWe can verify it using below commands.
$ file Unicode_Linux1.txt Unicode_Linux1.txt: UTF-8 Unicode text $ hexdump -C Unicode_Linux1.txt 00000000 e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 b0 97 e0 |................| 00000010 b1 81 0a |...| 00000013
2b. Using dos2unix command
We can also use dos2unix command, which converts file from UTF-16LE to UTF-8, and also removes BOM and CR characters. Here is the example,$ dos2unix -n Unicode_Windows.txt Unicode_Linux2.txt dos2unix: converting file Unicode_Windows.txt to file Unicode_Linux2.txt in Unix format ... $ file Unicode_Linux2.txt Unicode_Linux2.txt: UTF-8 Unicode text $ hexdump -C Unicode_Linux2.txt 00000000 e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 b0 97 e0 |................| 00000010 b1 81 0a |...| 00000013
3. Converting from UTF-8 to UTF-16
Now to convert files from UTF-8 to UTF-16LE on Linux, there is no direct way. The command unix2dos coverts from UTF-8 to UTF-8 only just by adding CR character. Also, unix2dos does not add BOM by default. So, we have to force it with -m option.$ unix2dos -m -n Unicode_Linux1.txt Unicode_Windows1.txt unix2dos: converting file Unicode_Linux1.txt to file Unicode_Windows1.txt in DOS format ... $ file Unicode_Windows1.txt Unicode_Windows1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators $ hexdump -C Unicode_Windows1.txt 00000000 ef bb bf e0 b0 a4 e0 b1 86 e0 b0 b2 e0 b1 81 e0 |................| 00000010 b0 97 e0 b1 81 0d 0a |.......| 00000017To covert it to UTF-16LE, we have to use iconv command after using unix2dos.
$ iconv -f UTF-8 -t UTF-16LE Unicode_Windows1.txt > Unicode_Windows2.txt $ file Unicode_Windows2.txt Unicode_Windows2.txt: Little-endian UTF-16 Unicode text, with CR line terminators $ hexdump -C Unicode_Windows2.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A...A...| 00000010 0a 00 |..| 00000012Instead of using unix2dos, we can directly use sed command to add BOM and CR like below and covert to UTF-16.
$ sed 1s/^/"\xef\xbb\xbf"/g Unicode_Linux1.txt | sed s/$/"\r"/g | iconv -f UTF-8 -t UTF-16LE > Unicode_Windows3.txt $ file Unicode_Windows3.txt Unicode_Windows3.txt: Little-endian UTF-16 Unicode text, with CR line terminators $ hexdump -C Unicode_Windows3.txt 00000000 ff fe 24 0c 46 0c 32 0c 41 0c 17 0c 41 0c 0d 00 |..$.F.2.A...A...| 00000010 0a 00 |..| 00000012
No comments:
Post a Comment