看到有前辈写了一个
UTF-8与UNICODE相互转换的代码
,
顺便提一下,希望可以给大家提供一点帮助.
下面是一些编码格式的bit长
Examples of fixed-width encoding forms:
Type
|
Each character encoded as
|
Notes
|
7-bit
|
a single 7-bit quantity
|
example:
ISO
646
|
8-bit G0/G1
|
a single 8-bit quantity
|
with constraints on use of C0 and C1 spaces
|
8-bit
|
a single 8-bit quantity
|
with no constraints on use of C1 space
|
8-bit
EBCDIC
|
a single 8-bit quantity
|
with the EBCDIC conventions rather than
ASCII
conventions
|
16-bit (
UCS
-2)
|
a single 16-bit quantity
|
within a code space of 0..FFFF
|
32-bit (
UCS
-4)
|
a single 32-bit quantity
|
within a code space 0..7FFFFFFF
|
32-bit (
UTF
-32)
|
a single 32-bit quantity
|
within a code space of 0..10FFFF
|
16-bit
DBCS
process code
|
a single 16-bit quantity
|
example: UNIX widechar implementations of Asian CCS's
|
32-bit
DBCS
process code
|
a single 32-bit quantity
|
example: UNIX widechar implementations of Asian CCS's
|
DBCS
Host
|
two 8-bit quantities
|
following IBM host conventions
|
Examples of variable-width encoding forms:
Name
|
Characters are encoded as
|
Notes
|
UTF
-8
|
a mix of one to four 8-bit code units in Unicode and one to six code units in 10646
|
used only with Unicode/10646
|
UTF
-16
|
a mix of one to two 16 bit code units
|
used only with Unicode/10646
|
Boost中提供了一个UTF-8 Codecvt Facet,可以在utf8和UCS-4(Unicode-32)之间转换.
使用方式如下
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// UCS-4 转换为 UTF-8
{
std::wofstream ofs("data.ucd");
ofs.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
}
// 读入 UTF-8 ,转换为 UCS-4
std::vector<ucs4_t> from_file;
{
std::wifstream ifs("data.ucd");
ifs.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) from_file.push_back(item);
}
//...
UTF-8 Codecvt Facet详见
http://www.boost.org/libs/serialization/doc/codecvt.html