UTF-8与UNICODE [转载自 http://www.cppblog.com/zuroc/archive/2006/02/15/3269.html ]

[转载自 http://www.cppblog.com/zuroc/archive/2006/02/15/3269.html ]
Boost:UTF-8 Codecvt Facet(unicode 和 utf-8 之间相互转码)

看到有前辈写了一个 UTF-8与UNICODE相互转换的代码 , 顺便提一下,希望可以给大家提供一点帮助.
下面是一些编码格式的bit长

Examples of fixed-width encoding forms:

Type	Each character encoded as	Notes
7-bit	a single 7-bit quantity	example: ISO 646
8-bit G0/G1	a single 8-bit quantity	with constraints on use of C0 and C1 spaces
8-bit	a single 8-bit quantity	with no constraints on use of C1 space
8-bit EBCDIC	a single 8-bit quantity	with the EBCDIC conventions rather than ASCII conventions
16-bit ( UCS -2)	a single 16-bit quantity	within a code space of 0..FFFF
32-bit ( UCS -4)	a single 32-bit quantity	within a code space 0..7FFFFFFF
32-bit ( UTF -32)	a single 32-bit quantity	within a code space of 0..10FFFF
16-bit DBCS process code	a single 16-bit quantity	example: UNIX widechar implementations of Asian CCS's
32-bit DBCS process code	a single 32-bit quantity	example: UNIX widechar implementations of Asian CCS's
DBCS Host	two 8-bit quantities	following IBM host conventions

Examples of variable-width encoding forms:

Name	Characters are encoded as	Notes
UTF -8	a mix of one to four 8-bit code units in Unicode and one to six code units in 10646	used only with Unicode/10646
UTF -16	a mix of one to two 16 bit code units	used only with Unicode/10646

Boost中提供了一个UTF-8 Codecvt Facet,可以在utf8和UCS-4(Unicode-32)之间转换.
使用方式如下

//...
// My encoding type
typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

// Set a New global locale
std::locale::global(utf8_locale);

// UCS-4 转换为 UTF-8
{
    std::wofstream ofs("data.ucd");
    ofs.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
          std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
}

// 读入 UTF-8 ,转换为 UCS-4
std::vector<ucs4_t> from_file;
{
    std::wifstream ifs("data.ucd");
    ifs.imbue(utf8_locale);
    ucs4_t item = 0;
    while (ifs >> item) from_file.push_back(item);
}
//...
UTF-8 Codecvt Facet详见
http://www.boost.org/libs/serialization/doc/codecvt.html

posted on 2006-09-10 13:12 Fxzeng's space 阅读(604) 评论(0) 编辑收藏引用

只有注册用户登录后才能发表评论。

fxzeng

导航

收藏夹

随笔档案

统计

留言簿(3)

阅读排行榜

评论排行榜

UTF-8与UNICODE [转载自 http://www.cppblog.com/zuroc/archive/2006/02/15/3269.html ]