C中诸如isdigit

看到CSDN首页一篇blog—— 。文中提到的算法暂不议论，单单看回复——一个网友指责楼主使用Char.IsNumber(str,i)的效率不如(r[i]<'0' || str[i]>'9')这种方法高，楼主连声表示认同——真的令我感慨万分，隔行如隔山呀。这话怎么说？C#这种不开源的语言使多少人对其实现产生了误解，我不是说C#程序员对效率算法研究不及C程序员，我是想说开源的ANSI C绝对能使程序员更准确的理解其实现的本质，从而不会错用、误用代码而不自知。

我对C#实现理解不深，因此我来谈谈C中诸如isdigit(c)、isalpha(c)之类函数的实现，借而推断上述讨论的正确性。

很多初学者都认为isalnum(c)是这样写的(我最初也是这么人为的)：

				
						inline
				
				
						 int
				 isdigit(char c)
{   return c>='0'&&c<'9';
}

我当初甚至认为，这种写法已经是非常高效了。直到某一天我开始学习Linux kernel source，我才真正见识了前辈的高效算法，真的眩的令人膛目结舌。下面这段代码摘自Linux 0.11 kernel，大家看看再说:)
0 /*
  0  *  linux/lib/ctype.h
  0  *
  1  *  (C) 1991  Linus Torvalds
  2  */  
  3
  4 #define _U      0x01    /* upper */
  5 #define _L      0x02    /* lower */
  6 #define _D      0x04    /* digit */
  7 #define _C      0x08    /* cntrl */
  8 #define _P      0x10    /* punct */
  9 #define _S      0x20    /* white space (space/lf/tab) */
10 #define _X      0x40    /* hex digit */
11 #define _SP     0x80    /* hard space (0x20) */
12 
13 extern unsigned char _ctype[];
14 extern char _ctmp;
15 
16 #define isalnum(c) ((_ctype+1)[c]&(_U|_L|_D))
17 #define isalpha(c) ((_ctype+1)[c]&(_U|_L))
18 #define iscntrl(c) ((_ctype+1)[c]&(_C))
19 #define isdigit(c) ((_ctype+1)[c]&(_D))
20 #define isgraph(c) ((_ctype+1)[c]&(_P|_U|_L|_D))
21 #define islower(c) ((_ctype+1)[c]&(_L))
22 #define isprint(c) ((_ctype+1)[c]&(_P|_U|_L|_D|_SP))
23 #define ispunct(c) ((_ctype+1)[c]&(_P))
24 #define isspace(c) ((_ctype+1)[c]&(_S))
25 #define isupper(c) ((_ctype+1)[c]&(_U))
26 #define isxdigit(c) ((_ctype+1)[c]&(_D|_X))
27 
28 #define isascii(c) (((unsigned) c)<=0x7f)
29 #define toascii(c) (((unsigned) c)&0x7f)
30 
31 #define tolower(c) (_ctmp=c,isupper(_ctmp)?_ctmp-('A'-'a'):_ctmp)
32 #define toupper(c) (_ctmp=c,islower(_ctmp)?_ctmp-('a'-'A'):_ctmp)
33 
34 #endif
35 

  1 /*
  2  *  linux/lib/ctype.c
  3  *
  4  *  (C) 1991  Linus Torvalds
  5  */
  6 
  7 #include <ctype.h>
  8 
  9 char _ctmp;
10 unsigned char _ctype[] = {0x00,                 /* EOF */
11 _C,_C,_C,_C,_C,_C,_C,_C,                        /* 0-7 */
12 _C,_C|_S,_C|_S,_C|_S,_C|_S,_C|_S,_C,_C,         /* 8-15 */
13 _C,_C,_C,_C,_C,_C,_C,_C,                        /* 16-23 */
14 _C,_C,_C,_C,_C,_C,_C,_C,                        /* 24-31 */
15 _S|_SP,_P,_P,_P,_P,_P,_P,_P,                    /* 32-39 */
16 _P,_P,_P,_P,_P,_P,_P,_P,                        /* 40-47 */
17 _D,_D,_D,_D,_D,_D,_D,_D,                        /* 48-55 */
18 _D,_D,_P,_P,_P,_P,_P,_P,                        /* 56-63 */
19 _P,_U|_X,_U|_X,_U|_X,_U|_X,_U|_X,_U|_X,_U,      /* 64-71 */
20 _U,_U,_U,_U,_U,_U,_U,_U,                        /* 72-79 */
21 _U,_U,_U,_U,_U,_U,_U,_U,                        /* 80-87 */
22 _U,_U,_U,_P,_P,_P,_P,_P,                        /* 88-95 */
23 _P,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L,      /* 96-103 */
24 _L,_L,_L,_L,_L,_L,_L,_L,                        /* 104-111 */
25 _L,_L,_L,_L,_L,_L,_L,_L,                        /* 112-119 */
26 _L,_L,_L,_P,_P,_P,_P,_C,                        /* 120-127 */
27 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 128-143 */
28 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 144-159 */
29 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 160-175 */
30 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 176-191 */
31 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 192-207 */
32 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 208-223 */
33 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,                /* 224-239 */
34 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};               /* 240-255 */
35 
36
此一段代码一看便知高下。^_^ 
我不认为C#的构架师是个笨笨，决计做不出来一个用大家反复调用的底层API函数还不及一段用户组合代码效率高的蠢事来，所以我认为“Char.IsNumber(str,i)的效率不如(r[i]<'0' || str[i]>'9')”绝对是个谬论！
 
-------------
乾坤一笑 写于2005年1月19日  转载请标明出处和原文链接

--------------------next---------------------
经过刚刚实际测试，VC++6.0&win2k编译环境下，编译器自动去除了inline前缀。所以改用宏的方法测试，得到结果如下。 
#define isdigit(c) ((c)>='0'&&(c)<'9') 
10: bool a=isdigit(c); 
0040103C movsx eax,byte ptr [ebp-4] 
00401040 cmp eax,30h 
00401043 jl main+37h (00401057) 
00401045 movsx ecx,byte ptr [ebp-4] 
00401049 cmp ecx,39h 
0040104C jge main+37h (00401057) 
0040104E mov dword ptr [ebp-0Ch],1 
00401055 jmp main+3Eh (0040105e) 
00401057 mov dword ptr [ebp-0Ch],0 
0040105E mov dl,byte ptr [ebp-0Ch] 
00401061 mov byte ptr [ebp-8],dl 

#define isdigit(c) ((_ctype+1)[c]&(_D)) 
29: bool a=isdigit(c); 
0040103C movsx eax,byte ptr [ebp-4] 
00401040 xor ecx,ecx 
00401042 mov cl,byte ptr _ctype+1 (00422301)[eax] 
00401048 and ecx,4 
0040104B neg ecx 
0040104D sbb ecx,ecx 
0040104F neg ecx 
00401051 mov byte ptr [ebp-8],cl 

总共节省了三条指令。 

实际如果精简代码，则可以达到九条指令的状态，大致如下： 
#define isdigit(c) ((c)>='0'&&(c)<'9') 
10: bool a=isdigit(c); 
0040103C movsx eax,byte ptr [ebp-4] 
00401040 cmp eax,30h 
00401043 jl main+37h (00401057) 
00401049 cmp eax,39h 
0040104C jge main+37h (00401057) 
0040104E mov dl,1 
00401055 jmp main+3Eh (0040105e) 
00401057 mov dl,0 
00401061 mov byte ptr [ebp-8],dl 

可以看出，两者效率的差异仅仅在一两条指令上，仅仅是编写小型程序并不值得。不过如果是大规模定义此类宏，又有很多非连续的（例如Hex）。那么后者显然值得这些代价。
--------------------next---------------------
楼主，我是《判断一个字符串是否全是数字的多种方法及其性能比较（C#实现）》一文的作者，先不说我那篇文章的水平怎么样（其实已经遭到了很多批评），我但就你这篇文章谈一下看法。 
首先，C#不是C，而楼主却把两者混为一潭。既然楼主承认“我对C#实现理解不深”，却又怎么敢由“C中诸如isdigit(c)、isalpha(c)之类函数的实现”来推断““Char.IsNumber(str,i)的效率不如(r[i]<'0' || str[i]>'9')”绝对是个谬论！”呢？你怎么知道Char.IsNumber(str,i)是通过调用isdigit(char c)来写的呢？你的猜测错了，根本就不是。我通过Reflactor对Framework进行了反编译，得到了Char.IsNumber的实现（用微软提供的ildasm可以得到IL的代码，结论是一样的）： 
这是Char.IsNumber(string,int)的实现： 
public static bool IsNumber(string s, int index) 
{ 
if (s == null) 
{ 
throw new ArgumentNullException("s"); 
} 
if (index >= s.Length) 
{ 
throw new ArgumentOutOfRangeException("index"); 
} 
return char.IsNumber(s[index]); 
} 

被调用的char.IsNumber(char)实现如下： 
public static bool IsNumber(char c) 
{ 
return CharacterInfo.IsNumber(c); 
} 

继续追踪，CharacterInfo.IsNumber(c)的实现： 
public static bool IsNumber(char ch) 
{ 
UnicodeCategory category1 = CharacterInfo.GetUnicodeCategory(ch); 
if ((category1 != UnicodeCategory.DecimalDigitNumber) && (category1 != UnicodeCategory.LetterNumber)) 
{ 
return (category1 == UnicodeCategory.OtherNumber); 
} 
return true; 
} 

继续，CharacterInfo.GetUnicodeCategory(ch)的实现： 
public static UnicodeCategory GetUnicodeCategory(char ch) 
{ 
return CharacterInfo.InternalGetUnicodeCategory(ch); 
} 

还有，CharacterInfo.InternalGetUnicodeCategory(ch)的实现： 
internal static unsafe UnicodeCategory InternalGetUnicodeCategory(char ch) 
{ 
byte num1 = CharacterInfo.m_pDataTable[ch >> 8]; 
ushort num2 = CharacterInfo.m_pLevel2WordOffset[(num1 << 4) + ((ch >> 4) & '\x000f')]; 
byte num3 = CharacterInfo.m_pDataTable[num2 + (ch & '\x000f')]; 
return (UnicodeCategory) num3; 
} 

好了，终于结束了。一个Char.IsNumber(str,i)就是通过上面的一系列调用来实现的，你说效率如何？如果不信可以自己去验证。 

Char.IsNumber(str,i)判断的是Unicode字符，而不仅仅是ASCII字符，我在你批判的那篇文章中也提到了，Char.IsNumber(str,i)可以判断全角的１，所以调用是麻烦了一些，不过功能是增加了。可见，楼主的批判是毫无道理的。
--------------------next---------------------

posted on 2009-02-06 16:13 向左向右走阅读(1274) 评论(1) 编辑收藏引用所属分类: C/C++学习资料库

# re: C中诸如isdigit 2012-07-20 13:37 dragoo1

存在即合理，学习了回复更多评论

刷新评论列表

只有注册用户登录后才能发表评论。

依睛(IT blog) 我回来了，PHP<-->C/C++ LINUX

导航

留言簿(26)

随笔分类(315)

文章分类(5)

收藏夹

随笔档案(330)

文章档案(11)

相册

阅读排行榜

评论排行榜

统计

积分与排名

友情连接

最新评论

C中诸如isdigit

评论

# re: C中诸如isdigit 2012-07-20 13:37 dragoo1