[C#] 문자열이 Unicode인지 판단하기

제목은 Unicode인지 판단하기라고 썼지만,
사실  Unicode Category를 판단하는 것입니다.

C/C++ 프로그램을 해보셨다면
영문자를 판단하는 코드를 한번쯤은 짜보셨을 겁니다.
"ascii code 얼마부터 얼마까지 속하면 영문자이다." 라는 식이지요.
(혹은 이 영문자가 대문자인지 소문자인지를 판단하는 프로그램도 있지요)

C#으로 오면서 유니코드 기반이다 보니
이런 저런 편리한 class들이 많이 생겼습니다.

이와 관련하여
System.Globalization 네임스페이스에 있는 CharUnicodeInfo 이라는 클래스인데요.

CharUnicodeInfo.GetUnicodeCategory를 사용하시면 됩니다.

예를 들어

using System;
using
System.Collections.Generic;

using
System.Globalization;
using
System.Collections;

namespace
WindowsApplication3
{
   
static class Program
    {
       
static void Main()
        {
           
string testKorean = "abcd한글";
            string
testDigit = "abc123";
            string
testSymbol ="\\*&";
            string
testSpace = "abc def";
           

           
PrintCategory(testKorean, GetCodeType(testKorean));
           
PrintCategory(testDigit, GetCodeType(testDigit));
           
PrintCategory(testSymbol, GetCodeType(testSymbol));
           
PrintCategory(testSpace, GetCodeType(testSpace));
       
}

       
static UnicodeCategory[] GetCodeType(string str)
        {
           
if (string.IsNullOrEmpty(str))
            {
               
return null;
           
}
           
else
           
{
                List<UnicodeCategory> m_listUnicodeInfo
= new List<UnicodeCategory>();                
                for
(int i=0; i<str.Length; i++)
                {
                    UnicodeCategory category
= CharUnicodeInfo.GetUnicodeCategory(str[i]);
                    if
(m_listUnicodeInfo.Contains(category) == false)
                    {
                        m_listUnicodeInfo.Add(category)
;
                   
}
                }
               
return m_listUnicodeInfo.ToArray();
           
}
        }

       
static void PrintCategory(string str, UnicodeCategory[] categories)
        {
            System.Console.WriteLine(str +
" : ");
           
            if
(categories != null)
            {
               
for(int i=0; i<categories.Length; i++)
                {
                    System.Console.Write(categories[i].ToString() +
", ");
               
}
                System.Console.WriteLine()
;
           
}
        }
    }
}
위의 코드를 실행시키면 아래와 같이 나옵니다.
사용자 삽입 이미지

판단할 수 있는 내용은 아래와 같습니다.
http://msdn2.microsoft.com/en-us/librar ··· ory.aspx


  Member name Description
Supported by the .NET Compact Framework ClosePunctuation Indicates that the character is the closing character of one of the paired punctuation marks, such as parentheses, square brackets, and braces. Signified by the Unicode designation "Pe" (punctuation, close). The value is 21.
Supported by the .NET Compact Framework ConnectorPunctuation Indicates that the character is a connector punctuation, which connects two characters. Signified by the Unicode designation "Pc" (punctuation, connector). The value is 18.
Supported by the .NET Compact Framework Control Indicates that the character is a control code, whose Unicode value is U+007F or in the range U+0000 through U+001F or U+0080 through U+009F. Signified by the Unicode designation "Cc" (other, control). The value is 14.
Supported by the .NET Compact Framework CurrencySymbol Indicates that the character is a currency symbol. Signified by the Unicode designation "Sc" (symbol, currency). The value is 26.
Supported by the .NET Compact Framework DashPunctuation Indicates that the character is a dash or a hyphen. Signified by the Unicode designation "Pd" (punctuation, dash). The value is 19.
Supported by the .NET Compact Framework DecimalDigitNumber Indicates that the character is a decimal digit; that is, in the range 0 through 9. Signified by the Unicode designation "Nd" (number, decimal digit). The value is 8.
Supported by the .NET Compact Framework EnclosingMark Indicates that the character is an enclosing mark, which is a nonspacing combining character that surrounds all previous characters up to and including a base character. Signified by the Unicode designation "Me" (mark, enclosing). The value is 7.
Supported by the .NET Compact Framework FinalQuotePunctuation Indicates that the character is a closing or final quotation mark. Signified by the Unicode designation "Pf" (punctuation, final quote). The value is 23.
Supported by the .NET Compact Framework Format Indicates that the character is a format character, which is not normally rendered but affects the layout of text or the operation of text processes. Signified by the Unicode designation "Cf" (other, format). The value is 15.
Supported by the .NET Compact Framework InitialQuotePunctuation Indicates that the character is an opening or initial quotation mark. Signified by the Unicode designation "Pi" (punctuation, initial quote). The value is 22.
Supported by the .NET Compact Framework LetterNumber Indicates that the character is a number represented by a letter, instead of a decimal digit; for example, the Roman numeral for five, which is 'V'. Signified by the Unicode designation "Nl" (number, letter). The value is 9.
Supported by the .NET Compact Framework LineSeparator Indicates that the character is used to separate lines of text. Signified by the Unicode designation "Zl" (separator, line). The value is 12.
Supported by the .NET Compact Framework LowercaseLetter Indicates that the character is a lowercase letter. Signified by the Unicode designation "Ll" (letter, lowercase). The value is 1.
Supported by the .NET Compact Framework MathSymbol Indicates that the character is a mathematical symbol, such as '+' or '= '. Signified by the Unicode designation "Sm" (symbol, math). The value is 25.
Supported by the .NET Compact Framework ModifierLetter Indicates that the character is a modifier letter, which is free-standing spacing character that indicates modifications of a preceding letter. Signified by the Unicode designation "Lm" (letter, modifier). The value is 3.
Supported by the .NET Compact Framework ModifierSymbol Indicates that the character is a modifier symbol, which indicates modifications of surrounding characters; for example, the fraction slash indicates that the number to the left is the numerator and the number to the right is the denominator. Signified by the Unicode designation "Sk" (symbol, modifier). The value is 27.
Supported by the .NET Compact Framework NonSpacingMark Indicates that the character is a nonspacing character, which indicates modifications of a base character. Signified by the Unicode designation "Mn" (mark, nonspacing). The value is 5.
Supported by the .NET Compact Framework OpenPunctuation Indicates that the character is the opening character of one of the paired punctuation marks, such as parentheses, square brackets, and braces. Signified by the Unicode designation "Ps" (punctuation, open). The value is 20.
Supported by the .NET Compact Framework OtherLetter Indicates that the character is a letter that is not an uppercase letter, a lowercase letter, a titlecase letter, or a modifier letter. Signified by the Unicode designation "Lo" (letter, other). The value is 4.
Supported by the .NET Compact Framework OtherNotAssigned Indicates that the character is not assigned to any Unicode category. Signified by the Unicode designation "Cn" (other, not assigned). The value is 29.
Supported by the .NET Compact Framework OtherNumber Indicates that the character is a number that is neither a decimal digit nor a letter number; for example, the fraction 1/2. Signified by the Unicode designation "No" (number, other). The value is 10.
Supported by the .NET Compact Framework OtherPunctuation Indicates that the character is a punctuation that is not a connector punctuation, a dash punctuation, an open punctuation, a close punctuation, an initial quote punctuation, or a final quote punctuation. Signified by the Unicode designation "Po" (punctuation, other). The value is 24.
Supported by the .NET Compact Framework OtherSymbol Indicates that the character is a symbol that is not a mathematical symbol, a currency symbol or a modifier symbol. Signified by the Unicode designation "So" (symbol, other). The value is 28.
Supported by the .NET Compact Framework ParagraphSeparator Indicates that the character is used to separate paragraphs. Signified by the Unicode designation "Zp" (separator, paragraph). The value is 13.
Supported by the .NET Compact Framework PrivateUse Indicates that the character is a private-use character, whose Unicode value is in the range U+E000 through U+F8FF. Signified by the Unicode designation "Co" (other, private use). The value is 17.
Supported by the .NET Compact Framework SpaceSeparator Indicates that the character is a space character, which has no glyph but is not a control or format character. Signified by the Unicode designation "Zs" (separator, space). The value is 11.
Supported by the .NET Compact Framework SpacingCombiningMark Indicates that the character is a spacing character, which indicates modifications of a base character and affects the width of the glyph for that base character. Signified by the Unicode designation "Mc" (mark, spacing combining). The value is 6.
Supported by the .NET Compact Framework Surrogate Indicates that the character is a high-surrogate or a low-surrogate. Surrogate code values are in the range U+D800 through U+DFFF. Signified by the Unicode designation "Cs" (other, surrogate). The value is 16.
Supported by the .NET Compact Framework TitlecaseLetter Indicates that the character is a titlecase letter. Signified by the Unicode designation "Lt" (letter, titlecase). The value is 2.
Supported by the .NET Compact Framework UppercaseLetter Indicates that the character is an uppercase letter. Signified by the Unicode designation "Lu" (letter, uppercase). The value is 0.


덧붙여 이름만 들어도 반가울 함수들도 제공하고 있습니다.

GetDecimalDigitValue()
GetDigitValue()
GetNumericValue()
chaoskcuf
프로그래밍/TIP& Study 2007/07/13 19:06

트랙백 주소 : http://chaoskcuf.com/trackback/102

댓글을 달아 주세요

Powerd by Textcube, designed by criuce
rss