Embarcadero's Missing IsPunctuation() Characters

This is the forum for miscellaneous technical/programming questions.

Moderator: 2ffat

Embarcadero's Missing IsPunctuation() Characters

Postby smd » Fri May 22, 2015 1:53 pm

I have limited trust with non-standardized library routines provided by Embarcadero. For instance, the operations for determining upper case, lower case, punctuation, Embarcadero creates several routines which supposedly do the same process as isupper() islower, ispunct() standard C/C++ routines.

I wrote a function that scanned all 65536 characters of Unicode checking each character with IsUpper() IsLower() and IsPunctuation().

Attached is the results. Three text files, one each for upper, lower, and punctuation in UTF-8 format which should open properly in any Unicode or utf-8 compliant text editor (I tested it in several editors).

Regarding the punctuation list. Note the breaks in what IsPunctuation() considers punctuation. The C and C++ standard defines ispunct() or iswpunct() as returning true for any of

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
or any punctuation character specific to the current locale.

Note the missing characters in Embarcadero's IsPunctuation() function. Specifically

$ + < = > ^ ` | ~

As to all the other non-ASCII characters, whether any characters are missing, I would assume so if Embarcadero cannot get the basic ASCII character punctuation correct.
Attachments
UpperLowerPunct.zip
Zip file of 3 text files in utf-8 format
(10.5 KiB) Downloaded 628 times
-----------------------------
Scott
smd
BCBJ Guru
BCBJ Guru
 
Posts: 130
Joined: Sat Nov 29, 2014 8:02 pm
Location: Las Vegas

Re: Embarcadero's Missing IsPunctuation() Characters

Postby rlebeau » Fri May 22, 2015 7:55 pm

smd wrote:I wrote a function that scanned all 65536 characters of Unicode checking each character with IsUpper() IsLower() and IsPunctuation().


Unicode has a LOT more codepoints defined than 65536 (room for 1114112 max, though many are not allocated yet). You are thinking of just the codepoints in the Basic Multilingual Plane, which has the same 65536 codepoints as UCS-2. You are ignoring codepoints outside of the BMP. You have to use the UnicodeString-based functions to handle those higher codepoints, the Char-based functions only handle the BMP codepoints.

smd wrote:Regarding the punctuation list. Note the breaks in what IsPunctuation() considers punctuation. The C and C++ standard defines ispunct() or iswpunct() as returning true for any of

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
or any punctuation character specific to the current locale.

Note the missing characters in Embarcadero's IsPunctuation() function. Specifically

$ + < = > ^ ` | ~


Embarcadero's functions use internal lookup tables that are based on Unicode classifications. You can use the GetUnicodeCategory() function to determine Embarcadero's categorization for any given codepoint, but the specific ones you mention above are categoried as follows:

$ ucCurrencySymbol
+ ucMathSymbol
< ucMathSymbol
= ucMathSymbol
> ucMathSymbol
^ ucModifierSymbol
` ucModifierSymbol
| ucMathSymbol
~ ucMathSymbol

That is why IsPunctuation() is false for all of them - they are all classified as symbols (IsSymbol() is true).

As an example, if you look at how the Unicode standard defines the "$" character, it is classified as a "currency symbol", it has no punctuation attributes (unless you count that it has a Bidi_Class of "European_Terminator").

IsPunctuation looks for the following classifications:

ucConnectPunctuation
ucDashPunctuation
ucClosePunctuation
ucFinalPunctuation
ucInitialPunctuation
ucOtherPunctuation
ucOpenPunctuation

Which none of the above characters are classified as.

Embarcadero's functions are following the Unicode standard, not the C/C++ standard.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1559
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA


Return to Technical

Who is online

Users browsing this forum: Baidu [Spider] and 22 guests

cron