The problem with the encoding

This is the forum for miscellaneous technical/programming questions.

Moderator: 2ffat

The problem with the encoding

Postby Lena » Mon Mar 30, 2015 6:24 am

Hi.
The device sends Russian letters "ABC"
Code: Select all
 if(JSONPair != NULL)
   {
   String JsonMemberName = JSONPair->JsonString->Value();
   String JsonMemberValue = JSONPair->JsonValue->ToString();//here i got "???"


But I got this: "???" (see picture.)
How to get a Russian letters from JSONPair->JsonValue?
Thanks.
Attachments
1.jpg
1.jpg (19.41 KiB) Viewed 29451 times
Lena
BCBJ Master
BCBJ Master
 
Posts: 583
Joined: Sun Feb 06, 2011 1:28 pm

Re: The problem with the encoding

Postby rlebeau » Mon Mar 30, 2015 4:47 pm

What you have shown indicates that the JSON data is going through a charset conversion using a charset that does not support the Russian characters you are looking for. But you have not provided enough information to diagnose what that is happening.

JSON uses UTF-8 by default. Does the input JSON use UTF-8?

When you parse the JSON into a TJSONValue using TJSONObject::ParseJSONValue(), are you parsing it as a String or as a byte array? If the former, where is the String coming from? If the latter, what are you setting the IsUTF8 parameter to?

Where is the JSON actually coming from? Did you verify that the Russian characters in question are correct in the raw JSON data before you parse it?
Last edited by rlebeau on Tue Mar 31, 2015 2:38 pm, edited 1 time in total.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1545
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: The problem with the encoding

Postby Lena » Tue Mar 31, 2015 12:23 am

Hi.
If the former, where is the String coming from?


My code C++ Builder XE7 up1:
Code: Select all
void ParseJSONValue(String JsonData, String ServerPascalIP, TIdContext *AContext)
{

  try
  {
     String IP = ServerPascalIP;
     String GJSONString = JsonData;
     std::unique_ptr<TJSONValue> LJSONValue(TJSONObject::ParseJSONValue(GJSONString));
     TJSONObject *LJSONObject = dynamic_cast<TJSONObject*>(LJSONValue.get());

     if(LJSONObject != NULL)
      {
       TJSONPair * JSONPair;
       TJSONArray * JsonArraySens;
       String JsonMemberName;
       String JsonMemberValue;

       for (int i = 0; i < LJSONObject->Count; i++)
        {

         JSONPair = LJSONObject->Pairs[i];
         if(JSONPair != NULL)
            {
             JsonMemberName = JSONPair->JsonString->Value();
             JsonMemberValue = JSONPair->JsonValue->ToString();//"???"


 //***

void __fastcall TForm1Main::IdTCPServer1Execute(TIdContext *AContext)
{
 //***
 String Sdata = AContext->Connection->IOHandler->ReadLn();
 String ConnectServerIP = AContext->Connection->Socket->Binding->PeerIP;
 ParseJSONValue(Sdata, ConnectServerIP, AContext);
 //***
Last edited by Lena on Tue Mar 31, 2015 12:35 am, edited 1 time in total.
Lena
BCBJ Master
BCBJ Master
 
Posts: 583
Joined: Sun Feb 06, 2011 1:28 pm

Re: The problem with the encoding

Postby Lena » Tue Mar 31, 2015 12:28 am

I try:
String Sdata = AContext->Connection->IOHandler->ReadLn(TEncoding::UTF8);

[bcc32 Error] Unit1MainForm.cpp(736): E2285 Could not find a match for 'TIdIOHandler::ReadLn(TEncoding *)'

If this:
AContext->Connection->IOHandler->DefStringEncoding = enUTF8;
String Sdata = AContext->Connection->IOHandler->ReadLn();
again "???"
Lena
BCBJ Master
BCBJ Master
 
Posts: 583
Joined: Sun Feb 06, 2011 1:28 pm

Re: The problem with the encoding

Postby HsiaLin » Tue Mar 31, 2015 5:21 am

Have you tried it like this:

AContext->Connection->IOHandler->DefStringEncoding = enUTF8;
UTF8String Sdata = AContext->Connection->IOHandler->ReadLn();
HsiaLin
BCBJ Master
BCBJ Master
 
Posts: 299
Joined: Sun Jul 08, 2007 6:29 pm

Re: The problem with the encoding

Postby Lena » Tue Mar 31, 2015 8:20 am

HsiaLin wrote:Have you tried it like this:

AContext->Connection->IOHandler->DefStringEncoding = enUTF8;
UTF8String Sdata = AContext->Connection->IOHandler->ReadLn();


Hi.
I see again "???" :(
Lena
BCBJ Master
BCBJ Master
 
Posts: 583
Joined: Sun Feb 06, 2011 1:28 pm

Re: The problem with the encoding

Postby smd » Tue Mar 31, 2015 8:47 am

Lena, what are the hex values of the Russian characters (unicode and utf-8 ideally) that you are having a problem with? I want to try something. set a break point at the instruction and look at the hex values. also try typing the characters into notepad, or some other unicode/utf8 capable editor, save, then look at the hex values. might be the characters are multi-byte, but are being corrupted or missing one of the lead bytes due to improper conversion between unicode/utf8/ansi.
-----------------------------
Scott
smd
BCBJ Guru
BCBJ Guru
 
Posts: 130
Joined: Sat Nov 29, 2014 8:02 pm
Location: Las Vegas

Re: The problem with the encoding

Postby rlebeau » Tue Mar 31, 2015 2:52 pm

Lena wrote:My code C++ Builder XE7 up1


You are parsing the JSON as a String, so any character loss would had to have occurred when you stored the JSON in the String before parsing it.

Lena wrote:
Code: Select all
void __fastcall TForm1Main::IdTCPServer1Execute(TIdContext *AContext)
{
 //***
 String Sdata = AContext->Connection->IOHandler->ReadLn();
 ...
 ParseJSONValue(Sdata, ConnectServerIP, AContext);
 //***



By calling ReadLn() without any parameters, raw bytes read from the socket will be decoded into a String using the encoding specified in the IOHandler->DefStringEncoding property, or the global GIdDefaultTextEncoding variable if DefStringEncoding is NULL. Both are set to ASCII by default. That would explain why you are losing Russian characters.

So, assuming the JSON really is UTF-8 on the wire, you can specify UTF-8 when calling ReadLn():

Code: Select all
// depending on which version of Indy 10 you are using...
String Sdata = AContext->Connection->IOHandler->ReadLn(IndyTextEncoding_UTF8());
// ...->ReadLn(TIdTextEncoding::UTF8);
// ...->ReadLn(TIdTextEncoding_UTF8);
// ...->ReadLn(IndyUTF8Encoding());
// ...->ReadLn(enUTF8);


Or you can set DefStringEncoding to UTF-8 before calling ReadLn(), such as when the client connects:

Code: Select all
void __fastcall TForm1Main::IdTCPServer1Connect(TIdContext *AContext)
{
    // depending on which version of Indy 10 you are using...
    AContext->Connection->IOHandler->DefStringEncoding = IndyTextEncoding_UTF8();
    // ... = TIdTextEncoding::UTF8;
    // ... = TIdTextEncoding_UTF8;
    // ... = IndyUTF8Encoding();
    // ... = enUTF8;
}
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1545
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: The problem with the encoding

Postby rlebeau » Tue Mar 31, 2015 3:02 pm

Lena wrote:I try:
String Sdata = AContext->Connection->IOHandler->ReadLn(TEncoding::UTF8);

[bcc32 Error] Unit1MainForm.cpp(736): E2285 Could not find a match for 'TIdIOHandler::ReadLn(TEncoding *)'


ReadLn() expects a TIdTextEncoding* or an IIdTextEncoding* (depending on which version of Indy 10 you are using). Presumably you are using Indy 10.6.0.0 (when IIdTextEncoding was introduced) or later, as TIdTextEncoding in earlier versions was just an alias for TEncoding in CB2009+. But you should never have been using TEncoding with Indy to begin with, only TIdTextEncoding/IIdTextEncoding (if you really want to use TEncoding, 10.6.0.0+ has an IndyTextEncoding() overload that wraps a TEncoding inside of an IIdTextEncoding).

Lena wrote:If this:
AContext->Connection->IOHandler->DefStringEncoding = enUTF8;
String Sdata = AContext->Connection->IOHandler->ReadLn();
again "???"


Then the raw JSON data being transmitted is likely not actually UTF-8 encoded to begin with. I would guess that it is actually ANSI encoded instead, probably with a Russian-enabled charset like Windows-1251, ISO-8859-5, or KOI8-R/KOI8-U. Indy can convert bytes to String using those charsets if they are installed on your OS (see the IndyTextEncoding(CodePage) and CharsetToEncoding() functions), but first you need to identify the real charset being used on the transmission of the JSON data.
Last edited by rlebeau on Tue Mar 31, 2015 3:18 pm, edited 2 times in total.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1545
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: The problem with the encoding

Postby rlebeau » Tue Mar 31, 2015 3:08 pm

HsiaLin wrote:Have you tried it like this:

AContext->Connection->IOHandler->DefStringEncoding = enUTF8;
UTF8String Sdata = AContext->Connection->IOHandler->ReadLn();


That is not very useful in this situation. ReadLn() returns a UnicodeString in CB2009+. It would be receiving raw bytes from the socket and converting them from UTF-8 to UTF-16 for output. Assigning a UnicodeString to a UTF8String converts from UTF-16 to UTF-8. ParseJSONValue() takes a UnicodeString as input, so passing a UTF8String would convert from UTF-8 back to UTF-16 again. That is a lot of unnecessary conversions.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1545
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: The problem with the encoding

Postby rlebeau » Tue Mar 31, 2015 3:13 pm

smd wrote:Lena, what are the hex values of the Russian characters (unicode and utf-8 ideally) that you are having a problem with? I want to try something. set a break point at the instruction and look at the hex values. also try typing the characters into notepad, or some other unicode/utf8 capable editor, save, then look at the hex values.


If the raw JSON bytes on the network are not actually encoded as UTF-8 to begin with, the Russian characters would already be lost before ReadLn() exits, if decoded as UTF-8. So you would not be able to inspect the hex data, unless you step into ReadLn() itself with the debugger and look at the raw bytes it reads, before they are converted to a String. Otherwise, a capture of the raw network data from a packet sniffer, such as Wireshark, would be more useful to inspect.

smd wrote:might be the characters are multi-byte, but are being corrupted or missing one of the lead bytes due to improper conversion between unicode/utf8/ansi.


I suspect the raw JSON bytes being transmitted are not actually encoded as UTF-8. That would explain the loss of characters when decoded as UTF-8.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1545
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: The problem with the encoding

Postby Lena » Wed Apr 01, 2015 2:02 am

probably with a Russian-enabled charset like Windows-1251


Thank you very much!
Now I see the letters normally.
Code: Select all
void __fastcall TForm1Main::IdTCPServer1Connect(TIdContext *AContext)
{
   AContext->Connection->IOHandler->DefStringEncoding = IndyTextEncoding(1251);
Lena
BCBJ Master
BCBJ Master
 
Posts: 583
Joined: Sun Feb 06, 2011 1:28 pm


Return to Technical

Who is online

Users browsing this forum: Bing [Bot] and 6 guests

cron