Indy, SOAP, XML and escape chars

This is the forum for miscellaneous technical/programming questions.

Moderator: 2ffat

Indy, SOAP, XML and escape chars

Postby macicogna » Fri Apr 24, 2015 8:52 am

Hi All,

I'm working with SOAP and my first approach was to use Indy Components, TIdHTTP class, to develop requests and get responses with a WebServer developed in .NET.

Everything works just fine, but now I'm testing some heavy queries that produces (uziped) responses about 30 MBytes. As the WebServer was planned do be used in my client's intra-web, my code handled this traffic with heavy tests just fine.

So, my problem is that the responses came with escape chars inside XML data. See an example:

Code: Select all
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/"><s:Body>
...
&lt;xml&gt;&#xD;
  &lt;Node1&gt;&#xD;
  ...
  &lt;/Node1&gt;&#xD;
&lt;/xml&gt;
...
</s:Body></s:Envelope>


My approach to that was use a optimized StringReplace function to change the escape chars to "<", ">" and "\n". With small responses, this works very good, but with the heavy tests the find/replace took about 10 minutes to process the response XML in memory. The total computational time is about 12 minutes, so the StringReplace is my bottle neck.

On the other hand, I've tested the SOAP framework that uses the THTTPRIO class. I'm using C++Builder XE2, so I think this framework is also using Indy Components, as we can see in Soap.SOAPHTTPTrans unit. The answer with THTTPRIO is something like this:

Code: Select all
<xml>
  <Node1>
  ...
  </Node1>
</xml>


So my doubt is: is there a way to setup TIdHTTP object in order to receive the response XML without escape chars?

I didn't tested the THTTPRIO with my heavy tests, so I would like to inspect with this community if THTTPRIO also does a find/replace work or it receives the response without escape chars.

Sorry about the length of my question.

Thank you in advance.

Marcelo.

References
FastStringReplace:
http://alexandrecmachado.blogspot.com.b ... elphi.html

Similar problem without answer:
http://codeverge.com/embarcadero.delphi ... ng/1079117
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 68
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: Indy, SOAP, XML and escape chars

Postby rlebeau » Fri Apr 24, 2015 2:37 pm

macicogna wrote:So, my problem is that the responses came with escape chars inside XML data.


The server is sending an XML document embedded inside the SOAP envelope, which is unusual but not unheard of. It is technically valid, as the inner XML document is being properly escaped within the SOAP envelope. The SOAP's node simply contains a string value, and that value happens to be another XML document, and so it needs to be escaped like any other string value would be.

macicogna wrote:My approach to that was use a optimized StringReplace function to change the escape chars to "<", ">" and "\n". With small responses, this works very good, but with the heavy tests the find/replace took about 10 minutes to process the response XML in memory.


Then it is not very well optimized, now is it? ;) I would not use a StringReplace() type of function for this situation. It would be more efficient to just run through the SOAP node's data one time, replacing escape sequences as you encounter them, and then truncate the data memory on time (if needed) when finished.

Besides, this is not even really needed anyway. Keep reading...

macicogna wrote:On the other hand, I've tested the SOAP framework that uses the THTTPRIO class. I'm using C++Builder XE2, so I think this framework is also using Indy Components, as we can see in Soap.SOAPHTTPTrans unit. The answer with THTTPRIO is something like this:

Code: Select all
<xml>
  <Node1>
  ...
  </Node1>
</xml>



That is what you should be getting when you actually read the text of the SOAP envelope node that contains the inner XML. Any XML parser would do this unescaping for you.

macicogna wrote:So my doubt is: is there a way to setup TIdHTTP object in order to receive the response XML without escape chars?


No. This is not an HTTP issue, or even an Indy issue. TIdHTTP is simply providing you with the raw data as it is sent by the server. You have to then parse the data separately. Just use a normal XML parser to process the SOAP envelope, and it will unescape the inner XML data for you.

macicogna wrote:I didn't tested the THTTPRIO with my heavy tests, so I would like to inspect with this community if THTTPRIO also does a find/replace work or it receives the response without escape chars.


It is receiving the raw escaped XML and parsing it like any other XML document. The unescaping happens when you read the text of the node that contains the inner XML.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1533
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: Indy, SOAP, XML and escape chars

Postby macicogna » Fri Apr 24, 2015 4:06 pm

Hi Remy,

Very insightful answer from you. Thanks again.

Also, It is nice to hear that my doubt was not due to Indy. :D

So I think my problem is now concerned with my XML parser, that is based on TXMLDocument (and MSXML DOM).

I didn't mention at first, but I've tried the FastStringReplace() approach because my TXMLDocument object wasn't able to find the inner <xml> node and, as consequence, its hierarchal children nodes. Actually, if I open the XML file with IE, I can see that the inner XML data, with escape chars, are shown as a string, with "<" and ">", but without hierarchical node structure. I can provide a real example if you need to check the details.

I've implemented a helper function just to locate the main node in order to import the XML data. Here it is:

Code: Select all
_di_IXMLNode
FindXMLNode(_di_IXMLNode AStartNode, String ANodeName) const
{
  _di_IXMLNode ANode = AStartNode;
  bool Stop  = false;
  bool Found = false;
  while (ANode && !Stop && !Found)
  {
    if (ANodeName == ANode->LocalName)
      Found = true;
    else
    {
      if (ANode->HasChildNodes)
        ANode = ANode->ChildNodes->GetNode(0);
      else
        Stop = true;
    }
  }
  if (!Found) ANode = NULL;
  return (ANode);
}


With a XML file similar to the example I've shown in my first post, it returns NULL when I call it like that:

Code: Select all
NodeDoc = MyXMLDoc->DocumentElement;
NodeXML = FindXMLNode(NodeDoc, "xml");
// Now NodeXML == NULL.


I don't want do bother you, but do you have any hint or reference about how to parse my XML SOAP responses with escape chars? Also, do you think that TXMLDocument, with its DOMVendors options, in special the MSXML DOM, is the real problem here?

Thank you in advance.

Best,

Marcelo
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 68
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: Indy, SOAP, XML and escape chars

Postby rlebeau » Sat Apr 25, 2015 5:39 pm

macicogna wrote:I didn't mention at first, but I've tried the FastStringReplace() approach because my TXMLDocument object wasn't able to find the inner <xml> node and, as consequence, its hierarchal children nodes.


It would not find the inner XML's child nodes, because the inner XML document is just an arbitrary string value within the SOAP document. You would have to read the string value and then parse it as a new XML document in order to access its nodes.

macicogna wrote:Actually, if I open the XML file with IE, I can see that the inner XML data, with escape chars, are shown as a string


Because it really is, from the perspective of the outer SOAP document.

macicogna wrote:I've implemented a helper function just to locate the main node in order to import the XML data.


That is not a correct search algorithm. It needs to be recursive when searching child nodes, and also search sibling nodes as well, eg:

Code: Select all
_di_IXMLNode FindXMLNode(_di_IXMLNode AStartNode, String ANodeName) const
{
  _di_IXMLNode ANode = AStartNode;
  while (ANode)
  {
    if (ANodeName == ANode->LocalName)
      return ANode;

    int count = ANode->ChildNodes->Count;
    for(int i = 0; i < count; ++i)
    {
      _di_IXMLNode Found = FindXMLNode(ANode->ChildNodes->Nodes[i], ANodeName);
      if (Found)
        return Found;
    }

    ANode = ANode->NextSibling;
  }

  return NULL;
}


I would suggest taking it a step further by incorporating XPath into the algorithm, when available, as it will greatly simplify the searching:

Code: Select all
_di_IXMLNode FindXMLNode(_di_IXMLNode AStartNode, String ANodeName) const
{
  _di_IDOMNodeSelect XPath = AStartNode->DOMNode;
  if (XPath)
  {
    _di_IDOMNode Found = XPath->selectNode(L"//"+ANodeName);
    if (Found)
    {
      TXMLDocument *doc = NULL;
      _di_IXmlDocumentAccess docAccess = AStartNode->OwnerDocument;
      if (docAccess)
        doc = docAccess->DocumentObject;
      return new TXMLNode(Found, NULL, doc);
    }
    return NULL;
  }
  else
  {
    // code above...
  }
}


macicogna wrote:With a XML file similar to the example I've shown in my first post, it returns NULL when I call it like that


As it should be, because "<xml>" is not a node of the outer SOAP document. You need to search for the outer SOAP node that contains the "<xml"> data as its Text.

macicogna wrote:I don't want do bother you, but do you have any hint or reference about how to parse my XML SOAP responses with escape chars?


Again, DON'T try to parse them. Let the XML engine do it for you. Find the XML node that contains the inner XML document, read that node's Text, which will unescape the characters, and then parse that string as a new XML document.

macicogna wrote:Also, do you think that TXMLDocument, with its DOMVendors options, in special the MSXML DOM, is the real problem here?


No. You are simply not processing the SOAP document correctly to locate, extract, and parse the inner XML the right way. The DOMVendor is not a factor, as this is how XML works in general. You have to take these factors into account for any DOMVendor.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1533
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: Indy, SOAP, XML and escape chars

Postby macicogna » Sun Apr 26, 2015 8:15 am

Hi Remy,

Thanks again for your support.

I'll summarize the thread here and put just one final doubt this post's end.

With the hint about the Text property I was able to get the inner XML string and convert it into a real XML file with the data I need. If someone else needs details I can publish the code here in future posts. I'm doing this just to be concise. The final and readable XML file has now about 25 MBytes.

Also, the example with XPath is very nice. I didn't know that resource and it is very helpful. My start function was intended just to catch the inner XML node, so now I think I might call it GetFirstChildXMLNode(). :D

About the DOM doubt, now I reorganized all my code and realized that the 10 minutes bottle neck is due to the XML parsing time with TXMLDocument. I've checked it out and find that it is a consensus that DOM could be slow with large XML files, so I think this is my case now.

Just to be fare, the FastStringReplace took 4.5s to replace "&lt;", "&gt;" and "&#xD;" to "<", ">" and "\n", respectively, in a 30 MBytes files loaded as a AnsiString object. Now it isn't necessary, but it is still fast. :D

Apart from what I've said initially, now I know that the heavy processing time really gets started when it reaches the nodes' loop, in especial when it determines the number of nodes (Count), as shown next:

Code: Select all
...
for (int p=0; p<ANodeList->ChildNodes->Count; p++)
{
  ANodeData = ANodeList->ChildNodes->GetNode(p);
  ...
}


I've looked about SAX, but I estimated that this approach would let me throw away all the code I have to read the XML Data I need. Indeed, it looks like this:

Code: Select all
<xml>
  <List>
    <Node1>
      <Field1>Value</Field1>
      ...
      <FieldN>Value</FieldN>
    </Node1>
    ...
    <NodeM>
      <Field1>Value</Field1>
      ...
      <FieldN>Value</FieldN>
    </NodeM>
  </List>
</xml>


The large XML file has N around 6 fields and M up to 100,000 nodes.

Just checking: do you agree that DOM is now the obstacle, due to a large XML file? I'm not complaining about it, I'm just checking the real situation in order to try an alternative because the 10 minutes process is quite a problem to my client.

Due to this simple structure of my XML data, I think I'll try a mini Parse class, based in linear reading of inner-tags content, instead of DOM or SAX. If I succeed, I'll post the code here in order to complete the thread and help other with similar problems.

But I'll wait your comment to try this alternative.

Thanks for your attention.

Best,

Marcelo.
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 68
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: Indy, SOAP, XML and escape chars

Postby rlebeau » Sun Apr 26, 2015 11:25 am

macicogna wrote:About the DOM doubt, now I reorganized all my code and realized that the 10 minutes bottle neck is due to the XML parsing time with TXMLDocument. I've checked it out and find that it is a consensus that DOM could be slow with large XML files, so I think this is my case now.


Yes, DOM is inherently slow for large documents. SAX is faster, but Delphi does not have a built-in SAX framework, so you would have to use a 3rd party library.

sollmann wrote:Apart from what I've said initially, now I know that the heavy processing time really gets started when it reaches the nodes' loop, in especial when it determines the number of nodes (Count)


At the very least, read the Count once and save it in a variable, don't re-read it on each loop iteration:

Code: Select all
int count = ANodeList->ChildNodes->Count;
for (int p=0; p<count; p++)
{
  ANodeData = ANodeList->ChildNodes->GetNode(p);
  ...
}


sollmann wrote:I've looked about SAX, but I estimated that this approach would let me throw away all the code I have to read the XML Data I need.


Yes, SAX is a different methodology than DOM (DOMs are usually built using SAX internally).

sollmann wrote:Just checking: do you agree that DOM is now the obstacle, due to a large XML file?


If memory usage and speed are an issue, yes. Use SAX instead (callback based), or even better use an XmlReader API (linear reading).
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1533
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: Indy, SOAP, XML and escape chars

Postby smd » Mon Apr 27, 2015 1:31 am

What is SOAP used for?
-----------------------------
Scott
smd
BCBJ Guru
BCBJ Guru
 
Posts: 130
Joined: Sat Nov 29, 2014 8:02 pm
Location: Las Vegas

Re: Indy, SOAP, XML and escape chars

Postby macicogna » Mon Apr 27, 2015 6:49 am

Hi Scott,

I'm using the SOAP Protocol to access WebServices developed with WCF.

I'm using Indy Components, mainly the TIdHTTP class, to send requests and receive responses based on XML data. SOAP uses HTTP methods and encapsulates data using XML, calling then "Envelopes".

Delphi and C++Builder has a framework to automate SOAP use ("WSDL Import", for example), but I decide to use Indy because it is already used in my project, what makes the new resource deployment more quicker, and I think this will help future maintenance.

Also, it was a nice opportunity to learn more about HTTP and XML, with this great help from Remy.

This thread was about how to manage large XML files received as responses. In my case, just in specific heavy tests to the WCF WebService.

I can give you more details if you want.

Best,

Marcelo.
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 68
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil


Return to Technical

Who is online

Users browsing this forum: No registered users and 13 guests

cron