TWebBrowser with Navigate inside a loop

This is the forum for miscellaneous technical/programming questions.

Moderator: 2ffat

TWebBrowser with Navigate inside a loop

Postby macicogna » Tue Aug 23, 2016 10:43 am

Hi All,

This is a very specific topic, so any hint will be appreciated.

Tested Environments: BDS 2006 (Windows Vista) and C++Builder XE2 (Windows 7 and 10).

I've developed a simple VCL App to test the response of Search Engines like Google, Bing and Yahoo. My focus is the performance in terms of seconds/search and traffic that leads to Captcha blocking.

The main idea is to load a list of terms with a TStringList and use TWebBrowser::Navigate() to open URL patterns like these:

Code: Select all
Google=https://www.google.com.br/#q=$(q)
Bing=http://www.bing.com/search?q=$(q)
Yahoo=https://search.yahoo.com/search?p=$(q)


Where the "$(q)" is replaced by TStringList's items.

In order to "wait" the result from Search Engines, I've created this method:

Code: Select all
void
TFormMain::DoNavigate(String AUrl)
{
  Browser->Navigate(WideString(AUrl), navNoHistory | navNoWriteToCache);
  while (Browser->ReadyState != Shdocvw::READYSTATE_COMPLETE)
  {
    Application->ProcessMessages();
  }
}


Here is the problem: In Windows Vista, this code works fine with Google, Yahoo and Bing, but in Windows 7 and 10 the Google tests fail. By fail I mean: the TWebBrowser object does not show it content in each search result, just for the last term. So, it looks like the test goes straight through the term's list in less than a second.

I've changed Navigate() flag options (no navNoHistory or navNoWriteToCache), without any luck.

Thank you in advance.

Best,

Marcelo.
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 63
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: TWebBrowser with Navigate inside a loop

Postby rlebeau » Thu Aug 25, 2016 2:44 pm

macicogna wrote:I've developed a simple VCL App to test the response of Search Engines like Google, Bing and Yahoo. My focus is the performance in terms of seconds/search and traffic that leads to Captcha blocking.


Why are you using TWebBrowser for that? That is a UI control, not well-suited for speed testing. An HTTP client class/API, like Indy's TIdHTTP, would be better suited for that task.

macicogna wrote:In order to "wait" the result from Search Engines, I've created this method:


Calling ProcessMessages() in a loop is going to affect your timing results. Not to mention it is just plain bad to call it in a loop at all. I would suggest DoNavigate() simply grab the current clock time and then call Navigate() by itself and exit. Let the TWebBrowser::OnDocumentComplete event tell you when the navigation is complete, at which time you can then grab the clock time again and calculate the difference.

macicogna wrote:Here is the problem: In Windows Vista, this code works fine with Google, Yahoo and Bing, but in Windows 7 and 10 the Google tests fail. By fail I mean: the TWebBrowser object does not show it content in each search result, just for the last term. So, it looks like the test goes straight through the term's list in less than a second.


Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google? What you describe sounds like Navigate() is probably failing and exiting immediately.

Can you show your actual test code?
Last edited by rlebeau on Thu Aug 25, 2016 7:52 pm, edited 1 time in total.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1399
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: TWebBrowser with Navigate inside a loop - SOLVED

Postby macicogna » Thu Aug 25, 2016 5:32 pm

Hi Remy,

It is always nice to change ideas with you. Thinking about your questions I've solved the problem, but here is my answers in order to help other readers.

Why are you using TWebBrowser for that?


As modern Search Engines, like Google, return just a bunch of Javascript, my first step was to capture "Captcha blocking", seen it with a TWebBrowser, in order to inspect the response latter and learn how to identify its content as HTML files.

That is a UI control, not well-suited for speed testing. AN HTTP client class/API, like Indy's TIdHTTP, would be better suited for that task.


Sure, nice hint. I think I'll implement a complementary version using TIdHTTP to check performance more accurately.

Calling ProcessMessages() in a loop is going to affect your timing results. Not to mention it is just plain bad to call it in a loop at all. I would suggest DoNavigate() simply grab the current clock time and then call Navigate() by itself and exit. Let the TWebBrowser::OnDocumentComplete event tell you when the navigation is complete, at which time you can then grab the clock time again and calculate the difference.


Yes, even using TWebBrowser the TWebBrowser::OnDocumentComplete approach is a better idea. My doubt is that subsequent TWebBrowser::Navigate() calls (inside a loop) might cancel previous one. I've read about this in other places and there I've got the ProcessMessages() approach.

Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google?


Yes, I was using TCPView.

Can you show your actual test code?


Sure. I've upload as an attachment.

Your question about "[...] actually attempting to contact Google?" makes me think that this problem might be outside my code, as just Google behave badly in Windows 7 an 10.

So, I've searched about Google's integration with Web Browsers and I've seen different Google URL patterns. So, I changed it to this one:

Code: Select all
Google=https://www.google.com.br/search?q=$(q)


And "Bingo"! Now it is working! The "/search?" makes the difference in Windows 7 and 10.

I've took Firefox's integration as a starting point:

Code: Select all
https://www.google.com.br/?client=firefox-b#q=Your+Query+Here&gfe_rd=cr


And here is my source that made me rethink the pattern: http://superuser.com/questions/578228/h ... -in-chrome

Thank you,

Marcelo.
Attachments
CheckSearchEngine.zip
(14.34 KiB) Downloaded 248 times
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 63
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: TWebBrowser with Navigate inside a loop - SOLVED

Postby rlebeau » Thu Aug 25, 2016 8:12 pm

macicogna wrote:As modern Search Engines, like Google, return just a bunch of Javascript, my first step was to capture "Captcha blocking", seen it with a TWebBrowser, in order to inspect the response latter and learn how to identify its content as HTML files.


Modern search engines provide REST APIs for performing searches in application code, returning machine-parsable results (usually XML or JSON). You should not be submitting HTML webforms and then scraping the resulting HTML/JavaScript for results.

macicogna wrote:Yes, even using TWebBrowser the TWebBrowser::OnDocumentComplete approach is a better idea. My doubt is that subsequent TWebBrowser::Navigate() calls (inside a loop) might cancel previous one.


If you use the OnDocumentComplete event (and use it correctly, ie no ProcessMessages() loop), then you won't be able to use a simple Navigate() loop anymore. You will have to wait until OnDocumentComplete is fired before then calling Navigate() again. You will have to break up your code logic into pieces, executing each piece at the proper time.

macicogna wrote:
Did you verify with a packet sniffer that Navigate() is actually attempting to contact Google?


Yes, I was using TCPView.


TCPView is not a packet sniffer. It only shows you active connections, but not the actual data that being transmitted on those connections. If you Navigate() to multiple URLs on the same server, connections might get reused. Use a real packet sniffer, like Wireshark or Fiddler, to look at the actual HTTP requests.

macicogna wrote:Your question about "[...] actually attempting to contact Google?" makes me think that this problem might be outside my code, as just Google behave badly in Windows 7 an 10.


Web browsers don't treat Google differently than any other sites. Something else is going on.

macicogna wrote:So, I've searched about Google's integration with Web Browsers and I've seen different Google URL patterns. So, I changed it to this one:

Code: Select all
Google=https://www.google.com.br/search?q=$(q)


And "Bingo"! Now it is working! The "/search?" makes the difference in Windows 7 and 10.


Whatever made you think that "https://www.google.com.br/#q=$(q)" would work in the first place? "#" is a bookmark delimiter. Everything after "#" is not actually part of the requested URL itself.

When you navigate to "https://www.google.com.br/#q=bcbj", for example, the web browser will connect to "www.google.com.br" and send a request for "/". The web server will never see "q=bcbj". And Google's "/" page is fairly minimal, which could explain why your Navigate() loop ran so quickly. Only AFTER the response has been fully processed by the web browser, the web browser will then look for a bookmark named "q=bcbj" within the HTML, and if found then scroll the display to that position.

When you navigate to "https://www.google.com.br/search?q=bcbj" instead, the web browser will connect to "www.google.com.br" and send a request for "/search?q=bcbj", which is a request for "/search" with "q=bcbj" as its input parameters, thus allowing search results to be queried and returned.
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1399
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: TWebBrowser with Navigate inside a loop

Postby macicogna » Fri Aug 26, 2016 4:15 pm

Hi Remy,

What a lesson here! Thanks.

Our focus with this subject is a freeware Anti-plagiarism software called CopySpider.

Modern search engines provide REST APIs...

Yes, but there are differences between "regular" and API searches [1] and obstacles of use like [2] and [3]. So, we decide to use TWebBrowser with a set of good Search Engines in order to run a few searches, as the user might do by hand, using the response's inner text to harvest our data of interest.

The good news is there are examples like DuckDuckGo, that has a Partnership door, which we will try in the near future.

You will have to wait until OnDocumentComplete is fired before...

I know my code need a complete redesign, but I would like to ask you, if possible, any hint about how to "wait" the document complete event. Are you talking about multi-thread and, maybe, semaphores?

TCPView is not a packet sniffer.

My bad. I was seen IPs and ports, but no packets. :oops:

Whatever made you think that "https://www.google.com.br/#q=$(q)" would work in the first place? "#" is a bookmark delimiter. Everything after "#" is not actually part of the requested URL itself.

I play guilty! I totally ran over this "#". On the other hand, thanks for your explanation. Now it is clear to me what happened.

In my defense: I don't know why the wrong URL pattern was working properly with my BDS 2006 and Windows Vista. If it had failed too, maybe I had not post this subject.

Thanks again for your attention.

Marcelo.

[1] See "This API does not include all of our links..." in https://duckduckgo.com/api
[2] See "...the Bing Search API has moved..." in http://www.bing.com/toolbox/bingsearchapi
[3] See "...we will discontinue the..." in https://developer.yahoo.com/boss/search/
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 63
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil

Re: TWebBrowser with Navigate inside a loop

Postby rlebeau » Sun Aug 28, 2016 3:15 pm

macicogna wrote:
You will have to wait until OnDocumentComplete is fired before...

I know my code need a complete redesign, but I would like to ask you, if possible, any hint about how to "wait" the document complete event.


The only way to wait for the event in a synchronous manner is with a ProcessMessages() loop. And while that will "work", it is not the best way to write code.

macicogna wrote:Are you talking about multi-thread and, maybe, semaphores?


Only if you want to search multiple sites at the same time in parallel. But you wouldn't use TWebBrowser for that, since it is a UI control. But you could do it with Indy's TIdHTTP instead, for instance.

What I was referring to earlier about using the OnDocumentComplete event is writing simple event-driven code to run your looping logic, for example:

Code: Select all
private:
    TStringList *URLs;
    String SearchParam;
    int CurrentIndex;

...

__fastcall TFormMain::TFormMain(TComponent *Owner)
    : TForm(Owner)
{
    URLs = new TStringList;
    URLs->Add("Google=https://www.google.com.br/search?q=$(q)");
    URLs->Add("Bing=http://www.bing.com/search?q=$(q)");
    URLs->Add("Yahoo=https://search.yahoo.com/search?p=$(q)");
    CurrentIndex = -1;
}

__fastcall TFormMain::~TFormMain()
{
    delete URLs;
}

void TFormMain::Search(const String &Param)
{
    SearchParam = TIdURI::ParamsEncode(Param);
    CurrentIndex = 0;
    DoSearch();
}

void TFormMain::DoSearch()
{
    String url = StringReplace(URLs->ValueFromIndex[CurrentIndex], "$(q)", SearchParam, TReplaceFlags());
    StatusLabel->Caption = "Searching " + URLs->Names[CurrentIndex] + "...";
    Browser->Navigate(WideString(url), navNoHistory | navNoWriteToCache);
}

void __fastcall TFormMain::BrowserDocumentComplete(TObject* ASender, const _di_IDispatch pDisp, const OleVariant &URL)
{
    Status->Caption = "Results received, processing...";
    // do something with Browser document content...

    if (++CurrentIndex < URLs->Count)
        DoSearch();
    else
    {
        CurrentIndex = -1;
        SearchParam = "";
     }
}
Remy Lebeau (TeamB)
Lebeau Software
User avatar
rlebeau
BCBJ Author
BCBJ Author
 
Posts: 1399
Joined: Wed Jun 01, 2005 3:21 am
Location: California, USA

Re: TWebBrowser with Navigate inside a loop

Postby macicogna » Mon Aug 29, 2016 11:20 am

Hi Remy,

Thanks for your collaboration.

But you could do it with Indy's TIdHTTP instead, for instance.

Yes, I've tested it with TidHTTP and the time measurement is more precise, as you affirmed. I've implemented a switch, so the user can choose between TWebBrowser or TidHTTP.

What I was referring to earlier about using the OnDocumentComplete...

I got it! Thanks for your time to show an example. I'll implement this idea too, just in order to my code work better with TWebBrowser.

If other readers want to see the code, just ask me and I'll post the improved version.

Best,

Marcelo.
User avatar
macicogna
BCBJ Veteran
BCBJ Veteran
 
Posts: 63
Joined: Mon Aug 04, 2008 4:57 pm
Location: Brazil


Return to Technical

Who is online

Users browsing this forum: No registered users and 6 guests

cron