Crawling Google play store -


i crawling google app store. use firefox+firebug review request , response. 1 parameter don't understand. example: url "" when load next page, post param pagtok, which's value "egiika==:s:ano1ljj4wwq" don't know value come from? 1 can help?

investigation

since google changed paging logic, , requires token, i've found myself trying investigate how either manually generate tokens, or scrape them out of html retrieved on each response. so, lets our hands dirty.

using fiddler2, able isolate token samples, looking @ requests issued each "paging" of play store.

here's whole request:

post https://play.google.com/store/search?q=a&c=apps http/1.1 host: play.google.com connection: keep-alive content-length: 123 origin: https://play.google.com user-agent: mozilla/5.0 (windows nt 6.3; wow64) applewebkit/537.36 (khtml, gecko)       chrome/39.0.2171.95 safari/537.36 content-type: application/x-www-form-urlencoded;charset=utf-8 accept: */* x-client-data: cie2yqeiplbjaqiptskbcmg2yqeinobkaqjuimobcimsyge= referer: https://play.google.com/store/search?q=a&c=apps accept-encoding: gzip, deflate accept-language: pt-br,pt;q=0.8,en-us;q=0.6,en;q=0.4,es;q=0.2  ** post body ** start=0&num=0&numchildren=0&pagtok=gaeiaggu%3as%3aano1ljltujw&ipf=1&xhr=1&token=bh2mlneviirja8dt-zhakrfnh7q%3a1420660393029 

now know what's request, next step keep track of more requests try isolate token formation logic.

here 3 request tokens find :

"gaeiaggu%3as%3aano1ljltujw", "gaeiaggo%3as%3aano1ljierqq", "gaeiagg8%3as%3aano1ljim1ci"

finding patterns

one thing our brain at, find patterns, here's mine found tokens formation:

1 - starts : "gaeia"

2 - followed : 2 random characters

3 - followed by: "%3as%3"

4 - followed : 11 random characters

browser javascript tricks x manual http requests

doing same request on browser, of time, won't yield same results using code, manually issuing http request. why ? because of javascript.

google heavy js user, use it's own tricks try fool you.

if @ html, see no token matches pattern described above, instead, find like:

u0026c\\u003dapps\42,\42gaeiaghq:s:ano1ljlxwby\42,\0420\42,\0420\42,\0420\42]\n

if carefully, see token within "random string". have replace : ":s:" "%3as%" .

regular expressions win

if apply regexes page, able find token, , than, manually replace :s: string %3as% one.

here's 1 ended using (powered best regex online builder

generated regular expression:

/gaei+.+:s:.{11}\42/

textual meaning of regular expression:

  • match string contains string gae
  • followed character 1 or more times
  • followed character 1 or more times
  • followed string :s:
  • followed character 11 times
  • followed string \42

tl:dr

the token comes html, "masked" google, "unmasks" using javascript (which can run if using browser engines such selenium or something).

in order fetch pagtoken of next page, read current page html, scrape (logic above), use on next request, repeat.

i hope helps, sorry wall of text, wanted clear possible


Comments

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Round ImageView Android -

How can I utilize Yahoo Weather API in android -