Crawling Google play store -
i crawling google app store. use firefox+firebug review request , response. 1 parameter don't understand. example: url "" when load next page, post param pagtok, which's value "egiika==:s:ano1ljj4wwq" don't know value come from? 1 can help?
investigation
since google changed paging logic, , requires token, i've found myself trying investigate how either manually generate tokens, or scrape them out of html retrieved on each response. so, lets our hands dirty.
using fiddler2, able isolate token samples, looking @ requests issued each "paging" of play store.
here's whole request:
post https://play.google.com/store/search?q=a&c=apps http/1.1 host: play.google.com connection: keep-alive content-length: 123 origin: https://play.google.com user-agent: mozilla/5.0 (windows nt 6.3; wow64) applewebkit/537.36 (khtml, gecko) chrome/39.0.2171.95 safari/537.36 content-type: application/x-www-form-urlencoded;charset=utf-8 accept: */* x-client-data: cie2yqeiplbjaqiptskbcmg2yqeinobkaqjuimobcimsyge= referer: https://play.google.com/store/search?q=a&c=apps accept-encoding: gzip, deflate accept-language: pt-br,pt;q=0.8,en-us;q=0.6,en;q=0.4,es;q=0.2 ** post body ** start=0&num=0&numchildren=0&pagtok=gaeiaggu%3as%3aano1ljltujw&ipf=1&xhr=1&token=bh2mlneviirja8dt-zhakrfnh7q%3a1420660393029
now know what's request, next step keep track of more requests try isolate token formation logic.
here 3 request tokens find :
"gaeiaggu%3as%3aano1ljltujw", "gaeiaggo%3as%3aano1ljierqq", "gaeiagg8%3as%3aano1ljim1ci"
finding patterns
one thing our brain at, find patterns, here's mine found tokens formation:
1 - starts : "gaeia"
2 - followed : 2 random characters
3 - followed by: "%3as%3"
4 - followed : 11 random characters
browser javascript tricks x manual http requests
doing same request on browser, of time, won't yield same results using code, manually issuing http request. why ? because of javascript.
google heavy js user, use it's own tricks try fool you.
if @ html, see no token matches pattern described above, instead, find like:
u0026c\\u003dapps\42,\42gaeiaghq:s:ano1ljlxwby\42,\0420\42,\0420\42,\0420\42]\n
if carefully, see token within "random string". have replace : ":s:" "%3as%" .
regular expressions win
if apply regexes page, able find token, , than, manually replace :s: string %3as% one.
here's 1 ended using (powered best regex online builder
generated regular expression:
/gaei+.+:s:.{11}\42/
textual meaning of regular expression:
- match string contains string gae
- followed character 1 or more times
- followed character 1 or more times
- followed string :s:
- followed character 11 times
- followed string \42
tl:dr
the token comes html, "masked" google, "unmasks" using javascript (which can run if using browser engines such selenium or something).
in order fetch pagtoken of next page, read current page html, scrape (logic above), use on next request, repeat.
i hope helps, sorry wall of text, wanted clear possible
Comments
Post a Comment