Issue Parsing a site with lxml and xpath in python -

i think messing xpath. trying information of each row on table in page.

this have far not outputting i'm looking for.

import requests lxml import etree  r = requests.get('http://mtgoclanteam.com/cards?edition=dtk') doc = etree.html(r.text) #get list of cards cards = [card card in doc.xpath('id("cardtable")/x:tbody/x:tr[1]/x:td[3]')] card in cards:     print card

the primary problem here actual document served server contains empty table:

<table id="cardtable" class="cardlist"/>

the data filled in after page loads embedded javascript follows empty table element:

<script>     $('#cardtable').datatable({         "alengthmenu": [[25, 100, -1], [25, 100, "all"]],           "bdeferrender": true,         "aasorting": [],           "bpaginate": false,         "aadata": [           ...data here...         ],         "aocolumns": [             { "stitle": "card name", "swidth": "260" },                      { "stitle": "rarity", "swidth": "40" },              { "stitle": "buy", "swidth": "80" },             { "stitle": "sell", "swidth": "80" },             { "stitle": "bots stock" }]      }) </script>

the data contained aadata element of dictionary passed datatable() method. extracting in python going tricky (this isn't json document). possibly suitable regular expression applied script text want (or iterate on lines of script , take 1 after aadata key).

for example:

import pprint import json import requests lxml import etree  r = requests.get('http://mtgoclanteam.com/cards?edition=dtk') doc = etree.html(r.text)  script = doc.xpath('id("templatemo_content")/script')[0].text found = false result = none line in script.splitlines():     if found:         if '[' in line:             result=line             break     if 'aadata' in line:         found = true  if result:     result =json.loads('[' + result + ']')     pprint.pprint(result)

this ugly , fragile (it break if format of script changed), works current input.

Autos

Search This Blog

Issue Parsing a site with lxml and xpath in python -