i'm looking pull table http://www.atpworldtour.com/rankings/top-matchfacts.aspx?y=2015&s=1# , put information in csv file.
i've done having few issues. first column of table contains both ranking of player , name. want split these 1 column contains ranking , other column contains player name.
here's code:
import urllib2 bs4 import beautifulsoup import csv url = 'http://www.atpworldtour.com/rankings/top-matchfacts.aspx?y=2015&s=1#' req = urllib2.request(url) page = urllib2.urlopen(req) soup = beautifulsoup(page) tables = soup.findall('table') my_table = tables[0] open('out2.csv', 'w') f: csvwriter = csv.writer(f) row in my_table.findall('tr'): cells = [c.text.encode('utf-8') c in row.findall('td')] if len(cells) == 16: csvwriter.writerow(cells)
here's output of few players:
"1 novak djokovic",srb,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46% "2 roger federer",sui,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33% "3 andy murray",gbr,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0% "4 rafael nadal",esp,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38% "5 kei nishikori",jpn,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
as can see first column isn't displayed number being on higher line rest of data extremely large gap.
the html code problem column more complex rest of columns:
<td class="col1" rel="1">1 <a href="/tennis/players/top-players/novak-djokovic.aspx">novak djokovic</a></td>
i tried separating couldn't work , thought might easier fix current csv file.
separating field after pulling out pretty easy. you've got number, bunch of whitespace, , name. use split
, default delimiter, , max split of 1:
cells = [c.text.encode('utf-8') c in row.findall('td')] if len(cells) == 16: cells[0:1] = cells[0].split(none, 1) csvwriter.writerow(cells)
but can separate within soup, , that's more robust:
cells = row.find_all('td') cell0 = cells.pop(0) rank = next(cell0.children).strip().encode('utf-8') name = cell0.find('a').text.encode('utf-8') cells = [rank, name] + [c.text.encode('utf-8') c in cells]