python - BeautifulSoup and CSV files -


i'm looking pull table http://www.atpworldtour.com/rankings/top-matchfacts.aspx?y=2015&s=1# , put information in csv file.

i've done having few issues. first column of table contains both ranking of player , name. want split these 1 column contains ranking , other column contains player name.

here's code:

import urllib2 bs4 import beautifulsoup import csv  url = 'http://www.atpworldtour.com/rankings/top-matchfacts.aspx?y=2015&s=1#' req = urllib2.request(url) page = urllib2.urlopen(req) soup = beautifulsoup(page) tables = soup.findall('table') my_table = tables[0]  open('out2.csv', 'w') f:     csvwriter = csv.writer(f)     row in my_table.findall('tr'):         cells = [c.text.encode('utf-8') c in row.findall('td')]         if len(cells) == 16:              csvwriter.writerow(cells) 

here's output of few players:

"1                             novak djokovic",srb,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46% "2                             roger federer",sui,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33% "3                             andy murray",gbr,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0% "4                             rafael nadal",esp,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38% "5                             kei nishikori",jpn,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42% 

as can see first column isn't displayed number being on higher line rest of data extremely large gap.

the html code problem column more complex rest of columns:

<td class="col1" rel="1">1                             <a href="/tennis/players/top-players/novak-djokovic.aspx">novak djokovic</a></td> 

i tried separating couldn't work , thought might easier fix current csv file.

separating field after pulling out pretty easy. you've got number, bunch of whitespace, , name. use split, default delimiter, , max split of 1:

cells = [c.text.encode('utf-8') c in row.findall('td')] if len(cells) == 16:     cells[0:1] = cells[0].split(none, 1)     csvwriter.writerow(cells) 

but can separate within soup, , that's more robust:

cells = row.find_all('td') cell0 = cells.pop(0) rank = next(cell0.children).strip().encode('utf-8') name = cell0.find('a').text.encode('utf-8') cells = [rank, name] + [c.text.encode('utf-8') c in cells]