Python: Get Html Table Data By Xpath
I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here.. Is there an simple py
Solution 1:
There is a fairly general pattern which you could use to parse many, though not all, tables.
import lxml.html as LH
import requests
import pandas as pd
deftext(elt):
return elt.text_content().replace(u'\xa0', u' ')
url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)
for table in root.xpath('//table[@id="sortabletable"]'):
header = [text(th) for th in table.xpath('//th')] # 1
data = [[text(td) for td in tr.xpath('td')]
for tr in table.xpath('//tr')] # 2
data = [row for row in data iflen(row)==len(header)] # 3
data = pd.DataFrame(data, columns=header) # 4print(data)
- You can use
table.xpath('//th')
to find the column names. table.xpath('//tr')
returns the rows, and for each row,tr.xpath('td')
returns the element representing one "cell" of the table.- Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
- What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:
PrisAdresseTidspunkt08.04Brovejen185500 Middelfart3min38sek17.88Hovedvejen115500 Middelfart4min52sek27.88Assensvej1055500 Middelfart5min56sek38.23EjbyIndustrivej1112600 Glostrup6min28sek48.15ParkAlle1252605 Brøndby25min21sek58.09Sletvej368310 TranbjergJ25min34sek68.24VindinggårdCenter297100 Vejle27min6sek77.99*Søndergade1168620 Kjellerup31min27sek87.99*GertrudRasksVej19210 AalborgSØ31min27sek97.99*Sorøvej134200 Slagelse31min27sek
Solution 2:
If you mean all the text:
from bs4 importBeautifulSoupurl_str='http://www.fdmbenzinpriser.dk/searchprices/5/'importrequestsr= requests.get(url_str).content
print([x.text for x in BeautifulSoup(r).find_all("table",attrs={"id":"sortabletable"})]
['Pris\nAdresse\nTidspunkt\n\n\n\n\n* Denne pris er indberettet af selskabet Indberet pris\n\n\n\n\n\n\xa08.24\n\xa0Gladsaxe Møllevej 33 2860 Søborg\n7 min 4 sek \n\n\n\n\xa08.89\n\xa0Frederikssundsvej 356 2700 Brønshøj\n9 min 10 sek \n\n\n\n\xa07.98\n\xa0Gartnerivej 1 7500 Holstebro\n14 min 25 sek \n\n\n\n\xa07.99 *\n\xa0Søndergade 116 8620 Kjellerup\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Gertrud Rasks Vej 1 9210 Aalborg SØ\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Sorøvej 13 4200 Slagelse\n15 min 7 sek \n\n\n\n\xa08.08 *\n\xa0Tørholmsvej 95 9800 Hjørring\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Nordvej 6 9900 Frederikshavn\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Skelmosevej 89 6980 Tim\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Højgårdsvej 2 4000 Roskilde\n15 min 7 sek']
Post a Comment for "Python: Get Html Table Data By Xpath"