Skip to content Skip to sidebar Skip to footer

Python: Get Html Table Data By Xpath

I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here.. Is there an simple py

Solution 1:

There is a fairly general pattern which you could use to parse many, though not all, tables.

import lxml.html as LH
import requests
import pandas as pd
deftext(elt):
    return elt.text_content().replace(u'\xa0', u' ')

url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)

for table in root.xpath('//table[@id="sortabletable"]'):
    header = [text(th) for th in table.xpath('//th')]        # 1
    data = [[text(td) for td in tr.xpath('td')]  
            for tr in table.xpath('//tr')]                   # 2
    data = [row for row in data iflen(row)==len(header)]    # 3 
    data = pd.DataFrame(data, columns=header)                # 4print(data)
  1. You can use table.xpath('//th') to find the column names.
  2. table.xpath('//tr') returns the rows, and for each row, tr.xpath('td') returns the element representing one "cell" of the table.
  3. Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
  4. What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:

PrisAdresseTidspunkt08.04Brovejen185500 Middelfart3min38sek17.88Hovedvejen115500 Middelfart4min52sek27.88Assensvej1055500 Middelfart5min56sek38.23EjbyIndustrivej1112600 Glostrup6min28sek48.15ParkAlle1252605 Brøndby25min21sek58.09Sletvej368310 TranbjergJ25min34sek68.24VindinggårdCenter297100 Vejle27min6sek77.99*Søndergade1168620 Kjellerup31min27sek87.99*GertrudRasksVej19210 Aalborg31min27sek97.99*Sorøvej134200 Slagelse31min27sek

Solution 2:

If you mean all the text:

from bs4 importBeautifulSoupurl_str='http://www.fdmbenzinpriser.dk/searchprices/5/'importrequestsr= requests.get(url_str).content

print([x.text for x in BeautifulSoup(r).find_all("table",attrs={"id":"sortabletable"})]

['Pris\nAdresse\nTidspunkt\n\n\n\n\n* Denne pris er indberettet af selskabet Indberet pris\n\n\n\n\n\n\xa08.24\n\xa0Gladsaxe Møllevej 33 2860 Søborg\n7 min 4 sek \n\n\n\n\xa08.89\n\xa0Frederikssundsvej 356 2700 Brønshøj\n9 min 10 sek \n\n\n\n\xa07.98\n\xa0Gartnerivej 1 7500 Holstebro\n14 min 25 sek \n\n\n\n\xa07.99 *\n\xa0Søndergade 116 8620 Kjellerup\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Gertrud Rasks Vej 1 9210 Aalborg SØ\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Sorøvej 13 4200 Slagelse\n15 min 7 sek \n\n\n\n\xa08.08 *\n\xa0Tørholmsvej 95 9800 Hjørring\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Nordvej 6 9900 Frederikshavn\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Skelmosevej  89 6980 Tim\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Højgårdsvej 2 4000 Roskilde\n15 min 7 sek']

Post a Comment for "Python: Get Html Table Data By Xpath"