Skip to content Skip to sidebar Skip to footer

Python Regex - Find String Between Html Tags

I am trying to extract the string between Html tags. I can see that similar questions have been asked on stack overflow before, but I am completely new to python and I am strugglin

Solution 1:

>>>a = '<b>Bold Stuff</b>'>>>>>>import re>>>re.findall(r'>(.+?)<', a)
['Bold Stuff']
>>>re.findall(r'>(.*?)<', a)[0] # non-greedy mode
'Bold Stuff'
>>>re.findall(r'>(.+?)<', a)[0] # or this, also is non-greedy mode
'Bold Stuff'
>>>re.findall(r'>(.*)<', a)[0] # greedy mode
'Bold Stuff'
>>>

At this point, both of greedy mode and non-greedy mode can work.

You're using the first non-greedy mode. Here is an example about what about non-greedy mode and greedy mode:

>>>a = '<b>Bold <br> Stuff</b>'>>>re.findall(r'>(.*?)<', a)[0]
'Bold '
>>>re.findall(r'>(.*)<', a)[0]
'Bold <br> Stuff'
>>>

And here is about what is (...):

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;

the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

To match the literals ( or ), use \( or \), or enclose them inside a character class: [(] [)].

Solution 2:

It's maybe simpler to just remove the HTML tags, leaving the content:

>>>import re>>>re.sub('<[^<>]+>', '', '<b>Bold Stuff</b>')
'Bold Stuff'
>>>

Note that using regexes to remove HTML tags is frequently considered bad practice compared to using a proper HTML parser, but it might be ok if you know your content and can rely on it.

Solution 3:

I guess that your issue is related to the MatchObject returned from re.search. In that case the match items can be accessed by the group() function. However, the first group is the whole match, but you wanted to get parenthesized subgroup.

text = '<b>Bold Stuff</b>'

m = re.search('>([^<>]*)<', text)
print (m.group(0)) # the whole match: >Bold Stuff<print (m.group())  # the same as with the zero argumentprint (m.group(1)) # the first parenthesized subgroup: Bold Stuff

It may work for some simple cases. However, in more complex cases it might be tricky to deal with tag overlapping, for example see RegEx match open tags except XHTML self-contained tags:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML...

Solution 4:

from bs4 import BeautifulSoup

page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('b').text

Post a Comment for "Python Regex - Find String Between Html Tags"