Let's say I wanted to remove vowels from HTML:
<a href="foo">Hello there!</a>Hi!
becomes
<a href="foo">Hll thr!</a>H!
I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this?
From stackoverflow
mike
-
Suppose the variable
test_htmlhas the following html content:<html> <head><title>Test title</title></head> <body> <p>Some paragraph</p> Useless Text <a href="http://stackoverflow.com">Some link</a>not a link <a href="http://python.org">Another link</a> </body></html>Just do this:
from BeautifulSoup import BeautifulSoup test_html = load_html_from_above() soup = BeautifulSoup(test_html) for t in soup.findAll(text=True): text = unicode(t) for vowel in u'aeiou': text = text.replace(vowel, u'') t.replaceWith(text) print soupThat prints:
<html> <head><title>Tst ttl</title></head> <body> <p>Sm prgrph</p> Uslss Txt <a href="http://stackoverflow.com">Sm lnk</a>nt lnk <a href="http://python.org">Anthr lnk</a> </body></html>Note that the tags and attributes are untouched.
From nosklo
0 comments:
Post a Comment