Saturday, February 5, 2011

Using Beautiful Soup, how do I iterate over all embedded text?

Let's say I wanted to remove vowels from HTML:

<a href="foo">Hello there!</a>Hi!

becomes

<a href="foo">Hll thr!</a>H!

I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this?

  • Suppose the variable test_html has the following html content:

    <html>
    <head><title>Test title</title></head>
    <body>
    <p>Some paragraph</p>
    Useless Text
    <a href="http://stackoverflow.com">Some link</a>not a link
    <a href="http://python.org">Another link</a>
    </body></html>
    

    Just do this:

    from BeautifulSoup import BeautifulSoup
    
    test_html = load_html_from_above()
    soup = BeautifulSoup(test_html)
    
    for t in soup.findAll(text=True):
        text = unicode(t)
        for vowel in u'aeiou':
            text = text.replace(vowel, u'') 
        t.replaceWith(text)
    
    print soup
    

    That prints:

    <html>
    <head><title>Tst ttl</title></head>
    <body>
    <p>Sm prgrph</p>
    Uslss Txt
    <a href="http://stackoverflow.com">Sm lnk</a>nt  lnk
    <a href="http://python.org">Anthr lnk</a>
    </body></html>
    

    Note that the tags and attributes are untouched.

    From nosklo

0 comments:

Post a Comment