STTTR 1: Python Regex to Remove Tags

I'm starting a new series of posts on coding tricks that are simple in principle but often take someone new to a technology too long to find on Stack Overflow. This serves two purposes. First, I can find them again, which always makes things easier when I haven't used something for awhile. Second, maybe other people on the web will randomly find them useful. I'm calling it ST3R, mostly because I'm a dork and I like cubing things.

In today's entry, I'm going to share a quick regular expression that will capture all the tags on a single page. This is useful for parsing HTML, XML, or other markup languages. I should note, however, that actual text processing of HTML tags is best handled by an HTML parser, not a basic regex.

In this case, however, we're going to play out a scenario where we're writing a python script that will remove all the tags from an HTML document. Let's say our HTML looks something like this:

<h1>This is an awesome Website</h1>  
<p>But I hate all these tags.  Wouldn't it be great if we could remove them <span class="bold">all at once</span>.</p>  

This is some pretty simple HTML that we're looking at, but let's look at how we'd write a python script to remove the tags:

import re #import our regex module

htmlFile = "THIS STRING CONTAINS THE HTML"

# now, we subsitute all tags for a simple space
htmlFile = re.sub('<.*?>', ' ', htmlFile)  

Here, we use the regular expression of <.*?>, which will capture everything that is between two brackets, no matter what. Of course, more advanced processing would take into consideration what's actually between them, but our .* will capture everything and the ? will make sure that the regex is not greedy (meaning it won't capture everything from the first < to the last > in the document).

That's all for now!

comments powered by Disqus