How to write a regular expression to match HTML tag?

May 28, 2008

It should be easy. Let’s say we are searching for all scripts at web page with regular expression like this:

<script.*</script>

And it will generally work… but with assumption that page contains only one script tag. In case that there is more… for example:

<html><head><script>first script</script></head><body>example body<script>second script</script></body></html>

The result of match is:

<script>first script</script></head><body>example body<script>second script</script>

instead of expected:

<script>first script</script>

The reason is the greedy nature of .* regular expression qualifier. It matches as much text as possible.

The solution is to use non-greedy qualifier which is .*? which matches as little text as possible.

So the regular expression should look like this:

<script.*?</script>

Thanks to Regular Expression HOWTO for explaining this.