Encode Entities Inside PRE Tags

Here’s a little Python script that searches through a given file for pre tags and encodes anything in between. This is useful for when escaping from syntax highlighting plugins and replacing every occurrence of the code shortcode with a pre tag.

import re, sys

# First argument is the filename, output is filename.encoded
filename = sys.argv[1]
f = file(filename)
output = open('%s.encoded' % filename, 'w+');

# Read the whole file, fire the regular expressions
contents = f.read()
expr = re.compile(r'<pre>(.*?)</pre>', re.MULTILINE|re.DOTALL)
matches = expr.findall(contents)

# Loop through each match and replace < > with &lt; and &gt;
for match in matches:
	contents = contents.replace(match, match.replace('<', '&lt;').replace('>', '&gt;'));

# Write output file and close both files
output.write(contents)
output.close()
f.close()

Most syntax highlighting plugins will encode all entities on the fly for you so when you stop using them your code might break. Also, most highlighting plugins will render your TinyMCE visual editor useless when working with code, and I think it’s quite common to work with code using the visual editor in WordPress. At least Twenty Ten and Twenty Eleven understand that ;)

However, as seen from the replacement part, I don’t really encode all entities but rather replace the greater than and less than symbols. It’s enough for most cases but if you need a real entity encoding you should use the cgi.escape function which is similar to htmlspecialchars in php.

Feed this script with your database dump and it’ll create a new file with an .encoded prefix which you can feed back to MySQL. Please note though that this script reads the entier input file which may lead to slow execution, high memory usage and swapping when working with large files. Worked fine on my 30 megabyte database though.

About the author

Konstantin Kovshenin

WordPress Core Contributor, ex-Automattician, public speaker and consultant, enjoying life in Moscow. I blog about tech, WordPress and DevOps.

2 comments

    • Sergey, good question. Before running it I was pretty sure there were no nested <code>pre</code> elements but if you think you might have then you should run a check for right before replacement if you want to leave them as they are. If you'd like to encode them too I think it'll handle it just fine though, you should give it a test ;) Thanks for your comment!