You are viewing alienghic

Protecting against XSS

« previous entry | next entry »
Aug. 14th, 2011 | 11:49 pm

I wanted a module to strip out potential XSS injections.

I looked at the set of allowed HTML on the LJ post and was came up with this idea.

use BeautifulSoup to parse the submitted html, remove all tags that aren't in a safe html whitelist. And then for img & a tags process the url and require they start with an allowed set of protocols. The main downside is that <img src=/foo.png/> wont work. you have to list the http:

This seems like a good method for sanitizing user input while allowing some html -- but how can you really know you're protecting against all the possible ways to inject a hostile payload. There's some really funky techniques for tricking the browser at http://ha.ckers.org/xss.html

from BeautifulSoup import BeautifulSoup, Tag
import urlparse

SAFE_HTML = set(['a','img',
                 'b','big','blockquote','br',
                 'center','cite','code',
                 'dd','div','dl','dt',
                 'em','font','h1','h2','h3','hr',
                 'i','input',
                 'li','nobr','ol','option','p','pre',
                 's','small','span','strike','strong','sub','sup',
                 'table','td','th','tr','tt','u','ul'])
SAFE_URL_SCHEME = set(['ftp', 'gopher', 'http', 'https', 'mailto', 
                       'svn', 'svn+ssh'])

def pasturize_html(body):
    soup = BeautifulSoup(body)
    pasturize_soup_contents(soup)
    return unicode(soup)

def pasturize_soup_contents(contents):
    for element in contents:
        if isinstance(element, Tag):
            if element.next is not None:
                pasturize_soup_contents(element)
            name = element.name.lower()
            if name == 'a':
                href = pasturize_url(element.get('href'))
                if href is not None:
                    element['href'] = href
            elif name == 'img':
                src = pasturize_url(element.get('src'))
                if src is not None:
                    element['src'] = src
            elif name not in SAFE_HTML:
                element.replaceWith('')

def pasturize_url(url):
    if url is None:
        return None
    
    parts = urlparse.urlparse(url)
    if parts.scheme.lower() not in SAFE_URL_SCHEME:
        return ''
    return urlparse.urlunparse(parts)
    

body = pasturize_html('''<IMG """><SCRIPT>alert("XSS")</SCRIPT>">''')
assert body == '''<img />">'''

body = pasturize_html('''<IMG SRC="   javascript:alert('XSS');">''')
assert body == '''<img />">'''
Tags:

Link | Leave a comment | Add to Memories | Share

Comments {2}

Josh

(no subject)

from: irilyth
date: Aug. 16th, 2011 04:40 am (UTC)
Link

Does NoScript do what you're looking to do here?

Reply | Thread

Diane Trout

(no subject)

from: alienghic
date: Aug. 16th, 2011 05:18 am (UTC)
Link

No, NoScript is a protection on the client side, what I'm trying to do is write sanitizing code for accepting input to a server from the wild internet. (E.g. i'm trying to write the server side component that does the same sensitization as this LJ message box.)

Reply | Parent | Thread