Fixing urlparse: More on pyparsing and introducing netaddress

This is the last in a series of three posts (1, 2), discussing issues with pythons urlparse module. Here, I intend to provide a solution.

In the last post, I was talking about parser combinators and parsec in particular, mentioning pyparsing towards the end. The angel-app being a python application, parsec, while cool, is of no immediate use. pyparsing on the other hand provides parsec-like functionality for python. Consider this excerpt from the RFC 3986-compliant URI parser that I'm about to present in this post (please ignore as usual the blog's spurious formatting):

dec_octet = Combine(Or([
Literal("25") + ZeroToFive, # 250 - 255
        Literal("2") + ZeroToFour + Digit,     # 200 - 249
        Literal("1") + repeat(Digit, 2),       # 100 - 199
        OneToNine + Digit,                     # 10 - 99
        Digit                                  # 1-9    
IPv4address = Group(repeat(dec_octet + Literal("."), 3) + dec_octet)

And now:

>>> from netaddress import IPv4address 
[snipped warning message]
>>> IPv4address.parseString("")
([(['127', '.', '0', '.', '0', '.', '1'], {})], {})
>>> IPv4address.parseString("350.0.0.1")
Traceback (most recent call last):
File "", line 1, in ?
egg/", line 1244, in parseImpl
raise exc
pyparsing.ParseException: Expected "." (at char 2), (line:1, col:3)

Anyhow, what I mean to say is this: We have a validating URI parser now. Apart from the bugs that are still to be expected for a piece of code at this early stage, it should be RFC 3986 compliant. You can get either the python package, or a tarball of the darcs repository (unfortunately my zope account chockes on the "_darcs" directory filename, so I'm still looking for a good way to host the darcs).

This is how one would use it:

>>> from netaddress import URI
>>> uri = URI.parseString("http://localhost:6221/foo/bar")
>>> uri.port
>>> uri.scheme

Or, in the case of a more complex parse:

>>> uri = URI.parseString("http://vincent@localhost:6221/foo/bar")
>>> uri.asDict().keys()
['scheme', 'hier_part']
>>> uri.hier_part.path_abempty
(['/', 'foo', '/', 'bar'], {})
>>> uri.hier_part.authority.userinfo
>>> uri.hier_part.authority.port

Hope you find this useful.

Comments (5)  Permalink

Fixing urlparse: A case for Parsec and pyparsing

In a previous post, I described issues with parsing and validating URL's with the functionality provided by Python's stdlib. I will just restate that clearly, all messages exchanged by angel-app nodes must be validated in order for it to work properly. What to do? First of all, I was of course not the first person to notice the module's shortcomings. However, I was surprised at the answers that popped up: It seems like no one was interested in actually coming up with a validating parser (perhaps even just for a subset of the complete URI syntax), but instead people focussed on fixing specific cases where the parser would fail -- in essence adding new features, rather than putting the whole system on a solid basis. Suggestions go so far as to propose a new URI parsing module. However, the proposed new module is again based on the premise that the input represents a valid URI, the behavior in the case of an invalid input is again left undefined. WTF? Have these people never looked beyond string.split() and regexes?

Dudes, writing a VALIDATING PARSER is NOT THAT HARD, if you have a reasonable grammar and good libs. Why do people keep pretending that it is? Sure, you might be afraid of having to fire up lex, yacc and antlr, and for good reason. But with sufficiently dynamic languages, that's usually unnecessary, if you have a parser combinator library handy.

The key idea behind parser combinators is that you write your parser in a bottom up fashion, in just the same way that you would define your grammar. You write a parser for a small part of the grammar, then combine these partial parsers to form a complex whole. The canonical example in this context is Haskell's parsec library. Let's start out with a simple restricted URI grammar:

module RestrictedURI where

import Text.ParserCombinators.Parsec

data URI = URI {
host :: [String],
port :: Int,
path :: [String]
} deriving (Eq, Show, Read)

schemeP = string "http" "scheme"
schemeSepP = string "://" "scheme separator"

hostPartP = many lower "part of a host name"
hostNameP = sepBy hostPartP (string ".") "host name"

pathSegmentP = sepEndBy1 (many1 alphaNum) (string "/") "multiple path segments"
pathP = do {
root - string "/" "absolute path required";
segments - pathSegmentP;
return (root:segments)
} "an absolute path, optionally terminated by a /"

restrictedURIP :: Parser URI
restrictedURIP =
do {
ignored - schemeP;
ignored - schemeSepP;
h - hostNameP;
p - pathP;
return (URI h 80 p)
} "a subset of the full URI grammar"

parseURI :: String -> (Either ParseError URI)
parseURI = parse restrictedURIP ""

(Where you should forgive me for the blog inserting break tags all over the place). But just to illustrate:

vincent$ ghci 
GHCi, version 6.8.1: :? for help
Loading package base ... linking ... done.
Prelude> :l restrictedURI
[1 of 1] Compiling RestrictedURI ( restrictedURI.hs, interpreted )
Ok, modules loaded: RestrictedURI.
*RestrictedURI> parseURI ""
Loading package parsec- ... linking ... done.
Right (URI {host = ["localhost","com"], port = 80, path = ["/","foo","bar"]})

Plus, we get composability, validation and error messages essentially for free:

*RestrictedURI> parseURI "" 
Left (line 1, column 17): unexpected "2" expecting lowercase letter,
"." or an absolute path, optionally terminated by a /

Now consider the following excerpt from Haskell's Network.URI.

--  RFC3986, section 3.1  
uscheme :: URIParser String
uscheme =
do { s - oneThenMany alphaChar (satisfy isSchemeChar)
; char ':'
; return $ s++":"

(Again, please forgive for the blog eating my code, but you can also get it from the haskell web site.) And compare that to the ABNF found in the corresponding section of the RFC:

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

Note how the complete URI grammar specification in the RFC is barely a page long. So yeah, implementing this grammar is a significant amount of work (of course you could always choose to support just a well-defined subset), but if you have a good parser combinator library, it's just a few hours of mechanically transforming the ABNF into your parser grammar. You can even watch the Simpsons while doing it (I did). In the case of Network.URI, this boils down a line count of 1278, with about half of the lines being comments or empty lines. Not only that, but given the complete grammar specification, it's super easy to formulate a modified grammar.

As it turns out, Python has a library quite like parsec, it's called pyparsing and I'll bore you with it in my next (and last) post on this topic.

Think you can Trust Python's stdlib? Think again.

It's been a while that I've blogged about Ken Thompson's Reflections on Trusting Trust. And this week I was bitten hard by its moral:

The moral is obvious. You can't trust code that you did not totally create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from using untrusted code. In demonstrating the possibility of this kind of attack, I picked on the C compiler. I could have picked on any program-handling program such as an assembler, a loader, or even hardware microcode. As the level of program gets lower, these bugs will be harder and harder to detect. A well installed microcode bug will be almost impossible to detect.

The task seemed simple enough. We had been passing around links between clones in a URL-like format of the type ${host}:${port}/${path}, with a small custom parser (an ugly hack) for parsing and unparsing these things. As we adapted the code to support IPv6 it turned out that in many cases (i.e. unless the nodename field was configured), raw IPv6 addresses would be passed around, and the parser would of course choke on that. Fair enough, I thought, time to use the established standards and

import urlparse 

Now this is supposed to split the URI into parts corresponding to scheme, host, path etc. like so

>>> urlparse.urlparse("") 
('http', '', '/bar', '', '', '')

Of course, most nodes still had the old clone links lying around, and I was surprised to find the parse for these entries:

>>> urlparse.urlparse("") 
('', '', '6221/bar', '', '', '')

Hmm. OK. Let's look at the internals of that parser, and vi

def urlsplit(url, scheme='', allow_fragments=1): """Parse a URL into 5 components: :/// ?#


(e.g. netloc is a single string) and we don't expand % escapes."""
key = url, scheme, allow_fragments
cached = _parse_cache.get(key, None)
if cached:
return cached
if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
netloc = query = fragment = ''
i = url.find(':')
if i > 0:
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
url = url[i+1:]
if url[:2] == '//':
netloc, url = _splitnetloc(url, 2)


scheme, url = url[:i].lower(), url[i+1:]

return tuple

(Why do blogs always _INSIST_ on fucking up source code? But we're kind of on topic, so maybe this fits). Anyhow, we have a fancy caching scheme, but the parser itself consists of a bunch of if and uri.split() statements. Talk about premature optimization. More than that, one should think that language implementors know a thing or two about parsers...

Consider: the parser is written in such a way that the result is predictable if and only if the input string represents a valid URL. But how do you find out if a string is indeed a URL? The answer is easy: you use a parser. In other words, the urlparse module is in most cases useless, because unless have sufficient control over the input (unlikely for networking apps) the parse result is essentially undefined.

However the urlparse module is not only "useless", it is in fact dangerous, since by using it for untrusted input, the behaviour of your app is by implication also essentially undefined (how do you handle an undefined result?). Now consider the following quick google code search. I don't suppose that any of the following names rings a bell with you: Zope, Plone, twisted, Turbogears, mailman, django, chandler, bittorrent. Surely all of these software packages have carefully reviewed all of their uses of urlparse, and properly identify and handle all cases where an arbitrary result may be returned... Script kiddies, REJOICE!


CALL for testing ANGEL APPLICATION release candidate 0.2.0rc1

Dear all,

the ANGEL APPLICATION source code has reached a point which we think is good for creating a new public release for m221e ANGELS.

To make sure things go well, we kindly ask that each etoy.AGENT running MAC OS X or a Unix-ish operating system downloads the RELEASE CANDIDATE of the software, which is available at

All we ask for is starting it and checking the following things:

- does it crash?
- does the "p2p process" run continuously?
- do all the icons and images show up correctly?

If you encounter problems, you can do the following:

- purge the repository via the new File menu command and see if the problem persists
- remove all previous data like so:

    rm -rf ~/.angel-app
    rm ~/.angelrc

and see if the problem persists
- report the operating system version
- for mac users, consider copy pasting output from (it shows the logging of angel-app)

For a list of changes, I suggest looking at agent Vincent's blog post at:

It would be nice to get feedback (also positive ;-) ) during the weekend.

thank you!

Comments (16)  Permalink

next generation etoy.TERMINALS

cheaper, faster, lighter, less fragile, easier to set up, batteries included: eeepc. comments welcome.

ANGEL APPLICATION - approaching beta

We're highly pleased with the progress we have been making lately: The next release of the ANGEL APPLICATION is to be expected for one of the coming weekends (obviously, it's ready when it's ready, we're largely debian nerds after all). The obligatory screenie (looks haven't changed much, tho'):

Major changes include:

  • a completely revamped security model: we have abandoned our previously mixed pull/push model in favor of a purely pull model. This greatly simplifies the code, and increases security by disallowing any (with one tiny, optional, exception) modification of data on the clients by remote agents. However, this required
  • NAT traversal support. This we implemented by adding optional support for NAT traversal via teredo/miredo. This in turn required
  • (optional) support for IPv6 in the twisted matrix library, our primary infrastructure library. The extension is available as a (limited, but self-contained) add-on module from our subversion repository.
  • To support transparent addressing in the face of a schizophrenic internet infrastructure, agent.POL has implemented a dynamic DNS service that supports IPv6 (note e.g. the clone located at, IPv6 required). He's currently offering that as a free service on We plan to integrate it more tightly into the angel-app as time and resources permit.
  • A revamped configuration subsystem.
  • Improved GUI support.
  • An extensive code cleanup, resulting in a reasonably clean object model and a rather thorough unit test harness, while actually reducing the size of the code base.

I'm currently in the process of stress-testing the system by letting POL's home machine backup my holiday pictures (again, IPv6 support required). Things are looking good so far ;-) Stay tuned, or grab the latest snapshot from svn.


Supporting IPv6 in twisted

It turns out that adding IPv6 support to the twisted library is rather straightforward. Originally, I just hacked up a few changes to get my prototype running with teredo, resulting in a few lines of changes to the twisted networking code (patch available). It turns out that the resulting code seems fully backwards-compatible with IPv4. Unintended, but highly welcome ;-) YAY (IPv6 required)!

Anyhow -- if you're a hacker, give teredo a try. Turn your laptop into a server in a few minutes. It's a sexy piece of technology which I think will greatly change the way we think about and work with the internet.

One thing to keep in mind are security issues: teredo provides you with a globally visible IP address, meaning you're directly addressable worldwide. NATs and many in-between firewalls are tunnelled through. If you're using a mac, add something like the following ipfw firewall ruleset (thanks POL) to your miredo startup command (see /etc/miredo.conf) to protect you from unsolicited and possibly dangerous traffic:


exec > "$LOGFILE" 2>&1

echo "Starting miredo hook for setting up firewall rules...."
echo "$0: $$" echo "uid: $UID"
killall -HUP lookupd DirectoryService

/sbin/ip6fw add 1000 allow log tcp from any to any 6221
/sbin/ip6fw add 1001 deny log tcp from any to any

ANGEL APPLICATION getting ready for IPv6?

After a beautiful afternoon hack, we got the angel-app to work with miredo/teredo and IPv6.

screenshot with IPv6 address


What's the meaning of this you might ask yourself? Well, it means the potential for true p2p networking, which has so far been a real pain in the ANGEL APPLICATION's backside.

Among other things, agent.POL has been able to access my ANGEL APPLICATION instance running on my laptop at home -- behind 2 layers of NAT, no less.

And if you have a teredo-enabled host (perhaps even with any IPv6-enabled host, we're not sure yet), you can try it yourself for the time being (no guarantees):


This means that a highly secure pull-only model is (in principle) within reach, greatly simplifying and stabilizing the ANGEL APPLICATION.

Stay tuned.

Comments (3)  Permalink

fascinating collection of old usenet threads

since we're in the digital history business: i've run across a fascinating chronological collection of old usenet threads.

Highlights include the announcement of GNU, and the announcement of the www.


Archives and the use of Second Life

The Stanford Humanities Lab and affiliates created this Machinima clip about their (at least) double-layered project called "The Dante Hotel" that conserved and experimented with a real-life hotel room appropriated in 1972, originally by Lynn Hershman Leeson and Elenor Coppola. More here.
Prev Next21-30/40 twisting values since 1994