Tuesday, March 15, 2011

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support@company.com' and '1234567@tickets.company.com'.

In perl, I could take the To: line of a raw email and find either of the above addresses with

/\w+@(tickets\.)?company\.com/i

In python, I simply wrote the above regex as '\w+@(tickets\.)?company\.com' expecting the same result. However, support@company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?

From stackoverflow
  • Two problems jump out at me:

    1. You need to use a raw string to avoid having to escape "\"
    2. You need to escape "."

    So try:

    r'\w+@(tickets\.)?company\.com'
    

    EDIT

    Sample output:

    >>> import re
    >>> exp = re.compile(r'\w+@(tickets\.)?company\.com')
    >>> bool(exp.match("s@company.com"))
    True
    >>> bool(exp.match("1234567@tickets.company.com"))
    True
    
    jcoon : I second this suggestion.
    BipedalShark : #2 is just me being a newb at stackoverflow. Fixed the initial post. ;)
  • I think the problem is in your expectations of extracted values. Try using this in your current Python code:

    '(\w+@(?:tickets\.)?company\.com)'
    
  • The documentation for re.findall:

    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
    
        If one or more groups are present in the pattern, return a
        list of groups; this will be a list of tuples if the pattern
        has more than one group.
    
        Empty matches are included in the result.
    

    Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.

    r'(\w+@(tickets\.)?company\.com)'
    r'\w+@(?:tickets\.)?company\.com'
    

    Note that you'll have to pick out the first element of each tuple returned by findall in the first case.

    Axeman : Okay, but interestingly, not *obvious*.
  • There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this

    #!/usr/bin/python
    
    import re
    
    regex = re.compile("(\w+@(?:tickets\.)?company\.com)");
    
    a = [
        "foo@company.com", 
        "foo@tickets.company.com", 
        "foo@ticketsacompany.com",
        "foo@compant.org"
    ];
    
    for string in a:
        print regex.findall(string)
    

0 comments:

Post a Comment