STTTR 2: Powerful Regexes for Date Processing

For my second installment of simple technical tricks to remember, I present some powerful regexes for use in date processing!

To try these out, just head over to regexr.com.

This regex will capture all dates in the formats MM/DD/YY, MM/DD/YYYY, MM-DD-YY, MM-D-YYYY, YYYY-MM-DD and more!

([0,1]?\d{1})[\/|\-](([0-2]?\d{1})|([3][0,1]{1}))[\/|\-](([1]{1}[9]{1}[9]{1}\d{1})|([1-9]{1}\d{3})|(\d{2}))

This regex will capture all dates in the typical US date format: january 24, 1990:

([Jj]anuary|[Ff]ebruary|[Mm]arch|[Aa]pril|[Mm]ay|[Jj]une|[Jj]uly|[Ss}eptember|[Oo]ctober|[Nn]ovember|[Dd]ecember)[\W|\-|\/]?(([0-2]?\d{1})|([3][0,1]{1})),[\W|\-|\/]?(([1]{1}[9]{1}[9]{1}\d{1})|([1-9]{1}\d{2,3})|(\d{2}))

While these are only two regex's, they can be incredibly effective at capturing dates. Here's an example of python code that will find all data spans in a given collection of text, stored as a string:

found_dates = []  
date_with_dashes_or_slashes = re.compile(  
        r'([0-2]?\d{1,3})[\/|\-](([0-2]?\d{1})|([3][0,1]{1}))[\/|\-](([1]{1}[9]{1}[9]{1}\d{1})|([1-9]{1}\d{3})|(\d{2}))')
full_month_name = re.compile(         r'(january|february|march|april|may|june|july|september|october|november|december)[\W|\-|\/]?(([0-2]?\d{1})|([3][0,1]{1})),[\W|\-|\/]?(([1]{1}[9]{1}[9]{1}\d{1})|([1-9]{1}\d{2,3})|(\d{2}))')  
found_dates.extend([(m.start(0), m.end(0)) for m in re.finditer(  
        date_with_dashes_or_slashes, doc_text)])
found_dates.extend([(m.start(0), m.end(0)) for m in re.finditer(  
        full_month_name, doc_text)])
# now, found_dates will contain the spans of date annotations in doc_text

If you find an error in this or have an even better regex, let me know in the comments below!

comments powered by Disqus