pandas - Python still having issues with try-except clause

Wednesday, 3 January 2018

pandas - Python still having issues with try-except clause

I am using the tld python library to grab the first level domain
from the proxy request logs using a apply function. When I run into a strange request
that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into
an error message like the
following:

TldBadUrl: Is not a
            valid URL http:1 con!
TldBadUrlTraceback (most recent call
            last)

in engine
----> 1 new_fld_column =
            request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc
            in apply(self, func, convert_dtype, args, **kwds)
 2353 else:
 2354
            values = self.asobject
-> 2355 mapped = lib.map_infer(values, f,
            convert=convert_dtype)
 2356 
 2357 if len(mapped) and
            isinstance(mapped[0],
            Series):


pandas/_libs/src/inference.pyx in
            pandas._libs.lib.map_infer
            (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
            in get_fld(url, 
fail_silently, fix_protocol, search_public, search_private,
            **kwargs)
 385 fix_protocol=fix_protocol,
 386
            search_public=search_public,
--> 387
            search_private=search_private
 388 )
 389
            


/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
            in process_url(url, fail_silently, fix_protocol, search_public,
            search_private)
 289 return None, None, parsed_url
 290
            else:
--> 291 raise TldBadUrl(url=url)
 292 
 293
            domain_parts =
            domain_name.split('.')

To
overcome this it was suggested to me to wrap the function in a try-except clause to
determine the rows that error out by querying them with
NaN:

import
            tld
from tld import get_fld

def
            try_get_fld(x):
 try: 
 return get_fld(x)
 except
            tld.exceptions.TldBadUrl: 
 return
            np.nan

This
seems to work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but
then fails for "http://urnt12.knhc..txt/" where I get another error message
like the one
above:

TldDomainNotFound: Domain
            urnt12.knhc..txt didn't match any existing TLD
            name!

This is what the
dataframe looks like total of 240,000 "requests" in a dataframe called
"request":

request

            request count
0 https://login.microsoftonline.com 24521

1
            https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net
            6252
3 https://fls-na.amazon.com 65225
4
            https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com
            12
6 http:1 CON 6 
7 http:/login.cgi%00 45822
8
            http://urnt12.knhc..txt/ 1

My
code:

from tld import
            get_tld
from tld import get_fld
import pandas as pd
import
            numpy as np
#Read back into to dataframe
request =
            pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove
            rows where there were null values in the request column 
request =
            request[pd.notnull(request['request'])]

#Find the urls that contain
            IP addresses and exclude them from the new dataframe
request =
            request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset
            index
request = request.reset_index(drop=True)

import
            tld
from tld import get_fld

def
            try_get_fld(x):
 try: 

 return get_fld(x)

            except tld.exceptions.TldBadUrl: 
 return
            np.nan

request['flds'] =
            request['request'].apply(try_get_fld)

#faulty_url_df =
            request[request['flds'].isna()]
#print(faulty_url_df)

Answer

It fails
because it's a different exception. You expect a
tld.exceptions.TldBadUrl: exception but get a
TldDomainNotFound

You
can either be less specific in your except clause and catch more exception with one
except clause or add another except clause to catch the other type of
exception:

try: 
 return
            get_fld(x)
except tld.exceptions.TldBadUrl: 
 return
            np.nan
except tld.exceptions.TldDomainNotFound:
 print("Domain not
            found!")

 return
            np.nan

Blog

Wednesday, 3 January 2018

pandas - Python still having issues with try-except clause

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file