Wednesday 3 January 2018

pandas - Python still having issues with try-except clause





I am using the tld python library to grab the first level domain
from the proxy request logs using a apply function. When I run into a strange request
that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into
an error message like the
following:



TldBadUrl: Is not a
valid URL http:1 con!
TldBadUrlTraceback (most recent call
last)

in engine
----> 1 new_fld_column =
request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc
in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354
values = self.asobject
-> 2355 mapped = lib.map_infer(values, f,
convert=convert_dtype)
2356
2357 if len(mapped) and
isinstance(mapped[0],
Series):


pandas/_libs/src/inference.pyx in
pandas._libs.lib.map_infer
(pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
in get_fld(url,
fail_silently, fix_protocol, search_public, search_private,
**kwargs)
385 fix_protocol=fix_protocol,
386
search_public=search_public,
--> 387
search_private=search_private
388 )
389



/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
in process_url(url, fail_silently, fix_protocol, search_public,
search_private)
289 return None, None, parsed_url
290
else:
--> 291 raise TldBadUrl(url=url)
292
293
domain_parts =
domain_name.split('.')


To
overcome this it was suggested to me to wrap the function in a try-except clause to
determine the rows that error out by querying them with
NaN:




import
tld
from tld import get_fld

def
try_get_fld(x):
try:
return get_fld(x)
except
tld.exceptions.TldBadUrl:
return
np.nan



This
seems to work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but
then fails for "http://urnt12.knhc..txt/" where I get another error message
like the one
above:



TldDomainNotFound: Domain
urnt12.knhc..txt didn't match any existing TLD
name!


This is what the
dataframe looks like total of 240,000 "requests" in a dataframe called
"request":



request

request count
0 https://login.microsoftonline.com 24521

1
https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net
6252
3 https://fls-na.amazon.com 65225
4
https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com
12
6 http:1 CON 6
7 http:/login.cgi%00 45822
8
http://urnt12.knhc..txt/ 1




My
code:



from tld import
get_tld
from tld import get_fld
import pandas as pd
import
numpy as np
#Read back into to dataframe
request =
pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove
rows where there were null values in the request column
request =
request[pd.notnull(request['request'])]

#Find the urls that contain
IP addresses and exclude them from the new dataframe
request =
request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset
index
request = request.reset_index(drop=True)

import
tld
from tld import get_fld

def
try_get_fld(x):
try:

return get_fld(x)

except tld.exceptions.TldBadUrl:
return
np.nan

request['flds'] =
request['request'].apply(try_get_fld)

#faulty_url_df =
request[request['flds'].isna()]
#print(faulty_url_df)


Answer





It fails
because it's a different exception. You expect a
tld.exceptions.TldBadUrl: exception but get a
TldDomainNotFound



You
can either be less specific in your except clause and catch more exception with one
except clause or add another except clause to catch the other type of
exception:



try: 
return
get_fld(x)
except tld.exceptions.TldBadUrl:
return
np.nan
except tld.exceptions.TldDomainNotFound:
print("Domain not
found!")

return
np.nan

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...