Friday 29 December 2017

pandas - Python still having issues with try-except clause

itemprop="text">



I am using the
tld python library to grab the first level domain from the proxy request logs using a
apply function. When I run into a strange request that tld doesnt know how to handle
like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the
following:



TldBadUrl: Is not a
valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in
engine

----> 1 new_fld_column =
request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc
in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354
values = self.asobject
-> 2355 mapped = lib.map_infer(values, f,
convert=convert_dtype)
2356
2357 if len(mapped) and
isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in
pandas._libs.lib.map_infer
(pandas/_libs/lib.c:66440)()


/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
in get_fld(url,
fail_silently, fix_protocol, search_public, search_private,
**kwargs)
385 fix_protocol=fix_protocol,
386
search_public=search_public,
--> 387
search_private=search_private
388 )
389


/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in
process_url(url, fail_silently, fix_protocol, search_public,
search_private)

289 return None, None, parsed_url
290
else:
--> 291 raise TldBadUrl(url=url)
292
293
domain_parts =
domain_name.split('.')


To
overcome this it was suggested to me to wrap the function in a try-except clause to
determine the rows that error out by querying them with
NaN:



import
tld

from tld import get_fld

def
try_get_fld(x):
try:
return get_fld(x)
except
tld.exceptions.TldBadUrl:
return
np.nan


This seems to
work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then
fails for "http://urnt12.knhc..txt/" where I get another error message
like the one
above:




TldDomainNotFound:
Domain urnt12.knhc..txt didn't match any existing TLD
name!


This is what the
dataframe looks like total of 240,000 "requests" in a dataframe called
"request":



request

request count
0 https://login.microsoftonline.com 24521
1
https://dt.adsafeprotected.com 11521

2
https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com
65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5
https://ib.adnxs.com 12
6 http:1 CON 6
7 http:/login.cgi%00
45822
8 http://urnt12.knhc..txt/ 1



My
code:




from tld import
get_tld
from tld import get_fld
import pandas as pd
import
numpy as np
#Read back into to dataframe
request =
pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove
rows where there were null values in the request column
request =
request[pd.notnull(request['request'])]
#Find the urls that contain IP
addresses and exclude them from the new dataframe

request =
request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset
index
request = request.reset_index(drop=True)

import
tld
from tld import get_fld

def
try_get_fld(x):
try:
return get_fld(x)


except tld.exceptions.TldBadUrl:
return
np.nan

request['flds'] =
request['request'].apply(try_get_fld)

#faulty_url_df =
request[request['flds'].isna()]
#print(faulty_url_df)


Answer




It fails because it's a
different exception. You expect a
tld.exceptions.TldBadUrl: exception but get a
TldDomainNotFound




You
can either be less specific in your except clause and catch more exception with one
except clause or add another except clause to catch the other type of
exception:



try: 
return
get_fld(x)
except tld.exceptions.TldBadUrl:
return
np.nan
except tld.exceptions.TldDomainNotFound:
print("Domain not
found!")
return
np.nan



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...