pandas - Python still having issues with try-except clause

Friday, 29 December 2017

pandas - Python still having issues with try-except clause

itemprop="text">

I am using the
tld python library to grab the first level domain from the proxy request logs using a
apply function. When I run into a strange request that tld doesnt know how to handle
like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the
following:

TldBadUrl: Is not a
            valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in
            engine

----> 1 new_fld_column =
            request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc
            in apply(self, func, convert_dtype, args, **kwds)
 2353 else:
 2354
            values = self.asobject
-> 2355 mapped = lib.map_infer(values, f,
            convert=convert_dtype)
 2356 
 2357 if len(mapped) and
            isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in
            pandas._libs.lib.map_infer
            (pandas/_libs/lib.c:66440)()


/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
            in get_fld(url, 
fail_silently, fix_protocol, search_public, search_private,
            **kwargs)
 385 fix_protocol=fix_protocol,
 386
            search_public=search_public,
--> 387
            search_private=search_private
 388 )
 389
            

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in
            process_url(url, fail_silently, fix_protocol, search_public,
            search_private)

 289 return None, None, parsed_url
 290
            else:
--> 291 raise TldBadUrl(url=url)
 292 
 293
            domain_parts =
            domain_name.split('.')

To
overcome this it was suggested to me to wrap the function in a try-except clause to
determine the rows that error out by querying them with
NaN:

import
            tld

from tld import get_fld

def
            try_get_fld(x):
 try: 
 return get_fld(x)
 except
            tld.exceptions.TldBadUrl: 
 return
            np.nan

This seems to
work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then
fails for "http://urnt12.knhc..txt/" where I get another error message
like the one
above:

TldDomainNotFound:
            Domain urnt12.knhc..txt didn't match any existing TLD
            name!

This is what the
dataframe looks like total of 240,000 "requests" in a dataframe called
"request":

request

            request count
0 https://login.microsoftonline.com 24521
1
            https://dt.adsafeprotected.com 11521

2
            https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com
            65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5
            https://ib.adnxs.com 12
6 http:1 CON 6 
7 http:/login.cgi%00
            45822
8 http://urnt12.knhc..txt/ 1

My
code:

from tld import
            get_tld
from tld import get_fld
import pandas as pd
import
            numpy as np
#Read back into to dataframe
request =
            pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove
            rows where there were null values in the request column 
request =
            request[pd.notnull(request['request'])]
#Find the urls that contain IP
            addresses and exclude them from the new dataframe

request =
            request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset
            index
request = request.reset_index(drop=True)

import
            tld
from tld import get_fld

def
            try_get_fld(x):
 try: 
 return get_fld(x)


            except tld.exceptions.TldBadUrl: 
 return
            np.nan

request['flds'] =
            request['request'].apply(try_get_fld)

#faulty_url_df =
            request[request['flds'].isna()]
#print(faulty_url_df)

Answer

It fails because it's a
different exception. You expect a
tld.exceptions.TldBadUrl: exception but get a
TldDomainNotFound

You
can either be less specific in your except clause and catch more exception with one
except clause or add another except clause to catch the other type of
exception:

try: 
 return
            get_fld(x)
except tld.exceptions.TldBadUrl: 
 return
            np.nan
except tld.exceptions.TldDomainNotFound:
 print("Domain not
            found!")
 return
            np.nan

Blog

Friday, 29 December 2017

pandas - Python still having issues with try-except clause

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file