I am using the
tld python library to grab the first level domain from the proxy request logs using a
apply function. When I run into a strange request that tld doesnt know how to handle
like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the
following:
TldBadUrl: Is not a
valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in
engine
----> 1 new_fld_column =
request_2['request'].apply(get_fld)
/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc
in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354
values = self.asobject
-> 2355 mapped = lib.map_infer(values, f,
convert=convert_dtype)
2356
2357 if len(mapped) and
isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in
pandas._libs.lib.map_infer
(pandas/_libs/lib.c:66440)()
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc
in get_fld(url,
fail_silently, fix_protocol, search_public, search_private,
**kwargs)
385 fix_protocol=fix_protocol,
386
search_public=search_public,
--> 387
search_private=search_private
388 )
389
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in
process_url(url, fail_silently, fix_protocol, search_public,
search_private)
289 return None, None, parsed_url
290
else:
--> 291 raise TldBadUrl(url=url)
292
293
domain_parts =
domain_name.split('.')
To
overcome this it was suggested to me to wrap the function in a try-except clause to
determine the rows that error out by querying them with
NaN:
import
tld
from tld import get_fld
def
try_get_fld(x):
try:
return get_fld(x)
except
tld.exceptions.TldBadUrl:
return
np.nan
This seems to
work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then
fails for "http://urnt12.knhc..txt/" where I get another error message
like the one
above:
TldDomainNotFound:
Domain urnt12.knhc..txt didn't match any existing TLD
name!
This is what the
dataframe looks like total of 240,000 "requests" in a dataframe called
"request":
request
request count
0 https://login.microsoftonline.com 24521
1
https://dt.adsafeprotected.com 11521
2
https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com
65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5
https://ib.adnxs.com 12
6 http:1 CON 6
7 http:/login.cgi%00
45822
8 http://urnt12.knhc..txt/ 1
My
code:
from tld import
get_tld
from tld import get_fld
import pandas as pd
import
numpy as np
#Read back into to dataframe
request =
pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove
rows where there were null values in the request column
request =
request[pd.notnull(request['request'])]
#Find the urls that contain IP
addresses and exclude them from the new dataframe
request =
request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset
index
request = request.reset_index(drop=True)
import
tld
from tld import get_fld
def
try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return
np.nan
request['flds'] =
request['request'].apply(try_get_fld)
#faulty_url_df =
request[request['flds'].isna()]
#print(faulty_url_df)
No comments:
Post a Comment