regex - Use non-consuming regular expression in pySpark sql functions

Wednesday, 1 January 2020

regex - Use non-consuming regular expression in pySpark sql functions

How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?

The following is reproducible, but does not give the desired results.

import pyspark
from pyspark.sql import (
    SparkSession,
    functions as F)


spark = (SparkSession.builder
         .master('yarn')
         .appName("regex")
         .getOrCreate()
         )

sc = spark.sparkContext
sc.version # u'2.2.0'

testdf = spark.createDataFrame([

    (1, "Julie", "CEO"),
    (2, "Janice", "CFO"),
    (3, "Jake", "CTO")],
    ["ID", "Name", "Title"])

ptrn = '(?=Ja)(?=ke)'

testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()

+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|      false|
|  2|Janice|  CFO|      false|
|  3|  Jake|  CTO|      false|
+---+------+-----+-----------+

testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()

+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|           |
|  2|Janice|  CFO|           |
|  3|  Jake|  CTO|           |

+---+------+-----+-----------+

testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()

+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|      Julie|

|  2|Janice|  CFO|     Janice|
|  3|  Jake|  CTO|       Jake|
+---+------+-----+-----------+

The desired results would be:

+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+

|  1| Julie|  CEO|      false|
|  2|Janice|  CFO|      false|
|  3|  Jake|  CTO|       true|
+---+------+-----+-----------+

As the third row in the Name column contains 'Ja' and 'ke'.

If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.

Blog

Wednesday, 1 January 2020

regex - Use non-consuming regular expression in pySpark sql functions

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file