Wednesday, 1 January 2020

regex - Use non-consuming regular expression in pySpark sql functions

How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?



The following is reproducible, but does not give the desired results.



import pyspark
from pyspark.sql import (
SparkSession,
functions as F)


spark = (SparkSession.builder
.master('yarn')
.appName("regex")
.getOrCreate()
)

sc = spark.sparkContext
sc.version # u'2.2.0'

testdf = spark.createDataFrame([

(1, "Julie", "CEO"),
(2, "Janice", "CFO"),
(3, "Jake", "CTO")],
["ID", "Name", "Title"])


ptrn = '(?=Ja)(?=ke)'


testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()



+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| false|
+---+------+-----+-----------+



testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()


+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| |
| 2|Janice| CFO| |
| 3| Jake| CTO| |

+---+------+-----+-----------+


testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()


+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| Julie|

| 2|Janice| CFO| Janice|
| 3| Jake| CTO| Jake|
+---+------+-----+-----------+


The desired results would be:



+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+

| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| true|
+---+------+-----+-----------+


As the third row in the Name column contains 'Ja' and 'ke'.



If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...