How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?
The following is reproducible, but does not give the desired results.
import pyspark
from pyspark.sql import (
SparkSession,
functions as F)
spark = (SparkSession.builder
.master('yarn')
.appName("regex")
.getOrCreate()
)
sc = spark.sparkContext
sc.version # u'2.2.0'
testdf = spark.createDataFrame([
(1, "Julie", "CEO"),
(2, "Janice", "CFO"),
(3, "Jake", "CTO")],
["ID", "Name", "Title"])
ptrn = '(?=Ja)(?=ke)'
testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| false|
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| |
| 2|Janice| CFO| |
| 3| Jake| CTO| |
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| Julie|
| 2|Janice| CFO| Janice|
| 3| Jake| CTO| Jake|
+---+------+-----+-----------+
The desired results would be:
+---+------+-----+-----------+
| ID| Name|Title|contns_ptrn|
+---+------+-----+-----------+
| 1| Julie| CEO| false|
| 2|Janice| CFO| false|
| 3| Jake| CTO| true|
+---+------+-----+-----------+
As the third row in the Name column contains 'Ja' and 'ke'.
If regexp_extract
or regexp_replace
are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length
to get a Boolean column.
No comments:
Post a Comment