Regular Expression Tips
Spark function regexp_extract
and regexp_replace
can transform data using regular expressions.
The regular expression pattern follows Java regex pattern.
Task Running Very Slowly
Stack trace shows:
java.lang.Character.codePointAt(Character.java:4884)
java.util.regex.Pattern$CharProperty.match(Pattern.java:3789)
java.util.regex.Pattern$Curly.match1(Pattern.java:4307)
java.util.regex.Pattern$Curly.match(Pattern.java:4250)
java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
java.util.regex.Pattern$Curly.match(Pattern.java:4248)
java.util.regex.Pattern$Start.match(Pattern.java:3475)
java.util.regex.Matcher.search(Matcher.java:1248)
java.util.regex.Matcher.find(Matcher.java:637)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.RegExpExtract_2$(Unknown Source)
Certain values in the dataset cause regexp_extract
with a certain regex pattern to run very slowly.
See https://stackoverflow.com/questions/5011672/java-regular-expression-running-very-slow.
Match Special Character in PySpark
You will need 4 backslashes to match any special character, 2 required by Python string escaping and 2 by Java regex parsing.
df = spark.sql("SELECT regexp_replace('{{template}}', '\\\\{\\\\{', '#')")