SchemaColumnConvertNotSupportedException with vectorized Parquet reads
It's a known limitation that the vectorized parquet reader for Spark has the following caveats for timestamp-type data:
- a timezone is expected with the timestamp-type data
- timestamp data must not be stored as Int96
- timestamp precision must be in microseconds, higher precision (ie. nanoseconds) not supported
If one of the above requirements is violated by timestamp-type data, then jobs with vectorized parquet scanner enabled may fail with SchemaColumnConvertNotSupportedException
To overcome this, do one of the following:
- Consider using dateint instead of timestamp type, since it doesn't have the complexities of timezones and precision, and is simpler for the vectorized reader to handle
- Correct the timestamp-type data to adhere to the requirements above
- Disable the vectorized parquet reader for the job (ie. set
spark.sql.parquet.enableVectorizedReader
tofalse
)
There is a related StackOverflow post SchemaColumnConvertNotSupportedException: column: & databricks KB entry