ShuffleExchangeExec loses track of executor registrations
If you encounter a Spark job failing with a lot of shuffle fetch failures but no corresponding loss of shuffle services
It seems that:
a fault-tolerance edge-case which can cause Spark jobs to fail if a single external shuffle service process reboots and fails to recover the list of registered executors (something which can happen when using YARN if NodeManager recovery is disabled) and the Spark job has a large number of executors per host.
Ultimately the resolution was to add to the config for the sparksql session: setting spark.files.fetchFailure.unRegisterOutputOnHost" to true