2018/04 ~ Technical Essentials

Apr 12, 2018

Removing Projection Column Ambiguity in Spark

Column ambiguity is quite common when you join two tables. Now this poses a unnecessary hassle when you want to select all the columns from both the tables whilst discarding the duplicate columns. The aforementioned problem is difficult to handle especially, if you have wide tables, where you would want to avoid typing the column names.

There are a couple of programmatic solutions to the problem, both essentially do the same thing, but achieve the results differently.

Either execute sql traditionally using ss.sql(query), and then manually transform the Dataset by converting to RDD, dropping duplicate column names, or
Implement the query execution in a similar way as spark does, drop the duplicate column names and then create Dataset this avoid unnecessary conversion and creation of Dataset until duplicates are dropped.

Note: There are huge trade offs with one of these approaches, first one works for all scenarios, the other one uses executeCollectPublic and has high probability of giving heap errors .

Solution 2:

def sqlDropDuplicateColumns(query: String, ss: SparkSession): Dataset[Row] = {
   val logicalPlan = ss.sessionState.sqlParser.parsePlan(sqlText = query)
   val qe = ss.sessionState.executePlan(logicalPlan)
   //Assert plan is valid
   qe.assertAnalyzed()
   val ep = qe.executedPlan
   //Drop duplicate column names
   val schema = schemaToSet(ep.schema)
   val rows = ep.executeCollectPublic().map(r => {
     val lb = new ListBuffer[Any]
     schema.map(sf => lb += r.getAs(sf.name))
     Row(lb: _*)
   })
   ss.createDataFrame(ss.sparkContext.parallelize(rows), schema)
 }

Solution 1:

def sqlDropDuplicateColumnsRDD(query: String, ss: SparkSession): Dataset[Row] = {
  val df=ss.sql(query)
  //Drop duplicate column names
  val schema = schemaToSet(df.schema)
  val rdd = df.rdd.map(r => {
    val lb = new ListBuffer[Any]
    schema.map(sf => lb += r.getAs(sf.name))
    Row(lb: _*)
  })
  ss.createDataFrame(rdd, schema)
}

Helper method:

def schemaToSet(schema:StructType):StructType={
  val schemaMap=new mutable.HashMap[String,StructField]()
  for(sf<- span="">schema){
    if(!schemaMap.contains(sf.name.toLowerCase)){
      schemaMap.put(sf.name.toLowerCase,sf)
    }
  }
  StructType(schemaMap.values.toArray)
}

Posted on Thursday, April 12, 2018 by Unknown

Technical Essentials

Java, ADF, Android, Identity Management, Data Science, Machine Learning, Fusion Middleware, Linux, Counter Strike 1.6, BSD, Windows, Programming, Search Engines

ADF Examining memory leaks

Simply Read: A RSS/Atom Reader

Event Handlers in OIM 11g

Analyzing and Fixing memory leaks in Android

Driving security for ADF Essentials: Part1

Target Reconciliation Scheduler in OIM 11g

Using Lucene in Android

Better Charting with NVD3

ADF: Doing Runtime Entity Validations

ADF View Criteria Performance Impact

Apr 12, 2018

Removing Projection Column Ambiguity in Spark

Solution 2:

Solution 1:

Helper method:

Subscribe to Feeds

About Me

Google Translate

Connect !

Categories

Total Pageviews

Popular Posts

Blog Archive

Site Search