Synapse Analytics and .NET for Apache Spark Example 4 - JOINS
This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh).
T-SQL
SELECT * FROM
chicago.safety_data one
INNER JOIN chicago.safety_data two ON one.Address = two.Address;
Spark SQL
SELECT * FROM
chicago.safety_data one
INNER JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "inner").Show();
T-SQL LEFT OUTER JOIN
Spark SQL
ELECT * FROM
chicago.safety_data one
LEFT OUTER JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");;
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "left_outer").Show(); //alternative "left" or "leftouter"
T-SQL RIGHT OUTER JOIN
SELECT * FROM
chicago.safety_data one
RIGHT OUTER JOIN chicago.safety_data two ON one.Address = two.Address;
Spark SQL
SELECT * FROM
chicago.safety_data one
RIGHT OUTER JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");;
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "right_outer").Show(); //alternative "right" or "rightouter"
T-SQL FULL OUTER JOIN
SELECT * FROM
chicago.safety_data one
FULL OUTER JOIN chicago.safety_data two ON one.Address = two.Address;
Spark SQL
SELECT * FROM
chicago.safety_data one
FULL OUTER JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "full_outer").Show(); //alternative "outer" or "fullouter"
T-SQL CROSS JOIN
SELECT * FROM
chicago.safety_data one
CROSS JOIN chicago.safety_data two;
Spark SQL
SELECT * FROM
chicago.safety_data one
CROSS JOIN chicago.safety_data
DataFrame API (C#)
Note that we must use CrossJoin
rather than Join
when we don’t pass in a list of columns to join on.
var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
.CrossJoin(
dataFrame.Alias("two")
);
Join types not available in T-SQL
Left-Semi Join
This is an inner join but only returns the columns from the left hand side of the join.
Left-Anti Join
This is the opposite of the Left-Semi join, it returns the columns on the right hand side of the join.
SELECT one.* FROM
chicago.safety_data one
INNER JOIN chicago.safety_data two ON one.Address = two.Address;
Spark SQL
SELECT * FROM
chicago.safety_data one
LEFT SEMI JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "left_semi").Show(); //alternative "semi" or "leftsemi"
T-SQL Left-Anti Join
SELECT two.* FROM
chicago.safety_data one
INNER JOIN chicago.safety_data two ON one.Address = two.Address;
Spark SQL
SELECT * FROM
chicago.safety_data one
LEFT ANTI JOIN chicago.safety_data two ON one.Address = two.Address
DataFrame API (C#)
var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
.Join(
dataFrame.Alias("two")
, Functions.Col("one.Address") == Functions.Col("two.Address")
, "left_anti").Show(); //alternative "anti" or "leftanti"
There you have it, how to do joins in .NET for Apache Spark and Spark SQL on Synapse Analytics.