Synapse Analytics and .NET for Apache Spark Example 4 - JOINS

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh).

T-SQL

SELECT * FROM 
    chicago.safety_data one 
INNER JOIN chicago.safety_data two ON one.Address = two.Address;

Spark SQL

SELECT * FROM 
    chicago.safety_data one 
INNER JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "inner").Show();

T-SQL LEFT OUTER JOIN

Spark SQL

ELECT * FROM 
    chicago.safety_data one 
LEFT OUTER JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");;
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "left_outer").Show();  //alternative "left" or "leftouter"

T-SQL RIGHT OUTER JOIN

SELECT * FROM 
    chicago.safety_data one 
RIGHT OUTER JOIN chicago.safety_data two ON one.Address = two.Address;

Spark SQL

SELECT * FROM 
    chicago.safety_data one 
RIGHT OUTER JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");;
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "right_outer").Show();   //alternative "right" or "rightouter"

T-SQL FULL OUTER JOIN

SELECT * FROM 
    chicago.safety_data one 
FULL OUTER JOIN chicago.safety_data two ON one.Address = two.Address;

Spark SQL


SELECT * FROM 
    chicago.safety_data one 
FULL OUTER JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "full_outer").Show(); //alternative "outer" or "fullouter"

T-SQL CROSS JOIN

SELECT * FROM 
    chicago.safety_data one 
CROSS JOIN chicago.safety_data two;

Spark SQL

SELECT * FROM 
    chicago.safety_data one 
CROSS JOIN chicago.safety_data

DataFrame API (C#)

Note that we must use CrossJoin rather than Join when we don't pass in a list of columns to join on.

var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
            .CrossJoin(
                    dataFrame.Alias("two")
                    );

Join types not available in T-SQL

Left-Semi Join

This is an inner join but only returns the columns from the left hand side of the join.

Left-Anti Join

This is the opposite of the Left-Semi join, it returns the columns on the right hand side of the join.

SELECT one.* FROM 
    chicago.safety_data one 
INNER JOIN chicago.safety_data two ON one.Address = two.Address;

Spark SQL

SELECT * FROM 
    chicago.safety_data one 
LEFT SEMI JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "left_semi").Show(); //alternative "semi" or "leftsemi"

T-SQL Left-Anti Join

SELECT two.* FROM 
    chicago.safety_data one 
INNER JOIN chicago.safety_data two ON one.Address = two.Address;

Spark SQL

SELECT * FROM 
    chicago.safety_data one 
LEFT ANTI JOIN chicago.safety_data two ON one.Address = two.Address

DataFrame API (C#)

var dataFrame = spark.Read().Table("chicago.safety_data");
dataFrame.Alias("one")
            .Join(
                    dataFrame.Alias("two")
                    , Functions.Col("one.Address") == Functions.Col("two.Address")
                    , "left_anti").Show(); //alternative "anti" or "leftanti"

There you have it, how to do joins in .NET for Apache Spark and Spark SQL on Synapse Analytics.