"Testing, testing, testing, to get started with automated ETL (ELT) testing have a look here: https://the.agilesql.club/etl-testing/

spark-dotnet examples - reading and writing csv files

How do you read and write CSV files using the dotnet driver for Apache Spark?

I have a runnable example here:

https://github.com/GoEddie/dotnet-spark-examples

Specifcally:

https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv

Let's take a walkthrough of the demo:

            Console.WriteLine("Hello Spark!");

            var spark = SparkSession
                .Builder()
                .GetOrCreate();

We start with the obligatory “Hello World!", then we create a new SparkSession.


            //Read a single CSV file
            var source = spark
                            .Read()
                            .Option("header", true)
                            .Option("inferShchema", true)
                            .Option("ignoreLeadingWhiteSpace", true)
                            .Option("ignoreTrailingWhiteSpace", true)
                            .Csv("./source.csv");

Here we read a csv file, the first line of the file contains the column names so we set the option “header” to true. We want to take a guess at the schema so use the option inferSchema. Because the csv has been written with lots of whitespace like “col1, col2” - if we didn't ignore leading whitespace our second column would be called " col2” which isn't ideal.


            //Write that CSV into many different CSV files, partitioned by city
            source.Write()
                    .Mode(SaveMode.Overwrite)
                    .Option("header", true)
                    .PartitionBy("city")
                    .Csv("./output-partitioned-city-csv");

We now take that single csv file we read in and write it back out, but instead of writing to a single file we break the csv's into multiple csv's split on “city”, imagine trying to do that in SSIS!

Now we have a set of csv files, we can read them back in as one data set, we can specify the folder they are in rather than each individual csv file:

var partitions = spark
                            .Read()
                            .Option("header", true)
                            .Option("ignoreLeadingWhiteSpace", true)
                            .Option("ignoreTrailingWhiteSpace", true)
                            .Schema("name string, age int, city string") //instead of inferSchema we specify the schema
                            .Csv("./output-partitioned-city-csv");

partitions is now a dataframe that contains all of the csv's we created earlier, neat hey? Notice also that we can specify the schema really easily using .Schema("name string, age int, city string").

 partitions.Write().Mode(SaveMode.Overwrite).Option("header", true).Csv("./output-single-partition-again");

Write the dataframe, of all the csv's, back into one csv file.

Here we see that a single csv “source.csv” was read, multiple csv's partitioned by “city” were written and read and then finally a single csv was written back out again:

spark-dotnet csv reading and writing