spark-dotnet how does user .net code run spark code in a java vm

Apache Spark is written in scala, scala compiles to Java and runs inside a Java virtual machine. The spark-dotnet driver runs dotnet code and calls spark functionality, so how does that work? There are two paths to run dotnet code with spark, the first is the general case which I will describe here, the second is UDF’s which I will explain in a later post as it is slightly more involved.

Spark and dotnet in a single docker container

I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala. If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need: jre Spark dotnet sdk Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container?

Where do you do your database development? Hopefully not production

Here are three scenarios, if you work with SQL Server either as a provider of database environments (DBA) or a consumer of database environments for your application (developer) then you will likely see yourself in one of these descriptions. If you don’t please 100% find some way to tell me (email, comment below etc.) Prod FTW’ers – There is only one place where the developers can develop. DBA’s (if you have one) complain every so often about someone using “sa” on production, whoever is using the “sa” account, keeps leaving open transactions in SSMS blocking all the users.

As a DBA, how do I offload some of my work?

I don’t have time for this “I am a DBA, I am busy, too busy. Developers keep pushing changes to production without me reviewing the code, and now I am stuck again, over the weekend fixing performance issues while the developers are on the beach with a pina colada and a cigar” Sound familiar? Maybe the developers aren’t on the beach drinking and smoking, but the sentiment is the same:

What is the key to automated database deployments?

In my blog post here https://the.agilesql.club/2019/06/what-steps-are-there-to-move-to-safe-automated-database-deployments/ I described the steps you need to go through so you can build up your confidence that you are capable of deploying databases using automation. I mean, afterall, knowing that it is possible to automate your deployments and having confidence that they will succeed are two very different things. Even with the best tooling in the world, automated database deployments are still a struggle and there is one key thing that you can do, no matter what tools you choose and that is to make the deployments re-runnable.

What steps are there to move to safe automated database deployments?

Database deployments are scary, you have all this data and if you drop the wrong table, run the wrong delete statement or have an error in a stored procedure that forgets to write that critical piece of data then you may never truly recover from that. You may well have backups but what if your backup is corrupt? What if your stored procedure hasn’t been writing the right data for a week?

How do you choose which columns to include in your SQL Server clustered index?

Database Engineering: Database Modelling: Do I need to have an incrementing identity int/bigint as my clustered index in a SQL Server database? When you want to produce a professional table design that will scale in the future and stop you being called at 4 AM to fix a performance issue, will you regret that decision to not add an incrementing “id” column to that core table? When you look in forums, do you see people fiercely guarding their opinion that you should/should not include an id column?

Why is my pivot query not returning any data? How to debug a PIVOT query

If you have a PIVOT query and it isn’t returning the data you expect, what can you do to troubleshoot it? The thing to do is to break it down into the constituent parts. First, lets take a look at a query and see what we can do to help. If you have this query: SELECT [1] AS Student1, [2] AS Student2, coalesce([3], 0) AS Student3 FROM (SELECT student_id, amount FROM [a]) as source PIVOT ( SUM(amount) FOR student_id in ([1],[2],[3]) ) as pv Step 1 - Do you have any actual data?

How to delete a lot of data from a SQL Server database

How to delete data efficiently When we delete data from a table in SQL Server, we can often find that the delete is slower than even the original insert, especially if the insert was made using the bulk insert API. Often when people find that they have a delete that is taking longer than it should and blocking other users, they kill the delete, but the rollback takes just as long if not longer than the delete up until that point.

How to load test a SQL database

How do you know whether a database code change is going to be great, okay or awful? If you want to test a new version of SQL Server how do you know whether to leave on or disable the legacy cardinality estimator? Will the latest cumulative update destroy the performance of your application? You don’t have to wait until production to know if the change you are going to do works or not and building a system to test the performance of your database is probably more straightforward than you realize.