Crime in the USA Report

I started my career as a MOLAP cube builder which left space for my creativity to improve the model that was in the data warehouse. It was always a great feeling to create a tool that could give the users ability to create reports on their own.

I wanted to play the same game with some datasets about crime in the USA from this webpage maintained by the FBI:

About the Crime Data Explorer

The Crime Data Explorer (CDE) is the FBI’s Uniform Crime Reporting (UCR) Program’s dynamic solution to presenting crime data in a more immediate venue that reflects the constant change in the nation’s crime circumstance.

The CDE pages provide a view of estimated national and state data, reported agency-level crime statistics, and graphs of specific variables from the National Incident-Based Reporting System (NIBRS).

Continue reading Crime in the USA Report

SQLSaturday Budapest

I attended the event with some friends on 2019-04-20.
Details below.

Building a modern data warehouse and BI solution in Microsoft cloud by Gergely Csom

Gergely gave an overview of Microsoft’s data warehouse related tools and services including SSIS, Azure Data Factory Dataflow and Power BI Dataflow, Azure Analysis services and many more.

Top 10 SSAS Design Best Practice (vagy talán több is) by Zoltán Horváth

It was a very nice trip down memory lane to hear a talk about OLAP cubes. I spent a great amount of my professional career building MOLAP cubes and this talk made we want to build one again just for fun.

How to win Kaggle competition and get familiar with machine learning ? by Marcin Szeliga

This was the best session for me and Balazs as well. Mr. Marcin talked about machine learning using the Titanic dataset as an example. I and Balázs also spent a lot of time with that dataset and Kaggle competition. It was great to hear about hyperparameters and gridsearch because we have only discovered them a about a week ago and tested them on the last workshop of the data science study group.

SQL Server In-database machine learning with R and Python by
Tomaž Kaštrun

Mr. Kaštrun’s talk was also about data science and machine learning. He gave the audience a deeper understanding of running R and Python code in SQL Server and its benefits and limitations.

SQL Server – A Developers vs Admins story by Liviu Leran

I can tell you that this session was also very informative but I don’t claim that I understood every topic of the session.

Balázs, what do you think about the event? I saw that you mastered the ‘How to win’ part of the machine learning session.

How to win Kaggle competition and get familiar with machine learning, part 2.

Hi there, Balazs here.
I really liked Marcin’s part, because I indeed spent a lot of time working on the Titanic dataset. We got some new tips to improve our accuracy, like the covariances and the testing on the whole dataset.
It was great to see that what he was doing with this competition is almost exactly what we’ve done with it one the workshops.

The biggest surprise for me was that I won a mini drone from one of the sponsors, Quest. I’ve never won anything, and when I first heard my name as one of the lucky ones whose name got drawn, I didn’t believe it.
It’s super fun, and pretty hard to control, but I’m on it.

Guess who would survive the Titanic with the Key Influencers visual!

Hi I am Bertalan Ronai from Hungary.

The following tutorial is going to describe my Power BI report that is using the new Key Influencers visual to enter in a machine learning competition on Kaggle using only Power BI.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

You can read more about the competition.   And download the csv-files from here:
We are going to use the test.csv , train.csv and gender_submission.csv files.

You are training your machine learning model on one dataset and test it on a test dataset.  

If you haven’t already, then please watch this video from @PATRICKDBA and @JustynaLucznik !

The Key Influencers visual can find segments in your data and identify key influencers

Short Detour about importing CSV files

When you import from a csv file you might see that you have a hardcoded columns argument in the query editor like this one:
= Csv.Document(File.Contents(“C:\Anaconda\datacamp_facebook_live_titanic\data\train.csv”),[Delimiter=”,”, Columns=12, Encoding=1250, QuoteStyle=QuoteStyle.None])

What happens when you replace the original csv file with a new csv that you csv that you have exported from a source and the new one has 13 columns? You are right, it won’t import the new column! So you should manually remove the Columns =12 part from the query above to make it independent of the number of columns.

Based on the segments the visual created I created a calculated column that evaluates to 1 or 0 (true or false)

Check what happens if you don’t put a parenthesis around the logical test such as HasCabin=true or a[Age]null!
not ([Age] is null) is a nicer code by the way! Thanks for the info Alex!

To create the Test query, I have copied the Titanic Train query that has the Survived column to create a TitanicTest query/table that does not have an original survived column, only the new SurvivedGuess calculated column.    I can send Kaggle a submsission csv containing two columns, the PassengerId and Survived (containing my guesses).

I have used Imke Feldmann’s Power query script that uses Python to write a csv.

By the way, I contributed a tiny bit to that script by adding a decimal separator argument, check it out! Read more about this great function here:

Here is how I called this function:

Sorry, I won’t go into the details of how to use Python in Power BI, only to that extent to say that you should install Python regardless of having Anaconda installed or not.

We can read the gender_submission csv to see how a submission should look like! It has only 418 rows beside the headers.   You could also create a query that I called CsvTestBeforeSubmit that reads your csv before submitting.

Alright! The only thing left now is to enter the Kaggle competition and reach a high score.

Try to reach a higher score than mine! Did you know that if you predict that only the women would survive then your score would be 76,55% ? Try to reach a higher number by predicting the clusters of women and men who would survive!

You can download my report from my Github repository!

Check out this amazing walkthrough of the Influencer visual by Leila Etaati!

I have asked my penpal, Alex Powers to create a walkthrough video in which he would build upon this idea to make it even better. Check it out by the link below! And follow him on LinkedIn and Twitter!

Titanic: Machine Learning

Berci asked me to upload my version of kaggle’s Titanic competition. Together on our workshop we achieved around 78%, which was a good starting point.

Speaking about the workshop: in January 2019 a Data Science group formed on Facebook, called Data Revolution:
Feel free to join.

Solving this task at first I started with the standard Decision Tree, without any tuning. Then I get into GridSearchCV and RandomizedSearchCV for the best parameters. But after tweaking the model with these validations, I still couldn’t get higher than 79%. RandomForest didn’t help either.

That’s when I found XGBoost, a powerful model, getting more and more attention in machine learning. With it, I could go over 80%.

If you have any questions, or tips, you can find me on LinkedIn:

You can find the notebook on:

Hit Refresh

I have recently finished Satya Nadella’s book Hit Refresh. @Alex Powers recommended it on LinkedIn so I had ordered myself a paperback copy.

I liked reading about the new CEO’s career path and family. I like reading about successful people’s career paths, that’s why I am reading Arnold Schwarzenegger’s Total Recall for the second time.

The author talks about Microsoft’s company culture, emerging new technologies such as AI or quantum computers and the importance of empathy in a company.

I should not try hard to convince you to read this book. If you work with data on a professional level, this book is for you.