I started my career as a MOLAP cube builder which left space
for my creativity to improve the model that was in the data warehouse. It was
always a great feeling to create a tool that could give the users ability to
create reports on their own.
Data Explorer (CDE) is the FBI’s Uniform Crime Reporting (UCR) Program’s dynamic solution to presenting crime data in a more
immediate venue that reflects the constant change in the nation’s crime
pages provide a view of estimated national and state data, reported
agency-level crime statistics, and graphs of specific variables from the
National Incident-Based Reporting System (NIBRS).
It was a very nice trip down memory lane to hear a talk about OLAP cubes. I spent a great amount of my professional career building MOLAP cubes and this talk made we want to build one again just for fun.
This was the best session for me and Balazs as well. Mr. Marcin talked about machine learning using the Titanic dataset as an example. I and Balázs also spent a lot of time with that dataset and Kaggle competition. It was great to hear about hyperparameters and gridsearch because we have only discovered them a about a week ago and tested them on the last workshop of the data science study group.
I can tell you that this session was also very informative but I don’t claim that I understood every topic of the session.
Balázs, what do you think about the event? I saw that you mastered the ‘How to win’ part of the machine learning session.
How to win Kaggle competition and get familiar with machine learning, part 2.
Hi there, Balazs here. I really liked Marcin’s part, because I indeed spent a lot of time working on the Titanic dataset. We got some new tips to improve our accuracy, like the covariances and the testing on the whole dataset. It was great to see that what he was doing with this competition is almost exactly what we’ve done with it one the workshops.
The biggest surprise for me was that I won a mini drone from one of the sponsors, Quest. I’ve never won anything, and when I first heard my name as one of the lucky ones whose name got drawn, I didn’t believe it. It’s super fun, and pretty hard to control, but I’m on it.
The following tutorial is going to describe my Power BI report that is using the new Key Influencers visual to enter in a machine learning competition on Kaggle using only Power BI.
The sinking of the RMS Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage, the
Titanic sank after colliding with an iceberg, killing 1502 out of 2224
passengers and crew.
The Key Influencers visual can find segments in your data and
identify key influencers
Short Detour about importing CSV files
When you import from a csv file you might see that you have
a hardcoded columns argument in the query editor like this one:
Columns=12, Encoding=1250, QuoteStyle=QuoteStyle.None])
What happens when you replace the original csv file with a
new csv that you csv that you have exported from a source and the new one has
13 columns? You are right, it won’t import the new column! So you should
manually remove the Columns =12 part from the query above to make it
independent of the number of columns.
Based on the segments the visual created I created a calculated column that evaluates to 1 or 0 (true or false)
Check what happens if you don’t put a parenthesis around the logical test such as HasCabin=true or a[Age]null! not ([Age] is null) is a nicer code by the way! Thanks for the info Alex!
To create the Test query, I have copied the Titanic Train query that has the Survived column to create a TitanicTest query/table that does not have an original survived column, only the new SurvivedGuess calculated column. I can send Kaggle a submsission csv containing two columns, the PassengerId and Survived (containing my guesses).
I have used Imke Feldmann’s Power query script that uses Python to write a csv.
Sorry, I won’t go into the details of how to use Python in
Power BI, only to that extent to say that you should install Python regardless
of having Anaconda installed or not.
We can read the gender_submission csv to see how a
submission should look like! It has only 418 rows beside the headers. You could also create a query that I called CsvTestBeforeSubmit
that reads your csv before submitting.
Alright! The only thing left now is to enter the Kaggle
competition and reach a high score.
Try to reach a higher score than mine! Did you know that if you predict that only the women would survive then your score would be 76,55% ? Try to reach a higher number by predicting the clusters of women and men who would survive!
Solving this task at first I started with the standard Decision Tree, without any tuning. Then I get into GridSearchCV and RandomizedSearchCV for the best parameters. But after tweaking the model with these validations, I still couldn’t get higher than 79%. RandomForest didn’t help either.
That’s when I found XGBoost, a powerful model, getting more and more attention in machine learning. With it, I could go over 80%.