Tag Archives: data science

BudapestPy Workshops 104 (2019-10-09)

Our fourth workshop took place at CEU. Our host for the evening was András Vereckei & Arieda Muço. András showed us their open source projects facilitating free knowledge transfer in a wide range of collected and curated topics (see links below)

https://datacarpentry.org/lessons
https://software-carpentry.org/lessons
https://librarycarpentry.org/lessons

This was an exceptional occasion in multiple aspects: our first workshop in English, the largest number of attendees so far (thanks to the large seminar room provided by our generous host) and this time we only had one presenter for the whole evening.

Dóri walked us through a mini project from driven data, and guided us from data cleaning til submitting a solution to this competition.
We learnt about time-series data, visualization, Gradient boosting for regression, train and test data sets, and how to evaluate or models’ performance. We mostly used the sklearn library to achieve these.

Thank you for the active participation, helping each other sort out the technical difficulties and especially for the group thinking, we enjoyed this session a lot!

Our next event is going to be at the Central European University (CEU), check out the event page!

Thanks for everyone to show up!
As always, the notebooks (the full, and the one with missing parts) are on our GitHub:
https://github.com/budapestpy-workshops

You can join us on our meetup page:
https://www.meetup.com/budapest-py/

The Team: Balogh Balázs, Rónai Bertalan, Szabó Dóra, Doma Miklós, Hackl Krisztián and Zsarnowszky Lóránt (last name, first name order)

Titanic: Machine Learning

Berci asked me to upload my version of kaggle’s Titanic competition. Together on our workshop we achieved around 78%, which was a good starting point.

Speaking about the workshop: in January 2019 a Data Science group formed on Facebook, called Data Revolution:
https://www.facebook.com/groups/DatasRev/
Feel free to join.

Solving this task at first I started with the standard Decision Tree, without any tuning. Then I get into GridSearchCV and RandomizedSearchCV for the best parameters. But after tweaking the model with these validations, I still couldn’t get higher than 79%. RandomForest didn’t help either.

That’s when I found XGBoost, a powerful model, getting more and more attention in machine learning. With it, I could go over 80%.

If you have any questions, or tips, you can find me on LinkedIn:
https://www.linkedin.com/in/baloghbalazs88/

You can find the notebook on:
https://anaconda.org/bbalazs88/titanic/notebook