YouTube & MBTI

YouTube & MBTI

Producing, publishing and analysing YouTube & MBTI Datasets on Kaggle

Backstory

I’ve currently got quite a few datasets on Kaggle, I enjoyed using web scraping to build datasets that others might find useful and so went through a period of producing quite a few. Two have been particularly well received, one that contains ~6 months worth of YouTube trending data for 10 countries and another that contains the MBTI type and some text written by about 9000 users of an MBTI forum.

The MBTI dataset came 2nd in the September monthly dataset competition in 2017 and the YouTube dataset is currently #8 in the list of most upvoted datasets on Kaggle.

YouTube Trending Data

So, let’s talk briefly about why these datasets might be interesting. The YouTube trending page is an interesting place. Dominated by content such as Late Night Talk Shows, which are not exactly the primary use case of YouTube for most people. Its efficacy has long been questioned, however, regardless of that, it’s undoubtable that it has an influence on the popularity of different genres of content on YouTube, and some of the interesting analysis of this dataset involves looking at what categories of video are most likely to reach trending, how long it takes a video to reach the trending page and variances in different countries’ trending pages. An analysis by the YouTuber Coffee Break used the data to demonstrate some heavy bias in how much effort it takes for different genres of video to reach the trending page. The Kaggle community also contributed several insightful kernels (name for piece of code designed to analyse a dataset on Kaggle) for example:

Exploring Youtube Trending Statistics by Donyoe Extensive US YouTube Exploration by Leonardo Ferreira What is Trending on YouTube by Quan Nguyen

If you wish to collect more targeted or up-to-date trending tab videos, you can find the link to the source code for the scraper at the top of the page.

MBTI Forum Data

The next dataset was on the MBTI (Myers Briggs Type Indicator). The MBTI is a personality test which categorises people into 16 distinct personality types, comprised of 4 axis.

Introverted/Extroverted Intuitive/Sensing Thinking/Feeling Perceiving/Judging

So an example type could be ENTJ for someone who is Extroverted, Intuitive, Thinking and Judging. For some more detailed info, you can visit the main site for the test.

Its validity has long been questioned, though most people who take the test, which can be done for free online, find it to be relatively accurate. Thus it is interesting to look at any data that might help us determine its accuracy. This dataset is by no means perfect, for example the types are self-reported and there is a heavy bias towards Introverted Intuitive types in the data. Perhaps this type of person is more attracted to the idea of discussing psychology on the internet, hence its over representation on an online psychology forum. Still, it is interesting to see the kernels that have been constructed for the dataset, which can be found here.

Projects

Scraping prnt.sc

Web Scraping & Machine learning

A scraper designed to crawl through the image sharing site prnt.sc, using a CNN to classify images and OCR to extract text.

Read Me

YouTube & MBTI

YouTube & MBTI

Producing, publishing and analysing YouTube & MBTI Datasets on Kaggle

Read Me

Building a Battleships AI

Monte-Carlo Simulation & Probability

Using Monte-Carlo simulation to build an AI for battleships

Read Me