Something went wrong. Try again later

paulwgraham

At large

171 0 3 1
Forum Posts Wiki Points Following Followers

Exploratory Data Analysis Of The GiantBomb.com Userbase

Introduction

To me Giant Bomb feels like a small site. I mostly see the same people in livestream chat and posting on the forums. Yet intellectually I know that GiantBomb.com must get a fair amount of traffic and have a sizable number of users in order to pay the salaries of the crew, artists, and engineers that run the site.

So I've always wondered: How many users does Giant Bomb have? How many users are Premium? How many users actually post to the forums?

Fortunately, on GiantBomb.com user profiles are public by default and contain all the information I need to answer these questions.

Methodology

I wrote a web-crawler that over the last couple of months has slowly scanned every Giant Bomb user profile it could find. I payed special attention to making sure the crawler respects the rules laid out in GiantBomb.com/robots.txt and that the crawler is throttled to only make one request per second.

Disclaimer

Given how long the data took to collect it will be by its very nature somewhat out of date. The following analysis doesn't account for the normal churn of users.

Also, it should be noted that the neither the manner in which the data was collected nor the methods used in the analyses have been vetted for correctness.

The Users

I was able to find 1,052,304 GiantBomb.com user accounts of which 988,483 have publicly available profiles.

The following charts show user sign-ups by ISO-8601 ISO week

No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided
No Caption Provided

From the above charts it's clear that something strange started happening in the 29th week of 2020.

The average number of user sign-ups per ISO week for 2019 was 642. For 2020 the average number of user sign-ups per ISO week was 1902. For 2021 the average number of user sign-ups per ISO week was 6,881.

The following chart shows user sign-ups by year.

No Caption Provided

413,673 users signed-up before 2020. 574,810 users signed-up in 2020 and after.

No Caption Provided

So what started this massive influx of new users? It's hard to say for sure but there are a few possibilities. In 2020 the Covid-19 pandemic kicked into full swing and during ISO Week 11 was when Giant Bomb began primarily streaming on Twitch with their "Lockdown" streams. This no doubt had an effect on user sign-ups.

However, after examining the data I believe the biggest contributor to this ongoing influx of user sign-ups is spam accounts.

The Spam

Did you know Giant Bomb has a significant spam problem? I didn't. I mean I do occasionally see spam posts on the forums but those posts are quickly deleted and the spam user accounts used to create the posts are quickly banned.

What I was surprised to discover is that not only are there a ton of obvious spam user accounts on GiantBomb.com the spammers are sticking their spam messages in the "About Me" sections of their user account profiles.

In addition to the weird sex stuff and obvious scams that are typical for spam the spam on Giant Bomb contains some interesting offers.

Do you want to buy some vegan deodorant ? Do you live in Florida and need to get your Tesla modified? Do you need to get a wedding dress altered in the UK? Are you desperate to buy a Feng Shui plant from Vietnam?

If you answered "yes" to any of the above questions the spam accounts on GiantBomb.com have you covered.

What's interesting is that the typical user account with "About Me" profile spam will never post on the Giant Bomb forums or comment on any videos. This means that there is practically no way an average real Giant Bomb user will ever see the spammers profile let alone the spam it contains.

One may ask: "If no one will ever see it why do the spammers bother?" Well, I think the answer to that has to do with the fundamental nature of spam. The vast majority of all spam will never be seen or acted upon. However, placing spam on sites is essentially free for the spammer. Therefor it makes a sociopathic kind of sense for a spammer to place their spam messages in as many places as possible regardless of effectiveness because any click-through they receive is pure upside. Thus, the only criterion that matters to the spammer is that it is possible to post spam on GiantBomb.com. It does not matter if it is an ineffective place to do so.

It should be noted that there appears to be a number of legitimate businesses being advertised by the spam user accounts. While it is possible that the owner of a florist in Fort Lauderdale, Florida is moonlighting as an operator of a spam spreading botnet I feel it's more likely that said florist innocently signed-up for an online marketing service without realizing it was a shady spammer.

It's also interesting that since "About Me" spam user accounts don't typically interact with GiantBomb.com in any way other than filling in their respective "About Me" sections I can confidently identify these types of spam user accounts by looking for user accounts with filled-in "About Me" sections that have never posted on the forums or commented on videos, have not contributed to the Giant Bomb WIKI, and don't currently hold any kind of Premium user status.

There are 239,113 "About Me" spam user accounts. 17,934 of them signed-up in 2020.

The following charts shows the number of "About Me" spam user sign-ups in 2020, 2021, and 2022 versus all other sign-ups those years.

No Caption Provided
No Caption Provided
No Caption Provided

The "About Me" spammers account for a sizable chunk of all user sign-ups in 2020, 2021 and 2022 but what about the rest of the sign-ups? Are they real users? For the majority of those remaining sign-ups I believe the answer is most likely no. For 2021-W26 there was an average of 4042 of sign-ups per day. Here are the usernames for some of the sign-ups that happened on 2021-W26-3:

"bernardc6w", "webstermvu", "lauryznq", "michale6mp", "bettieikh", "bruceq9n", "armandopnq", "mortonj0i", "garthqem", "macy1v0", "hayliemyc", "cristopherfil", "alvenat3a", "brodywxd", "g8udhyo060", "vivienxgn", "devanu5x", "wilmam5l", "angel27q", "grahamo0o", "baileyjw0", "eddieu_d", "hilario990", "brainhg4", "aidenapx", "anderaoyxe", "deon5wb", "kenyagmp", "aylinpwu", "marlenu_l", "cecilenzk", "jamisont4g", "cloyd8ec", "piercem70", "paxtonrnfv", "charitytbb", "everettempa", "triston6th", "cassidynah", "maiyarlq", "edwarde9j", "altalvj", "cecilianvg", "nigellie", "brandt5fv", "edwinaujr", "imaj6e", "edmund2gf"

These usernames seem like they were generated by spammers. They all seem sort of similar but also don't seem like the kind of usernames humans would choose.

Unlike for the "About Me" spam user accounts there doesn't seem to be a simple test that can be used to identify this type of spam user account. These accounts don't seem to interact with GiantBomb.com in any measurable way. My guess is that these spam user accounts are meant to lie dormant until such time they are used to post spam to the forums; after which they are quickly banned.

If spam user accounts can't be distinguished from real user accounts then it is impossible to get an precise count of real user accounts. However, this doesn't mean that it is impossible to get a relative sense of how many real users signed-up in a given year in comparison to some other year. To this end I find it helpful to not focus on identifying spam user accounts but to instead focus on finding the user accounts of real users.

User accounts of what are most likely real users can be found by looking for basically the opposite of what to look for when searching for spam user accounts. If a user has posted on the forums or commented on videos, has contributed to the Giant Bomb WIKI, or holds any kind of Premium user status then that user is most likely a real user.

The following charts show the sign-ups for accounts that belong to probable real users for the years of 2019 and 2020.

No Caption Provided
No Caption Provided

Obviously, these charts don't include real user accounts that don't interact with GiantBomb.com in any public way but if it is assumed that the proportion of active real users to inactive real users stays the same then the relative increase or decrease of real user sign-ups can be calculated.

For the year of 2019 there were 4671 user account sign-ups for known real users. In 2020 there were 3663 user account sign-ups for known real users. This represents a decrease in real user sign-ups by roughly -22%. This is a considerable difference from the percentage increase of overall user sign-ups which went from 33,588 in 2019 to 99,665 in 2020 giving a increase of 197%.

Active Users

So why does Giant Bomb feel like a small site despite having so many users? I believe the answer to that is that most users don't post. Of the 1,052,304 of Giant Bomb user accounts only 135,407 have ever posted, commented or contributed to the WIKI. Of those 135,407 active user accounts only 2690 have posted and/or commented in 2022.

Premium

There are 29,403 Giant Bomb user accounts with premium status. 24,276 are on yearly plans. 5,127 are on monthly plans.

No Caption Provided

edit: added missing chart.

15 Comments

Finding And Fixing Broken YouTube IDs On GiantBomb.com

Awhile ago I came across a video on giantbomb.com that couldn't play because of a bad YouTube video ID. I decided to dig deeper and ended up finding hundreds of broken YouTube IDs and some misplaced videos.

The Videos

Giant Bomb has over 14,000 videos on the site. When a user visits a video page the video will either be served by Giant Bomb and played in a custom video player or will be served by YouTube and played in an embedded YouTube player. Whether the video is served from Giant Bomb or from YouTube depends on a number of factors but generally speaking logged-out users tend to get the YouTube version.

The API

Giant Bomb has a API that any user can use to get the data associated with videos on the site.

Like any API the Giant Bomb API has is its own set of quirks. I started with the API by typing the endpoints and query parameters into Chrome and seeing what the API sent back. This seemed to work fine but as soon as I moved on to Python and the popular Requests module I encountered the first API quirk and got nothing but errors. It turns out that the Giant Bomb API doesn't like the User Agent that the Requests module sends by default. One custom User Agent header later and I started getting useful data back from the API.

The data that the API returns for an individual video looks something like the following (edited down to the relevant bits):

{

"deck":"Managing a mercenary group of mechs takes a lot of patience, hard work, and jumping.",

"guid":"2300-13060",

"id":13060,

"length_seconds":4508,

"name":"Quick Look: BattleTech",

"publish_date":"2018-05-13 06:00:00",

"site_detail_url":"https:\/\/www.giantbomb.com\/videos\/quick-look-battletech\/2300-13060\/",

"user":"ybbaaabby",

"video_type":"Quick Looks",

"video_show":{"id":3,"title":"Quick Looks"},

"video_categories":[{"id":3, "name":"Quick Looks"}],

"youtube_id":"5nC2PPLl3ec"

}

Notice that the data includes things like the video's guid, the video name, and the ID of corresponding video on YouTube. Also, notice that the above data lists the video's type, the video's show, and the categories the video belongs to.

Misplaced Videos

Putting aside the broken YouTube embed I began to wonder about what other video data could cause errors that wouldn't be caught by the GB CMS.

Looking at the video's type, show, and categories it's obvious that these three fields have something to do with where a video is located on the site. It also stands to reason that values for these fields are manually chosen when a video is added.

Where there is a manual process there will be mistakes. So I set about trying to find some of them. Fortunately for my purposes Giant Bomb videos tend to follow a standard naming convention for each show type. For example Quick Looks titles tend to begin with the words "Quick Look" and Bombcast video titles tend to contain the word "Bombcast". So for each show/category I created a regular expression that would match the expected title. I then ran a search that examined every video title looking for videos that matched a naming convention but weren't in the expected show/category. I also looked for videos that were in a show/category but didn't have a title that matched the expected naming convention. I found a few possibly misplaced videos including this one:

{

'deck': "When you tell the squad that you can't come to the club because you have to defeat the Heartless.",

'guid': '2300-12505',

'id': 12505,

'length_seconds': 4355,

'name': 'Kingdom Heartache: Episode 6: Wind of the Forgotten Sorrow',

'publish_date': '2017-09-10 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/kingdom-heartache-episode-6-wind-of-the-forgotten-/2300-12505/',

'user': 'benpack',

'video_type': None,

'video_show': None,

'video_categories': [],

'youtube_id': 'Jd5VAdPyttY'

}

which was surprising because I would have expected the CMS to warn users when they tried to add a video with no type, show, or category.

Broken YouTube IDs

Back to the broken YouTube embed. I wanted to find other videos with broken YouTube ID. For this I would need two things:

1. The stored YouTube ID for every Giant Bomb video.

2. A way to check if a given YouTube ID is valid.

It was easy enough to get the YouTube IDs using the Giant Bomb API. To check if a given YouTube ID is valid I turned to the YouTube Data API. The first thing I tried was to take every YouTube ID for every video on Giant Bomb and check it's validity using the https://www.googleapis.com/youtube/v3/videos endpoint. But with over 14k videos on the site I quickly ran out of quota for YouTube API requests.

So I shifted my approach to searching. To save on quota I first downloaded a list of all videos on giantbomb.com using the Giant Bomb API. Then I downloaded a list of all YouTube IDs on the Giant Bomb YouTube channel using the https://www.googleapis.com/youtube/v3/playlistItems endpoint and the ID of the "uploads" playlist.

I then compared the two lists to find potentially bad YouTube IDs. Any YouTube ID on the list provided by the Giant Bomb API but not on the list grabbed from the YouTube API could be a bad YouTube ID.

Interestingly, it's not sufficient for a YouTube ID to be missing from the YouTube channel for it to be declared bad. This is because on the Giant Bomb site there are videos such as the following that are associated with valid videos on other YouTube channels.

{

'deck': 'Jeff answers your fantastic questions and is finally confronted by a horrible truth.',

'guid': '2300-12968',

'id': 12968,

'length_seconds': 317,

'name': 'Quick Question with Jeff Bakalar: Ep. 10 - Jeff is a Lip Smacker',

'publish_date': '2018-04-10 07:59:00',

'site_detail_url': 'https://www.giantbomb.com/videos/quick-question-with-jeff-bakalar-ep-10-jeff-is-a-l/2300-12968/',

'user': 'vinny',

'video_type': 'Features',

'video_show': None,

'video_categories': [{'api_detail_url': 'https://www.giantbomb.com/api/video_category/2320-8/',

'id': 8,

'name': 'Features',

'site_detail_url': 'https://www.giantbomb.com/videos/features/'}],

'youtube_id': 'NIHmjbi6Dfc'

}

I then used the YouTube API to check every YouTube ID in the reduced pool of candidates. This worked well and in the end I found 205 broken YouTube IDs.

Interestingly while looking for bad YouTube IDs I came across some videos like these:

{

'deck': "It's time for a new generation to see the controversial and scandalous horrors that await in Night Trap.", 'guid': '2300-12563',

'id': 12563,

'length_seconds': 1648,

'name': 'Quick Look: Night Trap - 25th Anniversary Edition',

'publish_date': '2017-10-05 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/quick-look-night-trap-25th-anniversary-edition/2300-12563/',

'user': 'ybbaaabby',

'video_type': 'Quick Looks',

'video_show': (removed for brevity),

'video_categories': (removed for brevity),

'youtube_id': 'Night Trap - 25th Anniversary Edition: Quick Look'

}

{

'deck': "A few more Mario Party mini-game pitches for you to ponder: Block Jock, Boo's Cruise, POW WOW, Luigi Squeegee, Blooper Scooper.",

'guid': '2300-11201',

'id': 11201,

'length_seconds': 537,

'name': 'Best of Giant Bomb: 99 - Piranha Pajamas',

'publish_date': '2016-05-28 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/best-of-giant-bomb-99-piranha-pajamas/2300-11201/',

'user': 'turboman',

'video_type': 'Best of Giant Bomb',

'video_show': (removed for brevity),

'video_categories': (removed for brevity),

'youtube_id': 'Y6BeQ4BOnY'

}

Notice the YouTube IDs. The first is obviously invalid. It should look something like 'n6WelVKtDgQ' and not a bunch of words. The second YouTube ID only has 10 characters in it. A typical YouTube ID has 11.

This suggests to me that at some point in the Giant Bomb's history inputting a YouTube ID was a manual process and that they weren't checked with a regular expression.

Orphaned YouTube Videos

It occurred to me that the YouTube videos that where supposed to be pointed to by the broken YouTube IDs on giantbomb.com might actually still exist on YouTube but with different YouTube IDs.

I already had a list of all YouTube IDs stored on Giant Bomb. I also already had a list of all YouTube IDs from the Giant Bomb YouTube channel. So to search for videos that were on YouTube but not associated with a video on the Giant Bomb site I did the reverse of the search I did earlier. I looked for YouTube IDs that were on the YouTube channel but not on the list of YouTube IDs I gathered from the Giant Bomb API.

This new search yielded 837 orphaned videos. A few of the videos were clearly intended to be YouTube exclusives. However, a bunch of the videos returned clearly weren't intended to be YouTube exclusives including this one:

https://www.youtube.com/watch?v=enOmD9yL_Hg

Which was the video that inspired me to dig into this stuff to begin with.

Possible Matches

I then turned my attention to investigating whether it was possible to automate or semi-automate the process of matching each video with a broken YouTube ID on giantbomb.com with the correct corresponding orphaned video on the Giant Bomb YouTube channel.

The first thought I had was to match the videos based on title. Unfortunately the title that a video on giantbomb.com has doesn't necessarily corresponded to the title the video has on YouTube. They are usually close but not exactly the same. It seems like it's up to the CMS users to name the YouTube versions as they see fit. For example video titles often end up like this:

Quick Look: Dragon Ball Z: Kakarot [giantbomb.com]

Dragon Ball Z: Kakarot: Quick Look [YouTube]

So I couldn't rely on an exact title match in order to algorithmically pair giantbomb.com videos with their YouTube counterparts but I could at least use them to narrow down the search.

By finding the Levenshtein distance between the title of an orphaned video and the title of a video on giantbomb.com I could measure how similar their titles were.

For example the Levenshtein distance between this title (found on giantbomb.com):

Quick Look: Harry Potter and the Deathly Hallows: Part 2

and this title (found on YouTube):

Quick Look: Harry Potter and the Deathly Hallows pt. 2

is five.

So I calculated the Levenshtein distance between every video title on giantbomb.com and the title of a given orphaned video. I was then able to rank every video title on giantbomb.com for how similar it is to that given orphaned YouTube video's title. By taking only the lowest scores from this ranked list I then had a pool of possible matches for that orphaned video.

Now that I had a pool of possible matches for each orphaned video I now needed a way to determine which candidate video, if any, was the correct match. I googled around to see if I could find any software or techniques for video fingerprinting. I couldn't find any. It then occurred to me that I could just find matches based on audio and that there are all kinds of services like Shazam that use audio fingerprinting to do song identification. So sure enough with some more googling I was able to find an open source audio fingerprinting package. It was meant for music identification but I decided to see if I could make it work for my purposes.

Finger Printing Audio

This turned out to be the hardest part of the entire project. Not for any deep technical reason but because the software package I choose to do the audio fingerprinting had a lot of broken dependencies and was prone to crashing.

Since this analysis would take a fair amount of time and bandwidth the first thing I did was spin up a Digital Ocean droplet on which to do the work. Running in a tmux session I started pulling down the data needed for comparison. For YouTube I was able to pull down just the audio portion of each orphaned video using youtube-dl. But for the candidate videos on giantbomb.com I had to download each entire video. Which I then split using ffmpeg keeping only the audio portion to save disk space. I then used the open source audio fingerprinting package to do the comparisons.

In the end after dealing with countless crashes and problems I was only able to generate match data for a small slice of the orphaned YouTube videos.

Oddities

During my investigation I came across some of the cruft that accumulates on a big site after it's been around for awhile. Such oddities include:

Two Posts For The Same Video:

These two entries seem to share a video and comments section. I don't know why this happened but my best guess was that this was a double post and they wanted to merge the comments sections.

https://www.giantbomb.com/videos/quick-look-puyo-puyo-tetris/2300-11999/

https://www.giantbomb.com/videos/quick-look-puyo-puyo-tetris/2300-9838/

The Shortest Video Title (according to the API):

Shirts!

The Longest Video Title (according to the API):

We Can't Even Remember if We're Supposed to be Working Today, So Here's a Week-Old Lightning Returns: Final Fantasy XIII

The Longest Video (according to the API):

Extra Life: 2017 - Alex Navarro

Number Of Seconds Of Video On The Site (according to the API):

28207774

2 Comments