IMDb title.basics.tsv.gz dataset missing genres

  • 4
  • Problem
  • Updated 2 months ago
  • Solved
The IMDb title.basics.tsv.gz dataset is missing genres which shows only \N
https://www.imdb.com/interfaces/
Photo of JEAN LIU

JEAN LIU

  • 5 Posts
  • 5 Reply Likes

Posted 2 months ago

  • 4
Photo of ACT_1   ⚠️

ACT_1 ⚠️

  • 4813 Posts
  • 6450 Reply Likes
  
JEAN LIU
Joined community on July 12, 2020 - new today
- - -  
Data missing from published IMDB database
https://getsatisfaction.com/imdb/topics/data-missing-from-published-imdb-database
  
Please note that regarding the database files published here:
https://datasets.imdbws.com/
The genre data is missing from the 
title.basics.tsv.gz file.
  
Philip Persson
Joined on July 11, 2020
Posted July 11 2020
.
(Edited)
Photo of Philip Persson

Philip Persson

  • 4 Posts
  • 4 Reply Likes
This reply was created from a merged topic originally titled Data missing from published IMDB database.

Please note that regarding the database files published here: https://datasets.imdbws.com/

The genre data is missing from the title.basics.tsv.gz file.
Photo of Joel

Joel, Official Rep

  • 1241 Posts
  • 1761 Reply Likes
Hi there,

Thanks for the post.

I've cut a ticket to the necessary team to look into for you.

Thanks,

Joel 
Photo of Joel

Joel, Official Rep

  • 1241 Posts
  • 1761 Reply Likes
Hey again Jean,

Similar to the other conversation, I'd recommend giving this file another download and taking a look to see if the problems no longer present.

Sometimes there are problems with data generation which is what you might be seeing.

Let me know!

Cheers,
Joel 
Photo of Phil G

Phil G

  • 228 Posts
  • 500 Reply Likes
Joel, this issue has come up many times before, for example here (thanks to staff for completely ignoring my posts there, by the way), and I've seen it raised here repeatedly on an almost-regular basis every couple of months.

On all previous occasions, the "solution" seems to have been "ignore it and hope it sorts itself out the next time the datasets are rebuilt", without making any effort at all to identify and fix the underlying problem that causes this. Is that correct? Is there any point in us reporting it when this recurs in future?
Photo of Joel

Joel, Official Rep

  • 1241 Posts
  • 1761 Reply Likes
Hey Phil,

Thanks for the post and sorry to see the other thread was closed without answer...not sure why that happened, so I'll follow up with the agent. 

I can understand the repetitive answer about these datasets sounds fairly dismissal - I've passed your feedback onto the dataset team to look into.

Cheers,
Joel 
Photo of JEAN LIU

JEAN LIU

  • 5 Posts
  • 5 Reply Likes
Hey Joel,

Thanks for the update and the new dataset looks good! From Phil's comments though it seems like a recurring problem that the IMDb team can perhaps pay more attention to.

Also, I hate to diverge from the topic but from checking further into the data I noticed another interesting issue: many titles with "tconst" ending in 9 have the number "9" is truncated.

Here are a few examples:
  • The correct tconst for Imeji should be tt10925129 not tt1092512, and
  • The correct tconst for Ikinari wa kawarenai should be tt10925119 not tt1092511

tt10925128 tvEpisode Episode dated 5 September 2019 Episode dated 5 September 2019 0 2019 \N \N News,Reality-TV,Talk-Show
tt1092512 tvEpisode Imêji Imêji 0 2007 \N 24 Animation,Comedy
tt10925130 tvEpisode Episode #1.10 Episode #1.10 0 2014 \N \N \N

tt10925118 tvEpisode Episode #1.7 Episode #1.7 0 2014 \N \N \N
tt1092511 tvEpisode Ikinari wa kawarenai Ikinari wa kawarenai 0 2007 \N 24 Animation,Comedy
tt10925120 tvEpisode Episode dated 12 May 2010 Episode dated 12 May 2010 0 2010 \N \N News


Thanks,
Jean
Photo of Joel

Joel, Official Rep

  • 1241 Posts
  • 1761 Reply Likes
Hey again Jean,

Thanks for getting back in touch.

I think this is a misunderstanding on how the data is presented.

If we look at Imeji on IMDb, you can see it's tconst is actually tt1092512 and not tt10925129.

Similarly, if we look at  Ikinari wa kawarenai on IMDb the tconst is tt1092511 and not tt10925119.

I'm guessing this assumption is (understandably) because of its position in the file (i.e. between tt10925128 and tt10925130). - but it's actually just because the consts are ordered as text not a number.

I hope this helps!

Cheers,
Joel 
Photo of Dan Dassow

Dan Dassow, Champion

  • 17172 Posts
  • 19619 Reply Likes
Joel,

Put another way, the tconst are character strings not numbers. Character strings can only be sorted in alphabetic (i.e. lexagraphic) order and are not sorted in numeric order.
Photo of JEAN LIU

JEAN LIU

  • 5 Posts
  • 5 Reply Likes
Got it, that makes sense. Thanks Joel and Dan! :)
Photo of Vylmen

Vylmen

  • 95 Posts
  • 120 Reply Likes
Are you saying tt1092512 includes a trailing space that is significant in sorting? Otherwise, lexographically, it doesn't belong between two longer strings.
Photo of Philip Persson

Philip Persson

  • 4 Posts
  • 4 Reply Likes
The genres are there now!  Thank you!  :-)