IMDb Data – Now available in Amazon S3

  • 5
  • Announcement
  • Updated 2 weeks ago
  • (Edited)
This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.

For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.


In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:


  • Data refresh frequency is now daily (previously weekly).
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format.
  • The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
As part of housekeeping the FTP site, the data files will no longer be updated. The list data files will continue to be available at two locations (see below) until February 28, 2017. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata

 If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license.  The license grants you access to our content via an XML web service, plus the right to use the content in your product or service.  If that interests you, please email licensing@imdb.com.

 If you have any questions or concerns, please share your feedback in this thread.

 Thank you for your continued support.
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes

Posted 1 year ago

  • 5
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.

I’ll explain.

Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...

Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).

Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.

Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.

A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.

The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.

I want to leave here some rhetorical questions that boggle the mind.

1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?

2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides.   Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.

3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.

4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?

As I said, rhetorical questions.

Braindead, I say.


(Edited)
Photo of Vincent Fournols

Vincent Fournols

  • 759 Posts
  • 1025 Reply Likes
I wish that Amazon's answer to your questions (which I support, along with your position) is as effective and swift as their customer service... But this is probably the critical point: in a previous answer above, the Amazon "official" speaks of customers, when the people reacting here to this move are contributors, probably as thin as my little finger compared to a Saturday night crowd at the movies... It is all about balance of powers. Considering the weight IMDb has gained in the industry (at least on the US side), I am not sure that all the new data IMDb gets every week come from disinterested contributors like us.

I have another question to Amazon: what would be the cost of maintaining the FTP files, say on a monthly basis instead of a weekly one? I would be ready to pay a (very reasonable) fee to keep access to this data. Especially if it is discounted when I contribute to the data feed and update, as proposed by Valen above. This would seem fair in your merchandization and monetization of what use to be a fantastic collaborative and open and free project, don't you think?

V.
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

Actually no. People do not understand that Contributors are the real "secret sauce" behind IMDb’s success. They (Contributors) really sell themselves short. Data will "rot" very easily, if it goes without constant verification and supervision. Think of it as a green lawn. Without proper and constant maintenance it would turn into a "jungle" of sorts in a short amount of time.

Somebody mentioned Wikipedia and is a bit like that too. Without constant attention Wikipedia would quickly become unusable and worthless. And this change is even more braindead because now more than ever they will need to keep their Contributors happy because data input will grow more and more as time passes. And this is the time that they thought it be a good time to show Contributors (which they'll need more and more of as time goes by) this (symbolical) huge middle finger. As plain, common, sense, one does not (should not) antagonize the ones that one depends on.

Another way of looking at it (which is also one of my favorites) is that IMDb is the longest running, most successful case of Crowd Funding ever (it's still going at it), anticipating that movement by many many years. And the reason it worked so well and so successfully, until now, is because of 2 things, IMO: 1. Instead of money (which people tend to have more difficulty parting with, curiously enough) it asked "only" for time (which people donate more easily even if it is more valuable); and 2. The "rewards" were also very simple and easily understandable, "you give us your time and we won't cut you off when you want to make use of the data, for you own private pleasure”. There’s nothing simpler than that! And so, what do they want to do now? 1. "Give us your time AND your money too"; and 2. "Rewards? No. None of that. Here’s a reduced version of the data we chose for you IF you pay for it, like any other customer. But no rewards.". Genius!

You rightly pointed out that data is more and more coming from other interested parties. But that is only a small part of it. All data has errors, inconsistencies, typos, or it can be just plain wrong. "Dumping" data is just the first small, step. It's the eyes of the thousands (more than that) that little by little turn it into quality data. Elsewhere was a comparison with bees but I see it more as an ant colony where each ant (Contributor) does just a little bit every time but that is essential for the success of the whole colony. And one ant is dispensable but if you take all the ants from the colony it'll disappear very soon. These thousands and thousands of micro-corrections forwarded by Contributors resulted in IMDb data today. And don't get me wrong, there are errors there now. There will always be errors there. Just not the same ones and their importance and scale will be smaller and smaller with time. Providing happy Contributors still do their work, as always.

If I can misquote from "Soylent Green": "IMDb is people!”. In this case, Contributors. That is the reason it has the better (best) data. And you can quote me on that.

(Edited)
Photo of Terry Flynn

Terry Flynn

  • 4 Posts
  • 11 Reply Likes
I have no issue paying to download IMDb datasets for my own personnel use.

What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself. 

I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.
Photo of Doc Magro

Doc Magro

  • 1 Post
  • 2 Reply Likes
This matches what I do with it, except additionally I am moving toward mining the data for non-profit academic research. I agree the most disappointing thing is the extremely limited data this seems to be turning into.

On a curiousity-note, would love to connect with you and see what you are doing with the data.
Photo of Jeorj Euler

Jeorj Euler

  • 3343 Posts
  • 3748 Reply Likes
In regards to usecases, a lot of things that cannot be achieved through the advanced search tools, can be achieved through downloading files like certificates.list.gz, keywords.list.gz, so on and so forth, and then processing the data found within them.

So, I'm now in the position of urging for a much more comprehensive data set to be made available or for the advanced search tools to be more advanced. In addition, basically, Luca Canali is right. Amazon/IMDb is taking things away from us. Whatever it is that we will be gaining in return remains to be seen.
Photo of Col Needham

Col Needham, Official Rep

  • 5956 Posts
  • 2918 Reply Likes
Official Response
Thanks for the feedback so far on this thread.  Please do continue to post and we will try to take as much as possible into account. This post answers some of the questions raised and there will be further updates based on the next round of feedback. 

On the S3 access issues, we now have a working prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges.  Please watch for an announcement as we convert this into production code. The only thing needed will be an ordinary IMDb user account attached to a valid email address.  We still intend to also make the data available via S3 for those people who find the AWS access tools more convenient and can stay within the free tier of AWS.    

On the general data availability, we are adding the AKA titles to the basic data set accessible to everyone.  Longer term, we are looking at the possibility of daily diff files for at least some of the data in the basic set. 

On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site.  It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us. 

The background to all of this is that there is a huge multi-year technology migration project which is nearing completion at IMDb. We have too many complicated old systems around which have been slowing the overall pace of development (I add a bit more detail to this on https://getsatisfaction.com/imdb/topics/why-doesnt-imdb-staff-ever-consult-with-the-contributor-base...).  The move to the new technology has been providing the opportunity to look at the way we operate different parts of the IMDb service.  One of the oldest software systems is the one which publishes the FTP data, and we will soon no longer to even be able to generate the .list files once the final pieces of the old IMDb system are decommissioned; at least not without re-writing all of the publication software to connect to the new system and produce an extremely difficult to manipulate text file format which was designed 27 years ago and has not changed in 21 years. Instead, we decided that it would be better to publish the data via a modern system (S3 and soon over https) in a modern format which can be more easily parsed.  The other problem with FTP is that we have no idea how many people are using the data and for what purpose, nor do we know what additional things they may want from the data. From feedback over the years, we knew some of your requirements already, notably (a) access to the title and name constant data (b) an easier to parse format (c) information to help in matching other catalogs to IMDb (d) more frequent updates.  We found ourselves having to guess the remaining requirements until we decided the best way forward was to move the data to a new location within the FTP sites, post an announcement on Get Satisfaction (this thread) and then wait to gather feedback before replying and figuring out what steps to take next (this reply). 

We hope this helps.  We have plenty to be working upon in the meantime, and we will follow-up as we deliver parts of the above. 

Col
Founder & CEO, IMDb.com. 
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
Thank you Mr Needham for this much welcome answer, which is somewhat reassuring.

V.
Photo of Jeorj Euler

Jeorj Euler

  • 3343 Posts
  • 3748 Reply Likes
Thanks, for your feedback, Col Needham. I'm pleased to know that you have big plans. I hope the things of the future wind up having more merits than demerits.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
Thank you for responding, at least the timing and some of the motivation for this is clearer.
However, without the ability to answer the simple question "who is in this move", the data is essentially useless to me.  
Sad day for imdb, when it switches from "here is some awesome data, we are excited to see what you do with it" to "Convince us why we should let you see our data"
Reluctantly I am switching to another data provider, themoviedb.org
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

Sad day for imdb, when it switches from "here is some awesome data, we are excited to see what you do with it" to "Convince us why we should let you see our data"

 

You said it best, right there.

There are 2 main ways to "consume" IMDb data (files): 1 - The Static Mode: People already have an use for the data and go there and grab it for their usual needs; 2 - The Discovery/Explorer/Research Mode: people get some data, and then detect some patterns, get/try different data (files) and see some more patterns, get new ideas and connections, test theories, invent new uses for the data, detect errors/inconsistencies, get yet another batch of different files to retest other hypothesis...

While the first type of Users is perfectly fine and has an important place (plus, those same Users could already have been or they have the potential to become a type 2 User) the real "power" of (IMDb) data use is in the 2nd type.

IMDb wants to categorize/"lock" people in "typical", static/monolithic, use cases (which they can't because there is no such thing, when one gets access to such a rich and diverse quality data – the richer the data the more unlimited are the possibilities) without the understanding that is in exploration the true power and possibility of the unimpeded access people enjoyed until now. This is also the mode where more errors and problems with the data are bound to be discovered and, thus, reported/corrected.

So, they are (re)enabling type 1 Users (which is good), but type 2 Users (where the true value, for both IMDb and the Users/Contributors, really lies) are shackled and constrained in typical use-case boxes that has little or no real use to them (because the most desirable "consumer" of data, for IMDb, is the one that has no [pre}set idea what she/he'll do tomorrow with that data; the ones that ask themselves "What if...?" and then go about checking that out). And these Type 2 are also the main Contributors (if not to IMDb directly, at least indirectly) to the Film/TV community. They make all of us appreciate and understand all the interconnected nature of the art form. And that'll always return, in the end, to IMDb, in one form or another, because the more (quality) information and the bigger the community, the more people will turn to IMDb (because is one of the best, more popular, places to know more). Curtailing, impeding, Type 2 users, in any way, is a substantial self-inflicted wound (the proverbial "shot in the foot").

In spite of building one of the most successful stories of data gathering/maintenance in history they seem to lack a basic understanding of how this was achieved, how the whole direct/indirect feedback loop works, how the ample availability of their (almost) raw data was like seed(s) for fertile ground(s) -- where they could/would reap the benefits, severalfold, later, down the line, directly or indirectly. This kind of decisions seems to blissfully ignore how IMDb arrived at this point in time.

This is akin to the Captain of the Titanic failing to acknowledge the “invisible” 90% of the iceberg that lied below the line.

(Edited)
Photo of Nick

Nick

  • 3 Posts
  • 11 Reply Likes
I don't think I can or ever will understand or be able to wrap my head around this line of logic.

We created this awesome movie database, lets make an API
We made this awesome API, lets give everyone access to all the data!
20 years pass
Now that we've given everyone all the data for 20 years, and the API is old
lets update it!
People can't be using ALL our data right? No way that people are actually using all the data we provided them over the course of 20 years just because we can't see analytics for it. Not possible Nope.
Well just remove access to 9/10ths the data then.
Its fine if you guys change the format and work to update the way its parsed, removed redundancy etc, I am ALL for that, trust me you. But plain and simple, not giving me something I already use, is going to destroy so many projects, and development ecosystems, I don't think that we have a number that could go high enough to represent the amount you are killing by removing access to so much of the data. You're literally killing an un-ending amount of (infinity) projects.
(Edited)
Photo of Ron

Ron

  • 137 Posts
  • 63 Reply Likes
On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume.

We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us

Any update(s) on the above?
Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
That is a huge drawback for me. I'm not only a film buff, I also write articles about old and rare films for a non-commercial blog, and I do extensive research for that purpose. I'm using the IMDb list files for more than 15 years, and I download the updates almost every week (in the first years the diff files, then, since I had DSL, the list files). I wrote a script (which makes use of wget) that determines if new files are available, so the download is started automatically.

And yes, I need ALL list files. Not every file every day, but every file from time to time. I'm using the program AMDbFront (don't look for it - it has disappeared from the Internet since its author didn't develop it any more) to convert the files into a MySQL database. AMDbFront is also the viewer for the data. I'm using it in GUI mode almost daily, but sometimes I make complex queries using SQL. One example: A few years ago, scientists from Northwestern University developed a method to determine automatically which are the most culturally significant films (the winner was THE WIZARD OF OZ), and they used the IMDb list files for that purpose (movie-links.list in particular). Here is their paper:

http://www.pnas.org/content/112/5/1281.full

I managed to reproduce their most important result (the long-gap citation count) with my local IMDb data, using a SQL query I wrote. The cited article only covers US films, but I used the method to create respective lists for many other countries, and I published my results in the above-mentioned blog.

I also wrote a script (in VBScript) which adds a table to the MySQL database that contains all films I have on DVD or Blu-ray. The table contains the title (exactly as it's in movies.list) and flags for seen/unseen, region code and short/long films. That information is taken from a text file I maintain for that purpose.  With appropriate SQL queries, I can answer questions like "how many short films from France from the 1930s do I have on DVD" or "who is the actor/actress with whom I have the most films on DVD"?

Well, this all will become impossible with the new dataset format. I surely won't switch to it (even if it would be free of costs), but I will freeze my installation at the current state. That's the lesser of two evils for me.
Photo of k247

k247

  • 2 Posts
  • 2 Reply Likes
Thank you Mr Needham, this is indeed reassuring, as mentioned before.
The HTTP access should be a good alternative way to obtain the data i guess.
And i'm sure that adding the AKA titles will help a lot of users, including myself.
Will by any chance, the languages of the movies, be included as well ?
Because AKA titles are mainly important when dealing with non English movies, but i think there is no possibility with the new data files to determine which movie is English or non-English.
Photo of Andre Paulino de Lima

Andre Paulino de Lima

  • 1 Post
  • 1 Reply Like
This reply was created from a merged topic originally titled Proper channel to request authorisation to extract and use information for resear....

I was wondering if there is a channel I could ask IMDB authorisation for using IMDB movie synopsis data in my thesis.
I am aware of the 'IMDb Data – Now available in Amazon S3' announcement, but I was not able to find an interface that would publish movie synopsis.

Your response is greatly appreciated. 

Best regards,
Photo of Normunds Kalnberzins

Normunds Kalnberzins

  • 1 Post
  • 2 Reply Likes
"In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3" == "on a short notice we are destroying many projects and initiatives that were using our data, by withholding most of it and providing the rest in a new incompatible format"
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes
Official Response
We are currently working on asolution for data access via HTTP endpoint as an alternative to thedirect AWS S3 access. As part of this solution, we are also looking into a contributor exclusive solution, providing extended datasets based on a person’scontribution history and volume. Given these developments, we are postponing the shutdown of the IMDb FTP sites to November 7, 2017

Please stay tuned for more updates.  Thanks!
(Edited)
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
My takeaway from Col Needham's recent post was that the primary issue standing between providing complete data and not was the significant development effort to make the data available from the new systems.

But now, It sounds like from this post that IMDB will have the logic in place to provide fuller datasets, and will provide them to contributors.

Now it appears to be purely that IMDB wishes simply to prevent the public from accessing the collected data for free.

This appears to no longer be about development efforts, rather it appears to be exclusively about cutting off free data. 

Frankly, IMDB needs to get its story straight.  If the data is in fact available, and  more complete data sets would be there for us ... if we provide IMDB with free (aka donated) labor, then the entire premise of Col's justification looks shaky .. at best.  

I hope you guys come to your senses, but I no longer believe we are being told the real story here, so I have now completed moving my data sources away from IMDB, and will donate both time and money elsewhere.

I've enjoyed using IMDB's data since well before it sold out to Amazon, and its sad to see so many years of cooperation trampled by this ill considered project.
Photo of Jeorj Euler

Jeorj Euler

  • 3012 Posts
  • 3210 Reply Likes
Thanks for postponing the shutdown, IMDb staff. We can at least grant y'all that.
(Edited)
Photo of Jeorj Euler

Jeorj Euler

  • 3012 Posts
  • 3210 Reply Likes
Due to my notification settings, I was informed: "Nobody liked your comment". That's kind of funny like a pun. Thanks, Nobody.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
With the revised shutdown date now days away, I note that no promised updates have not occured.

Really a shame to see an organization lose its way.
Photo of Jeorj Euler

Jeorj Euler

  • 3012 Posts
  • 3210 Reply Likes
I can foresee the IMDb contributor base potentially splintering into "patriots" and "loyalists". I may be no patriot, but the loyalists are really starting to disgust me.
Photo of TennisShoe

TennisShoe

  • 1 Post
  • 1 Reply Like

I can understand why a transition needs to be made and that its not easy to achieve parity in the new system. My use case is a little different than the others on this list so I thought I'd chime in.

I do economic research on the television industry; asking questions like how characteristics of production companies affects the quality of the show. To that end I use a bunch of data that's not included in the new subset on S3. AKA titles definitely as mentioned by others; they are useful for matching across different datasets. The distributor and production company lists files helps me track which shows were on which networks as well as affiliated with each production company. Producer and writer lists lets me for example connect a show with an Emmy winning executive producer or creator. Lists like language and runtime help me screen out noise from the data, especially for shows that were not very popular and may be incorrectly labelled in other variables. And the full set of genres is important to capture all the shows in a category; if a show has more than 3 genres it may not be included if for example I try to understand what is relevant for ratings in comedies. For ratings, knowing the distribution of ratings is useful to understand how targeted a show was.

Anyway I hope that is useful. Happy to expand on this more if it would help with your triaging of features.

(Edited)
Photo of Brian Risselada

Brian Risselada

  • 28 Posts
  • 29 Reply Likes
I appreciate that Col and sv are stating that they want to hear from us, our reactions to this, and how we have been using the data, so that is why I am sharing this.

I just checked and I have submitted 1,014 updates to IMDB, which I suspect would appear to be a pittance compared to the top contributors, but I hope it is still evidence that I care about the completeness and accuracy of the IMDB database.

One of my favorite things on the IMDB site is the advanced search, but I wanted to do so much more and I wanted to integrate the data with my own custom algorithms and personal logs. I discovered the .list files provided by IMDB and this has been my basis for an exciting adventure. I am not a professional IT person or programmer. But to do what I wanted I taught myself Access and eventually SQL and Python and Django. I only download the files again about every year because I don't watch a lot of new films but I love adding and updating the data for older and more obscure films.

I'll admit that when I first started, even though I didn't have any database or programming experience, I found the .list files seemed antiquated, but I wrote my own programs to extract the data and place them into an SQL database to work the way I wanted it to. So I am kind of glad that IMDB is moving to some new structures that will hopefully be easier to use even though it will probably mean countless hours for me to rewrite a lot of my program for getting the updated data into my database.

The thing I am most concerned about though is making sure all of the data is still available. I use almost all of it, and what is currently in the S3 files would not be worth me using anymore. So I am glad that it has been indicated that there may be means for the rest of the data to be available still. I will list the information below that I use and how I use it.

What I use the most:
Movie (with stats like release year, number of votes, and rating)
AKA
Company (production and distribution)
Country of production
Genre
Keyword
Language spoken
Person (all persons of all roles, and all of their credits on all movies, but most of all directors)
Running time

What I also use often:
Aspect ratio
Certificate
Cinematrographic process
Color
Film negative format
Printed film format
Location
Movie connections
Release date

What I also use but rarely:
Camera
Film length
Laboratory
Sound mix

Also I used ALL OF THE ATTRIBUTES of these items

All of this is strictly for personal use. It is primarily for me to log films I've seen and track data about them and to run interesting queries in the database to find interesting results and patterns within the database regarding connections between all of these stats listed above. However it also can result in me identifying information that needs to be updated in the IMDB database that I can then submit updates for. I have had thoughts before about some day creating something available for public use, but that may be a pipe dream, and if I did ever do it I would of course pay for the license to do so.

So I hope that through this transition that all of this information will be made available in a complete and easy way for contributes like me who wish to have the information.

One other thing I'd like to request: the most useful thing you could add to the datasets you make available would be for movies and for the people to list the IMDB ID used in the IMDB URL for each of the movies. For instance in the dataset for movies for the entry for "The Godfather" it would list the ID as tt0068646 which corresponds to the webpage for "The Godfather" which is http://www.imdb.com/title/tt0068646/. Or for instance for the person Alfred Hitchcock the ID nm0000033 which refers to his page at http://www.imdb.com/name/nm0000033/

Thank you for considering my situation. 
Photo of Jonathan Yoni Revah

Jonathan Yoni Revah

  • 4 Posts
  • 8 Reply Likes
The switch, in its currently implementation is horrible.

IMDb is a community-driven website that relies on the mass of users for nearly everything, from reviews to ratings, to episode and movie release dates yet somehow most of those things are missing from the data dumps. You owe it to the community to give back and 'complete' data dumps in the form of an S3 bucket where the devs pay for bandwidth is the least you could do.

The tens of thousands of users that left reviews or ratings didn't do so for the benefit of a corporation. We contribute information to large repositories like wikipedia or IMDb because we want people to have access to it, and we do so hoping that the gatekeepers will do their best to keep all out there and easily available...but instead you guys have gone the opposite way. Everything needs to be accessed through your interfaces or apps, what you do give back is anorexic in comparison to what you take, and yet you still rely on users to feed you information for your business model to even work...

I urge you to seriously reconsider this philosophy or at the very least have a moment of honesty with the developer community and explain yourselves better. There is no reason to have omitted all of this information and I'm starting to think that there is also no reason to contribute or rely on your website.

You guys have spent the past 30 years harvesting your users for data while providing decent dumps of your database, and now that we've all learned to rely on you guys, you're taking that away. Take a page from Google: "Don't be evil".
(Edited)
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes

Hello,

Thank you for your continued feedback. As we review your feedback and work on the HTTP solutions described by Col and sv above, we are revising the shutdown date of the IMDb FTP sites to December 28, 2017.  

More updates will be provided closer to that time. Thank you.

Chris
Photo of Jeorj Euler

Jeorj Euler

  • 3304 Posts
  • 3701 Reply Likes
Sounds good, I guess... for now.
Photo of Jeorj Euler

Jeorj Euler

  • 3304 Posts
  • 3701 Reply Likes
WTF?! Time is running out!
Photo of Brian Risselada

Brian Risselada

  • 28 Posts
  • 29 Reply Likes
Yes, I'd also like to hear more updates. Thank you.
Photo of Vincent Fournols

Vincent Fournols

  • 759 Posts
  • 1026 Reply Likes
Count me in as well...
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Hello,
     We are still on track for the FTP sites to stop being updated after December 28. Before that date I will share information about the datasets published on S3, addressing the feedback we have received about requiring an S3 account to access those datasets.
Thank you for your continued patience,
Chris.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
I have been watching these comments, and I was hoping you would address the failures with the actors file.  

You haven't updated the actors file on the ftp site(s) in almost three months now.  It still has a date of September 22nd.   Were you guys aware of that?

Will there be any opportunity at all to get a single consistent picture of the last data handoff before you go live? 

And are you seriously saying you are making significant changes to your operational systems between Christmas and New Year's day?  And your operations staff is on board with this plan?

What possible motive is there for not saying which of the next few days:  Dec 20, 21, 22, 26, or 27 you will be communicating?  Those are the only 5 days that are not holidays or weekends.

If you tried to schedule a riskier date in the entire year to convert, or one more likely to cause problems, I think you have chosen the second worst date, with the 29th also being a Friday before a 3 day holiday weekend during Christmas-New Years being slightly worse.
(Edited)
Photo of Jeorj Euler

Jeorj Euler

  • 3304 Posts
  • 3701 Reply Likes
It's going to be a busy next two weeks.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
not so much for me, I've moved my data sources over to tmdb, as far as I can tell i mostly lose minor uncredited roles plus the mpaa rating, everything else i used i have via APIs rather than parsing a download file, so Im happy.

I still harbor resentment that imdb made me do this, and is so uncommunicative.  It is clear that imdb has no respect whatsoever for people using these files, their unwillingness to communicate their plans speaks volumes for the disregard they have for users here.

You would think a company that built its data on the back of volunteers and public submission of data would see the value in giving back, but it appears they are in full amazon mode now.  Petty, overworked, and customers are $.

I wonder would Col have done it the same again, if he could do it over.  Hes probably covered by a non-disparagement, so I doubt he can really respond as he might like to.

Anyway, failing to provide the promised data, and witholding the information about what is changing next tuesday is pretty much par for the course for the new amazon.
Photo of Jonathan Yoni Revah

Jonathan Yoni Revah

  • 4 Posts
  • 8 Reply Likes
So the data dumps in their current state...is that final too? It's missing a ton of information we've been relying on.

@Gardner von Holt
I've nearly moved over to tmdb but it's been a bit complicated trying to get data dumps from them (though possible). I was wondering, are you recreating their entire database or just querying as needed? I'm writing a script to download and update a database in-place for tmdb and thought I'd check if others are doing it too.
(Edited)
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
No the API actually models my use case far better, as I watch a movie i have not see before I make api calls and add a small subset of the new movie's data to my local database.
Photo of Vincent Fournols

Vincent Fournols

  • 759 Posts
  • 1026 Reply Likes
It sounds like a very interesting use case. Thanks for the feedback (I just need to learn to master APIs!)
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
This reply was created from a merged topic originally titled Flat files/data sets availability ?.

Hi,

Could some IMDb staff confirm when will the last batch of flat files/data sets be issued and made available on the FTP sites?

The Berlin one still mentions 2017-09-10, a Col Needham post mentions November, 7, and the latest set available was issued on November 24...

Thank you in advance for some clarification of the roadmap!

V.
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
Unless mistaken, the FTP site on Berlin Freie Universität is now dead (it prompts me for username/password.
I received no answer to my question above (I was only indicated that the question was forwarded to the "FTP team")

That is the end of a 27 y old story, which started at the university of Cardiff, building the biggest film-related database with contributions from all over the world, but in return, making the data available on those FTP servers for free.

Col N. repeated still recently that they are working on a http solution, on top of the S3 service proposed on Amazon infrastructure, which is rather complex, above all is not free and last but not least: partial.
And now FTP is closed.

Hey you guys, obviously, you have a vacant project manager position (I mean: a real IT project manager). I can make myself available if you wish.

Otherwise, sayonara, IMDb.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
Vincent, I am able to access both the German and Finland ftp sites still as of 8-Dec from the USA
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
Thank god I started my post with "Unless mistaken"!!
But I confirm, from France, I am prompted a user/pwd on Berlin and connection with the finnish site keeps getting reinitialized.
I will give it another try later.
(Edited)
Photo of Robert McGirr

Robert McGirr

  • 2 Posts
  • 0 Reply Likes
I have been accessing the Amazon S3 for a few months and am now getting "Access Denied" from the S3 bucket, started this week. I see no one upset here, is this just me? How do I resolve this?
Photo of Chris H.

Chris H., Official Rep

  • 47 Posts
  • 61 Reply Likes
Hello Robert,

    Is there specific data or tables that your are seeing this issue with?

Can you access earlier versions of the tables by using an earlier date in the path?

To help us track down if something has changed, can you tell us the exact date when you stopped being able to access the datasets on S3?

Many thanks,
Chris.
Photo of Robert McGirr

Robert McGirr

  • 2 Posts
  • 0 Reply Likes
I found the issue. It started last week intermittently. I was getting all of the object keys in a list and downloading those containing "current". There is a 1000 object limit on the ListObjectRequest and I suppose you recently went over that limit. I added a prefix to my request and now get a list of objects with "documents/v1/current/" as prefix.

Thank you for your response.

Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
"This more robust and reliable solution will replace the IMDb FTP sites, which will be retired on December 28, 2017."

How about actors.list.gz? It is not updated after Sep 22 2017?

-timon
Photo of Jeorj Euler

Jeorj Euler

  • 2977 Posts
  • 3114 Reply Likes
They're making off like bandits with a whole lot of customer-generated content.
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
Who own this content? I have not red any "small print" carefully (att all).
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Official Response
Thank you for the continued feedback.

Earlier in this thread, Col referred to a prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. This system will require an ordinary IMDb user account attached to a valid email address. However, this system is not yet quite ready for production so to help address some of the concerns raised about the 'Requester Pays' access via S3, today we activated an https entry point to provide access to the basic datasets. This https location is here, https://datasets.imdbws.com/ The page http://www.imdb.com/interfaces/ has been updated with this information.

We are finalizing the extended datasets and access model and I will post an update about that as soon as it is ready. 

The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot. While the data on the FTP servers will not be updated going forward, we will not remove the data for at least the next few weeks so people who need that data can still download it.
 
Photo of Vincent Fournols

Vincent Fournols

  • 688 Posts
  • 876 Reply Likes
A first quick look at title.basics.tsv.gz is very promising, and much more satisfactory than the former FTP offer.
Could you please state the charset used to published the text files?
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Hello Vincent,

    The files are in the UTF-8 character set. I have pushed out an update to the http://www.imdb.com/interfaces/ page to add that to the file details section.

Best regards,
Chris.
Photo of Owen Rees

Owen Rees

  • 157 Posts
  • 156 Reply Likes
My quick look at title.akas.tsv.gz suggests it is UTF-8 - tt0000008,5 is Cyrillic and looks sensible, emacs reports that the underlying file was UTF-8 and displays the characters correctly.

That Cyrillic aka title does not appear in the aka-titles.list.gz file on the FTP site.

It looks to me as if the new system is including data that the old system could not handle.
Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
The latest and final FTP snapshot is online now. But once again there is no updated actors.list! It's still from September 22. That's very annoying and embarrasing - 3 months should have been time enough to fix this problem. At least the final dump should contain recent files only.

By the way, on the German FTP server there is a new directory "frozendata". The files are now available both in
ftp://ftp.fu-berlin.de/pub/misc/movie... and
ftp://ftp.fu-berlin.de/pub/misc/movie..., but I guess that in the long run, only "frozendata" will remain.
Photo of lucacanali

lucacanali

  • 1 Post
  • 2 Reply Likes
Thanks for providing the HTTPS interface. It's really appreciated.

Something I don't understand is about the genre information. For all the movies that have more than 3 genres, only the first 3 in alphabetical order are reported.
For example for "Dunkirk" (tt5013056) the War and Thriller ones are omitted.
And Dunkirk really is a "War" film... I cannot get the reason for this limitation.

In fact, also imdb.com has this problem. The top bar lists only 3 genres, but then in the "Genres:" tag has all of them.

I would be also really great if you can add also the "Country" and the "Keywords" information. This info is really important to categorize movies. Without it I have to resort to scraping your site, that is something I would like to avoid.

Thanks,
Luca
Photo of Terry Flynn

Terry Flynn

  • 4 Posts
  • 11 Reply Likes
This is my first look at the data IMDb are now providing and I'm extremely disappointed. Maybe I've just been spoiled for the last 20 years with the depth and quality of data IMDb have historically provided. Sad to say, I foresee lots of screen scrapping in my future :-(
Photo of Col Needham

Col Needham, Official Rep

  • 5951 Posts
  • 2908 Reply Likes
From our 19 August update:

"On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site.  It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us. "
Photo of Jonathan Yoni Revah

Jonathan Yoni Revah

  • 4 Posts
  • 8 Reply Likes
This doesn't answer the real question that is, why is information being walled off? IMDB's role up until now is that of a repository built up by millions of contributors. Have you guys run out of ways to monetize the data?

It doesn't matter anyways to me as I've moved on and will be contributing data to other, more transparent organizations.
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 7 Reply Likes
I've also had my first look at the data that is available.  Getting the IMDb IDs is welcome, and the files are much easier to parse.  I'd already solved both these problems, but this will be good news for anyone looking at IMDb data for the first time.

However, there are some big gaps in the dataset.  There's no way to get the complete cast and crew for a title.  There is only the top, above-the-title, "principal" actors, and even then, the role and credit order is missing.  The actors file has "known for" data, which is fine, but limited.  For example, according to IMDb, Harrison Ford is known for Raiders of the Lost Ark, Witness, Air Force One and The Fugitive.  All of these are fine, notable films, but all the Star Wars films are missing, let alone minor works like American Graffiti.

There is no replacement for aka-names, certificates, plot, release-dates, running-times.  The crew credits don't qualify contributions the way the old data did.  You don't know if a writing credit is the credited screenwriter, or "story by", or the author of the original work for an adapted screenplay. 

I'm hopeful that IMDb will decide to fill in the gaps.  The alternatives are to scrape IMDb.com or to move to crowdsourced alternatives like tmdb.  I'd much rather stay with IMDb, even if it means paying for access.
Photo of Jonathan Yoni Revah

Jonathan Yoni Revah

  • 4 Posts
  • 8 Reply Likes
We've been going on about this for months, and their only response is that only top contributors will have access to a separate, more comprehensive dataset.

That and the fact that they want everyone's contact info and names for data dumps tells me that they're not happy about sharing that much anymore.

What they're completely forgetting is that the public shared with them first.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
David, please try to engage them, and if you are successful, please report back, if I could get a guaranteed source of data I might return as I would prefer the quality here.  I attempted to engage the licensing team months ago, and the licensing team returned my initial email, but never actually made any offer as to the pricing.  Like you, I was willing to pay, but no one ever named a price.
Photo of Terry Flynn

Terry Flynn

  • 4 Posts
  • 11 Reply Likes
In the old datasets the primary key was, what I call, the Fully Qualified IMDb Title (FQIT). My entire home library as based on FQIT, which is human readable. FQIT is still supported by IMDb in the .../title/tt9999999/reference page.

In the new datasets, the primary key is TCONST, which is not human readable. Like Hostnames and IP addresses, Hostnames are much easier to remember and more meaningful than the IP address.

I hope IMDb could update title.basics.tsv, or create a new dataset, to relate TCONST (new primary key) with FQIT (old primary key). This would go a long way to addressing mine (and I assume others) issues.

I understand a few FQIT's change over time and have developed processes to address it. Not really an issue for me.
Photo of Vincent Fournols

Vincent Fournols

  • 759 Posts
  • 1026 Reply Likes
FQITs (and names) have proven not stable enough to my taste: the accumulated data sets over the years have lead to unreconciled titles by the tens of thousands, and I am only concerned by feature films, as I discard all series and most of TV films and videos.
So I am really glad to be able to peg the data model on TCONST and NCONST, and I am goingg to take this opportunity to create a major evolution of my database. But a dataset reconciling FQIT and TCONST would be welcome for sure!

Nevertheless, I am going to hold it for a while, as the name IDs, NCONST, come up to 95000000, so I guess that IMDB will soon roll out a longer lasting code/structure/syntax.

Among the other side wishes, pending the availability of extended datasets, I would appreciate that https://datasets.imdbws.com/ indicates the update dates of the available sets. But I am sure it will come in time.
Photo of J

J

  • 8 Posts
  • 7 Reply Likes
I'm working on the FQIT (title) and FQIN (name) stuff because I want to use the old data as long as possible with my IMDb application that is around for 18 years now (with several updates along the way). 

Currently it is (still) not clear which additional data will be available to match the old LIST file content. We've only heared some vague promises.

For the access to the extended dataset I've only provided some minor corrections to the IMDb dataset itself which in my eyes would not qualify me to get access to the full data. On the other hand I'm a developer of an application which is around for years (18 year as I said above) supporting the LIST file format. This should allow me to get full access.
We will see what happens in the future.

I hope to release a new version of my application soon (the new TSV file format is supported as in 'it is read' but not used on the query side yet).
Photo of Ron

Ron

  • 137 Posts
  • 63 Reply Likes
The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot.

Will that final build be pushed to the FTP site(s)? The dates on most of the files are from several days before that final build.  Thanks.
Photo of Vincent Fournols

Vincent Fournols

  • 685 Posts
  • 860 Reply Likes
The weekly builds are pushed on Fridays, so I guess we will get them by tomorrow. Hopefully the bug on the actor extraction will have been solved (stuck to September so far)
Photo of J

J

  • 8 Posts
  • 7 Reply Likes
I hope so too. I stumled over another bug is in the business.list.gz file from 2017-12-22. It contains a publicate entry which causes my application to fail due to an error (duplicate key) thrown by the database.

The error in the business.list.gz file is the title "Akemarropa" (2009) which can be found in line 2.264.972 and also in line 2.265.001.
Photo of Vincent Fournols

Vincent Fournols

  • 685 Posts
  • 860 Reply Likes
Thanks for sharing. The FTP files have never been faultless, there has always been a handful of bugs and mistakes. But I have always preferred to correct them manually compared with the millions of available data.

In case you missed it, just note that the Dec 22, 2017 publication is known as the last and final one.
(Edited)
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 7 Reply Likes
As I only updated my local cache of the IMDb data dumps twice per year, I am just now learning of the retirement of the FTP dumps.  I have been using these dumps for years for personal use, mostly for data mining and to create lists of movies that I want to track down on cable.

I've invested a tremendous amount of time in creating code to parse these dumps.  At my current bill rate, it's tens of thousands of dollars.  As a film enthusiast and software professional, these dumps were a great way to do both at the same time.

While this is very sad for my home, for-fun project, I think it is unfair to think that Amazon is doing this as a money grab.  Amazon has world class data infrastructure in AWS, and it's only natural that eventually they would want to move people onto AWS and away from legacy systems that were built in the 90s.  I don't know much about the internals of IMDb, but I would expect that the system that created the FTP dumps is at least a couple generations away from the system that feeds IMDb.com.

That said, are there any tutorials on specifically getting IMDb data through S3?  I am pretty comfortable with the programming involved, but have not worked with Amazon S3 before.  imdb.com/interfaces states the entry point is https://datasets.imdbws.com/, but I need details on how to construct the SOAP or REST calls.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
The S3 files are accessible using the Amazon AWS SDK, there are sdks for most languages.  You can also access 3s files using a tool like Panic's Transmit.  Transmit is an ftp tool that also knows how to access files on s3 and the various cloud providers (google, etc.)
Photo of Andrew Gallant

Andrew Gallant

  • 6 Posts
  • 4 Reply Likes
You don't need libraries and you certainly don't need to construct the raw HTTP requests yourself. You can use the s3cmd CLI tool: https://getsatisfaction.com/imdb/topics/imdb-data-now-available-in-amazon-s3?topic-reply-list[settin... 

If you do want to construct the HTTP requests directly, then you need to go read Amazon's S3 docs.
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 7 Reply Likes
@Andrew - thanks for the pointer.  s3cmd is straightforward to use and is probably the optimal solution for this problem.
Photo of k247

k247

  • 2 Posts
  • 2 Reply Likes
First of all, thanx alot for the http access and the added title.akas.tsv
i was so happy that this new file includes the language of the movie as well now
but either the languages of the movies are wrong in that file, or something is missing

if a movies does not have an akas title, is it by default in english language ?

and why have movies with type = original and isOriginalTitle = 1 no language defined at all ?

thanx in advance for any help
this seems to go into the right direction
Photo of Shep

Shep

  • 5 Posts
  • 0 Reply Likes
This reply was created from a merged topic originally titled need fields added to exported csv file: country, release year not date, cast,-act....

I would like to download special customized fields to my watchlist download or customized list  that include Title, Release Year, Country of origin, type of title (movie, tv, miniseries, etc...), cast + character, director, description.  The watchlist download currently has title, full date, director, type of show, and numerous links & ratings (that I do not want).  WHO CAN HELP ME WITH THIS???  If I need IMDBpro, I will definitely get it.  By the way does Amazon own IMDB?  What about the software programs referenced -- do I need those like open source software like Linx, ApacheGNU and Linux utilities.  I am just sole proprietor helping an inmate with compiling movie data not a major corporation.  PLEASE HELP PLEASE HELP.  YOU CAN REACH ME AT wohlfop@outlook.comwohlfop@gmail.comtypingandinmate@gmail.com and/or 540 915 0683 
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
I am working on these files to interface with my personal database.

in ttprincipals.principalCast (and probably the other multivalued fields), I cannot figure out the sorting criteria: it is neither the one displayed on screen, nor the nn9999999 code itself, nor the resulting alphabetical order.

Please, could an IMDB rep clarify this?
Thanks in advance.
(Edited)
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
Trying to revive this question, which seems to have remained out of authoritative IMDb attentions.
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Hello Vincent,

    Where it is available the cast order is the billing order in the end-credits. When we don't have a billing order from the credits then the list is normally ordered alphabetically or by the person's popularity on IMDb. Cast members that are flagged as uncredited are always listed alphabetically.

Best regards,
Chris.
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
Thanks Chris, but I think tou are refering to the sorting criteria on the website, when I am asking about the TSV table.
Let me show you an example with "Day of the Outlaw" (1959) http://www.imdb.com/title/tt0052724/reference

On the website (/reference view), the cast is sorted as follows:


In the ttprincipals.tsv file, the record is this:


which amounts to have this sorting order:


As you may see, this order does not make any sense: it does not match the display on the website, it is not sorted by nconst (IMDb in the above image), and it is not sorted by alphabetical name or first name.
So I would just like to understand how the data is sorted in the principalCast field. And hopefully and ideally to have it sorted like on the website!

Thanks in advance.
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Yes, you were right, I was referencing the logic for display order. When the data is extracted from the DB to create the datasets for publishing out, the cast ordering is not being maintained and has no particular ranking order.
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
If you (and IMDB!) look full cast & crew listing you will see that
Tina Louise is actress
André De Toth is director
Lee E. Wells is writer
Alexander Courage  "made" music
Russell Harlan handle cinematography
...

So, title.principals.tsv does NOT tell cast (actors & actress) of movie but some bull shit info that is no useful for anyone. Hey IMBD! Do you have any professionals there?
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
*IMDB
Photo of Vincent Fournols

Vincent Fournols

  • 774 Posts
  • 1045 Reply Likes
You are right, I only focused on Robert Ryan and Tina Louise who about head the cast. So I think we have a major issue here...
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
Chris H. and IMDB, what is your logic for title.principals.tsv?
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 7 Reply Likes
I think the core of the problem is that while a list of the top actors and/or crew is useful for some cases, there really is no replacement for the credits that were available in the old FTP data.

So for example, if you wanted the cast list for this film, Day of the Outlaw, you would go to actors.list and actresses.list and build a list of anybody who had a record that started with "Day of the Outlaw (1959)" and then use the role and credit order to build the full list.  In this case, Tina Louise has an entry
Day of the Outlaw (1959)  [Helen Crane]  <3>
So we know that her character was "Helen Crane" and her credit order was 3rd.  This matches what you see on IMDb.com.  actors.list and actresses.list were really big files, so that was a fair bit of work if you wanted to discover the cast on a specific movie, but at least the data was there and not too hard to parse if you knew what to look for and minded the whitespace correctly.

What we are getting in title.principals seems to be a pseudo-random list of cast and crew with no differentiation between them, except (maybe) influenced by the popularity meter that IMDb maintains.  Maybe Alexander Courage, a crew member, is popular because of his association with Star Trek.

I know that IMDb has chosen to make most of its data unavailable to the public and we can't go back to the old days, but could we at least get the cast data from actors.list and actresses.list?  That way we can build an accurate list of performers, in the proper credit order.
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 7 Reply Likes
I think the core of the problem is that while a list of the top actors and/or crew is useful for some cases, there really is no replacement for the credits that were available in the old FTP data.

So for example, if you wanted the cast list for this film, Day of the Outlaw, you would go to actors.list and actresses.list and build a list of anybody who had a record that started with "Day of the Outlaw (1959)" and then use the role and credit order to build the full list.  In this case, Tina Louise has an entry
Day of the Outlaw (1959)  [Helen Crane]  <3>
So we know that her character was "Helen Crane" and her credit order was 3rd.  This matches what you see on IMDb.com.  actors.list and actresses.list were really big files, so that was a fair bit of work if you wanted to discover the cast on a specific movie, but at least the data was there and not too hard to parse if you knew what to look for and minded the whitespace correctly.

What we are getting in title.principals seems to be a pseudo-random list of cast and crew with no differentiation between them, except (maybe) influenced by the popularity meter that IMDb maintains.  Maybe Alexander Courage, a crew member, is popular because of his association with Star Trek.

I know that IMDb has chosen to make most of its data unavailable to the public and we can't go back to the old days, but could we at least get the cast data from actors.list and actresses.list?  That way we can build an accurate list of performers, in the proper credit order.
Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
David Chappelle wrote:

So for example, if you wanted the cast list for this film, Day of the Outlaw, you would go to actors.list and actresses.list and build a list of anybody who had a record that started with "Day of the Outlaw (1959)" and then use the role and credit order to build the full list. [...]
This matches what you see on IMDb.com.  actors.list and actresses.list were really big files, so that was a fair bit of work if you wanted to discover the cast on a specific movie, but at least the data was there and not too hard to parse if you knew what to look for and minded the whitespace correctly.
That's true if you do it manually, but it's no work at all if you are using proper software. I am using AMDbFront for parsing the FTP files, and here is a screenshot:

Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
David Chappelle wrote:

So for example, if you wanted the cast list for this film, Day of the Outlaw, you would go to actors.list and actresses.list and build a list of anybody who had a record that started with "Day of the Outlaw (1959)" and then use the role and credit order to build the full list. [...]
This matches what you see on IMDb.com.  actors.list and actresses.list were really big files, so that was a fair bit of work if you wanted to discover the cast on a specific movie, but at least the data was there and not too hard to parse if you knew what to look for and minded the whitespace correctly.
That's true if you do it manually, but it's no work at all if you are using proper software. I am using AMDbFront for parsing the FTP files, and here is a screenshot:

Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
I posted my above message only once, and I have no idea why it's here in two instances.
Photo of Michael

Michael

  • 1 Post
  • 2 Reply Likes
This reply was created from a merged topic originally titled IMDb Data Files Available for download.

According to your website:  (http://www.imdb.com/interfaces/) "The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily."   I'm looking at the downloadable file:  name.basics.tsv.gz , according to that file (downloaded 12/30/2017)...Victor Brooks (nm0003499) is not deceased, but if you look up nm0003499 on your website, he died in 1999.   Same for Leslie Adams (nm0011145)...he is alive according to name.basics.tsv.gz, but if you look nm0011145 up on your website he is deceased as of 1993.   Are these dataset files no longer updated?   Thanks
Photo of Chris H.

Chris H., Official Rep

  • 48 Posts
  • 65 Reply Likes
Hello Michael, The dataset is refreshed daily but date of death is only included in the file if we have a year, month and day for death for the person. This is why those data items are missing for the people you refer to.
Photo of Terry Flynn

Terry Flynn

  • 4 Posts
  • 11 Reply Likes
Chris, because IMDb only provide the Year of Death in name.basics.tsv.gz, and you have that on your web page for Michael's examples, why not include it?
Photo of chuck.kahn

chuck.kahn

  • 14 Posts
  • 7 Reply Likes
I don't see an equivalent to the ftp interface's business.list.  Will this be coming later to the S3 interface?
Photo of Brian Risselada

Brian Risselada

  • 28 Posts
  • 29 Reply Likes
Can we please get a response to this?