Support for Unicode.

  • 23
  • Idea
  • Updated 3 weeks ago
  • Under Consideration
Unicode is not fully supported in IMDb. For example, in Polish: you could change all references by searching “milosc” and then changing them to “miłość”. And Jiří Hnídek is written without an r-hacek on the start of their first name. It can also do the same for the ILM person Coşku Özdemır which is an Turkish person listed on Cinefex.
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes

Posted 6 years ago

  • 23
Photo of Randy

Randy

  • 2 Posts
  • 1 Reply Like
I agree. This affects titles, names, characters, discussion, and probably more.

Where I run into the problem most is in the discussion forums. If you paste non-ASCII characters copied from somewhere else (for example, to show a symbol that was in the film, or indeed to show the native-language title of the film), they just get turned into what appears to be the HTML text code for those characters, instead of the symbol itself.

It's 2013. This shouldn't be happening.
Photo of bluesmanSF

bluesmanSF, Champion

  • 10815 Posts
  • 6429 Reply Likes
Except that it's "Internet Movie Database," not "International." ;)
Photo of Dan Dassow

Dan Dassow, Champion

  • 13364 Posts
  • 13704 Reply Likes
This must be a Freudian slip. [wink]
Reminder to self: Don't post when tired.
Photo of bluesmanSF

bluesmanSF, Champion

  • 10815 Posts
  • 6429 Reply Likes
LOL. Too be honest, though, it's almost like they want it to be known that way. Most mentions of the spelled-out name are gone. Kind of like Kentucky Fried Chicken is only KFC now. New visitors seem to be having a hard time figuring out what the site is...video streaming, file sharing?
Photo of Emperor

Emperor, Champion

  • 6418 Posts
  • 3002 Reply Likes
Or after taking some random prescription medication you found lying in the meep.
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes
Yeah. Do not post nonsense. You have accidentally removed a comment (you need to dispute this remove).
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes
Since MobyGames supports Unicode, macrons in Japanese are OK for long vowels. Note that the title ends with a punctuation mark (full stop). Hungarian, Czech, Polish, Romanian, Slovak etc. requires a bunch of accented letters.
Photo of Dan Dassow

Dan Dassow, Champion

  • 13454 Posts
  • 13797 Reply Likes
It is almost like Randall Munroe has been reading this forum.
http://xkcd.com/1209/
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes
You quote the comic: “The Skywriter we hired has terrible Unicode support.”

After correcting Miroslav Kure's suname to Miroslav Kuře (to match Czech support: the Danish/Faroese/Norwegian ø is rcaron) in Battle for Wesnoth 1.11.1 contribution community, you have many problems with the Internet Archive Wayback Machine this time. First the connection is too slow to load and you get the error mesage “The machine that serves this file is down. We're working on it.” twice. Unicode in their own forum affects subjects (titles) and more. Note that the thread has nonsense!
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3261 Posts
  • 2917 Reply Likes
This has been mentioned many times over the past few years. A bit of history may help here.

When IMDb first started, it was updated by an automated email system. This was at a time when some of the email routers still only handled 7-bit ASCII and special encoding was needed to ensure that 8-bit codes would not be trashed. Moreover, some characters (e.g. | the 'pipe') were used internally (and in the email) as controls/delimiters. This is why you may sometimes see older contributors indicate a credit update as :

John Doe | 2nd Pirate | 22

By the time Unicode became standard, the system had grown quite complex. Before Unicode can be implemented, every part of the system needs to be checked and potentially modified to ensure that it will not be broken by any of the Unicode codes.

IMDb is currently in the process of moving the various lists (sections) to new internal systems. I hope and expect that they are designing these systems so that they will be able to support Unicode.

Once the moves have been completed, we may see support for Unicode, but don't expect it any time soon.
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes
You will need an answer. You have removed the first reply by accident. Where a name includes a suffix, we use a comma to separate it from the name. On game credits and indexes it is not treated as an integral part of the surname. Examples are:

Hernandez, Jonathan, Jr
Rowe, William A., Jr.
Tibbetts, Richard S., III

It thinks that the Get Satisfaction software uses Unicode. It supports different accented characters for Eastern European languages.
Photo of bluesmanSF

bluesmanSF, Champion

  • 10815 Posts
  • 6429 Reply Likes
The change log says you removed it...??? What the..??
Photo of taewong

taewong

  • 6 Posts
  • 0 Reply Likes
This reply was removed on 2013-03-25.
Photo of bluesmanSF

bluesmanSF, Champion

  • 10815 Posts
  • 6429 Reply Likes
Yep. And:


3 months ago
taewong, the poster:
Removed a reply in this topic
Reason: removed by the poster
Photo of BugMeNot

BugMeNot

  • 0 Posts
  • 0 Reply Likes
Actually, it seems that after the message-board makeover, Unicode support is even worse! At least with the old ones you could enter most extended ASCII glyphs (assuming proper code-page is set). But now anything that is above 127 doesn’t work.
Photo of Šimon Falko

Šimon Falko

  • 1 Post
  • 1 Reply Like
It's year 2014 and some Czech characters are still not supported.
Photo of Spyros

Spyros

  • 8 Posts
  • 1 Reply Like
It's almost 2015 and Greek characters aren't supported AT ALL.
Photo of Sorin

Sorin

  • 8 Posts
  • 0 Reply Likes
This reply was created from a merged topic originally titled
How many years will it take you to understand UNICODE?.


In 2009, in Contact #3034383 (http://www.imdb.com/helpdesk/thread?tid=3034383) you the owner of IMDB promised professional usage of UNCODE "in a little while". It is now 5 years and a half later and your web site is still crippled with no UNICODE implementation. 5 years and a half??? Don't you fill embarrassed with your "professionalism"? Shall we wait another 5 years for IMDB to understand the word "international"?

(This post is addressed solely and specifically to IMDb staff.)

Photo of Sorin

Sorin

  • 8 Posts
  • 0 Reply Likes
Correction: "don't you FEEL"
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
We are making slow and steady progress on Unicode support.  Note that until every single part of a system supports Unicode, none of it works.  We have a lot of critical backend systems that need to be migrated.  Unfortunately, we don't have a timetable that we can share, but please be aware that we are working on it.

Note that in the last few weeks we've enabled full Unicode support in the message boards:

http://www.imdb.com/board/bd0000043/nest/235469052

We had a number of encoding issues that I believe we have fixed.
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
Note that user reviews:

http://www.imdb.com/user/ur2278015/

...and lists:

http://www.imdb.com/list/ls001825868/

...also support Unicode.
Photo of Spyros

Spyros

  • 8 Posts
  • 1 Reply Like
Yes, but no movie display titles...
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
There is already limited support for this; see the Greek title here:

http://www.imdb.com/title/tt0015648/releaseinfo#akas

Our systems currently use a mixture of ISO-8859-1, UTF-8, and KOI8-R.  Untangling this mess while keeping things running is like changing the fan belt on an engine without switching it off.
Photo of Spyros

Spyros

  • 8 Posts
  • 1 Reply Like
I tried to add a title in a movie but the system didn't let me. It errored in every letter i entered.
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
Yup.  The submissions pipeline doesn't yet handle Unicode.
Photo of Spyros

Spyros

  • 8 Posts
  • 1 Reply Like
So, the movie titles written with Greek characters are made by the people inside?

Is there a timeline when I will be able to contribute Greek titles?
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
Yes, there were some cases added manually years ago.

We don't have a timeline yet, but we know people really want it.
Photo of Piotr Balwierz

Piotr Balwierz

  • 1 Post
  • 0 Reply Likes
3 years has passed and IMDB is still mentally in the pre-unicode 1990's.

If you don't want to fix your database for unicode support, then just write parsers and translate user input to html codes.
Moreover, some html codes are not supported, eg. ń

NB. It is not possible to have a title with a non-basic-latin character. Even if I fix a movie and input a html the form will on the fly change it to unicode and report a problem (!!)
Photo of Marco

Marco

  • 998 Posts
  • 1162 Reply Likes
The last update on this (at least in this thread) was two years ago, so can a staffer tell us what has happened these past two years regarding this issue?
(I note that in the message boards on IMDb, one could see exactly when a post was made, here I can only see that Murray responded two years ago, not very specific).
Photo of gromit82

gromit82, Champion

  • 7107 Posts
  • 8619 Reply Likes
Marco: In response to your latter comment, you can see the exact time of a post here, at least on the desktop version of GetSatisfaction. To do that, hover your mouse over the time designation of the post (such as "2 years ago"). So, for example, Murray's post that begins "Yes, there were some cases added manually years ago" was posted October 9, 2014 at 10:46:58 PM UTC.

I don't know whether or how it is possible to see the exact date and time on the mobile version of GetSatisfaction.
Photo of Marco

Marco

  • 998 Posts
  • 1162 Reply Likes
Thanks Gromit!
Is there also a way I could've replied this post to you instead of to myself that I haven't found?
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
Checking in to say that we're still working on it, but at this point can't commit to a timeline.
Photo of Marco

Marco

  • 941 Posts
  • 1064 Reply Likes
Thanks for letting us know you're still working on it.
Photo of Sorin

Sorin

  • 8 Posts
  • 0 Reply Likes
Come on! If you haven't done much in 7 years, the timeline is clear: for ever! :)
Photo of Misjn

Misjn

  • 6 Posts
  • 5 Reply Likes
How is this still a thing.
Photo of Jeorj Euler

Jeorj Euler

  • 7165 Posts
  • 9277 Reply Likes
Photo of Owen Rees

Owen Rees

  • 206 Posts
  • 313 Reply Likes
SGML/HTML/XML character references are no more useful in solving the underlying problem of representing and processing the full range of Unicode than any of a number of other encodings. They make sense if the data is represented and processed in XML - perhaps using technology such as XSLT - but even then they would appear only in externalised forms emitted as output or accepted as input. Since XML is, for preference, represented in UTF-8 in externalised forms, using character references does not give much benefit.

Using SGML character references in internal representations would cause all sorts of problems, especially with searching and matching.
Photo of Jeorj Euler

Jeorj Euler

  • 7165 Posts
  • 9277 Reply Likes
I see that the IMDb staff has left this proposal in an "under consideration" state. Very interesting.

I shall opine that it is not so challenging for search algorithms to be made to account for strings encoded with standard character entity references, and it would be a shame if most of the libraries and engines behind most search tools used deployed in any electronic database anywhere throughout the World-Wide Web lacked such a capability. But likewise, the same could be said of Unicode deployment, or that of Internet Protocol v6 for that matter.
Photo of Owen Rees

Owen Rees

  • 206 Posts
  • 313 Reply Likes
According to https://en.wikipedia.org/wiki/SGML_entity#Character_entities - and I have no reason to doubt its accuracy -
HTML 4, for example, has 252 built-in character entities that don't have to be explicitly declared. XML has five. XHTML has the same five as XML, but if its DTDs are explicitly used, then it has 253 (' being the extra entity beyond those in HTML 4).
This calls into doubt the concept of "standard character entity references" and also makes it clear that the sets of character entities that can be considered to be in common use do not cover the range of Unicode codepoints.

If we allow numeric character references - both decimal and hexadecimal - then each Unicode codepoint in the data can be represented in three or four ways in any system that can handle Unicode. The only rational way to deal with that complexity is to decode the data to strings of Unicode codepoints before applying normalisation and then using it in whatever processing is required. Having decoded the data to Unicode codepoints, the simplest and most widely supported encoding to use for any sort of I/O is UTF-8. Unless the data is being embedded in some SGML-like format such as XML, there is no reason to use character references and there is never a reason to use references for characters that do not have specific meanings in the markup if the underlying representation can support Unicode.

The most fundamental requirement in handling character encodings is to be obsessively strict in tracking how each piece of data is encoded. In general, data may have multiple layers of encodings and it is essential to keep track of which have been applied to each piece of data. Each additional kind of encoding adds complexity, especially if it can be layered on other encodings, so the goal should always be to use as few encodings as possible.

I expect that some filmmaker will want to capture the essence of the World Wide Web and will decide to use a title such as "Markup: < &lt; & &amp; changed the world" and whatever encodings are used by IMDb had better be able to cope with that. (and I hope that this forum can handle it too!).
Photo of leodevbro

leodevbro

  • 4 Posts
  • 6 Reply Likes
When the time comes, please don't forget Georgian language characters to be in the supported characters list.
Photo of Murray Chapman

Murray Chapman, Employee

  • 108 Posts
  • 61 Reply Likes
There's a Unicode block for Georgian characters, so they will be supported automatically. Whether or not the characters display properly in browsers will depend on whether people have a font installed locally.... but presumably those who are interested in Georgian characters will!
Photo of Kaveh

Kaveh

  • 3 Posts
  • 5 Reply Likes
When will Unicode be fully supported in text fields in IMDB? If this website is really Internet Movie DataBase, it's supposed to support non-English languages, and how come in 2019 your website doesn't support unicode, it's a shame.