How to fix subtitle problems in foreign languages

Subtitles in foreign languages can give strange results.  Here we show how to fix various problems with a French subtitle vtt file as an example, embedded in JW Player.
NOTE: JW Player 7.7 and above solved the issue with special characters, while lower versions and various other players give problems.

Some languages pose no problems, but as soon as you have special characters, like é, à, ô and others, you can run into trouble if you do not translate those characters into encoded characters. Therefore, if you get the same type of display errors as shown in the image below, you have some tweaking to do:

french-subtitles-error

Looks clearly out of order, doesn’t it? Each special character in replaced by a question mark in older JW Player versions and others. Luckily, this is quite easy to solve.

Encoding special characters in subtitle files

Let’s consider the following subtitle content in WebVTT format, which is the favorite subtitle/captions format for JW Player:

00:21.200 --> 00:27.500
La série montre les étapes, lieux et objets auxquels nous sommes 
tellement habitués et que nous ne voyons plus.

00:27.800 --> 00:31.000
Nous remarquons seulement quand ils disparaissent et c’est trop tard.

00:35.000 --> 00:46.000
La plupart des peintures ont une atmosphère désolée, renforcée 
par la gamme restreinte de couleurs, composée essentiellement
de gris teinté avec le noir profond du charbon de bois.

As you can see here, there are quite a few special characters (presented in red) that are not digested well in most video players.
The quick solution is to use an online HTML entity converter to translate those special characters into their proper equivalents. For this, I found an excellent tool that does the job properly: https://mothereff.in/html-entities, made by @Matthias.
It consists of two boxes, a Decoded and Encoded box. In the Decoded box, you place the text to be translated into HTML entities.  The Encoded box below automatically translates the text in real time:

Foreign subtitle html entities translation

Make sure you selected both check boxes below the Encoded box, otherwise, you get the wrong kind of encoding:

Foreign subtitles encoding

Needless to say, you only translate the text, not the timeline information. Paste the phrase in the Decoded box, then select the text in the Encoded box, copy it and paste it into your subtitle file, replacing the original phrase:

00:21.200 --> 00:27.500
La série montre les étapes, lieux et objets auxquels nous sommes
tellement habitués et que nous ne voyons plus.

00:27.800 --> 00:31.000
Nous remarquons seulement quand ils disparaissent et c’est trop tard.

00:35.000 --> 00:46.000
La plupart des peintures ont une atmosphère désolée, renforcée 
par la gamme restreinte de couleurs, composée essentiellement
de gris teinté avec le noir profond du charbon de bois.

The second phrase appears to be fine on first sight, yet it will also display a question mark if you load it into your player as is:

00:27.800 --> 00:31.000
Nous remarquons seulement quand ils disparaissent et c’est trop tard.

The apostrophe is incorrect, you need to replace c’est with c’est

It is possible to do this with the HTML entities encoder as well, but it creates an encoding that is unnecessary, like this:

Nous remarquons seulement quand ils disparaissent et c’est trop tard.

Although not really a problem, if you have a big subtitle/captions file, it adds up real quickly, so you best fix this sort of errors by hand.

How long should a phrase in subtitles be?

The ideal is 32 characters, although you may not always get away with that.  In any case, try to avoid lengthy phrases like this:

00:35.000 --> 00:46.000
La plupart des peintures ont une atmosphère désolée, renforcée 
par la gamme restreinte de couleurs, composée essentiellement
de gris teinté avec le noir profond du charbon de bois.

Instead, break them up in smaller bits.  As you can see in this example, the phrase remains in view for 11 seconds. Therefore, you can easily break this up in 3 parts, like this (encoding already implemented):

00:35.000 --> 00:38.500
La plupart des peintures ont une atmosphère désolée,

00:38.500 --> 00:40.500
renforcée par la gamme restreinte de couleurs,

00:40.500 --> 00:46.000
composée essentiellement de gris teinté avec
le noir profond du charbon de bois.

This will give a much better result.  Copying and pasting from the HTML entities encoder utility presents no problem as long as you use a text editor that doesn’t allow formatting, like NotePad, Notepad++, PSEditor and other code editors.

How to solve Phrases glued together

Sometimes, subtitle files contain hidden characters which mess up display. A common problem is carriage returns which are incorrect.  You don’t see it, and that is why it is hard to find the problem.
Consider the following output:

subtitles-broken-up

As you can see, the subtitles are messed up, showing two or more lines in one go, including timeline information.
The best way to solve this is by opening the subtitle file and scroll down to the part that starts with the problematic phrase.  In this example:

hors de vue, jetant....

At first sight, this looks normal (let’s forget the unencoded character in this case):

01:35.100 --> 01:40.000
hors de vue, jetant une ombre profond sur la voie.

01:48.000 --> 01:51.800
Ou juste un instant figé, les choses que vous marquerez en attendant...

But the reason why the player shows the second timeline as text is because the carriage returns in this file are seen as characters as well, therefore the player thinks it is one big phrase. To solve this, place the cursor at the beginning of

01:48.000 --> 01:51.800

And press backspace until the timeline is glued to the previous phrase, like this:

01:35.100 --> 01:40.000
hors de vue, jetant une ombre profond sur la voie.01:48.000 --> 01:51.800
Ou juste un instant figé, les choses que vous marquerez en attendant...

It may require pressing the backspace several times.
Then press SHIFT+ENTER twice, so that you get this (which looks the same as the original, but is different in reality):

01:35.100 --> 01:40.000
hors de vue, jetant une ombre profond sur la voie.

01:48.000 --> 01:51.800
Ou juste un instant figé, les choses que vous marquerez en attendant...

We are not finished yet. You need to do the same thing with the phrase that starts with Ou just un instant… because here also a “false” carriage return is interpreted as part of the phrase. So, we get first this:

01:48.000 --> 01:51.800Ou juste un instant figé, les choses que vous marquerez en attendant...

and then at the start of the phrase, press SHIFT+ENTER once this time:

01:48.000 --> 01:51.800
Ou juste un instant figé, les choses que vous marquerez en attendant...

When you save and upload the file again, the problem should be solved.

How come false carriage returns slip into a subtitle file?

This sometimes happens when you upload a file to your server or when you move the file from Mac to Windows and vise versa. Switching from one code editor to the other may trigger the problem as well. But at least you know now how to solve it. 🙂

6 thoughts on “How to fix subtitle problems in foreign languages”

  1. Hello,

    Thanks for the tutorial ! It’s really detailled, but it’s also a bit painfull.
    Recent subtitles websites now handle encodings, and do the conversion stuff (when possible) for you. I personally use subtitle-index.org

    Regards,

    Reply
  2. I wasted a ton of time fixing Vietnamese subtitle problems. It likes a nightmare. Your tutorial post is really helpful to me. Now I can handle this problem. Thanks for sharing

    Reply
  3. Hi, do you mean to say, that whatever encoded would be correctly embedded in the video.
    Is this applilcable to other languages also, or only for english.
    i will try.
    If the encoded is not as decoded seen language lines, what should i do
    please post to my email address

    Reply
    • Hi jraju,
      In theory, ys, but display of special characters also depends on headers of the page in which the player is embedded. If you have issues with special characters, try to put
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      in the header of your page.
      Or perhaps you can provide an example page where this problem shows up?

      Reply

Leave a Comment