Are markup characters part of a translation?¶
Infinite state machine approach¶
A long time ago I read on StackOverflow that a good translation process is one that does not include markup characters inside source and target languages. I partially agree with that statement, but it implies the existence of an ideal world in what we can remove all markup characters of source file, send to our translators a plain text file, and when the translation is returned to us, we could rebuilt the markup characters in the target language. As long as there is no artificial intelligence capable of doing that in a stable way, we need to look for other ways to fix the problem.
Beyond that statement, an artificial intelligence capable of solve the
problem or rebuild the target language file would be possible? I don’t know,
but think about next simple example. This machine would need to be able to
translate the Markdown string
**Flexible** replacement **again** to the
Reemplazamiento **flexible de nuevo**, giving it as input the
Reemplazamiento flexible de nuevo and
**Flexible** replacement **again**. For this simple example maybe it could
learn it without big problems, but think about idiomatic twists, language
contexts and the growth of languages in an ever-changing world within the
framework of a stable translation process at the cost of expensive learning
Do you want to wait for a machine capable of doing such a large job to appear to have acceptable translations? Personally I don’t.
Do the markup characters express meaning?¶
In one language could be a sentence with bold characters in a part of a sentence that could express meaning. This meaning could be expressed in another way in other language. So I declare that a good translation process is one that allows markup characters without the need to force the translator to do mental juggling working with them.
This implies a
.po files translator capable of work with markup characters
templates using an editor which allows to create markups like editors do. But,
this thing doesn’t exists currently.
The solution proposed by this library is to extract from Markdown files only the text that needs to be translated, including markdown characters:
**Bold text**is not changed, is dumped into msgids as
*Italic text*is not changed, is dumped into msgids as
`Code text`are unified to use the minimum possible backticks for start and end characters and dumped into msgids as
[Link text](target)is not changed if the text is different than the target, is dumped into msgids as is. In the case that link text and target are equal, is converted to an autolink and dumped into msgids as
![Image alternative text](/target.ext "Image title text"), are not changed, but included as is.
~~Strikethrough text~~is not changed, is dumped into msgids as
$$LaTeX maths displays$$are not changed, are dumped into msgids as
$$LaTeX maths displays$$.
_Underline text_are unified to
__Underline text__into msgids if
underlinemode is active, otherwise are treated like bold text (with two characters
__) and dumped as
**Underline text**or italic text (with one character,
_) and dumped as
Updates into source files are synchronized. A change in one string declares the old one obsolete and the translation can be updated quickly.
Translators work with
.pofiles directly, a standard in translations.
Parts of the Markdown files that do not need to be translated as code blocks or are not included in the translation (by default), reducing possibility of markup failures in translations.
Message replacers needs to be written and depends on this specification.
Translation editors needs to be configured with this specification if they want to handle properly markup character templates.