Creating a custom Machine Translation engine requires a great deal of training data. Knowing and using the correct data types ensures that your new engine will give your business the biggest Return On Investment.
Because when it comes to the training data you use for your Machine Translation (MT) engine, there’s one important rule to bear in mind:
The quality you put in is the quality you get out.
In this article, we’ll take a look at all of the data types used in Machine Translation engine creation. We’ll also highlight the kinds which will give you the best output.
How to get the highest quality Machine Translation results”
The quality of machine translation results you see will always be directly linked to the quality of the data used to train your engine.
There are two models of how to choose and collect the data you use to train your engine:
The Dirty Data Model
In the dirty data model, data:
- Is gathered from as many sources as possible
- The domain or subject matter does not matter
- Quality is not important
- Quantity is seen as the overriding concern
The idea is that good data will be more statistically relevant. Thus, it will be automatically identified and used by the engine.
Unfortunately, as the engine will be exposed to so much data – often of poor or unknown quality – it is very likely to learn “bad habits” which will dramatically affect the quality of the output your engine can create.
The Clean Data Model
In the clean data model, data:
- Is gathered from a small number of known high-quality sources
- Is from the same domain – on the same subject – which the engine will be used for
- Quantity is less important
- Quality is all
The unassailable logic in the clean data model is that if the system is never exposed to “bad habits”, it can’t learn any.
Ensuring that all of the data your machine learns from is “clean” means that it is learning from highly relevant data which is known to be correct and accurate.
It is possible to clean dirty data so that it is of a quality which can be used to train your engine. In fact, this is such an important process that serious Language Service Providers with in-house expertise in Machine Translation – Asian Absolute, for example – provide a specialist service which does exactly this.
Using higher quality data means less post-editing
Although it is possible to use limited amounts of poor quality data to train an engine, the result will always need to be heavily edited by a human translator in order to be usable. This will likely be to an extent which nullifies the advantages of using machine translation in the first place.
The higher the quality of the data used to train your system – most desirably a combination of high volumes of high-quality data, Translation Memories and rich glossaries – the less human post-editing will be required afterwards. The closer you can get to this goal, the more cost-effective your system will become.
Another strategy is to use only a small quantity of clean data to train your system and then use the post-editing feedback gained over time to steadily improve the quality of the output.
The kind of data which is best to use to train your system is:
- Carefully chosen – because an engine learns from any data it is exposed to.
- Clean – data which has been “cleaned” will be much better source material for any machine.
- Consistently same-domain – a high volume of data on the same topic is much better, as repetition (with some variation) is good for machine learning systems.
- Contemporary – high-frequency Translation Memories and other old-fashioned data need to be updated before use.
The main types of data in Machine Translation engine creation” There are two main types of data used in machine translation engine creation: Bilingual data (parallel data)Bilingual data is also often referred to as parallel data because it consists of a text and its translation set beside each other. A large collection of bilingual data might be called a parallel corpus (body) or even corpora.
When using bilingual data to create a machine translation engine, it is very important that the data is properly cleaned. The cleaning process eliminates “noise” such as imprecise translations, misaligned sentences, repetition, insertion or duplication, or issues relating to the co-occurrence or domain-specific usage of a term.
For this reason, high-quality bilingual data is harder to come across. Monolingual data (language model)Monolingual data is much easier to find than bilingual data. It consists of text in only one language.
High-quality monolingual data can play an important role in providing a language model for machine translation engines to learn from, improving their fluency and displaying the representation of words.Sub-types” Within the two main types of MT engine creation data, there are several sub-types which it’s important to be aware of:What is Foundation Data?Foundation data is data that has been cleaned, normalised and neutralised ready to be used to train a custom MT engine.
Foundation data tends to be non-domain-specific. This means it is commonly used to start training custom engines. The expectation is that additional subject matter-specific data will be used for additional training before the engine is put to real-world use.
The end result of this process will be a custom engine which produces high-quality output and which is highly effective when used on text in the chosen domain.
When an engine has been trained on foundation data and is awaiting full training on domain-specific data, it is sometimes referred to as a “foundation engine”.
Although the two terms may sound similar, a foundation engine has a key difference when compared with a baseline system…What is a Baseline System?A baseline system is sometimes referred to as a “generic” Machine Translation engine. These are systems which are trained to translate from one language to another for no specific domain. Unlike a foundation engine, a baseline system is generally trained on whatever data can be found with little-to-no selection being involved.
This means that – initially – a very good baseline system may be able to produce better results than a custom MT trained on foundation data. Of course, this is because a foundation engine is prepared as a base to build a system on rather than being intended for actual real-world use.
- A foundation engine is a structure based on carefully selected clean data. It is ready for a high-quality, custom engine to be constructed upon.
- A baseline system is a generic system trained on any available data.
Custom data is data which is provided by the client or organisation which wants to create the Machine Translation engine for a specific purpose.
As such, it is almost always highly relevant to the domain and style which the client wants their MT engine to work in. Although, as with all data which is to be used to train MT engines, it is a good idea to have your LSP (Language Service Provider) clean custom data prior to it being used for training.Manufactured dataManufactured data is data which has been gathered by crawling the web.
Even given the multiple tools which some purveyors of manufactured data use to try to distinguish pages automatically translated by Google Translate and the like, using manufactured data to train your custom MT engine isn’t a good move for a reason which should, by now, be clear:
If you don’t put quality training data into your engine, you’re not going to get quality out.The importance of high-quality data in Machine Translation training” The quality of the output you get from your trained MT engine will be directly linked to the quality of the data you used to train it.
In turn, that quality will dramatically impact your Return On Investment.
If you are looking to create a high-quality custom Machine Translation engine, knowing and using the right data types is a vital first step.
Only by ensuring that you have a high volume of properly cleaned sets of data can you train an MT engine which will produce the kind of results your business will benefit from.
Do you need clean data in a specific domain to start training a custom MT engine for your business?
Asian Absolute regularly cleans and fixes data for Machine Translation training for companies in every industry operating on five continents.
Talk to us about your goals today. Get a free, no-obligation quote or more information without commitment at any time.