Building a Machine Translation System with Forte

  • Easy to debug
  • Easy to adapt to changes
  • Low friction across environments
  • Easy to collaborate

In this tutorial, you will learn:

  • How to create a simple NLP pipeline
  • How to maintain and store the input data
  • How to perform sentence segmentation
  • How to annotate and query the data
  • How to translate the input text with a pre-trained model
  • How to manage multiple data objects
  • How to handle structures like HTML data
  • How to select a single data object for processing
  • How to replace the translation model with remote translation services
  • How to save and load the pipeline
# It is recommended to install these in command line
!pip install forte==0.2.0 forte.nltk requests
# for certain environment, you may run into troubles installing transformers, such as requiring Rust
# some workaround here: https://github.com/huggingface/transformers/issues/2831#issuecomment-600141935 might help
!pip install transformers==4.16.2
# you may want to try different pytorch version depending on your platform
# if you cannot install pytorch, try locate your problem at https://github.com/pytorch/pytorch/issues
!pip install torch==1.11.0
# for certain environment, the installation may fail
# some workaround here: https://github.com/google/sentencepiece/issues/378#issuecomment-1145399969 might help
!pip install sentencepiece

1 — How to Read Data from Source

  • What is a reader and why we need it
  • How to compose a simple pipeline with a pre-built reader
from forte import Pipeline
from forte.data.readers import TerminalReader
pipeline: Pipeline = Pipeline()
pipeline.set_reader(TerminalReader())
pipeline.initialize()
datapack = next(pipeline.process_dataset())
print(datapack.text)
  • What is a DataPack object and why we need it

2— How to Process Data in Pipeline

  • What is a processor and why we need it
  • How to add a pre-built processor to the pipeline
from fortex.nltk.nltk_processors import NLTKSentenceSegmenter
pipeline.add(NLTKSentenceSegmenter())
from ft.onto.base_ontology import Sentencepipeline.initialize()
for sent in next(pipeline.process_dataset()).get(Sentence):
print(sent.text)
  • What is the ontology system and why we need it
  • How to write a customized ontology and how to use it
from dataclasses import dataclass
from forte.data.ontology.top import Annotation
from typing import Optional

@dataclass
class Article(Annotation):

language: Optional[str]

def __init__(self, pack, begin: int, end: int):
super().__init__(pack, begin, end)
from forte.data import DataPack

sentences = [
"Do you want to get better at making delicious BBQ?",
"You will have the opportunity, put this on your calendar now.",
"Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers."
]
datapack: DataPack = DataPack()

# Add sentences to the DataPack and annotate them
for sentence in sentences:
datapack.set_text(datapack.text + sentence)
datapack.add_entry(
Sentence(datapack, len(datapack.text) - len(sentence), len(datapack.text))
)

# Annotate the whole text with Article
article: Article = Article(datapack, 0, len(datapack.text))
article.language = "en"
datapack.add_entry(article)

for article in datapack.get(Article):
print(f"Article (language - {article.language}):")
for sentence in article.get(Sentence):
print(sentence.text)
  • Article, as shown in our previous example, inherits annotation and contains language field to differentiate English and German. In the single DataPack example, English article has a span of English text in the DataPack. Likewise, German article has a span of German text in the DataPack.
  • Sentence in our example is used to break down article, and we pass sentences into MT pipeline.
  • Recording is an example subclass of AudioAnnotation, and it has extra recording_class field denoting the classes the audio belongs to.
  • BoundingBox is an example subclass of ImageAnnotation. As the picture shows, it has more inheritance relationships than other ontology classes due to the nature of CV objects. The advantage of Forte ontology is that it supports complex inheritance, and users can inherit from existing ontology and add new ontology features for their needs.
  • RelationLink is an example subclass of Link, and it has a class attribute specifying the relation type.
  • The basics of machine translation process
  • How to wrap a pre-trained machine translation model into a Forte processor
from forte.data import DataPack
from forte.data.readers import StringReader
from forte.processors.base import PackProcessor
from transformers import T5Tokenizer, T5ForConditionalGeneration
class MachineTranslationProcessor(PackProcessor):
"""
Translate the input text and output to a file.
"""
def initialize(self, resources, configs):
super().initialize(resources, configs)
# Initialize the tokenizer and model
model_name: str = self.configs.pretrained_model
self.tokenizer = T5Tokenizer.from_pretrained(model_name)
self.model = T5ForConditionalGeneration.from_pretrained(model_name)
self.task_prefix = "translate English to German: "
self.tokenizer.padding_side = "left"
self.tokenizer.pad_token = self.tokenizer.eos_token
def _process(self, input_pack: DataPack):
# en2de machine translation
inputs = self.tokenizer([
self.task_prefix + sentence.text
for sentence in input_pack.get(Sentence)
], return_tensors="pt", padding=True)
output_sequences = self.model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=False,
)
output = ''.join(self.tokenizer.batch_decode(
output_sequences, skip_special_tokens=True
))
src_article: Article = Article(input_pack, 0, len(input_pack.text))
src_article.language = "en"
input_pack.set_text(input_pack.text + '\n\n' + output)
tgt_article: Article = Article(input_pack, len(input_pack.text) - len(output), len(input_pack.text))
tgt_article.language = "de"
@classmethod
def default_configs(cls):
return {
"pretrained_model": "t5-small"
}
  • Users need to consider initializing all needed NLP components for the inference task such as tokenizer and model.
  • Users also need to specify all configuration in configs, a dictionary-like object that specifies configurations of all components such as model name.
  • After the initialization, we already have the needed NLP components. We need to consider several MT behaviors based on Forte DataPack.
  • Retrieve text data from DataPack (given that it already reads data from the data source).
  • Since T5 has a better performance given a task prompt, we also want to include the prompt in our data.
  • Tokenization that transforms input text into sequences of tokens and token ids.
  • Generate output sequences from model.
  • Decode output token ids into sentences using the tokenizer.
input_string: str = ' '.join(sentences)
pipeline: Pipeline = Pipeline[DataPack]()
pipeline.set_reader(StringReader())
pipeline.add(NLTKSentenceSegmenter())
pipeline.add(MachineTranslationProcessor())
pipeline.initialize()
for datapack in pipeline.process_dataset([input_string]):
for article in datapack.get(Article):
print([f"\nArticle (language - {article.language}): {article.text}"])
  • What is a MultiPack and why we need it
  • How to use a MultiPack
from forte.data import MultiPack
from forte.processors.base import MultiPackProcessor
from forte.data.caster import MultiPackBoxer
class MachineTranslationMPProcessor(MultiPackProcessor):
"""
Translate the input text and output to a file.
"""
def initialize(self, resources, configs):
super().initialize(resources, configs)

# Initialize the tokenizer and model
model_name: str = self.configs.pretrained_model
self.tokenizer = T5Tokenizer.from_pretrained(model_name)
self.model = T5ForConditionalGeneration.from_pretrained(model_name)
self.task_prefix = "translate English to German: "
self.tokenizer.padding_side = "left"
self.tokenizer.pad_token = self.tokenizer.eos_token

def _process(self, input_pack: MultiPack):
source_pack: DataPack = input_pack.get_pack("source")
target_pack: DataPack = input_pack.add_pack("target")

# en2de machine translation
inputs = self.tokenizer([
self.task_prefix + sentence.text
for sentence in source_pack.get(Sentence)
], return_tensors="pt", padding=True)

output_sequences = self.model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=False,
)

# Annotate the source article
src_article: Article = Article(source_pack, 0, len(source_pack.text))
src_article.language = "en"

# Annotate each sentence
for output in self.tokenizer.batch_decode(
output_sequences, skip_special_tokens=True
):
target_pack.set_text(target_pack.text + output)
text_length: int = len(target_pack.text)
Sentence(target_pack, text_length - len(output), text_length)

# Annotate the target article
tgt_article: Article = Article(target_pack, 0, len(target_pack.text))
tgt_article.language = "de"

@classmethod
def default_configs(cls):
return {
"pretrained_model": "t5-small",
}
nlp: Pipeline = Pipeline[DataPack]()
nlp.set_reader(StringReader())
nlp.add(NLTKSentenceSegmenter())
nlp.add(MultiPackBoxer(), config={"pack_name": "source"})
nlp.add(MachineTranslationMPProcessor(), config={
"pretrained_model": "t5-small"
})
nlp.initialize()
for multipack in nlp.process_dataset([input_string]):
for pack_name in ("source", "target"):
for article in multipack.get_pack(pack_name).get(Article):
print(f"\nArticle (language - {article.language}): ")
for sentence in article.get(Sentence):
print(sentence.text)

3 — How to Handle New Practical Requests

  • How to build a translation management system
  • How to preserve the structure like HTML in machine translation
  • How to select a specific DataPack from MultiPack for processing
html_input: str = """
<!DOCTYPE html>
<html>
<head><title>Beginners BBQ Class.</title></head>
<body>
<p>Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers.</p>
</body>
</html>
"""
nlp.initialize()
for multipack in nlp.process_dataset([html_input]):
print("Source Text: " + multipack.get_pack("source").text)
print("\nTarget Text: " + multipack.get_pack("target").text)
from forte.data import NameMatchSelector
from forte.data.readers.html_reader import ForteHTMLParser
class HTMLTagCleaner(MultiPackProcessor):

def initialize(self, resources, configs):
super().initialize(resources, configs)
self._parser = ForteHTMLParser()
def _process(self, input_pack: MultiPack):
raw_pack: DataPack = input_pack.get_pack("raw")
source_pack: DataPack = input_pack.add_pack("source")

self._parser.feed(raw_pack.text)
cleaned_text: str = raw_pack.text
for span, _ in self._parser.spans:
cleaned_text = cleaned_text.replace(
raw_pack.text[span.begin:span.end], ''
)
source_pack.set_text(cleaned_text)

class HTMLTagRecovery(MultiPackProcessor):
def _process(self, input_pack: MultiPack):
raw_pack: DataPack = input_pack.get_pack("raw")
source_pack: DataPack = input_pack.get_pack("source")
target_pack: DataPack = input_pack.get_pack("target")
result_pack: DataPack = input_pack.add_pack("result")
result_text: str = raw_pack.text
for sent_src, sent_tgt in zip(source_pack.get(Sentence), target_pack.get(Sentence)):
result_text = result_text.replace(sent_src.text, sent_tgt.text)
result_pack.set_text(result_text)
# Pipeline with HTML handling
pipeline: Pipeline = Pipeline[DataPack]()
pipeline.set_reader(StringReader())
pipeline.add(MultiPackBoxer(), config={"pack_name": "raw"})
pipeline.add(HTMLTagCleaner())
pipeline.add(
NLTKSentenceSegmenter(),
selector=NameMatchSelector(),
selector_config={"select_name": "source"}
)
pipeline.add(MachineTranslationMPProcessor(), config={
"pretrained_model": "t5-small"
})
pipeline.add(HTMLTagRecovery())
pipeline.initialize()
for multipack in pipeline.process_dataset([html_input]):
print(multipack.get_pack("raw").text)
print(multipack.get_pack("result").text)
  • How to use a different translation service
# You can get your own API key by following the instructions in https://docs.microsoft.com/en-us/azure/cognitive-services/translator/
api_key = input("Enter your API key here:")
import requests
import uuid
class OnlineMachineTranslationMPProcessor(MultiPackProcessor):
"""
Translate the input text and output to a file use online translator api.
"""
def initialize(self, resources, configs):
super().initialize(resources, configs)
self.url = configs.endpoint + configs.path
self.from_lang = configs.from_lang
self.to_lang = configs.to_lang
self.subscription_key = configs.subscription_key
self.subscription_region = configs.subscription_region
def _process(self, input_pack: MultiPack):
source_pack: DataPack = input_pack.get_pack("source")
target_pack: DataPack = input_pack.add_pack("target")

params = {
'api-version': '3.0',
'from': 'en',
'to': ['de']
}
# Build request
headers = {
'Ocp-Apim-Subscription-Key': self.subscription_key,
'Ocp-Apim-Subscription-Region': self.subscription_region,
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
# You can pass more than one object in body.
body = [{
'text': source_pack.text
}]
request = requests.post(self.url, params=params, headers=headers, json=body)

result = request.json()
target_pack.set_text("".join(
[trans['text'] for trans in result[0]["translations"]]
)
)
@classmethod
def default_configs(cls):
return {
"from_lang" : 'en',
"to_lang": 'de',
"endpoint" : 'https://api.cognitive.microsofttranslator.com/',
"path" : '/translate',
"subscription_key": None,
"subscription_region" : "westus2",
'X-ClientTraceId': str(uuid.uuid4())
}
nlp: Pipeline = Pipeline[DataPack]()
nlp.set_reader(StringReader())
nlp.add(NLTKSentenceSegmenter())
nlp.add(MultiPackBoxer(), config={"pack_name": "source"})
nlp.add(OnlineMachineTranslationMPProcessor(), config={
"from_lang" : 'en',
"to_lang": 'de',
"endpoint" : 'https://api.cognitive.microsofttranslator.com/',
"path" : '/translate',
"subscription_key": api_key,
"subscription_region" : "westus2",
'X-ClientTraceId': str(uuid.uuid4())
})
nlp.initialize()
for multipack in nlp.process_dataset([input_string]):
print("Source Text: " + multipack.get_pack("source").text)
print("\nTarget Text: " + multipack.get_pack("target").text)
  • How to export and import a Forte pipeline
import os
save_path: str = os.path.join(os.path.dirname(os.path.abspath('')), "pipeline.yml")
nlp.save(save_path)
with open(save_path, 'r') as f:
print(f.read())
new_nlp: Pipeline = Pipeline()
new_nlp.init_from_config_path(save_path)
new_nlp.initialize()
for multipack in new_nlp.process_dataset([input_string]):
print("Source Text: " + multipack.get_pack("source").text)
print("\nTarget Text: " + multipack.get_pack("target").text)

--

--

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform petuum.com

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Petuum, Inc.

Petuum, Inc.

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform petuum.com