Forest-to-String SMT for Asian Language Translation: NAIST at WAT

NAIST at WAT 2014
Forest-to-String SMT for Asian
Language Translation:
NAIST at WAT 2014
Graham Neubig
Nara Institute of Science and Technology (NAIST)
2014-10-4
1
NAIST at WAT 2014
Features of ASPEC
●
Translation between languages with different
grammatical structures
流動 プラズマ を 正確 に 測定 する ため に 画像 を 再 構成 した 。
an image was reconstituted in order to measure flowing plasma correctly .
●
We all know: Phrase-based MT is not enough
for the accurate measurement of plasma flow image was reconstructed .
2
NAIST at WAT 2014
Solution?: 2-step Translation Process
●
Pre-ordering [Weblio, SAS_MT, NII, TMU, NICT]
我々 は 科学 論文
を 翻訳 する
●
我々 翻訳 する
科学 論文
we translate
scientific papers
RBMT+Statistical Post Editing [TOSHIBA, EIWA]
我々 は 科学 論文
を 翻訳 する
we translate
science thesis
we translate
scientific papers
3
NAIST at WAT 2014
This is a lot of work... :(
How do I make good
Japanese-English
preordering rules?!
How do I make good
Japanese-Chinese
preorderering rules?!
What about error propagation?
What if better preordering
accuracy doesn't equal better
translation accuracy?
4
NAIST at WAT 2014
Evidence
5
NAIST at WAT 2014
Our Solution: Tree-to-String Translation
[Liu+ 06]
x1 with x0
VP0-5
VP2-5
PP0-1
N0
友達
PP2-3
P1
と
VP4-5
N2
P3
ご飯
を
V4 SUF5
食べ た
x1
x1
x1 x0
ate
a meal
my friend
x0
x0
ate a meal with my friend
6
NAIST at WAT 2014
Requirements for a
Tree-to-String Model
Source Sentence
Parser
Parallel
Corpus
これ は テスト です 。 データ を 使用 します 。
This is a test .
It uses data .
Rule Extraction
Rule Scoring
Optimization
Alignments
Tree-to-String
Model
7
NAIST at WAT 2014
Reducing our work load.
X
How do I make good
Japanese-English
preordering rules?!
X
How do I make good
Japanese-Chinese
preorderering rules?!
What about error propagation?
X
What if better preordering
accuracy doesn't equal better
translation accuracy?
8
NAIST at WAT 2014
Forest-to-string Translation
[Mi+ 08]
S
0,7
VP
1,7
NP
2,7
PP
4,7
NP
0,1
PRP VBD
0,1 1,2
I saw
NP
5,7
NP
2,4
DT
2,3
NN
3,4
IN
4,5
DT
5,6
NN
6,7
a girl with a telescope
9
NAIST at WAT 2014
Travatar Toolkit
●
Forest-to-string translation toolkit
●
Supports training, decoding
●
Includes preprocessing scripts for parsing, etc.
●
Many other features (optimization, Hiero, etc...)
Available open source!
http://phontron.com/travatar
10
NAIST at WAT 2014
NAIST WAT System
11
NAIST at WAT 2014
WAT Results
First place in all tasks!
BLEU
50
40
+13.0
+3.6
+2.2
+15.0
40
+3.8
20
Other
NAIST
20
10
0
+28.3
60
+1.8
+2.7
30
HUMAN
en-ja
ja-en
zh-ja
ja-zh
0
en-ja ja-en zh-ja ja-zh
12
NAIST at WAT 2014
System Elements
Travatar!
Same as [Neubig & Duh, ACL2014]
Recurrent Neural
Net Language Model
Pre/post Processing
(UNK splitting, transliteration)
Dictionaries
13
NAIST at WAT 2014
Recurrent Neural Network LM
I
can
eat
an
apple </s>
●
Vector representation → robustness
●
Recurrent architecture → longer context
14
NAIST at WAT 2014
Pre/post processing
UNK segmentation (ja-en)
Kanji Normalization (ja-zh, zh-ja)
試験 管立て
イチョウ黄叶
臭気鉴定师
球 内部 試験 管 立て
イチョウ黄葉
臭気鑑定師
球内部
Transliteration (ja-en)
Dictionary addition (ja-en)
Japan インテック
膿瘍
典型
Japan Intekku
apostema
archetype
15
NAIST at WAT 2014
Conclusion
16
NAIST at WAT 2014
Future Work
LOSE at next year's WAT.
(Make Travatar so easy to use that others
can use it to make really good MT systems
for Asian languages.)
Starting soon! Training scripts to be available:
http://phontron.com/project/wat2014
17