The Internet and Languages

Chapter 5

Prev List Next

# RALI Laboratory

In Montreal, Quebec, the RALI Laboratory (Laboratory of Applied Research in Computational Linguistics - Laboratoire de Recherche Appliquee en Linguistique Informatique) has worked in automatic text alignment, automatic text generation, automatic reaccentuation, language identification, and finite state transducers. RALI produces the "TransX family" of what it calls "a new generation" of translation support tools (TransType, TransTalk, TransCheck, and TransSearch), which are based on probabilistic translation models that automatically calculate correspondences between the text produced by a translator and the original text from the source language.

As explained on RALI"s website in 1998: "(a) TransType speeds up the keying-in of a translation by antic.i.p.ating a translator"s choices and criticizing them when appropriate. In proposing its suggestions, TransType takes into account both the source text and the partial translation that the translator has already produced. (b) TransTalk is an automatic dictation system that makes use of a probabilistic translation model in order to improve the performance of its voice recognition model. (c) TransCheck automatically detects certain types of translation errors by verifying that the correspondences between the segments of a draft and the segments of the source text respect well- known properties of a good translation. (d) TransSearch allows translators to search databases of pre-existing translations in order to find ready-made solutions to all sorts of translation problems. In order to produce the required databases, the translations and the source language texts must first be aligned."

# Natural Language Group

The Natural Language Group (NLG) at the Information Sciences Inst.i.tute (ISI) of the University of Southern California (USC) has been involved in various aspects of computational/natural language processing: machine translation, automated text summarization, multilingual verb access and text management, development of large concept taxonomies (ontologies), discourse and text generation, construction of large lexicons for various languages, and multimedia communication.

Eduard Hovy, head of the Natural Language Group, explained in August 1998: "People will write their own language for several reasons -- convenience, secrecy, and local applicability -- but that does not mean that other people are not interested in reading what they have to say!

This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the j.a.panese newspaper and other articles that pertain to what they make) or some Government Intelligence agencies (the people who provide the most up- to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire "weak" bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM (automated text summarization) and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text cla.s.sifier to decide whether to keep or reject the article. For these kinds of reasons, the U.S. Government has over the past five years been funding research in MT, SUM, and IR (information retrieval), and is interested in starting a new program of research in Multilingual IR.

This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in *all* the languages of the world. You will have them cl.u.s.tered by subarea, summarized by cl.u.s.ter, and the foreign summaries translated, all the kinds of things that you would like to have."

Eduard Hovy added in August 1999: "Over the past 12 months I have been contacted by a surprising number of new information technology (IT) companies and startups. Most of them plan to offer some variant of electronic commerce (online shopping, bartering, information gathering, etc.). Given the rather poor performance of current non-research level natural language processing technology (when is the last time you actually easily and accurately found a correct answer to a question to the web, without having to spend too much time sifting through irrelevant information?), this is a bit surprising. But I think everyone feels that the new developments in automated text summarization, question a.n.a.lysis, and so on, are going to make a significant difference. I hope so!--but the level of performance is not available yet.

It seems to me that we will not get a big breakthrough, but we will get a somewhat acceptable level of performance, and then see slow but sure incremental improvement. The reason is that it is very hard to make your computer really "understand" what you mean -- this requires us to build into the computer a network of "concepts" and their interrelationships that (at some level) mirror those in your own mind, at least in the subjects areas of interest. The surface (word) level is not adequate -- when you type in "capital of Switzerland", current systems have no way of knowing whether you mean "capital city" or "financial capital". Yet the vast majority of people would choose the former reading, based on phrasing and on knowledge about what kinds of things one is likely to ask the web, and in what way. Several projects are now building, or proposing to build, such large "concept" networks.

This is not something one can do in two years, and not something that has a correct result. We have to develop both the network and the techniques for building it semi-automatically and self-adaptively. This is a big challenge."

Eduard Hovy added in September 2000: "I see a continued increase in small companies using language technology in one way or another: either to provide search, or translation, or reports, or some other communication function. The number of niches in which language technology can be applied continues to surprise me: from stock reports and updates to business-to-business communications to marketing...

With regard to research, the main breakthrough I see was led by a colleague at ISI (I am proud to say), Kevin Knight. A team of scientists and students last summer at Johns Hopkins University in Maryland developed a faster and otherwise improved version of a method originally developed (and kept proprietary) by IBM about 12 years ago.

This method allows one to create a machine translation (MT) system automatically, as long as one gives it enough bilingual text.

Essentially the method finds all correspondences in words and word positions across the two languages and then builds up large tables of rules for what gets translated to what, and how it is phrased.

Although the output quality is still low -- no-one would consider this a final product, and no-one would use the translated output as is -- the team built a (low-quality) Chinese-to-English MT system in 24 hours. That is a phenomenal feat -- this has never been done before.

(Of course, say the critics: you need something like 3 million sentence pairs, which you can only get from the parliaments of Canada, Hong Kong, or other bilingual countries; and of course, they say, the quality is low. But the fact is that more bilingual and semi-equivalent text is becoming available online every day, and the quality will keep improving to at least the current levels of MT engines built by hand.

Of that I am certain.)

Other developments are less spectacular. There"s a steady improvement in the performance of systems that can decide whether an ambiguous word such as "bat" means "flying mammal" or "sports tool" or "to hit"; there is solid work on cross-language information retrieval (which you will soon see in being able to find Chinese and French doc.u.ments on the web even though you type in English-only queries), and there is some rather rapid development of systems that answer simple questions automatically (rather like the popular web system AskJeeves, but this time done by computers, not humans). These systems refer to a large collection of text to find "factiods" (not opinions or causes or chains of events) in response to questions such as "what is the capital of Uganda?" or "how old is President Clinton?" or "who invented the xerox process?", and they do so rather better than I had expected."

# ISSCO

In Geneva, Switzerland, ISSCO (Dalle Molle Inst.i.tute for Semantic and Cognitive Studies - Inst.i.tut Dalle Molle pour les etudes Semantiques et Cognitives) is a research laboratory conducting basic and applied research in computational linguistics (CL) and artificial intelligence (AI), for a number of Swiss and European research projects. The University of Geneva has provided administrative support and infrastructure. Research is funded with grants and contracts with public and private bodies.

Created by the Foundation Dalle Molle in 1972 to conduct research in cognition and semantics, ISSCO has come to specialize in natural language processing, including multilingual language processing, in a number of areas: machine translation, linguistic environments, multilingual generation, discourse processing, data collection, etc.

ISSCO is multi-disciplinary and multi-national. As explained on its website in 1998, "its staff and its visitors [are drawn] from the disciplines of computer science, linguistics, mathematics, psychology and philosophy. The long-term staff of the Inst.i.tute is relatively small in number; with a much larger number of visitors coming for stays ranging from a month to two years. This ensures a continual exchange of ideas and encourages flexibility of approach amongst those a.s.sociated with the Inst.i.tute."

# UNDL Foundation

The UNL (universal networking language) project was launched in the mid-1990s as a main digital metalanguage project by the Inst.i.tute of Advanced Studies (IAS) of the United Nations University (UNU) in Tokyo, j.a.pan. As explained on the bilingual (English, j.a.panese) website in 1998: "UNL is a language that -- with its companion "enconverter" and "deconverter" software -- enables communication among peoples of differing native languages. It will reside, as a plug-in for popular web browsers, on the internet, and will be compatible with standard network servers. The technology will be shared among the member states of the United Nations. Any person with access to the internet will be able to "enconvert" text from any native language of a member state into UNL. Just as easily, any UNL text can be "deconverted" from UNL into native languages. United Nations University"s UNL Center will work with its partners to create and promote the UNL software, which will be compatible with popular network servers and computing platforms."

In 2000, 120 researchers worldwide were working on a multilingual project in 16 languages (Arabic, Brazilian, Chinese, English, French, German, Hindu, Indonesian, Italian, j.a.panese, Latvian, Mongolian, Russian, Spanish, Swahiki, and Thai). The UNDL Foundation (UNDL: Universal Networking Digital Language) was founded in January 2001 to develop and promote the UNL project.

CHRONOLOGY

[Each line begins with the year or the year/month.]

1968: ASCII is the first character set encoding.

1971: Project Gutenberg is the first digital library.

1974: The internet takes off.

1990: The web is invented by Tim Berners-Lee.

1991/01: Unicode is a universal character set encoding for all languages.

1993/11: Mosaic is the first web browser.

1994/05: The Human-Languages Page is a catalog of language-related internet resources.

1994/10: The World Wide Web Consortium will deal with internationalization and localization.

1994: Travland is dedicated to both travel and languages.

1995/12: The Kotoba Home Page deals with language issues using our keyboard.

1995: The Internet Dictionary Project works on creating free translating dictionaries.

1995: NetGlos is a multilingual glossary of internet terminology.

1995: Global Reach is a virtual consultancy stemming from Euro-Marketing a.s.sociates.

1995: LISA is the localization industry standards a.s.sociation.

1995: "The Ethnologue: Languages of the World" offers a free online version.

1996/04 : OneLook Dictionaries is a fast finder in online dictionaries.

1997/01: UNL (universal networking language) is a digital metalanguage project.

1997/12: AltaVista launches AltaVista Translation, also called Babel Fish.

1997: The Logos Dictionary goes online for free.

1999/12: Britannica.com is the first main English-language online encyclopedia.

1999/12: WebEncyclo is the first main French-language online encyclopedia.

1999: WordReference.com offers free online bilingual translating dictionaries.

2000/02: yourDictionary.com is a major language portal.

2000/07: Non-English-speaking internet users reach 50%.

2001/01: Wikipedia is a main free multilingual cooperative encyclopedia.

2001/01: The UNDL Foundation develops UNL, a digital metalanguage project.

2001/04: The Human-Languages Project becomes the iLoveLanguages portal.

2004/01: Project Gutenberg Europe is launched as a multilingual project.

2007/03: IATE is the new terminological database of the European Union.

2009: "The Ethnologue" launches its 16th edition as an encyclopedic reference work.

WEBSITES

Alis Technologies: Aquarius.net: Directory of Localization Experts: ASCII Table: Asia-Pacific a.s.sociation for Machine Translation (AAMT): a.s.sociation for Computational Linguistics (ACL): a.s.sociation for Machine Translation in the Americas (AMTA): [email protected]: ELRA (European Language Resources a.s.sociation): ELSNET (European Network of Excellence in Human Language Technologies): Encyclopaedia Britannica Online: Encyclopaedia Universalis: Ethnologue: Ethnologue: Endangered Languages: EUROCALL (European a.s.sociation for Computer-a.s.sisted Language Learning): European a.s.sociation for Machine Translation (EAMT): European Bureau for Lesser-Used Languages (EBLUL): European Commission: Languages of Europe: European Minority Languages (list of the Inst.i.tute Sabhal Mr Ostaig): Google Translate: Grand Dictionnaire Terminologique (GDT): IATE: InterActive Terminology for Europe: ILOTERM (ILO: International Labor Organization): iLoveLanguages: International Committe on Computational Linguistics (ICCL): Internet Dictionary Project (IDP): Internet Society (ISOC): Laboratoire CLIPS (Communication Langagiere et Interaction Personne-Systeme): Laboratoire CLIPS: GETA (Groupe d"etude pour la Traduction Automatique): LINGUIST List (The): Localization Industry Standards a.s.sociation (LISA): Logos: Multilingual Translation Portal: MAITS (Multilingual Application Interface for Telematic Services): Merriam-Webster Online: Natural Language Group (NLG) at USC/ISI: Nuance: OneLook Dictionary Search: Oxford English Dictionary (OED): Oxford Reference Online (ORO): PAHOMTS (PAHO: Pan American Health Organization): Palo Alto Research Center (PARC): Palo Alto Research Center (PARC): Natural Language Processing: RALI (Recherche Appliquee en Linguistique Informatique): Reverso: Free Online Translator: SDL: SDL: FreeTranslation.com: SDL Trados: Softissimo: SYSTRAN: SYSTRANet: Free Online Translator: TEI: Text Encoding Initiative: TERMITE (Terminology of Telecommunications): *tmx Vokabeltrainer: Transparent Language: TransPerfect: Travlang: Travlang"s Translating Dictionaries: UNDL (Universal Networking Digital Language) Foundation: Unicode: Yahoo! Babel Fish: YourDictionary.com: YourDictionary.com: Endangered Languages: W3C: World Wide Web Consortium: W3C Internationalization Activity: WELL (Web Enhanced Language Learning): Wordfast: Xerox XRCE (Xerox Research Centre Europe): Xerox XRCE: Cross-Language Technologies:

Prev List Next