Materi Bahasa Inggris informatika

§ CS276: Information Retrieval and Web Search

§ Pandu Nayak and Prabhakar Raghavan

§ Term Vocabulary & Postings lists

§ (Tokenisasi)

§ Pertemuan sebelumnya:

§ Struktur dari Inverted Indeks:

§ Dictionary (Vocabulary) & Inverted List (Postings)

§ Vocabulary urut berdasarkan term (kata)

§ Untuk memproses Boolean Query:

§ Melakukan interseksi (merging) secara linear

§ Topik Pada Pertemuan Ini

Tahapan dalam Membangun Indeks

§ Preprocessing untuk membentuk vocabulary

§ Documen

§ Tokenisasi (tokenization)

§ Kata (terms) apa saja yang dimasukkan dalam indeks

§ Inverted List (Postings)

§ Cara merge secara lebih cepat (faster merge) dengan cara skip lists

§ Query dalam bentuk kaliman (phrase)

§ Diagram Proses Indexing

§ Parsing Dokumen

§ Perhatikan terlebih dahulu format dokumen

§ pdf/word/excel/html ?

§ Ditulis dalam bahasa apa?

§ Format character set yang digunakan

§ Complications: Format/Language

§ Dokumen yang akan diindeks dapat berupa dokumen yang ditulis dalam beberapa bahasa

§ Sebuah indeks dapat mengandung kata dari beberapa bahasa

§ Karena sebuah dokumen dapat ditulis dalam beberapa bahasa

§ Contoh: Email dalam bahasa Inggris tetapi attacment dari email adalah dokumen yang ditulis dalam bahasa Jerman

§ Apakah unit dari sebuah dokumen?

§ Sebuah file?

§ Sebuah email?

§ Sebuah email dengan 5 attachments?

§ Sekumpulan files (PPT atau halaman HTML)?

§ TOKENS & TERMS (KATA)

§ Tokenisasi (Tokenization)

§ Input: “Friends, Romans, Countrymen”

§ Output: Tokens

§ Friends

§ Romans

§ Countrymen

§ Jadi token adalah sederetan karakter (a sequence of characters) dalam dokumen

§ Setiap token menjadi kandidat dari elemen dalam indeks, tentunya setelah preprocessing

§ Beberapa isu dalam tokenisasi:

§ Finland’s capital ®

Finland? Finlands? Finland’s?

§ Hewlett-Packard  Hewlett dan Packardsebagai dua token atau satu?

§ state-of-the-art: break up hyphenated sequence

§ co-education

§ lowercase, lower-case, lower case?

§ San Francisco: satu token atau dua?

§ Bagaimana cara memutuskan bahwa SF adalah satu token?

§ Angka (Numbers)

§ 3/12/91 Mar. 12, 1991 12/3/91

§ No. B-52

§ Kode: 324a3df234cb23e

§ Telepon: (0651) 234-2333

§ Biasanya angka memiliki space diantaranya

§ Sistem IR yang lama tidak mengindeks angka

§ Tapi angka itu penting. Coba bayangkan bila ingin mencari baris dari error kode program melalui Sistem IR atau mencari nomor tertentu

§ Salah satu solusi adalah menggunakan mekanisme n-grams

§ Tokenisasi: Isu dalam bahasa

§ French

§ L'ensemble ® satu token atau dua?

§ L ? L’ ? Le ?

§ Want l’ensemble to match with un ensemble

§ Sampai tahun 2003, tidak berhasil bila dicari via Google

§ Internationalization!

§ German noun compounds are not segmented

§ Lebensversicherungsgesellschaftsangestellter

§ ‘life insurance company employee’

§ German retrieval systems benefit greatly from a compound splitter module

§ Can give a 15% performance boost for German

§ Chinese and Japanese:

§ 莎拉波娃现在居住在美国东南部的佛罗里达。

§ Not always guaranteed a unique tokenization

§ Further complicated in Japanese: Dates/amounts in multiple formats

§ Tulisan Arab ditulis dari kanan ke kiri tetapi untuk angka dibaca dari kiri ke kanan

§ Words are separated, but letter forms within a word form complex ligatures

§ ← → ← → ← start

§ ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

§ Stop words

§ Menggunakan stop list, kata-kata yang sering muncul (tetapi kurang penting) dapat dikeluarkan dari indeks:

§ Secara semantic mereka tidak penting: the, a, and, to, be

§ Jumlahnya cukup banyak: ~30% dari semua kata dalam corpus

§ Trend: stopword tidak diikutkan:

§ Hemat indeks dan dapat memperkecil ukuran indeks walaupun dikompres

§ Query optimisasi menjadi lebih baik

§ Tapi perlu juga memperhatikan Query sbb:

§ Judul film: “King of Denmark”

§ Judul Lagu: “Let it be”, “To be or not to be”

§ Relational query: “flights to London”

§ Normalisasi Kata (terms)

§ Kata harus dinormalisasidalam in indexed text as well as query words into the same form

§ We want to match U.S.A. and USA

§ Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary

§ We most commonly implicitly define equivalence classes of terms by, e.g.,

§ deleting periods to form a term

§ U.S.A., USA è USA

§ deleting hyphens to form a term

§ anti-discriminatory, antidiscriminatory è antidiscriminatory

§ Normalization: other languages

§ Accents: e.g., French résumé vs. resume.

§ Umlauts: e.g., German: Tuebingen vs. Tübingen

§ Should be equivalent

§ Most important criterion:

§ How are your users like to write their queries for these words?

§ Even in languages that standardly have accents, users often may not type them

§ Often best to normalize to a de-accented term

§ Tuebingen, Tübingen, Tubingen è Tubingen

§ Normalization: other languages

§ Normalization of things like date forms

§ 7月30日 vs. 7/30

§ Japanese use of kana vs. Chinese characters

§ Tokenization and normalization may depend on the language and so is intertwined with language detection

§ Crucial: Need to “normalize” indexed text as well as query terms into the same form

§ Case folding

§ Reduce all letters to lower case

§ exception: upper case in mid-sentence?

§ e.g., General Motors

§ Fed vs. fed

§ SAIL vs. sail

§ Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

§ Google example:

§ Query C.A.T.

§ #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.

§ Normalization to terms

§ An alternative to equivalence classing is to do asymmetric expansion

§ An example of where this may be useful

§ Enter: window Search: window, windows

§ Enter: windows Search: Windows, windows, window

§ Enter: Windows Search: Windows

§ Potentially more powerful, but less efficient

§ Thesauri and soundex

§ Do we handle synonyms and homonyms?

§ E.g., by hand-constructed equivalence classes

§ car = automobile color = colour

§ We can rewrite to form equivalence-class terms

§ When the document contains automobile, index it under car-automobile (and vice-versa)

§ Or we can expand a query

§ When the query contains automobile, look under car as well

§ What about spelling mistakes?

§ One approach is soundex, which forms equivalence classes of words based on phonetic heuristics

§ More in lectures 3 and 9

§ Lemmatization

§ Reduce inflectional/variant forms to base form

§ E.g.,

§ am, are, is ® be

§ car, cars, car's, cars' ® car

§ the boy's cars are different colors ® the boy car be different color

§ Lemmatization implies doing “proper” reduction to dictionary headword form

§ Stemming

§ Reduce terms to their “roots” before indexing

§ “Stemming” suggest crude affix chopping

§ language dependent

§ e.g., automate(s), automatic, automation all reduced to automat.

§ Porter’s algorithm

§ Commonest algorithm for stemming English

§ Results suggest it’s at least as good as other stemming options

§ Conventions + 5 phases of reductions

§ phases applied sequentially

§ each phase consists of a set of commands

§ sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

§ Typical rules in Porter

§ sses ® ss

§ ies ® i

§ ational ® ate

§ tional ® tion

§ Rules sensitive to the measure of words

§ (m>1) EMENT →

§ replacement → replac

§ cement → cement

§ Other stemmers

§ Other stemmers exist, e.g., Lovins stemmer

§ http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

§ Single-pass, longest suffix removal (about 250 rules)

§ Full morphological analysis – at most modest benefits for retrieval

§ Do stemming and other normalizations help?

§ English: very mixed results. Helps recall but harms precision

§ operative (dentistry) ⇒ oper

§ operational (research) ⇒ oper

§ operating (systems) ⇒ oper

§ Definitely useful for Spanish, German, Finnish, …

§ 30% performance gains for Finnish!

§ Recall basic merge

§ Walk through the two postings simultaneously, in time linear in the total number of postings entries

Pages

Kuliah Di Awang-Awang

Minggu, 23 Februari 2014

Materi Bahasa Inggris informatika

0 komentar:

Posting Komentar

Posting Kami

Total Pengunjung

Daftar Tamu

Yang Sering Dibuka

Daftar Inspirasi