§ CS276: Information Retrieval and Web Search
§ Pandu Nayak and Prabhakar Raghavan
§ Term Vocabulary & Postings lists
§ (Tokenisasi)
§ Pertemuan sebelumnya:
§ Struktur dari Inverted Indeks:
§ Dictionary (Vocabulary) & Inverted List (Postings)
§ Vocabulary urut berdasarkan term (kata)
§ Untuk memproses Boolean Query:
§ Melakukan interseksi (merging) secara linear
§ Topik Pada Pertemuan Ini
Tahapan dalam Membangun Indeks
§ Preprocessing untuk membentuk vocabulary
§ Documen
§ Tokenisasi (tokenization)
§ Kata (terms) apa saja yang dimasukkan dalam indeks
§ Inverted List (Postings)
§ Cara merge secara lebih cepat (faster merge) dengan cara skip lists
§ Query dalam bentuk kaliman (phrase)
§ Diagram Proses Indexing
§ Parsing Dokumen
§ Perhatikan terlebih dahulu format dokumen
§ pdf/word/excel/html ?
§ Ditulis dalam bahasa apa?
§ Format character set yang digunakan
§ Complications: Format/Language
§ Dokumen yang akan diindeks dapat berupa dokumen yang ditulis dalam beberapa bahasa
§ Sebuah indeks dapat mengandung kata dari beberapa bahasa
§ Karena sebuah dokumen dapat ditulis dalam beberapa bahasa
§ Contoh: Email dalam bahasa Inggris tetapi attacment dari email adalah dokumen yang ditulis dalam bahasa Jerman
§ Apakah unit dari sebuah dokumen?
§ Sebuah file?
§ Sebuah email?
§ Sebuah email dengan 5 attachments?
§ Sekumpulan files (PPT atau halaman HTML)?
§ TOKENS & TERMS (KATA)
§ Tokenisasi (Tokenization)
§ Input: “Friends, Romans, Countrymen”
§ Output: Tokens
§ Friends
§ Romans
§ Countrymen
§ Jadi token adalah sederetan karakter (a sequence of characters) dalam dokumen
§ Setiap token menjadi kandidat dari elemen dalam indeks, tentunya setelah preprocessing
§ Beberapa isu dalam tokenisasi:
§ Finland’s capital ®
Finland? Finlands? Finland’s?
§ Hewlett-Packard Hewlett dan Packardsebagai dua token atau satu?
§ state-of-the-art: break up hyphenated sequence
§ co-education
§ lowercase, lower-case, lower case?
§ San Francisco: satu token atau dua?
§ Bagaimana cara memutuskan bahwa SF adalah satu token?
§ Angka (Numbers)
§ 3/12/91 Mar. 12, 1991 12/3/91
§ No. B-52
§ Kode: 324a3df234cb23e
§ Telepon: (0651) 234-2333
§ Biasanya angka memiliki space diantaranya
§ Sistem IR yang lama tidak mengindeks angka
§ Tapi angka itu penting. Coba bayangkan bila ingin mencari baris dari error kode program melalui Sistem IR atau mencari nomor tertentu
§ Salah satu solusi adalah menggunakan mekanisme n-grams
§ Tokenisasi: Isu dalam bahasa
§ French
§ L'ensemble ® satu token atau dua?
§ L ? L’ ? Le ?
§ Want l’ensemble to match with un ensemble
§ Sampai tahun 2003, tidak berhasil bila dicari via Google
§ Internationalization!
§ German noun compounds are not segmented
§ Lebensversicherungsgesellschaftsangestellter
§ ‘life insurance company employee’
§ German retrieval systems benefit greatly from a compound splitter module
§ Can give a 15% performance boost for German
§ Chinese and Japanese:
§ 莎拉波娃现在居住在美国东南部的佛罗里达。
§ Not always guaranteed a unique tokenization
§ Further complicated in Japanese: Dates/amounts in multiple formats
§ Tulisan Arab ditulis dari kanan ke kiri tetapi untuk angka dibaca dari kiri ke kanan
§ Words are separated, but letter forms within a word form complex ligatures
§ ← → ← → ← start
§ ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
§ Stop words
§ Menggunakan stop list, kata-kata yang sering muncul (tetapi kurang penting) dapat dikeluarkan dari indeks:
§ Secara semantic mereka tidak penting: the, a, and, to, be
§ Jumlahnya cukup banyak: ~30% dari semua kata dalam corpus
§ Trend: stopword tidak diikutkan:
§ Hemat indeks dan dapat memperkecil ukuran indeks walaupun dikompres
§ Query optimisasi menjadi lebih baik
§ Tapi perlu juga memperhatikan Query sbb:
§ Judul film: “King of Denmark”
§ Judul Lagu: “Let it be”, “To be or not to be”
§ Relational query: “flights to London”
§ Normalisasi Kata (terms)
§ Kata harus dinormalisasidalam in indexed text as well as query words into the same form
§ We want to match U.S.A. and USA
§ Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary
§ We most commonly implicitly define equivalence classes of terms by, e.g.,
§ deleting periods to form a term
§ U.S.A., USA è USA
§ deleting hyphens to form a term
§ anti-discriminatory, antidiscriminatory è antidiscriminatory
§ Normalization: other languages
§ Accents: e.g., French résumé vs. resume.
§ Umlauts: e.g., German: Tuebingen vs. Tübingen
§ Should be equivalent
§ Most important criterion:
§ How are your users like to write their queries for these words?
§ Even in languages that standardly have accents, users often may not type them
§ Often best to normalize to a de-accented term
§ Tuebingen, Tübingen, Tubingen è Tubingen
§ Normalization: other languages
§ Normalization of things like date forms
§ 7月30日 vs. 7/30
§ Japanese use of kana vs. Chinese characters
§ Tokenization and normalization may depend on the language and so is intertwined with language detection
§ Crucial: Need to “normalize” indexed text as well as query terms into the same form
§ Case folding
§ Reduce all letters to lower case
§ exception: upper case in mid-sentence?
§ e.g., General Motors
§ Fed vs. fed
§ SAIL vs. sail
§ Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…
§ Google example:
§ Query C.A.T.
§ #1 result was for “cat” (well, Lolcats) not Caterpillar Inc.
§ Normalization to terms
§ An alternative to equivalence classing is to do asymmetric expansion
§ An example of where this may be useful
§ Enter: window Search: window, windows
§ Enter: windows Search: Windows, windows, window
§ Enter: Windows Search: Windows
§ Potentially more powerful, but less efficient
§ Thesauri and soundex
§ Do we handle synonyms and homonyms?
§ E.g., by hand-constructed equivalence classes
§ car = automobile color = colour
§ We can rewrite to form equivalence-class terms
§ When the document contains automobile, index it under car-automobile (and vice-versa)
§ Or we can expand a query
§ When the query contains automobile, look under car as well
§ What about spelling mistakes?
§ One approach is soundex, which forms equivalence classes of words based on phonetic heuristics
§ More in lectures 3 and 9
§ Lemmatization
§ Reduce inflectional/variant forms to base form
§ E.g.,
§ am, are, is ® be
§ car, cars, car's, cars' ® car
§ the boy's cars are different colors ® the boy car be different color
§ Lemmatization implies doing “proper” reduction to dictionary headword form
§ Stemming
§ Reduce terms to their “roots” before indexing
§ “Stemming” suggest crude affix chopping
§ language dependent
§ e.g., automate(s), automatic, automation all reduced to automat.
§ Porter’s algorithm
§ Commonest algorithm for stemming English
§ Results suggest it’s at least as good as other stemming options
§ Conventions + 5 phases of reductions
§ phases applied sequentially
§ each phase consists of a set of commands
§ sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
§ Typical rules in Porter
§ sses ® ss
§ ies ® i
§ ational ® ate
§ tional ® tion
§ Rules sensitive to the measure of words
§ (m>1) EMENT →
§ replacement → replac
§ cement → cement
§ Other stemmers
§ Other stemmers exist, e.g., Lovins stemmer
§ http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
§ Single-pass, longest suffix removal (about 250 rules)
§ Full morphological analysis – at most modest benefits for retrieval
§ Do stemming and other normalizations help?
§ English: very mixed results. Helps recall but harms precision
§ operative (dentistry) ⇒ oper
§ operational (research) ⇒ oper
§ operating (systems) ⇒ oper
§ Definitely useful for Spanish, German, Finnish, …
§ 30% performance gains for Finnish!
§ Recall basic merge
§ Walk through the two postings simultaneously, in time linear in the total number of postings entries
0 komentar:
Posting Komentar