Minggu, 23 Februari 2014

Materi Bahasa Inggris informatika



§   CS276: Information Retrieval and Web Search
§   Pandu Nayak and Prabhakar Raghavan
§   Term Vocabulary & Postings lists
§   (Tokenisasi)
§   Pertemuan sebelumnya:
§  Struktur dari Inverted Indeks:
§  Dictionary (Vocabulary) & Inverted List (Postings)


§  Vocabulary urut berdasarkan term (kata)
§  Untuk memproses Boolean Query:
§  Melakukan interseksi (merging) secara linear

§   Topik Pada Pertemuan Ini
Tahapan dalam Membangun Indeks
§  Preprocessing untuk membentuk vocabulary
§  Documen
§  Tokenisasi (tokenization)
§  Kata (terms) apa saja yang dimasukkan dalam indeks
§  Inverted List (Postings)
§  Cara merge secara lebih cepat (faster merge) dengan cara skip lists
§  Query dalam bentuk kaliman (phrase)
§   Diagram Proses Indexing
§   Parsing Dokumen
§  Perhatikan terlebih dahulu format dokumen
§  pdf/word/excel/html ?
§  Ditulis dalam bahasa apa?
§  Format character set yang digunakan
§  Complications: Format/Language
§   Dokumen yang akan diindeks dapat berupa dokumen yang ditulis dalam beberapa bahasa
§  Sebuah indeks dapat mengandung kata dari beberapa bahasa
§  Karena sebuah dokumen dapat ditulis dalam beberapa bahasa
§  Contoh: Email dalam bahasa Inggris tetapi attacment dari email adalah dokumen yang ditulis dalam bahasa Jerman
§   Apakah unit dari sebuah dokumen?
§  Sebuah file?
§  Sebuah email? 
§  Sebuah email dengan 5 attachments?
§  Sekumpulan files (PPT atau halaman HTML)?
§   TOKENS & TERMS (KATA)
§   Tokenisasi (Tokenization)
§   Input: Friends, Romans, Countrymen
§   Output: Tokens
§  Friends
§  Romans
§  Countrymen
§   Jadi token adalah sederetan karakter (a sequence of characters) dalam dokumen
§   Setiap token menjadi kandidat dari elemen dalam indeks, tentunya setelah preprocessing
§  Beberapa isu dalam tokenisasi:
§  Finlands capital ®
     Finland? Finlands? Finlands?
§  Hewlett-Packard Hewlett dan Packardsebagai dua token atau satu?
§  state-of-the-art: break up hyphenated sequence
§  co-education
§  lowercase, lower-case, lower case?
§  San Francisco: satu token atau dua? 
§  Bagaimana cara memutuskan bahwa SF adalah satu token?
§   Angka (Numbers)
§   3/12/91                          Mar. 12, 1991                                                12/3/91
§   No. B-52
§   Kode: 324a3df234cb23e
§   Telepon: (0651) 234-2333
§  Biasanya angka memiliki space diantaranya
§  Sistem IR yang lama tidak mengindeks angka
§  Tapi angka itu penting. Coba bayangkan bila ingin mencari baris dari error kode program melalui Sistem IR atau mencari nomor tertentu
§  Salah satu solusi adalah menggunakan mekanisme n-grams
§   Tokenisasi: Isu dalam bahasa
§   French
§  L'ensemble ® satu token atau dua?
§  L ? L ? Le ?
§  Want lensemble to match with un ensemble
§  Sampai tahun 2003, tidak berhasil bila dicari via Google
§  Internationalization!

§   German noun compounds are not segmented
§  Lebensversicherungsgesellschaftsangestellter
§  life insurance company employee
§  German retrieval systems benefit greatly from a compound splitter module
§  Can give a 15% performance boost for German
§   Chinese and Japanese:
§  莎拉波娃在居住在美国南部的佛里达。
§  Not always guaranteed a unique tokenization
§   Further complicated in Japanese: Dates/amounts in multiple formats
§   Tulisan Arab ditulis dari kanan ke kiri tetapi untuk  angka dibaca dari kiri ke kanan
§   Words are separated, but letter forms within a word form complex ligatures

§                                                                                  start
§   Algeria achieved its independence in 1962 after 132 years of French occupation.

§   Stop words
§   Menggunakan stop list, kata-kata yang sering muncul (tetapi kurang penting) dapat dikeluarkan dari indeks:
§  Secara semantic mereka tidak penting: the, a, and, to, be
§  Jumlahnya cukup banyak: ~30% dari semua kata dalam corpus
§   Trend: stopword tidak diikutkan:
§  Hemat indeks dan dapat memperkecil ukuran indeks walaupun dikompres
§  Query optimisasi menjadi lebih baik
§  Tapi perlu juga memperhatikan Query sbb:
§  Judul film: King of Denmark
§  Judul Lagu: Let it be, To be or not to be
§  Relational query: flights to London
§   Normalisasi Kata (terms)
§   Kata harus dinormalisasidalam in indexed text as well as query words into the same form
§  We want to match U.S.A. and USA
§   Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary
§   We most commonly implicitly define equivalence classes of terms by, e.g.,
§  deleting periods to form a term
§  U.S.A., USA  è  USA
§  deleting hyphens to form a term
§  anti-discriminatory, antidiscriminatory  è  antidiscriminatory
§   Normalization: other languages
§   Accents: e.g., French résumé vs. resume.
§   Umlauts: e.g., German: Tuebingen vs. Tübingen
§  Should be equivalent
§   Most important criterion:
§  How are your users like to write their queries for these words?

§   Even in languages that standardly have accents, users often may not type them
§  Often best to normalize to a de-accented term
§  Tuebingen, Tübingen, Tubingen è Tubingen
§   Normalization: other languages
§   Normalization of things like date forms
§  730 vs. 7/30
§  Japanese use of kana vs. Chinese characters

§   Tokenization and normalization may depend on the language and so is intertwined with language detection

§   Crucial: Need to normalize indexed text as well as query terms into the same form
§   Case folding
§   Reduce all letters to lower case
§  exception: upper case in mid-sentence?
§  e.g., General Motors
§  Fed vs. fed
§  SAIL vs. sail
§  Often best to lower case everything, since users will use lowercase regardless of correct capitalization

§   Google example:
§  Query C.A.T. 
§  #1 result was for cat (well, Lolcats) not Caterpillar Inc.
§   Normalization to terms

§   An alternative to equivalence classing is to do asymmetric expansion
§   An example of where this may be useful
§  Enter: window                 Search: window, windows
§  Enter: windows   Search: Windows, windows, window
§  Enter: Windows   Search: Windows
§   Potentially more powerful, but less efficient
§   Thesauri and soundex
§   Do we handle synonyms and homonyms?
§  E.g., by hand-constructed equivalence classes
§  car = automobile  color = colour
§  We can rewrite to form equivalence-class terms
§  When the document contains automobile, index it under car-automobile (and vice-versa)
§  Or we can expand a query
§  When the query contains automobile, look under car as well
§   What about spelling mistakes?
§  One approach is soundex, which forms equivalence classes of words based on phonetic heuristics
§   More in lectures 3 and 9
§   Lemmatization
§   Reduce inflectional/variant forms to base form
§   E.g.,
§  am, are, is ® be
§  car, cars, car's, cars' ® car
§   the boy's cars are different colors ® the boy car be different color
§   Lemmatization implies doing proper reduction to dictionary headword form
§   Stemming
§   Reduce terms to their roots before indexing
§   Stemming suggest crude affix chopping
§  language dependent
§  e.g., automate(s), automatic, automation all reduced to automat.
§   Porters algorithm
§   Commonest algorithm for stemming English
§  Results suggest its at least as good as other stemming options
§   Conventions + 5 phases of reductions
§  phases applied sequentially
§  each phase consists of a set of commands
§  sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

§   Typical rules in Porter
§   sses ® ss
§   ies ® i
§   ational ® ate
§   tional ® tion

§   Rules sensitive to the measure of words
§      (m>1) EMENT
§  replacement replac
§  cement   cement

§   Other stemmers
§   Other stemmers exist, e.g., Lovins stemmer
§   http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
§  Single-pass, longest suffix removal (about 250 rules)

§   Full morphological analysis at most modest benefits for retrieval

§   Do stemming and other normalizations help?
§  English: very mixed results. Helps recall but harms precision
§  operative (dentistry) oper
§  operational (research) oper
§  operating (systems) oper
§  Definitely useful for Spanish, German, Finnish,
§  30% performance gains for Finnish!
§   Recall basic merge
§   Walk through the two postings simultaneously, in time linear in the total number of postings entries


0 komentar:

Posting Komentar

Posting Kami