User:OrenBochman/BetterSearch

Hello, you have come here looking for the meaning of the word User:OrenBochman/BetterSearch. In DICTIOUS you will not only get to know all the dictionary meanings for the word User:OrenBochman/BetterSearch, but we will also tell you about its etymology, its characteristics and you will know how to say User:OrenBochman/BetterSearch in singular and plural. Everything you need to know about the word User:OrenBochman/BetterSearch you have here. The definition of the word User:OrenBochman/BetterSearch will help you to be more precise and correct when speaking or writing your texts. Knowing the definition ofUser:OrenBochman/BetterSearch, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.

Brainstorm Some Search Problems

Problem: Lucene search processes Wikimedia source text and not the outputted HTML.

  1. (Also) index output HTML file?

Problem: HTML also contains CSS, HTML, Script, Comments

  1. solution:
    Either index these too, or run a filter to remove them. Some Strategies are:
    1. Discard all markup.
      1. A markup_filter/tokenizer could be used to discard markup.
      2. Lucene Tika project can do this.
      3. Other ready made solutions.
    2. Keep all markup
      1. Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
        (interesting if one wants to also compress output for integrating into DB or Cache.
    3. Selective processing
      1. A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
      2. This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

Problem: Indexing offline and online

  1. real-time "only" - slowly build index in background
  2. offline "only" - used dedicated machine/cloud to dump and index offline.
  3. dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
    1. production of a linguistic/entity data or a new software milestone.
    2. offline analysis from dump (xml,or html)
    3. online processing newest to oldest updates with a (Poisson wait time prediction model)

Problem: Lucene Best Analyzers are Language specific

  1. N-Gram analyzer is language independent.
  2. A new Multilingual analyzer with a language detector can produced by
  3. extract features from query and check against model prepared of line.
  4. model would contain lexical feature such as:
    1. alphabet
    2. bi/trigram distribution.
    3. stop lists; collection of common word/pos/language sets (or lemma/language)
    4. normalized frequency statistics based on sampling full text from different languages..

Problem: Search is not aware of morphological language variation

  1. in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
  2. index Wiktionary so as to produce data for a "lemma analyzer".
    1. dumb lemma (bag with a representative)
    2. smart lemma (list ordered by frequency)
    3. quantum lemma (organized by morphological state and frequency)
  3. lemma based indexing.
  4. run a semantic disambiguation algorithm (tag )on disambiguate
  • other benefits:
  1. lemma based compression. (arithmetic coding based on smart lemma)
    1. indexing all lemmas
  2. smart resolution of disambiguation page.
  3. algorithm translate English to simple English.
  4. excellent language detection for search.
  • metrics:
  1. extract amount of information contributed by a user
    1. since inception.
    2. in final version.

Plan

Resources

Developer/Admin Information

Search Options

highlights:

Notes

A quick review of the above is summarized as follows:

Mediawiki does not appear to have native search capabilities. It can be searched via external components (indexed and then searched) via three extensions:

  1. Sphinx Search - for small sites (updated 2010)
  2. Lucene Search - Lucene search for large sites
  3. EzMwLucene - Easy Lucene search - an unadapted package from

MWSearch does not perform searches rather it provides integration with Lucene-search.


Potential Contact People

comitt capable developers, irc:#mediawiki

Screened

Unscreened

Misc