Hello, you have come here looking for the meaning of the word
User:OrenBochman/BetterSearch. In DICTIOUS you will not only get to know all the dictionary meanings for the word
User:OrenBochman/BetterSearch, but we will also tell you about its etymology, its characteristics and you will know how to say
User:OrenBochman/BetterSearch in singular and plural. Everything you need to know about the word
User:OrenBochman/BetterSearch you have here. The definition of the word
User:OrenBochman/BetterSearch will help you to be more precise and correct when speaking or writing your texts. Knowing the definition of
User:OrenBochman/BetterSearch, as well as those of other words, enriches your vocabulary and provides you with more and better linguistic resources.
Brainstorm Some Search Problems
Problem: Lucene search processes Wikimedia source text and not the outputted HTML.
- (Also) index output HTML file?
- solution:
Either index these too, or run a filter to remove them. Some Strategies are:
- Discard all markup.
- A markup_filter/tokenizer could be used to discard markup.
- Lucene Tika project can do this.
- Other ready made solutions.
- Keep all markup
- Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
(interesting if one wants to also compress output for integrating into DB or Cache.
- Selective processing
- A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
- This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).
Problem: Indexing offline and online
- real-time "only" - slowly build index in background
- offline "only" - used dedicated machine/cloud to dump and index offline.
- dua - each time the lingustic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible aproach would be
- production of a linguistic/entity data or a new software milestone.
- offline analysis from dump (xml,or html)
- online processing newest to oldest updates with a (Poisson wait time prediction model)
Problem: Lucene Best Analyzers are Language specific
- N-Gram analyzer is language independent.
- A new Multilingual analyzer with a language detector can produced by
- extract features from query and check against model prepared of line.
- model would contain lexical feature such as:
- alphabet
- bi/trigram distribution.
- stop lists; collection of common word/pos/language sets (or lemma/language)
- normalized frequency statistics based on sampling full text from different languages..
Problem: Search is not aware of morphological language variation
- in language with rich morphology this will reduce effectiveness of search. Hebrew, Arabic,
- index Wiktionary so as to produce data for a "lemma analyzer".
- dumb lemma (bag with a representative)
- smart lemma (list ordered by frequency)
- quantum lemma (organized by morphological state and frequency)
- lemma based indexing.
- run a semantic disambiguation algorithm (tag )on disambiguate
- lemma based compression. (arithmetic coding based on smart lemma)
- indexing all lemmas
- smart resolution of disambiguation page.
- algorithm translate English to simple English.
- excellent language detection for search.
- extract amount of information contributed by a user
- since inception.
- in final version.
Plan
Resources
Search Options
highlights:
Notes
A quick review of the above is summarized as follows:
Mediawiki does not appear to have native search capabilities.
It can be searched via external components (indexed and then searched) via three extensions:
- Sphinx Search - for small sites (updated 2010)
- Lucene Search - Lucene search for large sites
- EzMwLucene - Easy Lucene search - an unadapted package from
MWSearch does not perform searches rather it provides integration with Lucene-search.
comitt capable developers, irc:#mediawiki
Screened
Unscreened
Misc