1. IndexWriter
IndexWriter is the central component of the indexing process. This class creates
a new index and adds documents to an existing index. You can think of Index-Writer
as an object that gives you write access to the index but doesn’t let you read
or search it.
variables:
Directory directory - where the index directory
Analyzer analyzer - how to analyze text
methods:
addDocument(Document, Analyzer )
addIndexes(Directory[] dirs) merge another index
2. Directory
The Directory class represents the location of a Lucene index. an abstract class
that allows its subclasses (two of which are included in Lucene) to store the index
as they see fit.
Lucene has 5 concrete implementation of this abstract class.
CompoundFileReader - for accessing a compound stream.
DbDirectory - a Berkeley DB 4.3 based implementation
FSDirectory - Straightforward implementation of Directory as a directory of files
JEDirectory - Port of Andi Vajda's DbDirectory to to Java Edition of Berkeley Database
RAMDirectory - A memory-resident Directory implementation.
3. Analyzer
The abstract class Analyzer is in charge of extracting tokens out of text to be indexed
and eliminating the rest. Analyzers are an important part of Lucene and can be used for
much more than simple input filtering.Lucene comes with several implementations of it.
BrazilianAnalyzer - br
ChineseAnalyzer - cn
CJKAnalyzer - cjk
CzechAnalyzer - cz
DutchAnalyzer - nl
FrenchAnalyzer - fr
GermanAnalyzer - de
GreekAnalyzer - el
RussianAnalyzer - ru
ThaiAnalyzer - th
KeywordAnalyzer - "Tokenizes" the entire stream as a single token.
PatternAnalyzer -
PerFieldAnalyzerWrapper - used to facilitate scenarios where different fields require
different analysis techniques.
SimpleAnalyzer - filters LetterTokenizer with LowerCaseFilter.
SnowballAnalyzer - Filters StandardTokenizer with StandardFilter->LowerCaseFilter
->StopFilter->SnowballFilter
StandardAnalyzer - using a list of English stop words
StopAnalyzer - Filters LetterTokenizer with LowerCaseFilter and StopFilter
WhitespaceAnalyzer - An Analyzer that uses WhitespaceTokenizer
4. Document
Documents are the unit of indexing and search. It represents a collection of fields.
Fields of a document represent the document or meta-data associated with that document.
The meta-data such as author, title, subject, date modified, and so on, are indexed
and stored separately as fields of a document.
Variables:
List fields;
float boost;
5. Field
Each field corresponds to a piece of data that is either queried against or retrieved
from the index during search.
Lucene offers four different types of fields:
Keyword — Isn’t analyzed, but is indexed and stored in the index verbatim. This type
is suitable for fields whose original value should be preserved in its entirety, such
as URLs, file system paths, dates, personal names, Social Security numbers, telephone
numbers, and so on.
UnIndexed — Is neither analyzed nor indexed, but its value is stored in the index as
is. This type is suitable for fields that you need to display with search results, but
whose values you’ll never search directly.
UnStored — The opposite of UnIndexed. This field type is analyzed and indexed but isn’t
stored in the index. It’s suitable for indexing a large amount of text that doesn’t
need to be retrieved in its original form, such as bodies of web pages, or any other type
of text document.
Text — Is analyzed, and is indexed. This implies that fields of this type can be
searched against, but be cautious about the field size. If the data indexed is a String,
it’s also stored; but if the data (as in our Indexer example) is from a Reader, it isn’t
stored
Finally, UnStored and Text fields can be used to create term vectors (an advanced topic,
covered in section 5.7).