Class AlphanumericCjkAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.StopwordAnalyzerBase
com.apple.foundationdb.record.lucene.AlphanumericCjkAnalyzer
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class AlphanumericCjkAnalyzer
extends org.apache.lucene.analysis.StopwordAnalyzerBase
A CJK Analyzer which applies a minimum and maximum token length to non-CJK tokens. This is useful
when anticipating text which has a mixture of CJK and non-CJK tokens in it, as it will ensure that non-CJK
tokens will adhere to length limitations, but will ignore CJK tokens during that process.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.Analyzer.ReuseStrategy, org.apache.lucene.analysis.Analyzer.TokenStreamComponents
-
Field Summary
FieldsFields inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase
stopwords
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
Constructor Summary
ConstructorsConstructorDescriptionAlphanumericCjkAnalyzer
(org.apache.lucene.analysis.CharArraySet stopWords) AlphanumericCjkAnalyzer
(org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens) AlphanumericCjkAnalyzer
(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, String synonymName) Create a newAlphanumericCjkAnalyzer
.AlphanumericCjkAnalyzer
(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, String synonymName) AlphanumericCjkAnalyzer
(org.apache.lucene.analysis.CharArraySet stopWords, String synonymName) -
Method Summary
Modifier and TypeMethodDescriptionprotected org.apache.lucene.analysis.Analyzer.TokenStreamComponents
createComponents
(String fieldName) int
int
protected org.apache.lucene.analysis.TokenStream
void
setMaxTokenLength
(int length) void
setMinTokenLength
(int length) Methods inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase
getStopwordSet, loadStopwordSet, loadStopwordSet, loadStopwordSet
Methods inherited from class org.apache.lucene.analysis.Analyzer
attributeFactory, close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, getVersion, initReader, initReaderForNormalization, normalize, setVersion, tokenStream, tokenStream
-
Field Details
-
DEFAULT_MIN_TOKEN_LENGTH
public static final int DEFAULT_MIN_TOKEN_LENGTH- See Also:
-
UNIQUE_IDENTIFIER
- See Also:
-
-
Constructor Details
-
AlphanumericCjkAnalyzer
public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords) -
AlphanumericCjkAnalyzer
public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens) -
AlphanumericCjkAnalyzer
public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, @Nullable String synonymName) -
AlphanumericCjkAnalyzer
public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, @Nullable String synonymName) -
AlphanumericCjkAnalyzer
public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, @Nullable String synonymName) Create a newAlphanumericCjkAnalyzer
. This has an additional parameter that most analyzers don't have, theminAlphanumericTokenLength
. This is used to filter out any tokens that are too small and consistent only of alphanumeric characters, so CJK unigrams below that length are not filtered out. For example, with a min token length of 1 and a min alphanumeric token length of 3, the string "5人" would get tokenized into a single token, "人", but the string "500人" becomes two tokens, "500" and "人".- Parameters:
stopWords
- the stop words to excludeminTokenLength
- the minimum length of any tokenmaxTokenLength
- the maximum token lengthbreakLongTokens
- if true, the long tokens are broken up.synonymName
- the name of the synonym map to use, ornull
if no synonyms are to be used
-
-
Method Details
-
setMaxTokenLength
public void setMaxTokenLength(int length) -
getMaxTokenLength
public int getMaxTokenLength() -
setMinTokenLength
public void setMinTokenLength(int length) -
getMinTokenLength
public int getMinTokenLength() -
createComponents
protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents createComponents(String fieldName) - Specified by:
createComponents
in classorg.apache.lucene.analysis.Analyzer
-
normalize
protected org.apache.lucene.analysis.TokenStream normalize(String fieldName, org.apache.lucene.analysis.TokenStream in) - Overrides:
normalize
in classorg.apache.lucene.analysis.Analyzer
-