org.apache.lucene.analysis.Analyzer

org.apache.lucene.analysis.StopwordAnalyzerBase

com.apple.foundationdb.record.lucene.AlphanumericCjkAnalyzer

All Implemented Interfaces:: Closeable, AutoCloseable

public class AlphanumericCjkAnalyzer extends org.apache.lucene.analysis.StopwordAnalyzerBase

A CJK Analyzer which applies a minimum and maximum token length to non-CJK tokens. This is useful when anticipating text which has a mixture of CJK and non-CJK tokens in it, as it will ensure that non-CJK tokens will adhere to length limitations, but will ignore CJK tokens during that process.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.Analyzer.ReuseStrategy, org.apache.lucene.analysis.Analyzer.TokenStreamComponents
Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_MIN_TOKEN_LENGTH

static final String

UNIQUE_IDENTIFIER

Fields inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase
stopwords

Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
Constructor Summary

Constructors

Constructor

Description

AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords)

AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens)

AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, String synonymName)

Create a new AlphanumericCjkAnalyzer.

AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, String synonymName)

AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, String synonymName)
Method Summary

Modifier and Type

Method

Description

protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents

createComponents(String fieldName)

int

getMaxTokenLength()

int

getMinTokenLength()

protected org.apache.lucene.analysis.TokenStream

normalize(String fieldName, org.apache.lucene.analysis.TokenStream in)

void

setMaxTokenLength(int length)

void

setMinTokenLength(int length)

Methods inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase
getStopwordSet, loadStopwordSet, loadStopwordSet, loadStopwordSet

Methods inherited from class org.apache.lucene.analysis.Analyzer
attributeFactory, close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, getVersion, initReader, initReaderForNormalization, normalize, setVersion, tokenStream, tokenStream

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_MIN_TOKEN_LENGTH
  
  public static final int DEFAULT_MIN_TOKEN_LENGTH
  See Also:
  
  Constant Field Values
- UNIQUE_IDENTIFIER
  
  public static final String UNIQUE_IDENTIFIER
  See Also:
  
  Constant Field Values
Constructor Details
- AlphanumericCjkAnalyzer
  
  public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords)
- AlphanumericCjkAnalyzer
  
  public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens)
- AlphanumericCjkAnalyzer
  
  public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, @Nullable String synonymName)
- AlphanumericCjkAnalyzer
  
  public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, @Nullable String synonymName)
- AlphanumericCjkAnalyzer
  
  public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, @Nullable String synonymName)
  
  Create a new AlphanumericCjkAnalyzer. This has an additional parameter that most analyzers don't have, the minAlphanumericTokenLength. This is used to filter out any tokens that are too small and consistent only of alphanumeric characters, so CJK unigrams below that length are not filtered out. For example, with a min token length of 1 and a min alphanumeric token length of 3, the string "5人" would get tokenized into a single token, "人", but the string "500人" becomes two tokens, "500" and "人".
  
  Parameters:
  
  stopWords - the stop words to exclude
  
  minTokenLength - the minimum length of any token
  
  maxTokenLength - the maximum token length
  
  breakLongTokens - if true, the long tokens are broken up.
  
  synonymName - the name of the synonym map to use, or null if no synonyms are to be used
Method Details
- setMaxTokenLength
  
  public void setMaxTokenLength(int length)
- getMaxTokenLength
  
  public int getMaxTokenLength()
- setMinTokenLength
  
  public void setMinTokenLength(int length)
- getMinTokenLength
  
  public int getMinTokenLength()
- createComponents
  
  protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents createComponents(String fieldName)
  
  Specified by:
  
  createComponents in class org.apache.lucene.analysis.Analyzer
- normalize
  
  protected org.apache.lucene.analysis.TokenStream normalize(String fieldName, org.apache.lucene.analysis.TokenStream in)
  
  Overrides:
  
  normalize in class org.apache.lucene.analysis.Analyzer

Class AlphanumericCjkAnalyzer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer

Field Summary

Fields inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase

Fields inherited from class org.apache.lucene.analysis.Analyzer

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase

Methods inherited from class org.apache.lucene.analysis.Analyzer

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MIN_TOKEN_LENGTH

UNIQUE_IDENTIFIER

Constructor Details

AlphanumericCjkAnalyzer

AlphanumericCjkAnalyzer

AlphanumericCjkAnalyzer

AlphanumericCjkAnalyzer

AlphanumericCjkAnalyzer

Method Details

setMaxTokenLength

getMaxTokenLength

setMinTokenLength

getMinTokenLength

createComponents

normalize