Class AlphanumericCjkAnalyzer

java.lang.Object
org.apache.lucene.analysis.Analyzer
org.apache.lucene.analysis.StopwordAnalyzerBase
com.apple.foundationdb.record.lucene.AlphanumericCjkAnalyzer
All Implemented Interfaces:
Closeable, AutoCloseable

public class AlphanumericCjkAnalyzer extends org.apache.lucene.analysis.StopwordAnalyzerBase
A CJK Analyzer which applies a minimum and maximum token length to non-CJK tokens. This is useful when anticipating text which has a mixture of CJK and non-CJK tokens in it, as it will ensure that non-CJK tokens will adhere to length limitations, but will ignore CJK tokens during that process.
  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer

    org.apache.lucene.analysis.Analyzer.ReuseStrategy, org.apache.lucene.analysis.Analyzer.TokenStreamComponents
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
     
    static final String
     

    Fields inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase

    stopwords

    Fields inherited from class org.apache.lucene.analysis.Analyzer

    GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
  • Constructor Summary

    Constructors
    Constructor
    Description
    AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords)
     
    AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens)
     
    AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, String synonymName)
    Create a new AlphanumericCjkAnalyzer.
    AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, String synonymName)
     
    AlphanumericCjkAnalyzer(org.apache.lucene.analysis.CharArraySet stopWords, String synonymName)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents
     
    int
     
    int
     
    protected org.apache.lucene.analysis.TokenStream
    normalize(String fieldName, org.apache.lucene.analysis.TokenStream in)
     
    void
    setMaxTokenLength(int length)
     
    void
    setMinTokenLength(int length)
     

    Methods inherited from class org.apache.lucene.analysis.StopwordAnalyzerBase

    getStopwordSet, loadStopwordSet, loadStopwordSet, loadStopwordSet

    Methods inherited from class org.apache.lucene.analysis.Analyzer

    attributeFactory, close, getOffsetGap, getPositionIncrementGap, getReuseStrategy, getVersion, initReader, initReaderForNormalization, normalize, setVersion, tokenStream, tokenStream

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

    • AlphanumericCjkAnalyzer

      public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords)
    • AlphanumericCjkAnalyzer

      public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, boolean breakLongTokens)
    • AlphanumericCjkAnalyzer

      public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, @Nullable String synonymName)
    • AlphanumericCjkAnalyzer

      public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, @Nullable String synonymName)
    • AlphanumericCjkAnalyzer

      public AlphanumericCjkAnalyzer(@Nonnull org.apache.lucene.analysis.CharArraySet stopWords, int minTokenLength, int maxTokenLength, boolean breakLongTokens, @Nullable String synonymName)
      Create a new AlphanumericCjkAnalyzer. This has an additional parameter that most analyzers don't have, the minAlphanumericTokenLength. This is used to filter out any tokens that are too small and consistent only of alphanumeric characters, so CJK unigrams below that length are not filtered out. For example, with a min token length of 1 and a min alphanumeric token length of 3, the string "5人" would get tokenized into a single token, "人", but the string "500人" becomes two tokens, "500" and "人".
      Parameters:
      stopWords - the stop words to exclude
      minTokenLength - the minimum length of any token
      maxTokenLength - the maximum token length
      breakLongTokens - if true, the long tokens are broken up.
      synonymName - the name of the synonym map to use, or null if no synonyms are to be used
  • Method Details

    • setMaxTokenLength

      public void setMaxTokenLength(int length)
    • getMaxTokenLength

      public int getMaxTokenLength()
    • setMinTokenLength

      public void setMinTokenLength(int length)
    • getMinTokenLength

      public int getMinTokenLength()
    • createComponents

      protected org.apache.lucene.analysis.Analyzer.TokenStreamComponents createComponents(String fieldName)
      Specified by:
      createComponents in class org.apache.lucene.analysis.Analyzer
    • normalize

      protected org.apache.lucene.analysis.TokenStream normalize(String fieldName, org.apache.lucene.analysis.TokenStream in)
      Overrides:
      normalize in class org.apache.lucene.analysis.Analyzer