Class com.phrasys.Sentencer
java.lang.Object
|
+----com.phrasys.Sentencer
- public class Sentencer
- extends java.lang.Object
- implements java.io.Serializable, LineListener
This is a sentence splitter for English language texts. It reads
LineEvents and tries to identify sentence and paragraph boundaries
in the data. As these are found, SentenceEvents and
ParagraphEvents
are sent to registered listeners. The recommended sequence is for
the Sentencer to be followed by the Tokeniser.
For input the LineReader or any other emitter of LineEvents
can be used.
- Version:
- 1.0
- Author:
- Oliver Mason
DIRECTQUOTES- directionalised quotes (``text'')
MIXEDQUOTES- mixed quotes (`text")
PLAINQUOTES- plain quotes ("text")
Sentencer()
- Constructor.
addParagraphListener(ParagraphListener)
- Add a paragraph listener.
addSentenceListener(SentenceListener)
- Add a sentence listener.
escapeSGML(String)
- Escape angle brackets.
getQuoteStyle()
- Retrieve the style in which quotes are being processed.
newLine(LineEvent)
- Process a line.
removeParagraphListener(ParagraphListener)
- Remove a paragraph listener.
removeSentenceListener(SentenceListener)
- Remove a sentence listener.
replaceAllSubstring(String, String, String)
- Replace all occurrences of a string.
replaceDirectionalisedQuotes(String)
- Replace directionalised quote marks by entity names.
replaceMixedQuotes(String)
- Replace mixed quote marks by entity names.
replacePlainQuotes(String)
- Replace plain quote marks by entity names.
replaceSubstring(String, String, String)
- Replace a substring.
setQuoteStyle(int)
- Set the way quotes are being processed.
PLAINQUOTES
public static final int PLAINQUOTES
plain quotes ("text")
MIXEDQUOTES
public static final int MIXEDQUOTES
mixed quotes (`text")
DIRECTQUOTES
public static final int DIRECTQUOTES
directionalised quotes (``text'')
Sentencer
public Sentencer()
Constructor.
addSentenceListener
public void addSentenceListener(SentenceListener listener)
Add a sentence listener.
After each sentence, a SentenceEvent is sent to all registered
listeners.
- Parameters:
listener - the listener to register.
addParagraphListener
public void addParagraphListener(ParagraphListener listener)
Add a paragraph listener.
After each paragraph, a ParagraphEvent is sent to all registered
listeners.
- Parameters:
listener - the listener to register.
removeSentenceListener
public void removeSentenceListener(SentenceListener listener)
Remove a sentence listener.
- Parameters:
listener - the listener to be removed
removeParagraphListener
public void removeParagraphListener(ParagraphListener listener)
Remove a paragraph listener.
- Parameters:
listener - the listener to be removed
getQuoteStyle
public int getQuoteStyle()
Retrieve the style in which quotes are being processed.
The quote style can be defined through PLAINQUOTES,
MIXEDQUOTES and DIRECTQUOTES. For examples of how these
styles look like see the documentation entry for the respective
constant.
- Returns:
- an integer describing the current state.
- See Also:
- setQuoteStyle(int)
setQuoteStyle
public void setQuoteStyle(int style) throws java.lang.IllegalArgumentException
Set the way quotes are being processed.
- Parameters:
style - the new style how to deal with quotes
- Throws:
- java.lang.IllegalArgumentException -
- See Also:
- getQuoteStyle()
replaceMixedQuotes
public static java.lang.String replaceMixedQuotes(java.lang.String line)
Replace mixed quote marks by entity names.
In the mixed style, quote marks are normalised so that a backquote
(`) stands for an opening quote, while a double quote (") stands for
a closing quote. This function replaces these quotes by the respective
entity names bquo and equo. They will also be
surrounded by spaces in order to make tokenisation easier.
- Parameters:
line - a line that might contain quote marks.
- Returns:
- the same line with quote marks replaced.
replaceDirectionalisedQuotes
public static java.lang.String replaceDirectionalisedQuotes(java.lang.String line)
Replace directionalised quote marks by entity names.
Directionalised quote marks are `` and '' respectively.
quote. This function replaces these quotes by the respective
entity names bquo and equo. They will also be
surrounded by spaces in order to make tokenisation easier.
- Parameters:
line - a line that might contain quote marks.
- Returns:
- the same line with quote marks replaced.
replacePlainQuotes
public static java.lang.String replacePlainQuotes(java.lang.String line)
Replace plain quote marks by entity names.
The function tries to guess whether quotes are opening or closing, depending
on the position of blank spaces around them. Opening quotes are replaced by
bquo and closing ones by equo. Those which cannot be
identified are replaced by quo. Effectively, all double quotes
are removed from the input string. The entities will also be
surrounded by spaces in order to make tokenisation easier.
- Parameters:
line - a line that might contain quote marks.
- Returns:
- the same line with quote marks replaced.
escapeSGML
public static java.lang.String escapeSGML(java.lang.String line)
Escape angle brackets.
If the input text is not marked up in XML, but will later get enriched
with tags, it is desirable to escape special characters used by XML.
The characters to be replaced are &, >, and <.
- Parameters:
line - a line that might contain special characters.
- Returns:
- the same line with characters replaced.
replaceSubstring
public static java.lang.String replaceSubstring(java.lang.String fullString,
java.lang.String replace,
java.lang.String replacement)
Replace a substring.
This function replaces the first occurrence of the given substring in
the string given as the other argument.
- Parameters:
fullString - the string on which the substitution takes place.
replace - the substring that should be replaced.
replacement - the string to replace that substring.
- Returns:
- the new string.
replaceAllSubstring
public static java.lang.String replaceAllSubstring(java.lang.String fullString,
java.lang.String replace,
java.lang.String replacement)
Replace all occurrences of a string.
This function replaces all occurrences of the given substring in
the string given as the other argument.
- Parameters:
fullString - the string on which the substitution takes place.
replace - the substring that should be replaced.
replacement - the string to replace that substring.
- Returns:
- the new string.
newLine
public void newLine(LineEvent le)
Process a line.
This method is called each time the object receives a new LineEvent.
It sets of the processing of the line. The end of the input data is
marked by an empty LineEvent or a null parameter.
- Parameters:
le - the line event to process.