Class NexusTokenizer
Comments
A simple token pull-parser for the NEXUS file format as specified in:
Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.
The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of
' '
or'\t'
. Whitespace is only returned if the option is set - Word: any string of characters delimited by whitespace or punctuation
- Newline:
'\r'
,'\n'
or'\r\n'
. The parser will return the character unlessconvertNL
is set, in which case it will replace the token with the user specified new line character
The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).
Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream
NB: in this implementation, the token #NEXUS is considered special and when
read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'.
This token has special meaning and is reflected in it having its own token type
Usage
NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
// ignore whitespace
ntp.setIgnoreComments(true);
// ignore comments
ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase
String nToken = ntp.readToken();
while(nToken != null) {
System.out.println("Token: " + nToken);
System.out.println("Col: " + ntp.getCol());
System.out.println("Row: " + ntp.getRow());
}
- Version:
- $Id$, $Name$
- Author:
- $Author$
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final char
static final int
Flag indicating last token read was the header token #NEXUSstatic final char
static final char
static final char
static final char
static final char
static final int
Flag indicating last token read was a newline symbol/wordstatic final char
static final int
Flag indicating last token read was a punctuation symbolstatic final char
static final char
static final char
static final char
static final char
static final char
static final char
static final int
Flag indicating last token read was undefinedstatic final int
Flag indicating last token read was whitespacestatic final int
Flag indicating words should be converted to lowercasestatic final int
Flag indicating last token read was a wordstatic final int
Flag indicating words should be untouchedstatic final int
Flag indicating words should be converted to uppercase -
Constructor Summary
ConstructorsConstructorDescriptionConstructor for aNexusTokenParser
NexusTokenizer
(String file) Constructor for aNexusTokenParser
-
Method Summary
Modifier and TypeMethodDescriptionboolean
Gets the flag indicating whether this parser instance should convert newline characters.int
getCol()
Gets the current column position of the cursor.Returns the last read token.int
Determine the type of the last read token.int
getRow()
Gets the current row position of the cursor.int
Gets the word modification flag currently in useReads a token in from the underlying stream.boolean
Get the flag indicating whether or not this parser object is reading (and returning) whitespaceseek
(int tokenType) Seeks through the stream to find the next token of the specified type.Seeks through the stream to find the token argument.void
setConvertNewLine
(boolean b) Sets theconvertNL
flag.void
setIgnoreComments
(boolean b) Sets theignoreComments
flag.void
setNewLineChar
(char nl) Sets the character to be convert newline characters intovoid
setReadWhiteSpace
(boolean b) Sets thereadWS
flag.void
setWordModification
(int flag) Sets the flag value for word modification.
-
Field Details
-
L_PARENTHESIS
public static final char L_PARENTHESIS- See Also:
-
R_PARENTHESIS
public static final char R_PARENTHESIS- See Also:
-
L_BRACKET
public static final char L_BRACKET- See Also:
-
R_BRACKET
public static final char R_BRACKET- See Also:
-
L_BRACE
public static final char L_BRACE- See Also:
-
R_BRACE
public static final char R_BRACE- See Also:
-
F_SLASH
public static final char F_SLASH- See Also:
-
B_SLASH
public static final char B_SLASH- See Also:
-
COMMA
public static final char COMMA- See Also:
-
SEMI_COLON
public static final char SEMI_COLON- See Also:
-
COLON
public static final char COLON- See Also:
-
EQUALS
public static final char EQUALS- See Also:
-
ASTERIX
public static final char ASTERIX- See Also:
-
S_QUOTE
public static final char S_QUOTE- See Also:
-
D_QUOTE
public static final char D_QUOTE- See Also:
-
B_TICK
public static final char B_TICK- See Also:
-
ADDITION
public static final char ADDITION- See Also:
-
DASH
public static final char DASH- See Also:
-
L_THAN
public static final char L_THAN- See Also:
-
G_THAN
public static final char G_THAN- See Also:
-
HASH
public static final char HASH- See Also:
-
PERIOD
public static final char PERIOD- See Also:
-
L_FEED
public static final char L_FEED- See Also:
-
C_RETURN
public static final char C_RETURN- See Also:
-
TAB
public static final char TAB- See Also:
-
SPACE
public static final char SPACE- See Also:
-
WORD_UPPERCASE
public static final int WORD_UPPERCASEFlag indicating words should be converted to uppercase- See Also:
-
WORD_LOWERCASE
public static final int WORD_LOWERCASEFlag indicating words should be converted to lowercase- See Also:
-
WORD_UNMODIFIED
public static final int WORD_UNMODIFIEDFlag indicating words should be untouched- See Also:
-
UNDEFINED_TOKEN
public static final int UNDEFINED_TOKENFlag indicating last token read was undefined- See Also:
-
WORD_TOKEN
public static final int WORD_TOKENFlag indicating last token read was a word- See Also:
-
PUNCTUATION_TOKEN
public static final int PUNCTUATION_TOKENFlag indicating last token read was a punctuation symbol- See Also:
-
NEWLINE_TOKEN
public static final int NEWLINE_TOKENFlag indicating last token read was a newline symbol/word- See Also:
-
WHITESPACE_TOKEN
public static final int WHITESPACE_TOKENFlag indicating last token read was whitespace- See Also:
-
HEADER_TOKEN
public static final int HEADER_TOKENFlag indicating last token read was the header token #NEXUS- See Also:
-
-
Constructor Details
-
NexusTokenizer
Constructor for aNexusTokenParser
- Parameters:
file
- File name for the NEXUS file- Throws:
IOException
- I/O errors
-
NexusTokenizer
Constructor for aNexusTokenParser
- Parameters:
pr
- PushbackReader- Throws:
IOException
- I/O errors
-
-
Method Details
-
readWhiteSpace
public boolean readWhiteSpace()Get the flag indicating whether or not this parser object is reading (and returning) whitespace- Returns:
- returns the
readWS
flag
-
convertNewLine
public boolean convertNewLine()Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.- Returns:
- returns the
convertNL
flag
-
setReadWhiteSpace
public void setReadWhiteSpace(boolean b) Sets thereadWS
flag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').- Parameters:
b
- flag value forreadWS
-
setConvertNewLine
public void setConvertNewLine(boolean b) Sets theconvertNL
flag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' ifsetNewLineChar()
is not called) or to a user specified newline char- Parameters:
b
- flag value forconvertNL
-
setIgnoreComments
public void setIgnoreComments(boolean b) Sets theignoreComments
flag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.- Parameters:
b
- flag value forignoreComments
-
setNewLineChar
public void setNewLineChar(char nl) Sets the character to be convert newline characters into- Parameters:
nl
- Replacement newline character
-
getCol
public int getCol()Gets the current column position of the cursor. Changed after each read.- Returns:
- Column number (zero indexed)
-
getRow
public int getRow()Gets the current row position of the cursor. Changed after each read.- Returns:
- Row number (zero indexed)
-
getWordModification
public int getWordModification()Gets the word modification flag currently in use- Returns:
- Flag value for word modification
-
setWordModification
public void setWordModification(int flag) Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag).WORD_UNMODIFIED
indicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default isWORD_UNMODIFIED.
- Parameters:
flag
- Flag value, one ofWORD_LOWERCASE
,WORD_UPPERCASE
orWORD_UNMODIFIED
-
readToken
Reads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:- Word: any string of characters delimited by whitespace or punctuation
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
- Newline: '\r', '\n' or '\r\n'. The parser will return the character
unless
convertNL
is set, in which case it will replace the token with the user specified new line character
- Returns:
- returns a
String
token ornull
if EOF is reached (i.e. no more tokens to read) - Throws:
IOException
- I/O errorsNexusParseException
- Parsing errors
-
getLastTokenType
public int getLastTokenType()Determine the type of the last read token. AfterreadToken()
has been called, the type of token returned can be determined by callinggetLastTokenType()
. This returns one of five different constants:UNDEFINED_TOKEN
: default before anything is read from the streamWORD_TOKEN
: word token was readPUNCTUATION_TOKEN
: punctuation token was readNEWLINE_TOKEN
: newline token was readWHITESPACE_TOKEN
: whitespace token was read (never returned unless whitespace is being returned)HEADER_TOKEN
: last token was the special word #NEXUS
- Returns:
- Last token read.
-
seek
Seeks through the stream to find the next token of the specified type. The type value can be one of:- WORD_TOKEN
- PUNCTUATION_TOKEN
- NEWLINE_TOKEN
- WHITESPACE_TOKEN
- HEADER_TOKEN
- Returns:
- returns a
String
token ornull
if EOF is reached (i.e. no more tokens to read) - Throws:
IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false
-
seek
Seeks through the stream to find the token argument.- Returns:
- returns a
String
token ornull
if token is not found (i.e. EOF is reached) - Throws:
IOException
- I/O errorsNexusParseException
- Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false
-
getLastReadToken
Returns the last read token. Each call toreadToken()
stores the returned token so that it can be retrieved again. However, each consumingreadToken()
call replaces this buffer with the new token.- Returns:
- return the last read token
-