A wrapper around the stdlib `tokenize` which roundtrips.
The stdlib tokenize
module does not properly roundtrip. This wrapper
around the stdlib provides two additional tokens ESCAPED_NL
and
UNIMPORTANT_WS
, and a Token
data type. Use src_to_tokens
and
tokens_to_src
to roundtrip.
This library is useful if you're writing a refactoring tool based on the python tokenization.
pip install tokenize-rt
tokenize_rt.Offset(line=None, utf8_byte_offset=None)
A token offset, useful as a key when cross referencing the ast
and the
tokenized source.
tokenize_rt.Token(name, src, line=None, utf8_byte_offset=None)
Construct a token
name
: one of the token names listed in token.tok_name
or
ESCAPED_NL
or UNIMPORTANT_WS
src
: token's source as textline
: the line number that this token appears on.utf8_byte_offset
: the utf8 byte offset that this token appears on in the
line.tokenize_rt.Token.offset
Retrieves an Offset
for this token.
Token
representationstokenize_rt.src_to_tokens(text: str) -> List[Token]
tokenize_rt.tokens_to_src(Iterable[Token]) -> str
tokenize-rt
tokenize_rt.ESCAPED_NL
tokenize_rt.UNIMPORTANT_WS
tokenize_rt.NON_CODING_TOKENS
A frozenset
containing tokens which may appear between others while not
affecting control flow or code:
COMMENT
ESCAPED_NL
NL
UNIMPORTANT_WS
tokenize_rt.parse_string_literal(text: str) -> Tuple[str, str]
parse a string literal into its prefix and string content
>>> parse_string_literal('f"foo"')
('f', '"foo"')
tokenize_rt.reversed_enumerate(Sequence[Token]) -> Iterator[Tuple[int, Token]]
yields (index, token)
pairs. Useful for rewriting source.
tokenize_rt.rfind_string_parts(Sequence[Token], i) -> Tuple[int, ...]
find the indices of the string parts of a (joined) string literal
i
should start at the end of the string literal()
(an empty tuple) for things which are not string literals>>> tokens = src_to_tokens('"foo" "bar".capitalize()')
>>> rfind_string_parts(tokens, 2)
(0, 2)
>>> tokens = src_to_tokens('("foo" "bar").capitalize()')
>>> rfind_string_parts(tokens, 4)
(1, 3)
tokenize
tokenize-rt
adds ESCAPED_NL
for a backslash-escaped newline "token"tokenize-rt
adds UNIMPORTANT_WS
for whitespace (discarded in tokenize
)tokenize-rt
normalizes string prefixes, even if they are not parsed -- for
instance, this means you'll see Token('STRING', "f'foo'", ...)
even in
python 2.tokenize-rt
normalizes python 2 long literals (4l
/ 4L
) and octal
literals (0755
) in python 3 (for easier rewriting of python 2 code while
running python 3).