Interactive online tutorial: Notebook

Parsing and Visiting

LibCST provides helpers to parse source code string as concrete syntax tree. In order to perform static analysis to identify patterns in the tree or modify the tree programmatically, we can use visitor pattern to traverse the tree. In this tutorial, we demonstrate a common three-step-workflow to build an automated refactoring (codemod) application:

  1. Parse Source Code

  2. Build Visitor or Transformer

  3. Generate Source Code

Parse Source Code

LibCST provides various helpers to parse source code as concrete syntax tree: parse_module(), parse_expression() and parse_statement() (see Parsing for more detail). The default CSTNode repr provides pretty print formatting for reading the tree easily.

[2]:
import libcst as cst

cst.parse_expression("1 + 2")
[2]:
BinaryOperation(
    left=Integer(
        value='1',
        lpar=[],
        rpar=[],
    ),
    operator=Add(
        whitespace_before=SimpleWhitespace(
            value=' ',
        ),
        whitespace_after=SimpleWhitespace(
            value=' ',
        ),
    ),
    right=Integer(
        value='2',
        lpar=[],
        rpar=[],
    ),
    lpar=[],
    rpar=[],
)

Example: add typing annotation from pyi stub file to Python source

Python typing annotation was added in Python 3.5. Some Python applications add typing annotations in separate pyi stub files in order to support old Python versions. When applications decide to stop supporting old Python versions, they’ll want to automatically copy the type annotation from a pyi file to a source file. Here we demonstrate how to do that easliy using LibCST. The first step is to parse the pyi stub and source files as trees.

[3]:
py_source = '''
class PythonToken(Token):
    def __repr__(self):
        return ('TokenInfo(type=%s, string=%r, start_pos=%r, prefix=%r)' %
                self._replace(type=self.type.name))

def tokenize(code, version_info, start_pos=(1, 0)):
    """Generate tokens from a the source code (string)."""
    lines = split_lines(code, keepends=True)
    return tokenize_lines(lines, version_info, start_pos=start_pos)
'''

pyi_source = '''
class PythonToken(Token):
    def __repr__(self) -> str: ...

def tokenize(
    code: str, version_info: PythonVersionInfo, start_pos: Tuple[int, int] = (1, 0)
) -> Generator[PythonToken, None, None]: ...
'''

source_tree = cst.parse_module(py_source)
stub_tree = cst.parse_module(pyi_source)

Build Visitor or Transformer

For traversing and modifying the tree, LibCST provides Visitor and Transformer classes similar to the ast module. To implement a visitor (read only) or transformer (read/write), simply implement a subclass of CSTVisitor or CSTTransformer (see Visitors for more detail). In the typing example, we need to implement a visitor to collect typing annotation from the stub tree and a transformer to copy the annotation to the function signature. In the visitor, we implement visit_FunctionDef to collect annotations. Later in the transformer, we implement leave_FunctionDef to add the collected annotations.

[4]:
from typing import List, Tuple, Dict, Optional


class TypingCollector(cst.CSTVisitor):
    def __init__(self):
        # stack for storing the canonical name of the current function
        self.stack: List[Tuple[str, ...]] = []
        # store the annotations
        self.annotations: Dict[
            Tuple[str, ...],  # key: tuple of cononical class/function name
            Tuple[cst.Parameters, Optional[cst.Annotation]],  # value: (params, returns)
        ] = {}

    def visit_ClassDef(self, node: cst.ClassDef) -> Optional[bool]:
        self.stack.append(node.name.value)

    def leave_ClassDef(self, node: cst.ClassDef) -> None:
        self.stack.pop()

    def visit_FunctionDef(self, node: cst.FunctionDef) -> Optional[bool]:
        self.stack.append(node.name.value)
        self.annotations[tuple(self.stack)] = (node.params, node.returns)
        return (
            False
        )  # pyi files don't support inner functions, return False to stop the traversal.

    def leave_FunctionDef(self, node: cst.FunctionDef) -> None:
        self.stack.pop()


class TypingTransformer(cst.CSTTransformer):
    def __init__(self, annotations):
        # stack for storing the canonical name of the current function
        self.stack: List[Tuple[str, ...]] = []
        # store the annotations
        self.annotations: Dict[
            Tuple[str, ...],  # key: tuple of cononical class/function name
            Tuple[cst.Parameters, Optional[cst.Annotation]],  # value: (params, returns)
        ] = annotations

    def visit_ClassDef(self, node: cst.ClassDef) -> Optional[bool]:
        self.stack.append(node.name.value)

    def leave_ClassDef(
        self, original_node: cst.ClassDef, updated_node: cst.ClassDef
    ) -> cst.CSTNode:
        self.stack.pop()
        return updated_node

    def visit_FunctionDef(self, node: cst.FunctionDef) -> Optional[bool]:
        self.stack.append(node.name.value)
        return (
            False
        )  # pyi files don't support inner functions, return False to stop the traversal.

    def leave_FunctionDef(
        self, original_node: cst.FunctionDef, updated_node: cst.FunctionDef
    ) -> cst.CSTNode:
        key = tuple(self.stack)
        self.stack.pop()
        if key in self.annotations:
            annotations = self.annotations[key]
            return updated_node.with_changes(
                params=annotations[0], returns=annotations[1]
            )
        return updated_node


visitor = TypingCollector()
stub_tree.visit(visitor)
transformer = TypingTransformer(visitor.annotations)
modified_tree = source_tree.visit(transformer)

Generate Source Code

Generating the source code from a cst tree is as easy as accessing the code attribute on Module. After the code generation, we often use Black and isort to reformate the code to keep a consistent coding style.

[5]:
print(modified_tree.code)

class PythonToken(Token):
    def __repr__(self) -> str:
        return ('TokenInfo(type=%s, string=%r, start_pos=%r, prefix=%r)' %
                self._replace(type=self.type.name))

def tokenize(code: str, version_info: PythonVersionInfo, start_pos: Tuple[int, int] = (1, 0)
) -> Generator[PythonToken, None, None]:
    """Generate tokens from a the source code (string)."""
    lines = split_lines(code, keepends=True)
    return tokenize_lines(lines, version_info, start_pos=start_pos)

[6]:
# Use difflib to show the changes to verify type annotations were added as expected.
import difflib

print(
    "".join(
        difflib.unified_diff(py_source.splitlines(1), modified_tree.code.splitlines(1))
    )
)
---
+++
@@ -1,10 +1,11 @@

 class PythonToken(Token):
-    def __repr__(self):
+    def __repr__(self) -> str:
         return ('TokenInfo(type=%s, string=%r, start_pos=%r, prefix=%r)' %
                 self._replace(type=self.type.name))

-def tokenize(code, version_info, start_pos=(1, 0)):
+def tokenize(code: str, version_info: PythonVersionInfo, start_pos: Tuple[int, int] = (1, 0)
+) -> Generator[PythonToken, None, None]:
     """Generate tokens from a the source code (string)."""
     lines = split_lines(code, keepends=True)
     return tokenize_lines(lines, version_info, start_pos=start_pos)

For the sake of efficiency, we don’t want to re-write the file when the transformer doesn’t change the source code. We can use deep_equals() to check whether two trees have the same source code. Note that == checks the identity of tree object instead of representation.

[7]:
if not modified_tree.deep_equals(source_tree):
    ...  # write to file