graph_traversals.py 12.6 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334
"""
This module contains generic generator functions for traversing tree
(and DAG) structures.  It is agnostic to the underlying data structure
and implementation of the tree object.  It does this through dependency
injection of the tree's accessor functions: get_parents and
get_children.

The following depth-first traversal methods are implemented:

* Pre-order: Parent yielded before children; child with multiple
parents is yielded when first encountered.

    Example use cases (when DAGs are *not* supported):

        1. User access. If computing a user's access to a node relies
            on the user's access to the node's parents, access to the
            parent has to be computed before access to the child can
            be determined. To support access chains, a user's access on
            a node is actually an accumulation of accesses down from the
            root node through the ancestor chain to the actual node.

        2. Field value percolated down. If a value for a field is
            dependent on a combination of the child's and the parent's
            value, the parent's value should be computed before that of
            the child's. Similar to "User access", the value would be
            percolated down through the entire ancestor chain.

            Example: Start Date is
                max(node's start date, start date of each ancestor)
            This takes the most restrictive value.

        3. Depth. When computing the depth of a tree, since a child's
            depth value is 1 + the parent's depth value, the parent's
            value should be computed before the child's.

        4. Fast Subtree Deletion. If the tree is to be pruned during
            traversal, an entire subtree can be deleted, without
            traversing the children, as soon as the parent is determined
            to be deleted.

* Topological: Parent yielded before children; child with multiple
parents yielded only after all its parents are visited.

    Example use cases (when DAGs *are* supported):

        1. User access. Similar to pre-order, except a user's access
            is now determined by taking a *union* of the percolated
            access value from each of the node's parents combined with
            its own access.

        2. Field value percolated down. Similar to pre-order, except the
            value for a node is calculated from the array of
            percolated values from each of its parents combined
            with its own.

            Example: Start Date is
                max(node's start date, min(max(ancestry of each parent))
            This takes the most permissive from all ancestry chains.

        3. Depth. Similar to pre-order, except the depth of a node will
            be 1 + the minimum (or the maximum depending on semantics)
            of the depth of all its parents.

        4. Deletion. Deletion of subtrees are not as fast as they are
            for pre-order since a node can be accessed through multiple
            parents.

* Post-order: Children yielded before its parents.

    Example use cases:

        1. Counting. When each node wants to count the number of nodes
            within its sub-structure, the count for each child has to be
            calculated before its parents, since a parent's value
            depends on its children.

        2. Map function (when order doesn't matter). If a function
            needs to be evaluated for each node in a DAG and the order
            that the nodes are iterated doesn't matter, then use
            post-order since it is faster than topological for DAGs.

        3. Field value percolated up. If a value for a field is based
            on the value from it's children, the children's values need
            to be computed before their parents.

            Example: Minimum Due Date of all nodes within the
            sub-structure.

Note: In-order traversal is not implemented as of yet.  We can do so
if/when needed.

Optimization once DAGs are not supported:
Supporting Directed Acyclic Graphs (DAGs) requires us to use
topological sort, which has the following negative performance
implications:

* For a simple tree, we can immediately skip over traversing
descendants, once it is determined that a parent is not to be yielded
(based on the return value from the 'filter_func' function). However,
since we support DAGs, we cannot simply skip over descendants since
they may still be accessible through a different ancestry chain and
need to be revisited once all their parents are visited.

* For topological sort, we need the get_parents accessor function in
order to determine whether all of a node's parents have been visited.
This means the underlying implementation of the graph needs to have
an efficient way to get a node's parents, perhaps with back pointers
to each node's parents.  This requires additional storage space, which
could be eliminated if DAGs are not supported.

"""
from collections import deque


def traverse_topologically(
        start_node,
        get_parents,
        get_children,
        filter_func=None,
        yield_descendants_of_unyielded=False,
):
    """
    Generator for yielding nodes of a tree (or directed acyclic graph)
    in a topological sort.  The tree is traversed using the
    get_parents and get_children accessors. The filter_func function is
    used to limit which nodes are actually yielded.

    Arguments:
        start_node (any hashable type) - The starting node for the
        traversal.

        get_parents (node->[node]) - Function that returns a list of
            parent nodes for a given node.

        get_children (node->[node]) - Function that returns a list of
            children nodes for a given node.

        filter_func (node->boolean) - Function that returns
            whether or not to yield the given node.
            If None, the True function is assumed.

        yield_descendants_of_unyielded (boolean) -
            If False, descendants of an unyielded node are not
                yielded.
            If True, descendants of an unyielded node are yielded even
                if none of their parents were yielded.

            Note: In case of a DAG, a descendant is yielded if *any* of
            its parents are yielded.

    Yields:
        node: The result of the next node in the traversal.
    """
    return _traverse_generic(
        start_node,
        get_parents=get_parents,
        get_children=get_children,
        filter_func=filter_func,
        yield_descendants_of_unyielded=yield_descendants_of_unyielded,
    )


def traverse_pre_order(start_node, get_children, filter_func=None):
    """
    Generator for yielding nodes of a tree (or directed acyclic graph)
    in a pre-order sort.

    Arguments:
        See description in traverse_topologically.
    """
    return _traverse_generic(
        start_node,
        # There's no need to get_parents
        get_parents=None,
        get_children=get_children,
        filter_func=filter_func,
    )


def traverse_post_order(start_node, get_children, filter_func=None):
    """
    Generator for yielding nodes of a tree (or directed acyclic graph)
    in a post-order sort.

    Arguments:
        See description in traverse_topologically.
    """
    class _Node(object):
        """
        Wrapper node class to keep an active children iterator.
        An active iterator is needed in order to determine when all
        children are visited and so the node itself should be visited.
        """
        def __init__(self, node, get_children):
            self.node = node
            self.children = iter(get_children(node))

    # If filter_func isn't provided, make it a no-op.
    filter_func = filter_func or (lambda __: True)

    # Use deque for the stack, which is O(1) for pop and append.
    # Use the _Node class to keep track of iterated children.
    stack = deque([_Node(start_node, get_children)])

    # Keep track of which nodes have been visited.
    visited = set()

    while stack:
        # Peek at the current node at the top of the stack.
        current = stack[-1]

        # Verify the node wasn't already visited and the node
        # satisfies the filter_func.
        if current.node in visited or not filter_func(current.node):
            # Since already visited or filtered out, remove from the
            # stack and continue with the next node.
            stack.pop()
            continue

        # See if there are any additional children for this node.
        try:
            next_child = current.children.next()

        except StopIteration:
            # Since there are no children left, visit the node and
            # remove it from the stack.
            yield current.node
            visited.add(current.node)
            stack.pop()

        else:
            # If so, add the child to the top of the stack.
            stack.append(_Node(next_child, get_children))


def _traverse_generic(
        start_node,
        get_parents,
        get_children,
        filter_func=None,
        yield_descendants_of_unyielded=False,
):
    """
    Helper function to avoid duplicating functionality between
    traverse_depth_first and traverse_topologically.

    If get_parents is None, do a pre-order traversal.
    Else, do a topological traversal.

    The topological traversal has a worse time complexity than
    pre-order does, as it needs to check whether each node's
    parents have been visited.

    Arguments:
        See description in traverse_topologically.
    """

    # If filter_func isn't provided, make it a no-op.
    filter_func = filter_func or (lambda __: True)

    # Use deque for the stack, which is O(1) for pop and append.
    stack = deque([start_node])

    # Keep track of which nodes have been visited and whether they
    # were in fact yielded.
    yield_results = {}  # dict(node:boolean)

    # While there are more nodes on the stack...
    while stack:

        # Take a node off the top of the stack.
        current_node = stack.pop()

        # If we're doing a topological traversal, then make sure all
        # the node's parents have been visited. If they haven't,
        # then skip the node for now; we'll encounter it again later
        # through another one of its parents.
        if get_parents and current_node != start_node:
            parents = get_parents(current_node)

            # If all of the parents have not yet been visited, continue.
            if not all(parent in yield_results for parent in parents):
                continue

            # If none of the parents have yielded, continue, unless
            # specified otherwise (via yield_descendants_of_unyielded).
            elif not yield_descendants_of_unyielded and not any(yield_results[parent] for parent in parents):
                continue

        # If the current node has already been visited, continue.
        if current_node not in yield_results:

            # For a topological sort, it's important that we visit
            # the children even if the parent isn't yielded, in case
            # a child has multiple parents and this is its last
            # parent.  So we add the children to the stack before
            # checking the yield results of the parent.
            #
            # This implementation can be further optimized specifically
            # for pre-order sort, since it doesn't have this
            # requirement.  But for now, we err on the side of reusing
            # this code for both, with room for optimization once
            # the use of pre-order sort is prevalent when DAGs are
            # not supported.
            # JIRA ticket for optimizing pre-order: MA-1560

            if get_parents:
                # For a topological sort, add all the children since
                # they would not have been visited.
                unvisited_children = list(get_children(current_node))

            else:
                # For a pre-order sort, filter out already visited
                # children.
                unvisited_children = list(
                    child
                    for child in get_children(current_node)
                    if child not in yield_results
                )

            # Add the node's unvisited children to the stack in reverse
            # order so they are traversed in their original order.
            unvisited_children.reverse()
            stack.extend(unvisited_children)

            # Yield the result of the node if the node satisfies the
            # filter_func.
            should_yield_node = filter_func(current_node)
            if should_yield_node:
                yield current_node

            # Keep track of whether or not the node was yielded so we
            # know whether or not to yield its children.
            yield_results[current_node] = should_yield_node