Efficient Grammar Fuzzing

post

software engineering

fuzzing book

How can we efficiently implement grammars for fuzzing?

Authors

Aidan Dyga

Molly Suppo

Titus Smith

Gregory M. Kapfhammer

Published

November 13, 2024

Overview

This post discusses the Efficient Grammar Fuzzing chapter from The Fuzzing Book. In this post, we will cover derivation trees and what they are, expansion and how it works, and cost functions and how they can be used for grammar fuzzing. We will also examine both the theoretical aspect of these concepts as well as more code-based examples explaining how it works in practice.

Summary

As we have learned in the past, grammars are essential tools for producing syntactically valid inputs while offering a structured approach to input generation. This chapter refines the previous string-based algorithm into a tree-based algorithm, allowing for faster grammars that have much more control over the production of fuzz inputs. In this blog, we will explore how to create a more efficient grammar fuzzer that is more applicable to real word use cases.

What is a Derivation Tree?

In the context of grammar fuzzing, derivation trees are used to efficiently manage and control the process of expanding grammar rules. A derivation tree represents the entire structure of a generated string, keeping track of which parts have been expanded and which still need to be. This tree-based structure makes it easier to visualize, manipulate, and compare different expansions. Each node in the tree corresponds to either a terminal (REMINDER: cannot be expanded further) or a non-terminal (REMINDER: can be expanded into other symbols). The tree is built step by step, starting from a root node and progressively expanding non-terminals until only terminal symbols remain.

Make-Up of a Derivation Tree

Below are the different components of a derivation tree:

Terminal Node(s): These are the “leaf” nodes and represent the actual symbols of the generated string (e.g., characters, words).
Non-terminal Node(s): These represent symbols that can be expanded into other symbols according to the grammar rules (e.g., <expr> or <term>).
Root Node: The root node represents the start symbol of the grammar.
Expansion: The tree is built by recursively expanding non-terminals according to the grammar’s production rules.

While the simple grammar fuzzer used previously was inefficient and lacked control, leading to potential infinite expansions, derivation trees solve these issues by providing a more structured and efficient approach. By expanding nodes in a derivation tree instead of manipulating strings directly, the algorithm can track progress, avoid infinite expansions, and maintain better control over the process

What is the purpose of using a derivation tree in grammar-based string generation, and how does it improve the process compared to generating a string directly?

Click to Expand for the Answer

The purpose of using a derivation tree is to provide a structured representation of the string generation process. A derivation tree captures the full history of how a string is derived from the start symbol by expanding non-terminal symbols according to grammar rules. This structure allows for better control, visualization, and manipulation of the generation process. Additionally, derivation trees allow for better visualization and comparing different derivations of the same string, allowing for clearer understanding and debugging of the generation process.

How Expansion Works

Expansion starts by searching for a non-terminal symbol without children. This is also referred to as a “leaf.” The expansion is then added as a child, or child node of the “leaf.” The process then continues as another non-terminal symbol is chosen to expand. This expansion process repeats until there are no non-terminal symbols left to expand.

When defining expansion from a coding standpoint, there are ways to indicate whether a symbol has children and whether it is able to be expanded. We use None to indicate a situation where the symbol has no children and it is non-terminal, meaning we can expand the node. We use [] to indicate a situation where the symbol has no children and it is terminal, meaning we cannot expand the node.

What are the three different kinds of expansion?

Click to Expand for the Answer

Maximum cost
Minimum cost
Random (previously discussed)

This relates to the next section regarding cost functions and how they are used to facilitate expansion in certain situations.

Cost Functions

The process of closing the expansion of a derivation tree involves ensuring that expansions do not inflate the tree size unnecessarily. To achieve this, we introduce cost functions that help determine the most efficient way to expand each symbol in the tree.

Determine the Cost of Expanding

symbol_cost: Computes the minimum cost of all possible expansions for a given symbol, by calling expansion_cost for each possible expansion.
expansion_cost: Calculates the total cost of expanding a symbol by summing up the costs of all its possible expansions. If a non-terminal symbol is revisited during expansion (leading to recursion), the cost is set to infinity to prevent infinite loops.

Expanding by Maximum Cost

class GrammarFuzzer(GrammarFuzzer):
    def expand_node_min_cost(self, node: DerivationTree) -> DerivationTree:
        if self.log:
            print("Expanding", all_terminals(node), "at minimum cost")

        return self.expand_node_by_cost(node, min)

Expanding by Minimum Cost

class GrammarFuzzer(GrammarFuzzer):
    def expand_node_max_cost(self, node: DerivationTree) -> DerivationTree:
        if self.log:
            print("Expanding", all_terminals(node), "at maximum cost")

        return self.expand_node_by_cost(node, max)

In what specific situations would someone prefer to use expanding by maximum versus minimum costs?

Click to Expand for the Answer

Minimum: This approach is preferred when generating small, efficient fuzz inputs to quickly explore edge cases without inflating the derivation tree.
Maximum: This method is useful for generating complex, deeply nested inputs to stress-test the system and uncover bugs in more intricate scenarios.

Key Takeaways

Derivation trees and cost functions play a critical role in improving the efficiency and control of grammar-based fuzzing by guiding the expansion of derivation trees. These functions help avoid unnecessary growth in the tree and prevent infinite loops, ensuring that the fuzzer produces valid and varied inputs without redundant or inefficient expansions. However, using these functions can present challenges, especially with large or highly recursive grammars, where calculating the cost of expansions can become computationally expensive and may require optimizations to maintain performance. Lastly, efficient grammar-fuzzing adds increased functionality to our fuzzing tests and allow for more accurate fuzzing.

Reference

Unless otherwise noted, the source code segments in this article are excerpted directly from the Fuzzing Book. The authors of this online book licensed the source code and its written content under the BY-NC-SA 4.0 Creative Commons License. More details about the license for the Fuzzing Book are available in the The Fuzzing Book License.

Return to Blog Post Listing