Saturday, June 9, 2007

Describing full binary trees with a context-free language


Today I describe a type of graph (full binary tree) using formal language theory, and in particular, a context-free language. The purpose of this exercise is to use the typical mathematical trick of moving from one area of mathematics to another in order to solve a problem. The problem in this case is how to describe a tree-based statistical model that is used in cognitive science. (But we will save talking about those models for another day.)

Let's recall some graph theory. A binary tree is a directed acyclic graph (DAG) where each node has no more than two children. A full (or proper) binary tree is a tree where each non-terminal node has two children. In other words, each node either has 0 (the terminals or leafs) or two children. The node which is not a child of any other node is referred to as the root. Above you can see an example of a full binary tree that was created in Matlab.

Another notion that we need that comes from graph theory (and like binary trees is used a lot in data structures in computer science) is the preorder traversal. A traversal is a particular way of visiting the nodes of a tree. And the preorder traversal is when one begins with the root of the tree and then moves downward from left-to-right. A sample implementation of this is:
preorder(node)
print node.value
if node.left ≠ null then preorder(node.left)
if node.right ≠ null then preorder(node.right)
But this can be clearly seen in the following figure.



And now let's recall some formal language theory. A formal language is set of strings (finite or infinite) of symbols with precise rules (often recursive) for generating those strings. A context-free language is a particular type of formal language. Formally, a context-free language (CFG) is a 4-tuple (V, Σ, R, S) where:
  1. V is a finite set called the variables ;
  2. Σ is a finite set, V ∩ Σ = ∅, called the terminals;
  3. R is a finite set of productions (or rules), with each production being composed of a variable and a string of variables and terminals, i.e. αA, where A ∈ (V∪Σ)*, where '*' is Kleene (or star) operation: for any set A, A* = {x1x2xk| k≥0 ∧ ∀ik xiA} ;
  4. SV is the start variable.
Now we can let the grammar for full binary trees defined by the preorder traversal be defined as follows:
GFBT={{α},{L, R}, R, α},
where the productions R are given by (where '|' represents the disjunction) and ε represents the empty string:
α → LαR|LRα|LαRα|ε.

Following Chomsky and Miller ["Introduction to the formal analysis of natural languages", Handbook of Mathematical Psychology: Volume II, 1963, pp. 290, 293] we can refer to the first production as left-recursive, the second as right-recursive, and the third as self-embedding. Naturally we can think of the 'L' symbol as representing left and the 'R' symbol representing right.

In order to see how this works, let's use this grammar's productions to generate the string the represents the above tree (the same one we used to demonstrate the preorder traversal). Note that since the language is context-free, we don't have to replace alpha with the same string each time:
α → LαRα →LLRRα →LLRRLαRα→LLRRLLRRLR.
And so now we can use this language to generate a string that represents any full binary tree with the preorder traversal.

At a later point I will tie this into a particular application in mathematical psychology, viz. with multinomial processing tree models.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home