# July 2021

# Language, Statistics, & Category Theory, Part 3

Welcome to the final installment of our mini-series on the new preprint "An Enriched Category Theory of Language," joint work with John Terilla and Yiannis Vlassopoulos. In Part 2 of this series, we discussed a way to assign sets to expressions in language — words like "red" or "blue" – which served as a first approximation to the *meanings* of those expressions. Motivated by elementary logic, we then found ways to represent combinations of expressions — "red **or** blue" and "red **and **blue" and "red **implies** blue" — using basic constructions from category theory.

I like to think of Part 2 as a commercial advertising the benefits of a category theoretical approach to language, rather than a merely algebraic one. But as we observed in Part 1, algebraic structure is not all there is to language. There's also statistics! And far from being an afterthought, those statistics play an essential role as evidenced by today's large language models discussed in Part 0.

Happily, category theory already has an established set of tools that allow one to incorporate statistics in a way that's compatible with the considerations of logic discussed last time. In fact, the entire story outlined in Part 2 has a statistical analogue that can be repeated almost *verbatim*. In today's short post, I'll give lightning-quick summary.

It all begins with a small, yet crucial, twist.

# Entropy + Algebra + Topology = ?

Today I'd like to share a bit of math involving ideas from information theory, algebra, and topology. It's all in a new paper I've recently uploaded to the arXiv, whose abstract you can see on the right. The paper is short — just 11 pages! Even so, I thought it'd be nice to stroll through some of the surrounding mathematics here.

To introduce those ideas, let's start by thinking about the function $d\colon[0,1]\to\mathbb{R}$ defined by $d(x)=-x\log x$ when $x>0$ and $d(x)=0$ when $x=0$. Perhaps after getting out pencil and paper, it's easy to check that this function satisfies an equation that looks a lot like the product rule from Calculus:

Functions that satisfy an equation reminiscent of the "Leibniz rule," like this one, are called **derivations**, which invokes the familiar idea of a *derivative*. The nonzero term $-x\log x$ above may also look familiar to some of you. It's an expression that appears in the *Shannon entropy* of a probability distribution. A probability distribution on a finite set $\{1,\ldots,n\}$ for $n\geq 1$ is a sequence $p=(p_1,\ldots,p_n)$ of nonnegative real numbers satisfying $\sum_{i=1}^np_i=1$, and the **Shannon entropy **of $p$ is defined to be

Now it turns out that the function $d$ is nonlinear, which means we can't pull it out in front of the summation. In other words, $H(p)\neq d(\sum_ip_i).$ Even so, curiosity might cause us to wonder about settings in which Shannon entropy *is itself* a derivation. One such setting is described in the paper above, which shows a correspondence between Shannon entropy and derivations of (wait for it...) *topological simplices*!

# Language, Statistics, & Category Theory, Part 2

Part 1 of this mini-series opened with the observation that language is an algebraic structure. But we also mentioned that thinking merely algebraically doesn't get us very far. The algebraic perspective, for instance, is not sufficient to describe the passage from probability distributions on corpora of text to syntactic and semantic information in language that wee see in today's large language models. This motivated the category theoretical framework presented in a new paper I shared last time. But even before we bring statistics into the picture, there are some immediate advantages to using tools from category theory rather than algebra. One example comes from elementary considerations of logic, and that's where we'll pick up today.

Let's start with a brief recap.