July 2020

Language Modeling with Reduced Densities

July 9, 2020

•

Today I'd like to share with you a new paper on the arXiv—my latest project in collaboration with mathematician Yiannis Vlassopoulos (Tunnel, IHES). To whet your appetite, let me first set the stage. A few months ago I made a 10-minute introductory video to my PhD thesis, which was an investigation into mathematical structure that is both algebraic and statistical. In the video, I noted that natural language is an example of where such mathematical structure can be found.

Language is algebraic, since words can be concatenated to form longer expressions. Language is also statistical, since some expressions occur more frequently than others.

As a simple example, take the words "orange" and "fruit." We can stick them together to get a new phrase, "orange fruit." Or we could put "orange" together with "idea" to get "orange idea." That might sound silly to us, since the phrase "orange idea" occurs less frequently in English than "orange fruit." But that's the point. These frequencies contribute something to the meanings of these expressions. So what is this kind of mathematical structure? As I mention in the video, it's helpful to have a set of tools to start exploring it, and basic ideas from quantum physics are one source of inspiration. I won't get into this now—you can watch the video or read the thesis! But I do want to emphasize the following: In certain contexts, these tools provide a way to see that statistics can serve as a proxy for meaning. I didn't explain how in the video. I left it as a cliffhanger.

But I'll tell you the rest of the story now.