Anything2Vec: Mapping Reddit into Vector Spaces đ„
Word2Vec is a powerful machine learning technique for embedding text corpus' into vector spaces. While useful for NLP problems, this blog post shows how it can also be used to represent and better understand communities on Reddit. In collaboration with: @CSSLab

A common problem in ML, natural language processing (NLP), and AI at large surrounds representing objects in a way computers can process. And since computers understand numbers â which we have a common language for comparing, combining and manipulating â this generally means assigning objects numbers in some fashion. Think taking something abstract but intuitive to humans, like the text of a book, and assigning each word in that book a unique number. That book could then be represented by the list, or vector, of numbers assigned to it. This is the process of embedding that book as a vector â and there is an increasingly rich literature of techniques for embedding objects as vectors.
While much of this literature focuses on representing words as vectors, which can aide in NLP problems, much of the logic can be transferred to embedding any arbitrary set of objects. Through my research at the University of Toronto, and their computational social science lab, Iâve been applying embedding techniques to understand online forums like Reddit. This article is meant to serve as a starting point to break down the research that is being done at UofT. For more information on my research check out https://cameronraymond.me, and for the original paper that this article is based on see Waller, I., & Anderson, A.
First, weâll take a look at what it means to embed some thing as a vector and what a good embedding entails. Then weâll take a common embedding technique, Word2Vec, and see how it is used to model words as vectors. After seeing why Word2Vec is so useful, we can start to generalize its principles and show its utility in mapping the different communities of Reddit.
What is an embedding?
While embedding techniques can get complex â at its core, to embed some thing is just to represent that thing as a vector of real numbers. This is useful because thereâs a common currency when talking about vectors of real numbers; namely they are easy to add, subtract, compare and manipulate. So to embed some set of objects then is just to represent those objects with unique vectors of real numbers. So not all embedding techniques involve complex neural nets, and often simple embeddings are powerful enough for a given problem; however, there are benefits to more nuanced techniques that weâll focus on.
A âdumb embeddingâ would be to one-hot encode all the different unique objects as their own unit basis vector. This means that in a set of |V| objects, each object in that set, v, is represented as a vector of size |V| with all 0s, except for the vth index which is a 1.
.](https://cdn-images-1.medium.com/max/2000/1*UOjWvDziH86T2MmiDpp98Q.png)
Why might this not be a powerful enough embedding? Even though we have the tools to manipulate these vectors, it may not return intuitive results. This is because when objects are one-hot encoded, the embedding isnât tied back to the real world in anyway. Specifically, there isnât a logical relationship between objectsâ representations that reflects their actual relationships; each vector is equally far from every other vector. In an ideal world, you may want the vector representing âredâ ([red]=<1 0 0>) and the vector representing âyellowâ ([yellow]=<0 1 0>), when added together, to return the vector representing âorangeâ ([orange]-><1 1 0>). One-hot encoding only lets you say what an item is by its vector, it doesnât tell you how the vectors relate to one another. With that said, one-hot encoding is often a good starting point.
To understand how we can embed objects in a way that is tied back to the real world weâll look at a more nuanced technique called Word2Vec. While generally used to embed words, it generalizes to arbitrary objects in certain cases as well. Word2Vec allows us to represent each object from a set of objects as a dense vector of real numbers in a way that preserves relations between different objects.
To get the intuition behind how Word2Vec works, weâll look at its most common use case: embedding words as vectors. As such, those familiar with Word2Vec can skip the next section. From there weâll see how Word2Vec can generalize to embed other objects. For this weâll embed Redditâs 10,000 most active communities. Finally, weâll show how this embedding aligns with our understanding of what these communities represent.
Word2Vec
The underlying intuition behind Word2Vec is that two words are similar if they are used in similar ways. For example if you substitute the word âgoodâ for the word âgreatâ in a sentence, it will likely still make sense. This concept is well summarized by the linguist John Rupert Firth who, in 1957, said âyou shall know a word by the company it keeps.â While there are various implementations of Word2Vec, this article will focus on the Skip-gram model which fits in well with Firthâs ideas.
âYou shall know a word by the company it keeps.â â J.R. Firth
The Skip-gram model â when applied to words â goes through each word in the text corpus and tries to predict the n words on either side of it. The n words surrounding the target word are its context. In the picture below we see that the context for the word ânastyâ are ferocious, dogâs, sharp and bite.

We start off by one-hot encoding each word, and then use a shallow neural network to predict all the context vectors associated with the target word. In this way, words used in similar contexts will have similar output vectors. By taking the output of the hidden layer, before converting the output into the concatenation of the one-hot encoded vectors, we can represent that word as a dense vector of real numbers.

Through this training process Word2Vec preserves semantic as well as syntactic shifts in language. For example the transformation from the vector representing the word âKingâ (denoted by [King] ) to [Queen] is roughly the same as the transformation from [Man] to [Woman]. Therefore we can represent the analogy Man is to Woman as King is to Queen as [Man]-[Woman]=[King]-[Queen]. And if we didnât already know that Queen is the final component of the analogy, we could solve for it using the equation [Queen] = [King]-[Man]+[Woman].

Anything2Vec
The Skip-gram model has been well explored when applied to words, as seen through the popularity of Word2Vec, but its utility doesnât stop at linguistic analogies. For this weâll show how Word2Vec generalizes to situations where thereâs a logical target-context relation.
Subreddit Embeddings
Just as you can âknow a word by the company it keeps,â the same logic applies to Reddit and its variety of online communities, called subreddits. The, less pithy, analog in this case is that we can know a subreddit by the commenters it keeps. For the Skip-gram model, each subreddit represents a âwordâ and that subredditâs commenters act as the âcontext.â So like Word2Vec, subreddits with similar commenters will have similar output vectors.

While the output vectors are embedded in a high dimensional vector space (often 150+ dimensions), and thus canât be visualized, principal component analysis can return a 3-dimensional approximation. Below is a visualization of such an approximation for all 10,000 subreddits. In this plot weâve highlighted the hip hop oriented subreddit, /r/hiphopheads, and itâs 100 closest vectors. As we can see, the closest subreddits by cosine similarity are also hip hop themed.

Subreddit Analogies
With Word2Vec, the resulting embeddings can preserve relationships between words. This allows for simple vector addition and subtraction to answer analogy problems. For example, to answer the analogy Berlin is to Germany as Ottawa is to x, we calculate [x]=[Germany]-[Berlin]+[Ottawa] and choose the closest vector to [x] which would be [Canada]. This property holds for our subreddit embedding as well. When posing the analogy /r/boston is to /r/chicago as /r/bostonceltics is to x, the closest vector to [/r/bostonceltics]-[/r/boston]+[/r/chicago] is the subreddit dedicated to the Chicago Bulls.

On a testing set of ~1,500 similar analogy problems (city to sports team, university to university town, state to state capital) our embedding attained 81% accuracy.
When and When Not?
The core intuition behind Word2Vec, and its generalization, is that you can represent words, subreddits, Twitter users, etc⊠by the company they keep. Words used in similar contexts are likely similar; the same holds for subreddits with similar commenters and Twitter users with similar followers. However, if there isnât enough data, the embedding isnât likely to pick up on the different dimensions in which the entities can be similar or different. Any user on Reddit likely comments on a variety of subreddits, not all of which are related. Yet, from a macro point of view, over millions of comments, very nuanced relations begin to emerge.
By first starting with a bare-bones approach to what an embedding can be, and then seeing how more nuanced embeddings can improve NLP problems â this article showed how embedding techniques can derive interesting results when applied to arbitrary objects, like subreddits. If you have thoughts on how youâd like to see this work used, feel free to let me know below!