A Language Model Quantifies "Diversity"

Several political creeds over the past few decades have come to support the idea that diversity is valuable and desirable and that diverse societies may improve communication between people of different backgrounds and lifestyles, leading to greater understanding and peaceful coexistence. The usage of the term “diversity” has gained prominence as a result, at least in the Anglosphere. Figure 1 illustrates the growing popularity of the term diversity by displaying its frequency of usage over time in a large corpus of English books.

Figure 1: Usage frequency of the word diversity over time in a large corpus of English books (image generated with Google books Ngram viewer).

Academic institutions have willingly embraced the concept of diversity and have put in place procedures to foster diverse faculty, administrative and student bodies by supporting the recruitment of individuals from historically excluded populations. The pro-diversity efforts have been justified in terms of how diverse academic communities provide educational benefit to students and on grounds of achieving social justice.

Given the well-documented benefits of viewpoint diversity (Duarte et al. 2014; Shi et al. 2019), especially for enterprises of exploratory nature such as education and research, the degree of universities commitment to embrace viewpoint diversity becomes a metric of paramount importance. Yet, there is a scarcity of scientific work which has studied universities’ attitudes towards viewpoint diversity.

This work analyses how 50 elite universities in the United States use the terms diversity and diverse in their online institutional domains. In particular, the focus is on quantifying to what extent universities concentrate on the demographic denotation of diversity over its intellectual denotation. The sample data analysed consists of a large corpus of textual data gathered from the institutional web domains of the studied universities. An automatic web crawler (spider) was used to scrape textual data (16GB) found in universities websites by automatically following links within a University domain and collecting all detected textual content except structural, coding and css styling HTML elements.

Method

In order to study the usage of the diversity concept by universities, distributional semantics theory is used, which postulates that linguistic items with similar distributions tend to have similar meanings (Firth 1957). That is, the meaning of a word can be approximated by the set of contexts in which it occurs.

Recent advances in machine learning such as word embeddings for natural language processing (NLP) have given credence to the distributional hypothesis (Mikolov, Yih & Zweig 2013). A word embedding model derives from a large corpus of text a mapping of words to dense vector representations in a continuous high dimensional space (see Figure 2) that capture complex semantic and syntactic relations between words by leveraging the cooccurrence statistics of words and contexts in the corpus on which the model was trained.

A particular type of word embedding that has become very popular in the machine learning literature is the word2vec set of techniques (Mikolov et al. 2013). Word2vec uses a shallow neural network to learn a distributed representation of words based on the textual contexts in which they occur within a text corpus, thus leveraging the distributional hypothesis. After training word2vec on a text corpus, words that are used in similar contexts will end up with similar numerical vector representations. One of the most impressive capabilities of word2vec is its ability to draw together words that are used synonymously in similar contexts even if they never appear together in the training corpus. This feature is a key component of the ability of word2vec to generalize.

Figure 2 illustrates the mapping of words to vector representations carried out by word2vec. A key property of word embedding models is the clustering of terms with similar semantic roles (see top right of Figure 2) and the existence of structure in vector space such as regular offsets between pairs of words with a particular semantic association that map to culturally meaningful relationships such as gender (see bottom right of Figure 2). Since word2vec brings words used in similar contexts, and thus semantically related according to the distributional hypotheses to adjacent regions in the vector space, the context in which a word is used in a corpus of text can serve as a reliable proxy to estimate the semantic denotation with which the word is used in the corpus. Connotations of words in vector space can be estimated by calculating the cosine similarity in vector space between a word vector of interest and any other reference term.

Figure 2: Illustration of words mapped to 7 dimensional numerical vectors also known as word embeddings. The dimensions of the word embeddings codify semantic and syntactic features of the words. For visualization purposes, high dimensional spaces can be mapped to low dimensional spaces using techniques such as t-SNE that preserve the geometrical structure of the original space. Word embeddings possess the distinctive property that words that have close semantic meanings are mapped to adjacent regions of the vector space (top right). Also, the vector space captures meaningful syntactic and semantic regularities such as certain directions codifying for semantic relationships between words like for instance, gender, as shown in the bottom right of the figure by the dotted lines.

In order to evaluate the thematic space with which the concept of diversity is used in the corpus of text data gathered from universities websites, three independent human raters were asked to identify words that capture the 2 overarching categories of diversity studied: demographic and intellectual diversity. Subsequently, the author collated the answers and sorted the gathered terms into diversity subtypes within each diversity construct. Each diversity construct is thus defined by its representative bag of words as illustrated in Figure 3, where the red subtypes of diversity represent the demographic diversity construct and the blue subtypes of diversity represent the intellectual diversity construct.

Figure 3: The demographic diversity construct and the intellectual diversity construct are composed of several subtypes of diversity. Each of the 2 diversity construct is operationalized with a bag of words. The demographic diversity construct is shown in red (rows above the dotted line). The intellectual diversity construct is shown in blue (rows below the dotted line).

In order to quantify the usage of the diversity concept in the studied corpus, the terms “diverse” and “diversity” are combined into a minimalistic bag of words that is used to operationalize and detect the presence of the diversity concept in the text. From now on, the tuple [‘diversity’,’diverse’] will be referred to as the diversity terms.

Results and Discussion

A comparison of cosine similarity from the diversity terms to the demographic and intellectual diversity constructs revealed that in most universities in the sample, the word embeddings for the diversity terms were closer in vector space to the demographic diversity construct than to the intellectual diversity construct (Figure 4).

Figure 4: Cosine similarity between the diversity terms (“diversity“ and “diverse”) and the demographic (red squares) and intellectual (blue circles) diversity constructs for 50 elite U.S. universities.

On the combined text corpus concatenating all the universities corpora, the cosine similarity between the diversity terms and the different diversity subtypes produced the results displayed in Figure 5. Each diversity subtype was operationalized using the corresponding bag of words shown in Figure 3.

Figure 5: Cosine similarity between the diversity terms (“diversity” and “ diverse”) and several diversity subtypes. The subtypes of diversity making up the demographic diversity construct are displayed in red. The subtypes of diversity making up the intellectual diversity construct are displayed in blue.

The way that demographic and intellectual diversity constructs have been operationalized in this work could be probed. This work has probably not been able to cover all possible connotations and subtypes of the diversity concept. Also, the bag of words used to describe each construct could be questioned since it is difficult to find hard boundaries in semantic space. Nonetheless, the trends shown by Figure 5 clearly show a tendency of universities to use the term diversity in its demographic denotation, clustering around external appearance cues. Even if some relevant terms have been accidentally omitted in Figure 3, the nature of word embedding models ensures that terms that can be used in similar contexts have similar vector representations which are adjacent in vector space. Therefore, inclusion or exclusion of related terms in the bag of words used to operationalize constructs in this work would not change significantly the cosine similarity metric used to quantify the usage of the term diversity since cosine similarity measures proximity of word embeddings in vector space and related terms have similar vector representations in embedding models.

To sum up, this work has uncovered a pattern of elite universities in the U.S. overwhelmingly identifying the concept of diversity with demographic subtypes of diversity such as race, ethnicity or gender over intellectual denotations of diversity such as opinions, principles or ideas in their online institutional presence. Thus, it appears as if intellectual types of diversity, those that relate to variety of mental phenomena such as viewpoints, thoughts, values or beliefs take a diminished role in configuring the diversity construct as universities interpret and use the term.

It is also worth noting that despite universities tendency to associate the concept of diversity with mostly demographic denotations of the word, some subtypes of demographic diversity such as disability status or socioeconomic background receive a conspicuous low degree of attention by universities, see Figure 5, as judged by how said universities use the term diversity in their online domains.

The key question is why does the Academy side-line viewpoint diversity over demographic diversity? The themes and features contained in the institutional online profile of universities probably reflect the viewpoints of those doing the writing. Consequently, language usage can serve as a proxy for power dynamics created by the current composition of the Academy. Numerous studies have shown that universities across the U.S. have a severe lack of ideological diversity among its faculty, with the vast majority of professors leaning left-of-center (e.g. Cardiff & Klein 2005; Langbert, Quain & Klein 2016). It has been shown before that political orientation is a strong contributor to viewpoint variance (Emler, Renwick & Malone 1983).

Thus, in its current state, the Academy lacks viewpoint diversity. As a result, a homogeneous liberal academy could conceivably form a power structure unlikely to desire opening its circle to intellectual out groups. Intellectual homogeneity has several negative consequences on the educational and research mission of the University. Therefore, it would be best for universities to stretch their understanding of the term diversity to encompass intellectual denotations of the word with the same vigor with which they already direct their diversity efforts to the demographic connotations of the word.

Full Paper:

David Rozado (2019). “Using Word Embeddings to Analyze how Universities Conceptualize ‘Diversity’ in their Online Institutional Presence.” Society. DOI: 10.1007/s12115-019-00362-9

What do Universities Mean When They Talk About "Diversity?" A Computational Language Model Quantifies

Method

Results and Discussion

Get HxA In Your Inbox

Related Articles

Make a Donation