What 3.5 million books can reveal about the way we think of men and women

What 3.5 million books can reveal about the way we think of men and women

When Bill Gates says that if he would start a business today, and it would be a startup to let machines read books through machine learning, you know that it’s just a matter of time when someone realizes it one day. Get to know this machine learning technology that was able to distract interesting patterns about men and women by reading 3,5 million books. Read on for this study on pattern recognition and machine learning.

pattern recognition and machine learning

Pattern recognition and Machine learning

Machine learning has had my attention for a while now. One of my friends works in finance and was at the very first start of Regminer. It’s a technology tool for financial institutions to keep track of the ever expanding universe of laws and regulations.

Machine learning can be a great solution to make complex data more accessible, relevant and easy to use. Although pattern recognition and machine learning can be a great way for financial institutions to stick with the latest laws and regulations, it can also be used in many other fields too.

Men and women in literature

At the 2019 Annual Meeting of the Association for Computational Liguistics, authors and additional coauthors presented a paper of a machine learning that analyzed 3.5 million books to find that adjectives ascribed to women tend to describe physical appearance, whereas words that refer to behavior go to men. Additional coauthors of the study are from the University of Maryland, Google Research Johns Hopkins University, the University of Massachusetts Amherst, and Microsoft Research. The dataset is based on the Google Ngram Corpus.

Therefore, researchers trawled through an enormous quantity of books in an effort to find out whether there is a difference between the types of words that describe men and women in literature. Using a new computer model, the researchers analyzed a dataset of 3.5 million books, all published in English between 1900 to 2008. The books include a mix of fiction and non-fiction literature.

The researchers extracted adjectives and verbs associated with gender-specific nouns like “daughter” and “stewardess”. For example, in combinations such as “sexy stewardess” or “girls gossiping.” They then analyzed whether the words had a positive, negative, or neutral sentiment, and then categorized the words into semantic categories such as “behavior,” “body,” “feeling,” and “mind.”

pattern recognition and machine learning

Body and appearance

Their analysis demonstrates that negative verbs associated with body and appearance appear five times as often for female figures as male ones. The analysis also demonstrates that positive and neutral adjectives relating to the body and appearance occur approximately twice as often in descriptions of female figures. Women are most frequently described with adjectives like “beautiful” and “sexy”.

On the flip side, male ones are most frequently described using adjectives that refer to their behavior and personal qualities. The most frequent used adjectives refer much more to the appearance of women than the words to describe men. Commonly used descriptors for men include righteous, “rational”, and “courageous”.

Computer scientist and assistant professor Isabelle Augenstein points out, “We are clearly able to see that the words used for women refer much more to their appearances than the words used to describe men. Thus, we have been able to confirm a widespread perception, only now at a statistical level,”.

In the past, linguists typically looked at the prevalence of gendered language and bias, but using smaller data sets. Now, computer scientists can deploy machine learning algorithms to analyze vast troves of data—in this case, 11 billion words. Although many of the books were published several decades ago, they still play an active role, Augenstein points out. The algorithms used to create machines and applications that can understand human language, are fed with data in the form of text material that is available online. This is the technology that allows smartphones to recognize our voices and enables Google to provide keyword suggestions.


“The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is ‘true.’ If any of these patterns refer to biased language, the result will also be biased. The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices,” says Augenstein.

She gives an example of where it may be important: “If the language we use to describe men and women differs in employee recommendations, for example, it will influence who is offered a job when companies use IT systems to sort through job applications.”

The researchers point out that the analysis has its limitations. For example, the analysis does not take into account who wrote the individual passages. Also the differences in the degrees of bias depending on whether the books were published during an earlier or later period within the data set timeline are not taken into account. Furthermore, it does not distinguish between genres—e.g. between romance novels and non-fiction. The researchers are currently following up on several of these items with pattern recognition and machine learning in a later stadium.

Grounded conversations

In some ways the data makes the conversation about inequality between men and women more grounded. And I like that. I mean, it still can be challenging to explain certain experiences related to sexism to others sometimes. It can be challenging to get your point across when the person in front of you personally never dealt with those experiences before. Most of the times it’s difficult to explain something the person doesn’t recognize.

Because experiences are not completely shared, it’s more likely that this person will not see the problem or they might think it’s just a personal observation rather than a systematical one. This ignorance might not be on purpose or coming from a bad intent, but it can be a blind spot. since sexism and misogyny are so deeply rooted in our culture.., this is quite imaginable and understandable – but to a certain level. Therefore, something like mansplaining is often a learned behavior. Many men might not even be aware they’re doing it.

When it comes to sexism, I feel like it’s our job as women to tell these experiences and stories over and over again. We need to inform our parents, our husbands, our neighbours and friends on how that widespread perception on men and women looks like. We need both men and women to understand the double standards that still exists in this world, and we need a real conversation to be able to make changes.

Aside from having real conversations, also pattern recognition and machine learning might help us break this system. Now this machine has analyzed hundreds of books, it demonstrates insightfully how the world between men and women is structured – only now from a statistical point of view. When statistics come in, the issue automatically seem to become more evident. Like as if it now, truly exists. I guess the world likes to rely so much on data and empirical knowledge…

Hopefully statistics like these furthers a more honest conversation too, so that both men and women can speak their own truths and share experiences – even when we not always recognise ourselves in the stories of others.

Data about women

With more and more machine learning algorithms in our lives, I also foresee that gendered language and stereotypes will impact future algorithms. In the article We Need to Close the Gender Data Gap By Including Women in Our Algorithms, Caroline Criado Perez writes about the fact that we haven’t been collecting data on women (enough). She points out that it is important to build a world that has been designed for both men and women.

This article that she wrote for Time Magazine is very useful. She is apt in describing how this Gender Data Gap might manifest in our lives. For example, we still often end up using men as the default in many clinical trials. The data gap and a male-biased curriculum already leaves women 50% more likely to be misdiagnosed if they have a heart attack. It might be a cruel example, but she mentions many more – also funny ones.

Artificial Intelligence and language technology become more prominent across society. Because of that it becomes more important to be aware of gendered language. Augenstein points out: “We can try to take this into account when developing machine-learning models by either using less biased text or by forcing models to ignore or counteract bias. All three things are possible.

So the more AI is introduced into different fields, the more we need diversive (female) minds. With a mixture of diversive minds, we can create an equal playing field for everyone.


I hope the research on pattern recognition and machine learning by Augenstein will raise awareness on this topic. It’s great that she has been able to confirm a widespread perception, only now at a statistical level. I’m sure there will be more research on this topic to cover the whitespots of this one. And I’m excited to hear more about that.

Thanks for reading! If you enjoyed this piece of text, you will probably enjoy these as well:

1280 720 Lisanne Swart
Share this:

Leave a Reply

Previous Post
Next Post

    Start Typing