Unreasonable

Machine Learning Paves the Path to a Readable Internet

The internet is such an integral part of our lives that it has become a human rights violation for governments to intentionally obstruct their citizens’ access to it.

In July 2016, the UN passed a resolution that addressed what many of us already knew. It is critical that we increase access to the internet, as it “facilitates vast opportunities for affordable and inclusive education globally”.

We need to start distinguishing between technical access and user’s ability to read and understand what is presented online.

Fans of the internet believe that the web can significantly improve standards of living, with health and education services increasingly becoming available online. However, governments have fallen short of delivering the internet for the masses.

In 2017, 3.6 billion people were online. This means around 53 percent of the world’s population remain without access to the internet. However, access alone is not the answer to unlocking universal education opportunities.

We need to start distinguishing between technical access and user’s ability to read and understand what is presented online. For many people, the amount of information can be overwhelming and finding specific content within pages and pages of search results can be daunting.

Additionally, a large proportion of the available information is actually too hard to understand for almost half of the global population.

It is thought that as much as 70 percent of published, written content is not understood by the majority of readers. At Wizenoze we have data to show that more than 41 percent of users leave websites because the text is too difficult to read. This is known as the readability gap, or the difference between the reading level of a visitor and the reading level of the text they are reading.

The readability gap is at its most acute for students. But it is not only students who are impacted. Many teachers are unable to access the highest reading levels as well.

No Access, No Education

The Better Internet for Kids (BIK) Map was created to support EU Member States implementing recommendations of the European Strategy for a Better Internet for Children set out by the European Commission in May 2012. A recent study highlighted that many EU Member states’ national policies cover all themes and pillars of the BIK strategy to some extent. However, the area of “positive content for children” is less adhered to. Ten EU countries report there is zero national policy on quality online content for children.

As Anthony Lake, Executive Director of UNICEF said: “The Internet was designed for adults, but it is increasingly used by children and young people, and digital technology increasingly affects their lives and futures.”

More importantly, most of the information online is still too difficult to read and understand.

The current online environment for children and students focuses on blocking inappropriate information through filters. Filtering is a good solution to protect users from harmful information. However, it is not a solution that helps users gain access to information they can understand. The information that gets through the filters is very often not relevant or is commercially driven. More importantly, most of the information online is still too difficult to read and understand.

There is a need for better readability tools to help content writers match their audience reading levels

At Wizenoze we have developed a state-of-the-art algorithm to accurately predict the reading level of a text on a five-level scale.

Measuring the readability of a text has a long history, mainly in the educational domain. The standard way to do this is using hand-crafted readability formulas. For instance, the popular online tool Readability-Score and the Flesch-Kincaid grade level score uses these kinds of formulas to perform its readability assessment. For example, the formula for the Flesch-Kincaid grade level score looks like this:

At its core, the score is a combination of the average number of words in a sentence and the average number of syllables in a word.

Essentially all readability formulas look like this. They combine a few superficial textual properties such as the number of words and the number of sentences, in a relatively simple mathematical formula. This formula is then manually tuned using a small set of example documents on different reading levels.

At Wizenoze, we think that such a simple model can never capture all the complexities involved in determining the readability of a text. Accurate readability analysis requires both more linguistic knowledge and a more complex model.

The Power of Machine Learning

Hand-crafting complex formulas (or models) is extremely difficult, especially in domains where it is hard to make human knowledge explicit. Readability is one such a domain. That’s where Machine Learning is so useful for finding good predictive models.

Accurate readability analysis requires both more linguistic knowledge and a more complex model.

Learning predictive models with ML requires training data, which in our case means documents for which we already know the reading level. Wizenoze collected over 100,000 documents from all kinds of different sources and reading levels: schoolbooks, news articles, web texts, et cetera. The labeling for these documents is very heterogeneous: sometimes very precise, sometimes very coarse. Therefore, we mapped all these different labels to our own five-point reading level scale.

For each document in our training data, Wizenoze extracted a large number of features using our Natural Language Processing pipeline, which we describe below. These features include the traditional readability formulas, like the Flesch-Kincaid score, but also many others. Our machine learning algorithms learn how each feature contributes to making an accurate readability prediction for a new document.

The Power of Natural Language Processing

From a linguistic perspective, the readability of a text is determined by much more than a few superficial textual features. For example, does the reader know most of the words? Does the text contain complex grammatical structures? Are there enough connectives to explain the flow of the text? Is the text about a lot of different concepts?

We use modern Natural Language Processing (NLP) techniques to automatically extract a rich set of linguistic features that directly and indirectly relate to readability. As an example, consider the following sentence (taken and adapted from a New York Times article):

Months after Britain voted to leave the European Union, the first tangible victim of that decision is identified: “Marmite, a sludgy and odd-tasting breakfast spread.”

To compute the Flesch-Kincaid grade level score for this sentence, we only need the following information: the text has one sentence, the text has 30 words, and the text has 43 syllables.

On the other hand, here is a partial view of the internal representation of the text used in the Wizenoze machine learning-based model.

We created a model for readability prediction that is far more accurate and insightful than standard readability formulas.

Amongst a lot of other things, we identify:
• grammatical structure (a passive construction is used),
• named entities (Britain is a location, Marmite is a product),
• and the part of speech for each word (months is a noun, voted a verb).

This means that NLP allows us to use far more linguistic knowledge in our readability analysis than traditional readability formulas.

Using both natural language processing and machine learning, we created a model for readability prediction that is far more accurate and insightful than standard readability formulas.

A Readable Web

In response to the global need to support online users with content they can source at their reading level, Wizenoze’s sophisticated readability technology has created a readable web.

The Web for Classrooms is the largest curated safe-for-school collection of online content for students available in the world today. The Web for Classrooms provides rapid access for students and teachers to over 6 million pages of curriculum-supportive online material, curated by teachers and all searchable by reading level.

Early evaluation studies of the impact of the Web for Classrooms have shown that 91 percent of students progressed further towards their desired learning outcome compared with peers using alternative search engines. If correct, then the Web for Classrooms should now become the default search engine for education – in class, in libraries and at home.

Try our technology for yourself at wizescan.com, or access the Web for Classrooms library for free.

Illustration by Brian Stauffer.