Lexical Demand Levels of Schoolbooks:  A Corpus

Text samples and statistics from school textbooks

Donald P. Hayes
Department of Sociology
Cornell University


Please note:  Donald P. Hayes passed away in October of 2006.  Please send queries concerning this corpus to Bruce Hayes at bhayes@humnet.ucla.edu.


This corpus contains nearly 2000 schoolbooks: primers and basal readers used in grades 1 through 8 by schools in the United States, Canada, France and Sweden. The most comprehensive analyses are from the 1930's to the late 1990's. The corpus contains the sample texts from each reader and statistics on the readers. The analyses used the QLEX6 software package. No single schoolbook's use is reported. After each text is edited to an international transcripton standard, it is combined with other texts from the same grade, in the same decade, in the same nation, to yield a single estimate of their demand levels. Graphs showing the major comparisons are given by nation, decade and grade.

Why was this corpus developed? The primary objective was to measure the relative level of demand that these school books made on the students' knowledge base (lexicon, concepts, domain, etc.) in these nations, in different decades of the twentieth century. There are grounds for suspecting that the demand levels of schoolbooks have been lowered. One policy implication is that verbal attainment may be raised by revising upward textbook demand levels in all grades.

For more information, read the guide to the corpus, entitled A Spectrum of Natural Texts.


Directory

Data

Papers by Donald P. Hayes and colleagues on lexical demand

The QLEX 6 analysis package (free):  download

Other analyzed text files (all genres):  the 2000 Cornell Corpus

This web site complete as a single .zip file:  download (18.6 mb)


Data

There are three tables below, one for each country.  The tables are organized in rows by year of publication and in columns by grade level.  In each cell, the "text" link will bring up sample text passages (format:  ASCII text) for that particular country, grade level, and era of publication.  The "Statistics" link brings up a text file produced by the QLEX 6 software package, containing an extensive statistical analysis of the texts.  Lastly, the "Graph" link brings up a graph in png format)

Here is what the graphs show.  Horizontal axis:   integer values, arranged on a log scale, corresponds to the frequency rank of English/Swedish words, as determined by a large corpus of newspaper text.  Common words are on the left, rare words on the right.  Vertical axis:  cumulative proportion of the total text constituted by words up to the indicated frequency rank.  At rank 25, the total is restarted at zero.  The rationale for this is that the first 24 words are (in both English and Swedish) dominated by grammatical words such as the or and, which are uninformative concerning lexical difficulty.

The graph gives two series.  The series shown in red represents a large corpus of newspaper texts.  The graph in green represents the text under analysis.  In general, the more difficult the text, the higher on the graph will be the green line.

United States

Summary file

Primer 1st 2nd 3rd 4th 5th 6th 7th 8th Series
McGuffey readers 1896 Text
Statistics
Text
Statistics
Text
Statistics
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Text
Statistics
1920-45 Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1946-59 Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1960s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1970s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1980s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1990s
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
Time series Animated Animated Animated Animated Animated Animated Animated Animated Animated

Canada

Summary file

1st 2nd 3rd 4th 5th 6th 7th 8th Series
1930s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1950s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1960s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1970s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1980s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1990s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
Time series Animated Animated Animated Animated Animated Animated Animated Animated

France

Summary file

1st 2nd 3rd 4th 5th 6th 7th 8th Series
1930s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1940s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1950s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1960s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1970s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1980s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1990s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
2000s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
Time series Animated Animated Animated Animated Animated Animated Animated Animated

Sweden

Summary file

123 456 789 Series
1930s Text
Statistics
Graph
Text
Statistics
Graph
Animated
1940s Text
Statistics
Graph
Text
Statistics
Graph
Animated
1950s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1960s Text
Statistics
Graph
Text
Statistics
Graph
Animated
1970s Text
Statistics
Graph
Text
Statistics
Graph
Text
Statistics
Graph
Animated
1980s Text
Statistics
Graph
Text
Statistics
1990s Text
Statistics
Graph
Time series Animated Animated Animated


Return to Donald Hayes's Home Page