Donald Hayes - Lexical Analysis

This website describes D. P. Hayes’ continuing program of research on natural texts which began in 1980. The focus has been on the relative ‘accessibility’ of any English-language text -- as measured by the LEX statistic. Underlying LEX is a model based on patterns of word choice, determined by software (QANALYSIS): the departure of an edited text from the model’s lognormal statistical distribution. Several experimental and comparative validation studies are described. Samples representing the full spectrum of natural texts -- pre-primers and talk with animals to technical scientific reports -- are contained in Cornell Corpus 2000 ( N = 5000+ texts). Newspapers have remained close to LEX = 0.0 since 1665 – making them a familiar reference level for interpreting any text’s accessibility. LEX meets the standards required of natural science measures: validity, reliability, robustness, stability over centuries, and precision. Empirically, natural texts vary from LEX -85 to +58. A number of substantive studies are described briefly showing the range of LEX’s applications, both applied and theoretical.

READ ME FIRST - Understanding lexical accessibility
LEXGUIDE 2003 (PDF format) - Critical details of how to do lexical analysis of texts
Cornell Corpus 2000:
- view summary of 101 samples
- Download summary of 101 samples for Excel
- View the statistical summary of the entire Cornell Corpus
- Download the statistical summary table for Excel
- browse the Cornell Corpus
- Download entire Cornell Corpus in a single 18MB tar.bzip2 file.
  Decompress with the following commands in a DOS/command Window:
  - bunzip2 CornellCorpus2000.tar.bz2
    bunzip2 will remove the .bz2 part of the file name, and the resulting file will be about 76 megabytes.
  - tar -xmf CornellCorpus2000.tar
    tar will unpack the corpus into a subdirectory full of files, consuming another 76 megabytes of space. You may wish to delete the CornellCorpus2000.tar file afterward to save space.
  If these commands are not recognized (and they won't be recognized on most Windows computers), you should download the above two command program files (bunzip2 and tar) and save them to your Windows directory first.
qanalysis lexical analysis software
replace-list - word replacement list (also provided with software)

This page was last updated Jun 26, 2006.