A public dataset of speeches in the Hansard. The dataset provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech. The dataset also includes all speeches of ten words or more made from 1936 to 1979. The dataset contains a total of 4,212,134 speeches and 773,585,770 words. It can be accessed through Zenodo, and is distributed under a Creative Commons 4.0 BY-SA licence. It is currently available as a CSV file, if you would like other formats please get in touch.

Sentiment Classification

The speeches have been classified for sentiment using a total of five libraries from the R packages sentimentr, syuzhet and lexicon. The libraries are:

  1. The AFINN library by Finn Årup Nielsen, labelled afinn. The AFINN library was accessed through the syuzhet package.

  2. The Opinion Mining, Sentiment Analysis and Opinion Spam Detection dataset by Bing Liu, Minqing Hu and Junsheng Cheng, labelled bing. The Bing library was access through the syuzhet package.

  3. The NRC Word-Emotion Association Lexicon, library by Saif M. Mohammad, labelled nrc. The NRC library was access through the syuzhet package.

  4. The Sentiwords dataset, created by Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. The Sentiwords library was accessed through the library was accessed through the lexicon package.

  5. The Hu & Liu dataset, by Minqing Hu and Bing Liu, labelled Hu. The Hu & Liu library was accessed through the sentimentr package.

Summary Statistics

I have produced summary statistics of mean sentiment and speech length, for each MP, and by party, party group and gender. Download all four tables in one XLSX workbook.

Dataset Variables

Variable Description Data Type (R)
id ID for each speech, corresponding to the parlparse ID character
speech The actual text of the speech character
word_count The word count of the speech numeric
speech_date The date the speech was made date
year The year the speech was made factor
afinn_sentiment The afinn sentiment score numeric
afinn_sd The standard deviation of the afinn score numeric
bing_sentiment The bing sentiment score numeric
bing_sd The standard deviation of the afinn score numeric
nrc_sentiment The nrc sentiment score numeric
nrc_sd The standard deviation of the afinn score numeric
sentiword_sentiment The sentiword sentiment score numeric
sentiword_sd The standard deviation of the afinn score numeric
hu_sentiment The hu sentiment score numeric
hu_sd The standard deviation of the afinn score numeric
proper_id The ID used by the Member’s Names Information Service (only applicable after 1980) character
proper_name The MP’s name (only applicable after 1980) factor
house_start_date The date the MP was first elected to the House of Commons (only applicable after 1980) date
date_of_birth MP’s date of birth (only applicable after 1980) date
gender One of Male or Female (only applicable after 1980) factor
party The political party the MP belonged to at time of speech (only applicable after 1980) factor
age Age at time of speech (only applicable after 1980) numeric
party_group (only applicable after 1980) factor
ministry Identifier for the government at time of speech factor
government An indicator if the the MP is a member of the governing party (or parties), or in the opposition (only applicable after 1980) factor

Methodology

The parlparse project provides scraped xml files of Hansard debate going back to 1936, and assigns an ID to each speaker. However, I could not find where the IDs assigned are linked to other information, such as constituencies or parties, or the MNIS ID system used by parliament. Long-serving MPs may also have dozens of these IDs assigned to them, and they are not consistently linked together. There are also substantial numbers of speeches where there is no ID assigned a speaker, and they are classified as ‘unknown’. I created a table with every possible combination of name and ID, and matched the speakers in that table to their MNIS ID, using a mixture of exact string, approximate string and manual matching. The information in this table was then matched to the complete list of speech IDs. In the case of commonly used names (e.g. the two Labour MPs named John Smith who were both members of the house between 1989 and 1992) I manually identified which MP was actually speaking by locating adjacent Hansard records where their full name, constituency or ministerial title was used. In a handful of cases I had to use the content of their speech and any adjacent speeches to provide further clues to an MPs identity.

The code and matching data used to generate this dataset is available on Github. The github repository also includes the code used to produce the sentiment scores included in the dataset.

Notes

Sarah Olney (4591) does not have a birth date listed in the Members Names Information Service, and I have been unable to locate her date of birth elsewhere, only the year of birth. Her birthdate is, as a consequence, listed as 1977-01-01, this will be amended to the correct month and day when her biography is updated.

The data used to create this dataset was taken from the parlparse project operated by They Work For You and supported by mySociety.

The dataset is licensed under a Creative Commons Attribution 4.0 International License.Creative Commons License

The code included in this repository is licensed under an MIT license.

Please contact me if you find any errors in the dataset. The integrity of the public Hansard record is questionable at times, and while I have improved it, the data is presented ‘as is’.

The DOI is 10.5281/zenodo.376839