Download from Zenodo

A public dataset of speeches in the Hansard, stored as a tibble class in RDS files, for the R programming language.1 It includes the text and sentiment classifications for every speech made in the House of Commons between the 1979 general election and the end of 2017, with information on the speaking MP, their party, gender, birth date,2 starting and finishing dates (if applicable) as an MP, and age at the time of the speech. For pre-1979 election data, please see here. Documentation for previous versions of the Hansard Speeches and Sentiment dataset can be found here

The hansard_senti_post_V250 dataset contains 2,196,175 speeches and 382,484,493 words. It is distributed under a Creative Commons 4.0 BY-SA licence. It can be accessed through Zenodo, and is distributed under a Creative Commons 4.0 BY-SA license. The latest version expands coverage to the end of 2017, corrects several thousand spelling errors and includes two new sentiment classification methods.

Sentiment Classification Methods

The speeches have been classified for sentiment using a total of six libraries from the R package lexicon by Tyler Rinker, and one by Ludovic Rheault, Kaspar Beelen, Christopher Cochrane and Graeme Hirst. All classification has used the method in Tyler Rinker’s sentimentr package (Rinker 2017). The libraries are:

  1. The AFINN library, labelled afinn. The AFINN library was accessed through Matthew Jockers’s syuzhet package (Nielsen 2011).

  2. A variant of the syuzhet library, included in the lexicon package, labelled jockers. (Jockers 2015).

  3. The NRC Word-Emotion Association Lexicon, labelled nrc. The NRC library was access through the lexicon package. (Mohammad and Turney 2013).

  4. The Opinion Mining, Sentiment Analysis and Opinion Spam Detection library, labelled huliu. The library was access through the lexicon package. (Hu and Liu 2004).

  5. A modified version of the unnamed lexicon from this paper, labelled rheault. As the method in sentimentr does not use distinguish between the same word that can occupy multiple lexical categories,3 I used the average polarity score assigned to such words.4 (Rheault et al. 2016).

  6. A combination of the syuzhet and Opinion Mining, Sentiment Analysis and Opinion Spam Detection libraries created by Rinker, labelled jockers_rinker.

  7. The Augmented SenticNet 4 Polarity Table from lexicon, labelled senticnet (Cambria et al. 2016).

Summary Statistics

I have produced summary statistics of the hansard_senti_post_V250 with the weighted (by speech length) and unweighted mean and standard deviation of sentiment scores, and average and total speech word counts. These are available by:

Download all nine tables in one XLSX workbook. Each table contains the following variables:

Variable Description
afinn_sentiment_avg Average sentiment, afinn library
afinn_sd_avg Average sentiment standard deviation, afinn library
afinn_sentiment_wtd Average sentiment, weighted by speech length, afinn library
afinn_sd_wtd Average sentiment standard deviation, weighted by speech length, afinn library
jockers_sentiment_avg Average sentiment, jockers library
jockers_sd_avg Average sentiment standard deviation, jockers library
jockers_sentiment_wtd Average sentiment, weighted by speech length, jockers library
jockers_sd_wtd Average sentiment standard deviation, weighted by speech length, jockers library
nrc_sentiment_avg Average sentiment, nrc library
nrc_sd_avg Average sentiment standard deviation, nrc library
nrc_sentiment_wtd Average sentiment, weighted by speech length, nrc library
nrc_sd_wtd Average sentiment standard deviation, weighted by speech length, nrc library
huliu_sentiment_avg Average sentiment, huliu library
huliu_sd_avg Average sentiment standard deviation, huliu library
huliu_sentiment_wtd Average sentiment, weighted by speech length, huliu library
huliu_sd_wtd Average sentiment standard deviation, weighted by speech length, huliu library
rheault_sentiment_avg Average sentiment, rheault library
rheault_sd_avg Average sentiment standard deviation, rheault library
rheault_sentiment_wtd Average sentiment, weighted by speech length, rheault library
rheault_sd_wtd Average sentiment standard deviation, weighted by speech length, rheault library
jockers_rinker_sentiment_avg Average sentiment, jockers_rinker library
jockers_rinker_sd_avg Average sentiment standard deviation, jockers_rinker library
jockers_rinker_sentiment_wtd Average sentiment, weighted by speech length, jockers_rinker library
jockers_rinker_sd_wtd Average sentiment standard deviation, weighted by speech length, jockers_rinker library
senticnet_sentiment_avg Average sentiment, senticnet library
senticnet_sd_avg Average sentiment standard deviation, senticnet library
senticnet_sentiment_wtd Average sentiment, weighted by speech length, senticnet library
senticnet_sd_wtd Average sentiment standard deviation, weighted by speech length, senticnet library
tot_speeches Total number of speeches
tot_words Total number of words
avg_speech_length Average number of words per speech

Dataset Variables

The hansard_senti_post_V250 and hansard_senti_pre_V250 datasets have slightly different variables, as there is more information available for all post-1979 MPs, and that is included in hansard_senti_post_V250.

hansard_senti_post_V250 Dataset Variables

Variable Description Data Type
pp_id ID for each speech, corresponding to the parlparse ID character
eo_id ID number for each speech, as assigned by me, to accommodate situations where the same parlparse ID was assigned to distinct speeches character
speech The actual text of the speech character
afinn_sentiment The afinn sentiment score numeric
afinn_sd The standard deviation of the afinn score numeric
jockers_sentiment The jockers sentiment score numeric
jockers_sd The standard deviation of the jockers score numeric
nrc_sentiment The nrc sentiment score numeric
nrc_sd The standard deviation of the nrc score numeric
huliu_sentiment The huliu sentiment score numeric
huliu_sd The standard deviation of the huliu score numeric
rheault_sentiment The rheault sentiment score numeric
rheault_sd The standard deviation of the rheault score numeric
jockers_rinker_sentiment The jockers_rinker sentiment score numeric
jockers_rinker_sd The standard deviation of the jockers_rinker score numeric
senticnet_sentiment The senticnet sentiment score numeric
senticnet_sd The standard deviation of the senticnet score numeric
word_count The word count of the speech numeric
speech_date The date the speech was made date
year The year the speech was made numeric
time The time the speech was made (not consistently available), stored as a character vector; e.g. ‘16:24:00’ character
url The URL of the speech character
as_speaker If the speaker is the Speaker of the house Logical
speaker_id One of three ID schemes used in the parlparse scraper character
person_id One of three ID schemes used in the parlparse scraper character
hansard_membership_id One of three ID schemes used in the parlparse scraper character
mnis_id The ID used by the Member’s Names Information Service. This ID remains constant, even if an MP changes parties, seats, etc. character
dods_id Dods Monitoring ID integer
pims_id Parliamentary Information Management Services ID integer
proper_name The MP’s name character
party_group Grouping of political parties. Labour and Labour Co-op MPs are listed as ‘Labour’, Conservative MPs as ‘Conservative’, Liberal Democrats, Social Democrats and Liberals are all listed as ‘Liberal Democrat’, and all other MPs are listed as ‘Other’. factor
party The political party the MP belonged to at time of speech character
government An indicator if the the MP is a member of the governing party (or parties), or in the opposition factor
age Age at time of speech integer
gender One of Male or Female factor
date_of_birth MP’s date of birth date
house_start_date The date the MP was first elected to the House of Commons date
house_end_date The date the MP left the House of Commons date
ministry Identifier for the government at time of speech character

Methodology

The parlparse project provides scraped xml files of Hansard debate going back to 1936, and assigns an ID to each speaker. However, I could not find where the IDs assigned are linked to other information, such as constituencies or parties, or the MNIS ID system used by parliament. Long-serving MPs may also have dozens of these IDs assigned to them, and they are not consistently linked together. There are also substantial numbers of speeches where there is no ID assigned a speaker, and they are classified as ‘unknown’. I created a table with every possible combination of name and speaker_id, person_id and hansard_membership_id, and matched the speakers in that table to their MNIS ID, using a mixture of exact string, manually checked approximate strings and manual matching/hand coding. The information in this table was then matched to the complete list of speech IDs. In the case of common names,6 I manually identified which MP was actually speaking by locating adjacent Hansard records where their full name, constituency or ministerial title was used. In a handful of cases I had to use the content of their speech and any adjacent speeches to provide further clues to an MPs identity.

Licences and Code

The data used to create this dataset was taken from the parlparse project operated by They Work For You and supported by mySociety.

The dataset is licensed under a Creative Commons Attribution 4.0 International License.Creative Commons License

The code included in the GitHub repository used to create this dataset is licensed under an MIT license. The code used to generate this dataset is available on Github.

Please contact me if you find any errors in the dataset. The integrity of the public Hansard record is questionable at times, and while I have improved it, the data is presented ‘as is’.

Citing this dataset

Please cite this dataset as:

Odell, Evan. (2017). “Hansard Speeches and Sentiment V2.5.0 [Dataset].” 10.5281/zenodo.1183893.

@article{odell2018,
  title = {Hansard {{Speeches}} and {{Sentiment V2}}.5.0 [Dataset]},
  url = {https://evanodell.com/projects/datasets/hansard-data/},
  doi = {10.5281/zenodo.1183893},
  date = {2018-02-24},
  keywords = {dataset},
  author = {Odell, Evan}
}

The DOI of V2.5.0 is 10.5281/zenodo.1183893. The DOI for all versions is 10.5281/zenodo.780985, and will always resolve to the latest version of the Hansard Speeches and Sentiment dataset.

References

Cambria, Erik, Soujanya Poria, Rajiv Bajpai, and Björn Schuller. 2016. “SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives.” In, 2666–77. http://sentic.net/senticnet-4.pdf.

Hu, Minqing, and Bing Liu. 2004. “Mining and Summarizing Customer Reviews.” In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168–77. ACM. http://dl.acm.org/citation.cfm?id=1014073.

Jockers, Matthew L. 2015. Syuzhet: Extract Sentiment and Plot Arcs from Text. https://github.com/mjockers/syuzhet.

Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Wordemotion Association Lexicon.” Computational Intelligence 29 (3): 436–65. http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00460.x/full.

Nielsen, Finn Årup. 2011. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” CoRR abs/1103.2903. http://arxiv.org/abs/1103.2903.

Rheault, Ludovic, Kaspar Beelen, Christopher Cochrane, and Graeme Hirst. 2016. “Measuring Emotion in Parliamentary Debates with Automated Textual Analysis.” PLOS ONE 11 (12): e0168843. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168843.

Rinker, Tyler W. 2017. Sentimentr: Calculate Text Polarity Sentiment. Buffalo, New York: University at Buffalo/SUNY. http://github.com/trinker/sentimentr.


  1. If you would like other formats please get in touch.

  2. Sarah Olney (mnis_id 4591) does not have a birth date listed in the Members Names Information Service, and I have been unable to locate her date of birth elsewhere, only the year of birth. Her birthdate is, as a consequence, listed as 1977-01-01, this will be amended to the correct month and day if her biography is updated. Several members (Gillian Keegan, Laura Smith, Mike Hill) still do not have birth dates listed at all, and I could not locate exact dates for Anna McMorrin, Bob Seely, Danielle Rowley, Fiona Onasanya, Kemi Badenoch, so listed their birthday as the first of the given year or month. This will be updated in future versions of this data set.

  3. e.g. ‘bid’ can be both a noun, as in a bid submitted in response to a project tender, and a verb, as in to bid for an item at an auction

  4. Rheault et al. (2016) have a more complex method of calculating polarity that accounts for lexical types. See their paper and the related repository for details.

  5. As in for each new Prime Minister and/or general election.

  6. e.g. the two Labour MPs named John Smith who were both members of the house between 1989 and 1992.