Download from Zenodo
A public dataset of speeches in the Hansard, stored as a tibble class in RDS files, for the R programming language.1 House of Commons between the parliament returned from the 1979 general election and the parliamentary summer recess starting on 2017-07-20, with information on the speaking MP, their party, gender, birth date,2 starting and finishing dates as an MP, and age at the time of the speech. The dataset also includes all speeches made from 1936 to the dissolution of parliament for the 1979 general election. The post-1979 election dataset is labelled hansard_senti_post_V241
and the pre-1979 election dataset is labelled hansard_senti_pre_V241
. Both datasets are encoded as UTF-8.
Documentation for previous versions of the Hansard Speeches and Sentiment dataset can be found here
The hansard_senti_post_V241
dataset contains 2,169,348 speeches and 373,323,215 words. The hansard_senti_pre_V241
dataset contains 2,977,461 speeches and 406,062,364 words. It can be accessed through Zenodo, and is distributed under a Creative Commons 4.0 BY-SA license. The latest version, V2.4.1 corrects errors introduced by Regex in V2.2, improves encoding issues in speeches, corrects several spelling mistakes in the hansard record, removes duplicate speeches, updates some MPs’ names, and includes speeches up to the summer recess starting on 2017-07-20. For details on how speech sentiments were classified, see below.
Changes in V2.4.1
- Henry McLeish was listed as a member of the Conservative party, instead of Labour. This has been corrected.
Sentiment Classification Methods
The speeches have been classified for sentiment using a total of three libraries from the R package lexicon
by Tyler Rinker, one from the syuzhet
package by Michael Jockers, and one by Ludovic Rheault, Kaspar Beelen, Christopher Cochrane and Graeme Hirst. All classification has used the method in Tyler Rinker’s sentimentr
package (Rinker 2017). The libraries are:
The AFINN library, labelled
afinn
. The AFINN library was accessed through Matthew Jockers’ssyuzhet
package (Nielsen 2011).A variant of the syuzhet library, included in the
lexicon
package, labelledjockers
. (Jockers 2015).The NRC Word-Emotion Association Lexicon, labelled
nrc
. The NRC library was access through thelexicon
package. (Mohammad and Turney 2013).The Opinion Mining, Sentiment Analysis and Opinion Spam Detection library, labelled
huliu
. The library was access through thelexicon
package. (Hu and Liu 2004).A modified version of the unnamed lexicon from this paper, labelled
rheault
. As the method insentimentr
does not use distinguish between the same word that can occupy multiple lexical categories,3 I used the average polarity score assigned to such words.4 (Rheault et al. 2016).
Summary Statistics
I have produced summary statistics of the hansard_senti_post_V241
with the weighted (by speech length) and unweighted mean and standard deviation of sentiment scores, and average and total speech word counts. These are available by:
Download all nine tables in one XLSX workbook. Each table contains the following variables:
Variable | Description |
---|---|
afinn_sentiment_avg |
Average sentiment, afinn library |
afinn_sd_avg |
Average sentiment standard deviation, afinn library |
afinn_sentiment_wtd |
Average sentiment, weighted by speech length, afinn library |
afinn_sd_wtd |
Average sentiment standard deviation, weighted by speech length, afinn library |
jockers_sentiment_avg |
Average sentiment, jockers library |
jockers_sd_avg |
Average sentiment standard deviation, jockers library |
jockers_sentiment_wtd |
Average sentiment, weighted by speech length, jockers library |
jockers_sd_wtd |
Average sentiment standard deviation, weighted by speech length, jockers library |
nrc_sentiment_avg |
Average sentiment, nrc library |
nrc_sd_avg |
Average sentiment standard deviation, nrc library |
nrc_sentiment_wtd |
Average sentiment, weighted by speech length, nrc library |
nrc_sd_wtd |
Average sentiment standard deviation, weighted by speech length, nrc library |
huliu_sentiment_avg |
Average sentiment, huliu library |
huliu_sd_avg |
Average sentiment standard deviation, huliu library |
huliu_sentiment_wtd |
Average sentiment, weighted by speech length, huliu library |
huliu_sd_wtd |
Average sentiment standard deviation, weighted by speech length, huliu library |
rheault_sentiment_avg |
Average sentiment, rheault library |
rheault_sd_avg |
Average sentiment standard deviation, rheault library |
rheault_sentiment_wtd |
Average sentiment, weighted by speech length, rheault library |
rheault_sd_wtd |
Average sentiment standard deviation, weighted by speech length, rheault library |
tot_speeches |
Total number of speeches |
tot_words |
Total number of words |
avg_speech_length |
Average number of words per speech |
Dataset Variables
The hansard_senti_post_V241
and hansard_senti_pre_V241
datasets have slightly different variables, as there is more information available for all post-1979 MPs, and that is included in hansard_senti_post_V241
.
hansard_senti_post_V241
Dataset Variables
Variable | Description | Data Type |
---|---|---|
pp_id |
ID for each speech, corresponding to the parlparse ID | character |
eo_id |
ID number for each speech, as assigned by me, to accommodate situations where the same parlparse ID was assigned to distinct speeches | character |
speech |
The actual text of the speech | character |
afinn_sentiment |
The afinn sentiment score |
numeric |
afinn_sd |
The standard deviation of the afinn score |
numeric |
jockers_sentiment |
The jockers sentiment score |
numeric |
jockers_sd |
The standard deviation of the jockers score |
numeric |
nrc_sentiment |
The nrc sentiment score |
numeric |
nrc_sd |
The standard deviation of the nrc score |
numeric |
huliu_sentiment |
The huliu sentiment score |
numeric |
huliu_sd |
The standard deviation of the huliu score |
numeric |
rheault_sentiment |
The rheault sentiment score |
numeric |
rheault_sd |
The standard deviation of the rheault score |
numeric |
word_count |
The word count of the speech | numeric |
speech_date |
The date the speech was made | date |
year |
The year the speech was made | numeric |
time |
The time the speech was made (not consistently available), stored as a character vector; e.g. ‘16:24:00’ | character |
url |
The URL of the speech | character |
as_speaker |
If the speaker is the Speaker of the house | Logical |
speaker_id |
One of three ID schemes used in the parlparse scraper |
character |
person_id |
One of three ID schemes used in the parlparse scraper |
character |
hansard_membership_id |
One of three ID schemes used in the parlparse scraper |
character |
mnis_id |
The ID used by the Member’s Names Information Service. This ID remains constant, even if an MP changes parties, seats, etc. | character |
dods_id |
Dods Monitoring ID | integer |
pims_id |
Parliamentary Information Management Services ID | integer |
proper_name |
The MP’s name | character |
party_group |
Grouping of political parties. Labour and Labour Co-op MPs are listed as ‘Labour’, Conservative MPs as ‘Conservative’, Liberal Democrats, Social Democrats and Liberals are all listed as ‘Liberal Democrat’, and all other MPs are listed as ‘Other’. | factor |
party |
The political party the MP belonged to at time of speech | character |
government |
An indicator if the the MP is a member of the governing party (or parties), or in the opposition | factor |
age |
Age at time of speech | integer |
gender |
One of Male or Female | factor |
date_of_birth |
MP’s date of birth | date |
house_start_date |
The date the MP was first elected to the House of Commons | date |
house_end_date |
The date the MP left the House of Commons | date |
ministry |
Identifier for the government at time of speech | character |
Notes on the hansard_senti_pre_V241
Dataset
The historical Hansard record often uses inconsistent and confusing naming conventions for MPs. I have not matched pre-1979 election MPs to their MNIS IDs, as not all pre-1979 election MPs will have an MNIS ID to be matched to, and the naming conventions appear to be particularly confusing. Long term I hope to develop a convention for a unique ID code for MPs that can identify them, their party, their constituency and any office they held at the time, but that is a project without a timetable. If you want to contribute to that project please get in touch.
MPs’ MNIS IDs, names, birthdates, start and end dates as an MP is available here.
hansard_senti_pre_V241
Dataset Variables
Variable | Description | Data Type |
---|---|---|
pp_id |
ID for each speech, corresponding to the parlparse ID | character |
eo_id |
ID number for each speech, as assigned by me, to accommodate situations where the same parlparse ID was assigned to distinct speeches | character |
speech |
The actual text of the speech | character |
speaker_name |
The name of the speaker, as listed in the Hansard record | character |
afinn_sentiment |
The afinn sentiment score |
numeric |
afinn_sd |
The standard deviation of the afinn score |
numeric |
jockers_sentiment |
The jockers sentiment score |
numeric |
jockers_sd |
The standard deviation of the jockers score |
numeric |
nrc_sentiment |
The nrc sentiment score |
numeric |
nrc_sd |
The standard deviation of the nrc score |
numeric |
huliu_sentiment |
The huliu sentiment score |
numeric |
huliu_sd |
The standard deviation of the huliu score |
numeric |
rheault_sentiment |
The rheault sentiment score |
numeric |
rheault_sd |
The standard deviation of the rheault score |
numeric |
word_count |
The word count of the speech | numeric |
speech_date |
The date the speech was made | date |
year |
The year the speech was made | numeric |
time |
The time the speech was made (not consistently available), stored as a character vector; e.g. ‘16:24:00’ | character |
url |
The URL of the speech | character |
as_speaker |
If the speaker is the Speaker of the house | Logical |
speaker_id |
One of three ID schemes used in the parlparse scraper |
character |
person_id |
One of three ID schemes used in the parlparse scraper |
character |
hansard_membership_id |
One of three ID schemes used in the parlparse scraper |
character |
Methodology
The parlparse project provides scraped xml files of Hansard debate going back to 1936, and assigns an ID to each speaker. However, I could not find where the IDs assigned are linked to other information, such as constituencies or parties, or the MNIS ID system used by parliament. Long-serving MPs may also have dozens of these IDs assigned to them, and they are not consistently linked together. There are also substantial numbers of speeches where there is no ID assigned a speaker, and they are classified as ‘unknown’. I created a table with every possible combination of name and speaker_id
, person_id
and hansard_membership_id
, and matched the speakers in that table to their MNIS ID, using a mixture of exact string, manually checked approximate strings and manual matching/hand coding. The information in this table was then matched to the complete list of speech IDs. In the case of common names,6 I manually identified which MP was actually speaking by locating adjacent Hansard records where their full name, constituency or ministerial title was used. In a handful of cases I had to use the content of their speech and any adjacent speeches to provide further clues to an MPs identity.
Licences and Code
The data used to create this dataset was taken from the parlparse project operated by They Work For You and supported by mySociety.
The dataset is licensed under a Creative Commons Attribution 4.0 International License.
The code included in the GitHub repository used to create this dataset is licensed under an MIT license. The code used to generate this dataset is available on Github.
Please contact me if you find any errors in the dataset. The integrity of the public Hansard record is questionable at times, and while I have improved it, the data is presented ‘as is’.
Citing this dataset
Please cite this dataset as:
Odell, Evan. (2017). “Hansard Speeches and Sentiment V2.4.1 [Dataset].” 110.5281/zenodo.841009.
The DOI of V2.4.1 is 10.5281/zenodo.841009. The DOI for all versions is 10.5281/zenodo.780985, and will always resolve to the latest version of the Hansard Speeches and Sentiment dataset.
References
Hu, Minqing, and Bing Liu. 2004. “Mining and Summarizing Customer Reviews.” In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168–77. ACM. http://dl.acm.org/citation.cfm?id=1014073.
Jockers, Matthew L. 2015. Syuzhet: Extract Sentiment and Plot Arcs from Text. https://github.com/mjockers/syuzhet.
Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Wordemotion Association Lexicon.” Computational Intelligence 29 (3): 436–65. http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00460.x/full.
Nielsen, Finn Årup. 2011. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” CoRR abs/1103.2903. http://arxiv.org/abs/1103.2903.
Rheault, Ludovic, Kaspar Beelen, Christopher Cochrane, and Graeme Hirst. 2016. “Measuring Emotion in Parliamentary Debates with Automated Textual Analysis.” PLOS ONE 11 (12): e0168843. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168843.
Rinker, Tyler W. 2017. Sentimentr: Calculate Text Polarity Sentiment. Buffalo, New York: University at Buffalo/SUNY. http://github.com/trinker/sentimentr.
If you would like other formats please get in touch.↩
Sarah Olney (mnis_id 4591) does not have a birth date listed in the Members Names Information Service, and I have been unable to locate her date of birth elsewhere, only the year of birth. Her birthdate is, as a consequence, listed as 1977-01-01, this will be amended to the correct month and day if her biography is updated. Most of the 2017 general election intake do not have listed birthdates, and their age is not listed as a result. This will be updated in future versions of this data set.↩
e.g. ‘bid’ can be both a noun, as in a bid submitted in response to a project tender, and a verb, as in to bid for an item at an auction↩
Rheault et al. (2016) have a more complex method of calculating polarity that accounts for lexical types. See their paper and the related repository for details.↩
As in for each new Prime Minister and/or general election.↩
e.g. the two Labour MPs named John Smith who were both members of the house between 1989 and 1992.↩