Hansard Speeches and Sentiment V2

A public dataset of speeches in the Hansard. The dataset provides information on every speech made in the House of Commons between the parliament returned from the 1979 general election and the dissolution of parliament for the 2017 general election, with information on the speaking MP, their party, gender and age at the time of the speech. The dataset also includes all speeches made from 1936 to the dissolution of parliament for the 1979 general election. The post-1979 election dataset is labelled senti_post_v2 and the pre-1979 election dataset is labelled senti_pre_v2.

The senti_post_v2 dataset contains 2,234,229 speeches and 404,589,163 words. The senti_pre_v2 dataset contains 2,977,498 speeches and 413,046,298 words. It can be accessed through Zenodo, and is distributed under a Creative Commons 4.0 BY-SA licence. It is currently available as a CSV file, if you would like other formats please get in touch. The latest version, V2.0, includes improved consistency in sentiment calculations, with five different libraries and the same methods of calculation used for each library and corrects several misidentified speeches. It also includes all speeches up to the dissolution of parliament for the 2017 General Election.

Note that these files are UTF-8 encoded, and when I’ve opened them on a Windows computer there have been problems with characters not rendering correctly.

Sentiment Classification Methods

The speeches have been classified for sentiment using a total of four libraries from the R package lexicon by Tyler Rinker, and one from the syuzhet package by Michael Jockers. All classification has used the method in Tyler Rinker’s sentimentr package. The libraries are:

The AFINN library by Finn Årup Nielsen, labelled afinn. The AFINN library was accessed through Matthew Jockers’s syuzhet package.
A variant of the syuzhet library by Matthew Jockers, included in the lexicon package, labelled jockers.
The NRC Word-Emotion Association Lexicon, library by Saif M. Mohammad and Peter D. Turney, labelled nrc. The NRC library was access through the lexicon package.
The Sentiwords library, created by Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani, labelled sentiwords. The library was accessed through the lexicon package.
The Opinion Mining, Sentiment Analysis and Opinion Spam Detection dataset by Bing Liu, Minqing Hu and Junsheng Cheng, labelled hu. The library was access through the lexicon package.

Summary Statistics

I have produced summary statistics of mean sentiment and speech length, for each MP, and by party, party group, government or opposition status and gender. Download all five tables in one XLSX workbook.

Dataset Variables

The senti_post_v2 and senti_pre_v2 datasets have slightly different variables, as there is more information available for all post-1979 MPs, and that is included in senti_post_v2.

`senti_post_v2` Dataset Variables

Variable	Description	Data Type (R)
`id`	ID for each speech, corresponding to the parlparse ID	character
`speech`	The actual text of the speech	character
`afinn_sentiment`	The `afinn` sentiment score	numeric
`afinn_sd`	The standard deviation of the `afinn` score	numeric
`jockers_sentiment`	The `jockers` sentiment score	numeric
`jockers_sd`	The standard deviation of the `jockers` score	numeric
`nrc_sentiment`	The `nrc` sentiment score	numeric
`nrc_sd`	The standard deviation of the `nrc` score	numeric
`sentiword_sentiment`	The `sentiword` sentiment score	numeric
`sentiword_sd`	The standard deviation of the `sentiword` score	numeric
`hu_sentiment`	The `hu` sentiment score	numeric
`hu_sd`	The standard deviation of the `hu` score	numeric
`word_count`	The word count of the speech	numeric
`speech_date`	The date the speech was made	date
`time`	The time the speech was made (not consistently available), stored as a character vector	character
`url`	The URL of the speech	character
`as_speaker`	If the speaker is the Speaker of the house	Logical
`speakerid`	One of three ID schemes used in the `parlparse` scraper	character
`person_id`	One of three ID schemes used in the `parlparse` scraper	character
`hansard_membership_id`	One of three ID schemes used in the `parlparse` scraper	character
`mnis_id`	The ID used by the Member’s Names Information Service	character
`age`	Age at time of speech	integer
`party_group`	Grouping of political parties	factor
`ministry`	Identifier for the government at time of speech	factor
`government`	An indicator if the the MP is a member of the governing party (or parties), or in the opposition	factor
`proper_name`	The MP’s name	character
`house_start_date`	The date the MP was first elected to the House of Commons	date
`date_of_birth`	MP’s date of birth	date
`house_end_date`	The date the MP left the House of Commons	date
`gender`	One of Male or Female	factor
`party`	The political party the MP belonged to at time of speech	character
`dods_id`	Dods Monitoring ID	integer
`pims_id`	Parliamentary Information Management Services ID	integer

`senti_post_v2` Dataset Variables

Variable	Description	Data Type (R)
`speech`	The actual text of the speech	character
`id`	ID for each speech, corresponding to the parlparse ID	character
`hansard_membership_id`	One of three ID schemes used in the `parlparse` scraper	character
`speech_date`	The date the speech was made	date
`year`	The year a speech was made	integer
`speakerid`	One of three ID schemes used in the `parlparse` scraper	character
`person_id`	One of three ID schemes used in the `parlparse` scraper	character
`colnum`	The column number the speech appears in Hansard publications	integer
`time`	The time the speech was made (not consistently available), stored as a character vector	character
`url`	The URL of the speech	character
`as_speaker`	If the speaker is the Speaker of the house	Logical
`afinn_sentiment`	The `afinn` sentiment score	numeric
`afinn_sd`	The standard deviation of the `afinn` score	numeric
`jockers_sentiment`	The `jockers` sentiment score	numeric
`jockers_sd`	The standard deviation of the `jockers` score	numeric
`nrc_sentiment`	The `nrc` sentiment score	numeric
`nrc_sd`	The standard deviation of the `nrc` score	numeric
`sentiword_sentiment`	The `sentiword` sentiment score	numeric
`sentiword_sd`	The standard deviation of the `sentiword` score	numeric
`hu_sentiment`	The `hu` sentiment score	numeric
`hu_sd`	The standard deviation of the `hu` score	numeric
`word_count`	The word count of the speech	numeric

Methodology

The parlparse project provides scraped xml files of Hansard debate going back to 1936, and assigns an ID to each speaker. However, I could not find where the IDs assigned are linked to other information, such as constituencies or parties, or the MNIS ID system used by parliament. Long-serving MPs may also have dozens of these IDs assigned to them, and they are not consistently linked together. There are also substantial numbers of speeches where there is no ID assigned a speaker, and they are classified as ‘unknown’. I created a table with every possible combination of name and ID, and matched the speakers in that table to their MNIS ID, using a mixture of exact string, approximate string and manual matching. The information in this table was then matched to the complete list of speech IDs. In the case of commonly used names (e.g. the two Labour MPs named John Smith who were both members of the house between 1989 and 1992) I manually identified which MP was actually speaking by locating adjacent Hansard records where their full name, constituency or ministerial title was used. In a handful of cases I had to use the content of their speech and any adjacent speeches to provide further clues to an MPs identity.

The code and matching data used to generate this dataset is available on Github.

Notes

Sarah Olney (mnis_id 4591) does not have a birth date listed in the Members Names Information Service, and I have been unable to locate her date of birth elsewhere, only the year of birth. Her birthdate is, as a consequence, listed as 1977-01-01, this will be amended to the correct month and day when her biography is updated.

The data used to create this dataset was taken from the parlparse project operated by They Work For You and supported by mySociety.

The dataset is licensed under a Creative Commons Attribution 4.0 International License.

The code included in this repository is licensed under an MIT license.

Please contact me if you find any errors in the dataset. The integrity of the public Hansard record is questionable at times, and while I have improved it, the data is presented ‘as is’.

Citing this dataset

Please cite this dataset as:

Odell, Evan. (2017). ‘Hansard Speeches and Sentiment V2.0 [Dataset].’ http://doi.org/10.5281/zenodo.579712.

The DOI is 10.5281/zenodo.579712

References

{% bibliography –file hansard-data %}

Sentiment Classification Methods

Summary Statistics

Dataset Variables

senti_post_v2 Dataset Variables

senti_post_v2 Dataset Variables

Methodology

Notes

Citing this dataset

References

`senti_post_v2` Dataset Variables

`senti_post_v2` Dataset Variables