Emmy Awards 2018: can data predict Best Drama?

Emmy Awards 2018 -The Handmaid's Tale

Data can do anything.

On July 12th, the Emmy Awards 2018 nominees for Outstanding Drama were announced, celebrating the year's most binge-worthy TV. The nominations were as varied as ever, with entries like the retro-stalgic and family-friendly Stranger Things contending with post-watershed heavyweights such as HBO's Game of Thrones.

Topical analysts that we are - and following our World Cup sticker calculations - we thought we'd have a go at predicting the winners.

Despite the variety in the Emmy Awards 2018 lineup, best drama nominees - by definition - must share common attributes. Namely, in their quality and critical acclaim.

But how does one measure quality, precisely?

If you were asked to rank your own favourite shows this year, you might not struggle, but it's in defining how this subjective ranking system works that often proves challenging.

If your favourite drama last year featured dragons, predicting that Game of Thrones would top your list in 2018 would be fairly straightforward, right?

Not exactly.

If the only thing a show had to do to be crowned your personal #1 was to feature a dragon, you'd (probably) be in the minority.

What if two shows fit the bill?

You'd need a second variable. Perhaps you prefer shows with larger casts over more intimate dragon-centric ensembles, or maybe shows with the most profanity always earn your favour.

If more than one show also meets the criteria, or if a second judge enters the equation - as in Emmy-reality - guessing the favourite becomes increasingly complex.

Emmy Awards 2018: How to (try and) guess the winner

In attempting to predict this year's best drama winner, and to cope with the number of variables at play, we used a logistic regression model. Based on the data available, this process works by first assessing whether previous winners share similarities, and then by assessing how closely this year's nominees fit the criteria shared by past victors.

Still following?

Basically, it comes down to a question of which show looks most like a winner.

The criteria used to characterise nominees had to be applied to every entry in the dataset, and numeric descriptors have been used to accomplish this. For example, entries featuring a female lead received a 1 in the Female Lead column – shows which didn’t receive a 0. We also inputted:

Aggregated review scores from various sites
Genre flags
Whether nominated seasons featured the demise of a main cast member
Their production network
The number of nominations they had received in other categories
Several other variables that could be applied across all entries.

Thirty-nine past entries, each described by a total of fifty variables, were used to create the model. A flag of whether each entry had won in its respective year was also inputted to identify these records as our target. The seven 2018 nominees were then included in the dataset, differentiated by a second flag to note that their composition was not to be considered a factor, but that they were to be scored.

The model also assessed which of the variables used to describe our entries were deemed as most important, to avoid drawing false conclusions. An example of this is The Handmaid’s Tale, featuring Elizabeth Moss in the titular role, which won last year’s Outstanding Drama award. Whilst featuring the star is unlikely to damage any shows chances of critical acclaim, the nominated series’ she’s not in don’t become ineligible for the award by default - which the model would hypothetically account for by establishing her impact over all entries assessed.

And the winner is…

The model outputs a propensity for each show to have won in its respective year of entry. This determines a predicted winner for each year, as shown:

Emmy awards 2018 predictions

Overall, the model correctly predicted 5 out of the past 6 winners – although this isn’t as sterling an endorsement of its capacity to predict this year’s winner as it might first appear.

In 2012 for example, both Breaking Bad and Game of Thrones were predicted as more likely winners than Homeland was. Yet Homeland won:

Emmy awards 2018 predictions

OK...

This is because Homeland has not gone on to win Outstanding Drama since 2012, whilst Breaking Bad and Game of Thrones have both won twice since 2012. As far as the model is concerned, in 2012, both Breaking Bad and Game of Thrones looked like winners, because they have both since become winners.

In 2018, the frontrunner for Outstanding Drama according to our model is currently The Handmaid’s Tale, now in its second season, followed closely by Game of Thrones, then Stranger Things:

Emmy awards 2018 predictions

While The Handmaid’s Tale has only won once previously (given that it’s currently only in its second season), it’s also the only current nomination to have a previous victory under its belt, aside from Game of Thrones (as summarised above). Due to the limited size of the dataset, The Handmaid’s Tale in 2018 is deemed to be the most likely winner largely because it most closely fits the description for the show with the highest historical ratio of wins to nominations: The Handmaid’s Tale in 2017.

Similarly, Game of Thrones is the only current entrant to have won twice previously. While it has also lost three times, the model still identifies the series as possessing winning characteristics.

Despite the impact of prior wins on the propensity to win in future, several other variables have also been identified as significant in terms of what you need to take home the gold:

Whether or not a show is produced by an on-demand service
Reviewer reactions to a show’s first season
Whether or not a show is in its inaugural season or has many behind it.

So where do we go from here?

Emmy Awards 2018 predictions... Take Two

Ultimately, regression models were designed to be used with thousands and millions of records - not thirty-nine. Future iterations of the model could be improved - and better validated - if more years of data were collated. This is because of several reasons.

Firstly, less data makes for less accurate variable assessment and weighting. Take network, for example, ten of our nominated seasons were produced by HBO, of which two have won to date. And, the same is true of AMC. Broadly, this would suggest that being produced by HBO or AMC makes no difference to the odds of success. However, this ratio is likely to change if we increased our available data points, meaning that the impact of the network would most likely gain more significance in predicting a winner.

Furthermore, despite the vast quantities of TV watching this approach would likely necessitate, including variables which only describe specific seasons of a show - not the show in general - would improve the validation process and model. If The Handmaid’s Tale in 2018 looks largely dissimilar to The Handmaid’s Tale in 2017 according to the data, e.g. due to variables relating to specific scenes in each season, 2018’s season will only be predicted as the winner if it shares similarities to other winning seasons of other shows - not just to its victorious self.

The winners of the Emmy Awards 2018 will be announced on September 17^th

(We plan to build a second model before then, and to make a new prediction using our updated approach, so watch this space.)

Who do you think will take the trophy at the Emmy Awards 2018? Share your predictions below.

References

Interested in our capabilities and want to find out more?

Get in touch

Our Blog - stay up to date with all things Webalytix

Use the Lockdown Lull to Spring Clean your Data

By Libby Plowman | May 21, 2020

The lockdown has inspired a spring-cleaning trend, so now your house is in order, how about refreshing your data to ensure you get the most out of it? Are your customers feeling valued or could they do with some TLC too? Here are some tips to help you get started.

Automating RFM segmentation and labour-intensive tasks

By Irina Obrazcova | May 10, 2020

With the advancement of machine learning and artificial intelligence, automation is becoming more and more prevalent within the business world. However, there is still a big gap in our understanding of just how much can be automated. What is RFM and does your business need it?

CRM Strategy: How to build and maintain customer relationships

By Libby Plowman | April 7, 2020

Today, building a meaningful client base is more important than ever. When it comes to costs, acquiring new customers is five times more expensive than keeping existing ones. So how do you actually maintain lasting customer relationships? We’ve compiled some useful tips to help you grow and retain your hard-earned customers.

Goldilocks and the three steps to understanding machine learning

By Irina Obrazcova | March 15, 2020

It’s a hot topic lately, but for many, machine learning is still a bit of a puzzle. This post looks at the key components, debugging and demystifying what is often seen as an overly technical concept, as well as offering some practical insights into how it all actually works.

Simpson’s Paradox and segmentation: why analysis is crucial

By Irina Obrazcova | February 10, 2020

Simpson’s Paradox refers to a data phenomenon where a trend existing in groups reverses when the data is studied as a whole. When it comes to analytics, understanding this paradox is vital since it can completely alter any insights gained from the data itself. How does the relationship between Simpson’s Paradox and segmentation apply in business situations?

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_145586238_1	1 minute	Set by Google to distinguish users.
_gat_UA-189296586-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.

Cookie	Duration	Description
_hjSession_2252737	30 minutes	No description
_hjSessionUser_2252737	1 year	No description
wp_wpfileupload_6b1ea12ba8dc270fa567a4f380043a44	2 days	No description