Recency Bias in Phish.net Show Ratings

Thursday 04/23/2020 by phishnet

RECENCY BIAS IN PHISH.NET SHOW RATINGS

[we'd like to thank Prof. Paul Jakus, @paulj, for yet another thought-provoking statistical analysis of Phish.net data - ed.]

Phish.net show ratings are meant to convey Phish fans’ collective perception of how good a show was, but these ratings are subject to a number of biases. For example, .net ratings do not come from a random sample (sampling bias), and people tend to rate the shows they’ve attended quite highly (attendance bias).

Another possible bias, which the .net Cognoscenti have termed “Recency Bias”, is the tendency to rate a show during the first few days after the performance, if not immediately after the show. It is believed that ratings posted in the immediate aftermath of a concert will reflect the warm glow of that experience. People have not taken the time to reflect on the quality of that show relative to the performances immediately before or after, or within the context of an entire Phish tour. Recency bias implies that a show’s rating will decline as its warm glow dissipates.

It occurred to me that I could estimate the magnitude of recency bias using a Phish show database I’ve periodically updated since Summer 2018. We’ll look solely at the 21-show Summer 2018 tour, which started at Lake Tahoe on July 17 and ended at Dick’s on September 2. For each show, we can use snapshots of .net ratings taken on October 2, 2018, on May 5, 2019, and on April 2, 2020. Thus, we have ratings taken one month after the conclusion of tour, 8 months after tour, and 19 months after tour.

Here are the ratings time paths of three Summer 2018 shows [Gorge Night 3 (7/22/18), Bill Graham Civic Auditorium Night 2 (7/25/18), and The Forum Night 1 (7/27)]:

Gorge3 and Forum1 both show slightly declining ratings over time, while BGCA2 shows a slight uptick. Gorge3 fell by 0.118 points, as 95 new ratings came in between October 2018 and April 2020. In contrast, over this same time period, ratings by 34 new people pulled the BGCA2 rating up by 0.058 points—so immediate ratings might not always be “too high”.

On average, Phish.net Summer 2018 show ratings were about 0.051 points lower in April 2020 than they had been in October 2018. This sort of observation—declining ratings over time—is why people were thinking about recency bias. However, this simple difference doesn’t measure the bias because the April 2020 rating includes the contributions of both those who rated while still in the warm glow of tour (rating while “hot”) and the “cooler heads”—those who waited until well after the tour had concluded.

Fortunately, we can use some simple algebra to extract the implicit average show rating for the cooler heads: multiply the mean rating by the number of raters for April 2020 and again for October 2018, take the difference, and then divide by the number of new raters. (This approach assumes that no one who rated a show before October 2018 went back and changed their rating.) The “cool” ratings are based on anywhere from 29 to 164 new raters for a given show (mean=80) so the sample sizes are reasonable for this calculation. This is what we get:

Tahoe1, Gorge3, and Forum1, in particular, did not fare as well when cooler heads prevailed, with drops of 0.5 points or more. However, BGCA2 was more than 0.5 points higher than it was in October 2018.

The average Summer 2018 show rating, as measured in October 2018, was 4.065; when measured using only the cooler heads, the average show rating is 3.768. This implies a mean recency bias of almost 0.3 points (7.3%). Some 57% of Summer 2018 shows were rated at 4.0 or greater using the hot ratings whereas only 38% of shows exceeded 4.0 using the cool ratings.

The comparison of the hot versus cool show ratings shakes up the Top Five Summer 2018 shows rather thoroughly:

Ranking

Hot Ratings

October 2018 Raters/New Raters

Cool Ratings

October 2018 Raters/New Raters

Dick’s 1

521/164

Alpharetta 3

397/95

MPP 2

515/153

Alpharetta 2

367/78

Alpharetta 1

505/147

BGCA 1

315/48

Gorge 3

470/95

BGCA 2

295/34

Alpharetta 3

397/95

Alpharetta 1

505/147

“New Raters” = # of raters between October 2018 and April 2020

It would have been nice if measured bias had been relatively constant (small variance) because then it could be ignored. The graph below, which simply repackages the data used in the bar graph above, shows that recency bias is not particularly stable.

Why do we observe a large bias for some shows and a smaller bias for other shows? Well, I tried running a few statistical models to explain the difference—controlling for free webcasts, the number of new raters, the initial rating, etc.—but nothing was obvious. If you have any suggestions as to why we see this pattern, let me know…

If you liked this blog post, one way you could "like" it is to make a donation to The Mockingbird Foundation, the sponsor of Phish.net. Support music education for children, and you just might change the world.

14 comments - Link: http://phi.sh/b/5e9ff635

Comments

2020-04-23 9:11 am, comment by FunkDog

My own bias is that Phïsh is playing better. They started out great and never stopped. The ratings are based on numbers. So larger audience leads to larger numbers in the polls for more recent shows. When shows have a lower rating from 1993 than 2013 or 2019, it’s because they way they are ranked is based on number count.

So maybe go by percentage of votes for each year. This will reflect the percentage of votes of people who were fans at the time. So if Phïsh just keeps going and the audience grows and grows, going by percents of votes, instead of number of votes, year by year, can accommodate the growing audience and provide ratings that are reflective of the audience size voting. This will also offer some clarity for past years as well.

Score: -1

2020-04-23 9:34 am, comment by Tenaciousdnj

I would like to see the impact that setlists have upon the ratings. It is likely easier for a show to get higher ratings when song selection is strong. Initial excitement over setlist may cloud judgement on how good the playing/jams were. On the flip side, a show may not gain as much favor or excitement initially when the setlist is not as strong. Figuring out if this is the main factor would likely be tough. You would first have to figure out some kind of rankings for how popular each song is among the fanbase. To then have to go through every show and determine whether it has a setlist that would be strong on paper would probably be difficult. However, using the examples you used one could definitely argue that Gorge 3 and Forum 1 have much stronger setlists than the BGCA 2 show. Those two showed a rating decline over time and the BGCA show showed a rating increase over time.
My theory would be that setlists are the biggest contributing factor to bias in the initial ratings. Some shows with strong setlists may also just be all around great shows, in which case their rating may not change much. You see this with the Alpharetta 2018 shows for example. This theory may not be entirely possible to test without being somewhat subjective, but based on my own personal experience I think setlist can have a huge impact in the way a show is viewed upon first impression.

Score: 1

2020-04-23 11:34 am, comment by AntelopeFreeway

Thanks for the article. I'm interested in a similar topic, which is balancing the bias between shows in different eras. Or, how to determine what year was rated as Peak Phish, according to the show ratings database. Enjoyed your work!

Score: 1

2020-04-23 11:48 am, comment by BigJibbooty

What @tenaciousdnj says definitely seems to make sense, if you look at the shows (on consecutive nights) that went up and down by the most - BGCA 2 and Forum 1. Forum 1 looks far better on paper, while BGCA 2 features the on paper less than inspiring run to end Set 1 of Ocelot, Waking Up Dead, Number Line and More, followed by the at that time not yet renowned jam vehicle SYSF starting off Set 2. But Ocelot is a rare jam chart version, and SYSF is a monster that "made" that song. Anyone looking at these two shows on paper wouldn't know that....but on relisten would be like "wow".

That all being said, I have no idea why Alpha 1 dropped....that show kicked ass on paper and on relisten. Though it is still the top ranked show of 2018 summer tour, which it should be!

Score: 1

2020-04-23 12:01 pm, comment by Svenzhenz

There is something to be said about song selection and song length that contributes (rightly or not) to a show's rating. BGCA 2 is a good example of a show with it's centerpiece being a new song (SYSF) with a great jam, and how that can move ratings over time. The "vintage" of SYSF (2nd performance ever) initially hurt this show's rating as most fans are jaded a-holes and don't want songs written after 1996 in their shows. Yet, because Phish fans are also fickle bastards, they soon realized that this version of SYSF (with a VERY long jam) does indeed kick ass, and should be rated higher than it was initially rated.

Score: 1

2020-04-23 12:20 pm, comment by Capt_Tweezerpants

Love this stuff, great job.

I just want to point out that I do go back and change ratings sometimes, but maybe i am an outlier.

In particular, after the Bakers Dozen, I reexamined how I rate shows, resulting in a stricter system that required lowering ratings on 2016 shows.

Score: 0

2020-04-23 4:09 pm, comment by ekstewie1441

I prefer recency bias. My instinct is that a higher percentage of people voting shortly after the show actually heard the whole show - either attended, streamed, LivePhish, or other.

Somebody that listens to, say, a couple songs from MPP Tweezerfest and thinks, "Meh, doesn't have a 28+ minute jam and don't care much for Waiting All Night" or someone who looks at Charlotte 2019 setlist and song lengths and thinks "sucks they close first sets now with new songs" - to me - is less informed and useful than someone who attended.

Score: 3

2020-04-23 5:48 pm, comment by justinofhudson

This is a great article ! Thanks

Score: 1

2020-04-24 10:00 am, comment by SpinningPiper

There should be two ratings shown for every show. Attendees and Non-Attendees. If the user has the "I was there" checked for a given show, their vote goes into Attendee rating, If the user does not, their vote goes to non-attendee. I don't think this would be that difficult to set up.

Score: 0

2020-04-24 2:08 pm, comment by BCADrummer

Do you include couch tour as attendance? Not fighting, just curious.

@SpinningPiper said:

There should be two ratings shown for every show. Attendees and Non-Attendees. If the user has the "I was there" checked for a given show, their vote goes into Attendee rating, If the user does not, their vote goes to non-attendee. I don't think this would be that difficult to set up.

Score: 0

2020-04-24 9:28 pm, comment by Lemuria

This is super yummy - thank you, Paul!!

Score: 0

2020-04-24 10:22 pm, comment by lysergic

This is excellent.

Score: 0

2020-04-26 9:40 am, comment by ColForbin

Very interesting analysis – nice work!

Score: 0

2020-04-26 5:38 pm, comment by HuckZ

It's so very different from one person to another. I have a friend who refuses to own Phish studio recordings, and I'm not crazy about half the live stuff he flips out over. There's a little Phish for everyone.

Check out Phish's first jump into Virtual Reality gaming:

https://youtu.be/a0PAHW48NaM

Score: 0