I recently joined the popular social photo sharing website 500px.com.  As a photography enthusiast, I’m very interested in which photos resonate with people, and the idea of seeing which of my photos drew the most likes and favorites was quite appealing to me.  So I signed up, and uploaded a few photos, within a few minutes, sure enough, I had amassed a few likes and favorites on my photos.  Great.  I also saw that 500px takes these likes, and favorites (and possibly many other things) and uses them to compute what they call the “Pulse” score.  Basically, the more likes and favorites you get, the higher your pulse score becomes, and the better you photo is ranked on the site.

Now that I had a couple photos uploaded, I realized that I had absolutely no idea what the Pulse score actually means.  One photo got an 85.  That sounds pretty good I guess?  I mean, out of 100, that’s like a B, right?  Did that mean my photo was more popular than 85% of the other photos?  Well, no.  Did it have 85 likes?  Wrong again.  Since 500px doesn’t publish the algorithm they use to calculate pulse, we really don’t know what it means.  In fact, we don’t even know what the inputs are.  We can watch the pulse score change as we get likes and favorites, so we can say pretty strongly that they play a factor, but that’s about it.

Now that I was thinking about it, I had quite a few questions.  How popular were my photos compared the others?  What percentage of photos made it above 80 on the Pulse score, and thus earn the “Popular” designation?

So I set out on a fact finding mission.  I figured that there had to be some information about Pulse available online since the site is so popular, but I wasn’t able to uncover anything.  No amount of Google searching answered my questions.  I saw a few posts doing some analysis on what cameras people used the most, and what categories were most popular, but still nothing on the Pulse score.  It did however spark the idea that if I were able to download some metadata about the photos, I could just look at what the scores were, and put my photos in some sort of context

Cue The 500px API

Alright folks, techie stuff starts here, so if statistics and computer code make you squeamish, you had better skip down to the results section.

As many websites do these days, 500px exposes an API as a web service, basically allowing us to download snippets of their data.  I decided that I would download a list of all of the photos that were uploaded in a day, and look at their scores.  From there, I could calculate the average, plot the distribution, and find out what percentage of the photos made it to the Popular section.  One day’s worth of photos turned out to be about 28,000.  As a fairly large sample, it should be fairly indicative of the broader population.

Since I knew that I would be doing some statistical analysis on the data I downloaded, I chose to access the API through R.  I wrote up a simple R script, which you can see below, to download a day’s worth of photo metadata, and then calculate percentiles, and plot the distribution on a histogram.  My R coding ability is fairly feeble, but the script below seems to get the job done.  It takes a while to run though, just because of the number of calls it has to make to the API in order to download sufficient data.

A couple notes about the R script:

  • In order to connect to the API, you need a consumer key, which you can obtain by registering on their website.  Their documentation spends a lot of time talking about OAuth. Since we aren’t building an app that integrates with 500px, we don’t need to worry about that.  All we need here is public data, which is possible to obtain just by passing the consumer key to the API web service in the query string.
  • I chose to download photo metadata from the “Fresh Yesterday” population.  This gives the photos a full 24 hours to reach their max pulse score.  Scores are docked 10 points after the first day, and a smaller amount each day thereafter, so in the vast majority of cases, photos will reach their highest score during the first day.  I don’t expect this simplification to affect the results in a meaningful way.
  • The data is returned in JSON format, hence my use of the RJSONIO library.  The response tells you how many pages it contains, which in my case was 273.  I hard coded this value in the script in the interest of time, even though it would be better to pull if from the response header.
  • I only downloaded every 5th page.  Rather than downloading the entire day’s worth of data, I only looked at 20%.  For the statistically inclined, the mean and standard deviation of our sample gives us a 99% confidence interval of about 1.  So we know that the mean we calculate will have a 99% chance of being within 1 point of the true mean of all photos uploaded that day.  Close enough for me.  I did however make sure to take every 5th page, rather than the first 20% to give me a good cross section from the entire day, so as not to overemphasize certain times, and thus time zones and regions where certain types of users might be more active.
  • Note that we are only talking about photos uploaded on one particular day here.  We aren’t trying to extrapolate this to include every photo on the site for instance.  In fact, we can say pretty safely that the same will not apply since the Pulse algorithm has been rewritten at least once since 500px has been live.

The first thing did was to connect to the API, and execute a query on the web service to download the metadata.  The API call we will be taking advantage of is this one.

For this I wrote two functions: getLine(), and getPage().  Since the JSON data that comes from 500px is paginated, we first select a stream of content, and then supply a page number to get the relevant set of data.  It then removes the header information, and returns only the array of photos.  GetLine() simply extracts the fields we want from each photo record and vectorizes them, so that they are easy to bind into a data frame.


library("RJSONIO")
library("RCurl")

getLine <-function(i,p){
 return(c(p[[i]]$id,p[[i]]$highest_rating))
}

getPage <- function(pnum){ 

 base_url <- "https://api.500px.com/v1/photos?"
 rpp <- "100"
 feature <- "fresh_yesterday"
 sort <- "created_at"
 page <- pnum
 include_states <- "voted"
 consumer_key <- "Your consumer key here"
 reqURL <- paste0(base_url, "&page=", page, "&rpp=", rpp, "&feature=", feature,"&sort=", sort, "&include_states=", include_states,"&consumer_key=",consumer_key)
 responseStr = getURLContent(reqURL)
 data <- fromJSON(responseStr, simplify=FALSE)
 photos <- data["photos"][[1]]
 df <- as.data.frame(do.call(rbind,lapply(1:length(photos), getLine, p=photos)))
 return(df)
}

Now that we’ve got our helper functions set up, we actually need to download the data.  As I mentioned, we’ll be downloading every 5th page, and binding the results together into a nice data frame.


npage <- 273
myseq <-seq(1,npage,by=5)
percent <- seq(.01,.99, by= .01)
pdata <- do.call(rbind, lapply(myseq, function(x) getPage(toString(x))))

The Results

The first thing I wanted was to calculate a table of percentiles.  I used R’s built in quantile function and then set up a graph of the results.


percentile <- do.call(rbind, as.list(quantile(pdata[,2], percent)))

plot(c(0,100), c(0,100),type="n", xlab="Percentile", ylab="Pulse Score", main="Pulse Score Percentile")
lines(percentile, type="l")

What we get is the following chart, showing percentile on the x-axis, and the Pulse score required to achieve that percentile on the y-axis.

percentile

What we see is that there are a few photos that don’t get any likes at all, followed by a steep increase in the pulse scores of lower ranked images.  From there, the relationship appears to be largely linear.  Obviously, the most interesting thing about this graph is the sharp upward curve at a pulse score of about 70.  This corresponds to the huge increase in visibility that a photo gets from being included in the Upcoming section, which happens once a photo reaches a pulse score of 70.  We can see that there are very few photos that reach a score of 70, but don’t make it any further.

We can also view the same information in tabular format:

Screen Shot 2014-11-17 at 7.46.12 PM

So already, we have answered two of my original questions.  We now know that the median pulse score is about a 60.  So if your photo has less than a 60, it falls into the bottom half of all photos and vice versa.  Also, looking at the table, a pulse score of 80 is about the 69th percentile, meaning that about 31% of the photos uploaded make it to the Popular section.

The percentiles allow us to see how our photos are doing relative to other users, but I wanted to better visualize the distribution of all photos, so I created a histogram.


hist(pdata[,2], breaks=50, freq=FALSE, main="Pulse Score Distribution")
curve(dnorm(x, mean=mean(pdata[,2]), sd=sd(pdata[,2])), add=TRUE, col="darkblue", lwd=2)

all hist

What I can see from the histogram above, is that there are really two different worlds within 500px; the photos that make it to the Popular section, and those that do not.  Each section of the population displays a significantly different distribution of scores.  Because of this, we can see that the blue curve, a normal distribution with the characteristics (standard deviation and mean) that our sample displays, does not fit the right or the left of the graph well.

Segmenting The Data

I decided that since the photos above and below 70 display such different characteristics, that I would divide the population there, and see if this allowed any further insight into either portion.


lower <- pdata[pdata$V2 < 70, ]
upper <- pdata[pdata$V2 >= 70, ]

hist(lower$V2, breaks=14, freq=FALSE)
curve(dnorm(x, mean=mean(lower[,2]), sd=sd(lower[,2])), add=TRUE, col="darkblue", lwd=2)

hist(upper$V2, breaks=30, freq=FALSE)
curve(dnorm(x, mean=mean(upper[,2]), sd=sd(upper[,2])), add=TRUE, col="darkblue", lwd=2)

Lets start with the lower segment, from 0 to 70 pulse.  What we can see here is that the distribution is quite jagged.  This is because at lower levels of pulse, individual likes have much more effect on the score.  For instance, I’ve noticed that if a photo receives a like shortly after being uploaded, it receives 20.7 points.  Therefore, many photos will skip the 0 to 20 pulse range completely, which is partly why we don’t see many photos.  Similarly, you can see spikes at other scores that represent commonly occurring circumstances such as one or two likes.

lower

You’ll see that the normal distribution still doesn’t fit the lower portion of the sample very well.  This shouldn’t be that surprising simply because the underlying data is not normally distributed.  It is the product of some algorithm, with a density function that is not normal.

Lets take a look at the upper portion of the distribution from 70 to 100.  Here, we see a much more familiar bell shaped curve.  This is for several reasons.  Here, in the upper portion of the distribution, likes are worth small fractions of a point, which produces a much smoother curve, rather than the 10 or 20 point jumps we saw in the lower half.  Also, the vast majority of photos in this range are displayed on the popular section of the site, so they get a lot more views.  Hey, we all like pretty things to look at right?

Upper

So What Does This All Mean?

My take away from all of this is that 500px is all about exposure.  There are really two components to this.  Since photos are displayed as pages of thumbnails, the first challenge is to get viewers to the page where your thumbnail resides.  In reality, this means the Upcoming and Popular pages.  We can see strong evidence of this in the almost complete lack of photos that end up with a 70 something rating.  As soon as photos crest 70, they are displayed on the Upcoming page, were they get a lot more eye traffic than the Fresh page.  This increased viewership creates a lot more likes and favorites, and quickly shoots the photos up into the 80s and 90s.

The second challenge is once you have the viewer on your page, to actually get them to click on your image.  My theory is that this has a lot to do with how good your photo is in comparison to other photos on the same page.  My gut tells me that most people will click on one or two images per page before moving on.  This is also consistent with the distributions we have observed.  Since Upcoming is sorted by the time each photo reached a score of 70, great photos have a good chance of being the only great photo on the page, so they quickly gain Pulse.  However, when they reach Popular, photos are sorted by rating. In other words, you start at the back, and have to work your way forward.  With each page, the photos get better and better, making a given photo less and less likely to attract viewers.  This helps to explain the decline in the number of photos that reach the upper 90s.

Wrapping it Up

For me this was an exercise motivated purely by curiosity.  I wanted to know how good my photos were in comparison to the others on 500px.  To this end, I answered my questions: about 31% of photos achieve the Pulse score of 80 they need to be included in the Popular section.  The median score is about 60.

I didn’t discover any secret recipe for getting 99.9 ratings, and I’m glad that I didn’t.  In fact, quite the opposite happened.  I confirmed my suspicions that much of 500px is about getting eyes on your photos, and how do you do that?  You take better photos.  You get better at post processing your photos.  You build followers.  All of this takes time and hard work, but constantly improving is what motivates us to go out and shoot in the first place, right?  And with that in mind, I wish you happy shooting!

Advertisements

10 thoughts on “What does the 500px pulse score really mean?

  1. Great article. It looks like 500px recently changed the pulse algorithm. At the very least, it seems to scale slower than before (First vote is worth 12.9 instead of 20.7). I’d be curious to see how those percentiles breakdown now. I’d imagine the percentile of photos reaching Popular has declined since this change.

    Like

  2. “You take better photos.”
    Of course, this should be. However, one could also say “You upload photos displaying something, people like”, such as sunrise, sunset, landscape, tiddies, models, pretty women,… and then quality is secondary, which is understandable but sometimes really disappointing.

    Like

  3. Thanks for the interesting article. I have a question though. If a person uploads a photo at say 11:00am GMT why does the initial viewing rate not carry on through all the other time zones as they become more active? After all not many people will be viewing that photo at UTC +11. Like you, I am just interested.

    Like

  4. An interesting analysis. I wonder how 500px accounts for the time zones. I seem to get a number of votes within an hour or so of posting, yet if the photos are consistently appealing then the rate of voting should be maintained across different time zones as people log in and view them. This does not happen. Why do you think that a posting at say, 8:00am EST, gets a number of votes within an hour but then tails off, why is this consistency not maintained as the other time zones become more active? I am just interested, as you were.

    Like

    • Hi Tony, I think this is simply because their main user base is in North America and Europe. It’s all about how many eyes you get on the page where your photo is listed, and there just aren’t enough users in the other regions. After an hour, you’ve probably just fallen too far off of the fresh and upcoming pages to get the views.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s