* Is There An Olympic Event for Data Geeks?

[fa icon="clock-o"] 8/22/16 4:00 PM [fa icon="user"] Jon Thompson [fa icon="folder-open'] Power BI

If you're like us, you're sad to see it go.  The only truly global event in existence.  Yes, the summer Olympics are over.  From synchronized swimming to steeplechase, there’s no better way to eat potato chips and judge world-class athletes. The Blue Margin team has been enjoying the Olympics and, in true data-geek fashion, wanted to bring the numbers to life in Power BI.

 

The catch? Obtaining data for the current Olympics in a usable format is challenging and/or expensive. However, the data is available in pieces on various websites, so we decided to take a crack at scraping those sites using Power BI. Although scraping data can be easy, the task presented a few interesting challenges. We started by adding a web data source using the following URL:https://www.rio2016.com/en/medal-count-sports. The data on the website looks beautiful.

 

 

 download (17).png

But the scraped data . . . not so much.

 download (18).png

 

 

We then combined the tables from the web page and got to a better place, but still not great. Take note of the cells with more than one value. Those are “ties”, and they make this process a little more challenging (not Olympic challenging, but harder than eating potato chips).

 

 

We analyzed the actual contents and formatting of the “tie” cells. It wasn’t pretty. For the country, the ties were separated by a line feed, 3 tabs, another line feed, and another 3 tabs. Yikes. Same issue for athlete ties, except with 4 tabs. Last but definitely not least, we found one cell that separated athlete ties by a line feed, a bunch of non-breaking spaces, another line feed, etc. (We imagined a web developer saying “Hmmm, I think I’ll add a line feed here, and a tab there, and one more space…perfect! They’ll never figure it out” as they formatted the web page and laughed maniacally.) After finding the magic combination of line feeds and tabs, we replaced all line feeds with a pipe character (i.e., one of these – “|” - looks like a capital “I”, but is taller and can usually be found above the “\” key).

 

download (19).png

 

We also replaced all tabs with a pipe. This left us with a bunch of pipes follow by spaces. So, we replaced all instances of pipes followed by spaces with a single pipe, then replaced all repeated pipes with a single pipe.

Download Now: Guidelines for Data Validation Whitepaper

Why all the pipes? Basically, we chose a character that was unlikely to show up in the source data. We used that character to help eliminate the white space associated with ties and to put as a separator between the country and athlete so that we could break them into columns after the unpivot process. Whew!

Anyway, lots of pipes and various transformations later, we ended up with 30 columns of alternating countries and athletes (i.e. GoldCountry1, GoldAthlete1, GoldCountry2, GoldAthlete2, … BronzeCountry5, BronzeAthlete5). This was definitely overkill as it allowed for 5 ties, but hey, anything is possible at the Olympics, right?

Then, we did a few more transformations to show each medal winner as a "country | athlete" pair, then un-pivoted the table (which is always a little bit magical).

 

 

We eliminated any single-pipe (empty) rows, split the value column into two (country and athlete), and voila! Usable data! This is the kind of stuff that gets nerds up in the morning. That and coffee. There is probably a more elegant way to achieve the same result using M-code, but we were looking to get it done quick-and-dirty just for the fun of playing with the data right away.

 

olympic+6.png

 

So what next? First, we grabbed some historical summer Olympics data online and added it to our data model so that we could do some year-over-year comparisons. It turns out that report-ready data for summer Olympics 1896-2008 is readily available from many different sources. We downloaded an Excel file of medal-winners and pulled it into our data model along with a scrape from Wikipedia of all the different countries that have participated in the summer Olympics (https://en.wikipedia.org/wiki/List_of_participating_nations_at_the_Summer_Olympic_Games). Then we added a few dimension tables, created relationships, clapped our hands and ta-da! It’s dashboard time!

 

 

Pretty cool, huh? From this dashboard we can easily tell how many different athletes have medaled in the summer Olympics, how many different sports have been played, the trend of total participating countries and athletes throughout the years, top athletes, and more. You could spend hours creating cool measures, building interesting visuals, and learning more about the Olympics than your average potato-chip-eating Joe will ever know.

Pulling the data wasn’t the simplest process, but it allowed us to get what we wanted without doing a full data discovery or writing complex code. For a quick afternoon dashboarding session (one of our favorite past-times), this worked perfectly.

Here's a link to the .pbix file if you'd like to check it out.

 
guidelines-data-validation-blue-margin-power-bi

Your Colorado Microbrew Recommendation:


New Belgium Brewing - Fort Collins:  Allagash Fat Funk (ABV 5.6%)

Does watching the Olympics make you feel slightly out of shape?  Don't let those world record-setting athletes get you down!  Grab yourself an Allagash Fat Funk - an Olympic-style collaboration between New Belgium Brewing and Allagash Brewing from Portland, Maine.  Allagash took New Belgium back to its Belgian roots with a De Dolle yeast with hints of pear and banana (that's healthy, right?) with a spicy finish.  Get your fat funk on!

Jon Thompson

Written by Jon Thompson