Analyzing Go code with BigQuery

Recently my colleague Felipe Hoffa told me about a new public dataset in BigQuery: ALL THE PUBLIC GITHUB CODE!

Counting Go files

As a gopher, my first reaction was to check how many Go files are in that dataset. My SQL is not amazing, but I’m able to do that!

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 3) = ‘.go’

Running that query I see that there are more than 12 million files with a .go extension in the dataset. That’s a lot! But wait … I just ran that query on TWO BILLION ROWS and it finished in 6 seconds? Wow! 😮

Counting all the files with .go extension on GitHub

Ok, so that’s awesome! But I also processed 105GB, and since I’m the cost of the query is proportional to the size of the data queried (even though the first TB per month is free) it’s probably a good idea to create a new dataset and a new table containing just the files with a .go extension to minimize the cost.

Advertisements