Top Selling Books

Project Proposal

The goal of this project was to collect data from a variety of sources, prepare the data, merge and store the datasets into a database, and create visualizations of the data.


Data

I obtained data from the following sources:

  • Goodreads-books (Kaggle) - This CSV file is a comprehensive list of all books listed in Goodreads. It contains 11,127 books with the following 12 variables for each book:
    bookID

    unique identification number

    title

    title of the book

    authors

    name of book authors

    average_rating

    average rating on Goodreads

    isbn

    ten digit unique identifier

    isbn13

    thirteen digit unique idenifier

    language_code

    primary language

    num_pages

    total number of pages

    rating_count

    total number of ratings received

    text_reviews_count

    total number of reviews written

    publication_date

    date of book publication

    publisher

    name of book publisher

  • List of best-selling books (Wikipedia) - The list of best-selling books consists of the following variables:
    book

    book title

    authors

    name of book authors

    original language

    language in which the book was originially written

    first published

    year in which the book was first published

    approximate sales

    how much the book made in millions of dollars

    genre

    the genre of the book

  • Open Library API - The Open Library API contains over 20 million book editions with information about books. I used the usbn13 number from the Goodreads CSV to obtain additional information abou tthe books. Some of the data extracted included title, genres, languages, publish_country, etc.

Data Preparation

I performed data transformations and cleaning techniques on each of these datasets, as seen in the following:

Goodreads CSV
List of best-selling books
Open Library API

SQL Database and Visualizations

Once each dataset was prepared, I merged and stored them in a SQLite database and created visualization of my findings, as seen in the following:

Code for SQL and Visuals

Conclusion

This project provide many learning opportunities. As a whole, it provided an in-depth explanation of the data wrangling process from start to finish. As a hands-on learner, I feel like it provided me with a solid understanding of data wrangling.