Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

This case study was done using Advanced SQL Analytics queries.

AntaraChat/SQL---IMDb-Movie-Analysis

Folders and files, repository files navigation, imdb-movie-analysis using sql.

  • This case study was analysed using Advanced SQL.
  • It also contains a seperate pdf file "Executive Summary" that has the entire analysis summary details
  • It was done as part of the Executive PG Diploma Data Science program of UpGrad in collaboration with IIIT Bangalore .
  • Languages Used: Advanced SQL Software Used: (SQL Workbench) Steps to follow to use the analysis sql file:
  • Import the sql text file "IMDB+dataset+import".
  • Next import the sql file "IMDB+question" .
  • Continue with your additional analysis(optional).

image

Use SQL on a Movie Database to Decide What to Watch

Author's photo

Table of Contents

Completing the SQL Movie Database Download

Sql exercises on a movie database, finding all the movies for a given director, using sql on a large existing movie database.

We’ll demonstrate how to use SQL to parse large datasets and gain valuable insights, in this case, to help you choose what movie to watch next using an IMDb dataset.

In this article, we’ll be downloading a dataset directory from IMDb. Not sure what to watch tonight? Are you browsing Netflix endlessly? Decide what to watch using the power of SQL! We’ll be loading an existing movie IMDb dataset into SQL. We’ll analyze the data in different ways like sorting movies by their rating, by what actors star in the movie, or by other similar criteria.

As mentioned in this blog post on how to practice SQL , the best way to practice SQL is by gaining hands-on experience in solving real-world problems, which is exactly what we’ll be doing.

If you have a basic knowledge of SQL, you should be able to follow this article easily. If you have no IT experience whatsoever, consider starting with this SQL A to Z Learning Track designed for people who have no experience in IT and want to start their adventure with SQL.

Let’s get started by learning how to get the movie data into our SQL database.

Let’s walk through the process of downloading our data and loading it into a database management system (DBMS), step by step. Common DBMSs include MySQL, Oracle DB, PostgreSQL, and SQL Server.

Although this article focuses on movie data, you can choose an entirely different dataset. Check out this list of free online datasets you can use and find the one you are interested in. The import of these datasets will be similar regardless of what dataset you use.

Open whatever variety of SQL you are using. For this example, I’ll be using SQL Server Management Studio, but the steps should be similar for all of the other varieties of SQL out there. Let’s get started:

  • The dataset files can be accessed and downloaded from https://datasets.imdbws.com/ . The data is refreshed daily.
  • basics.tsv.gz
  • akas.tsv.gz
  • crew.tsv.gz
  • episode.tsv.gz
  • principals.tsv.gz
  • ratings.tsv.gz
  • Extract the downloaded zip files. The end result will be a TSV (tab-separated) file for each table.
  • Open each file in a spreadsheet application like Google Sheets or Microsoft Excel.
  • Find and replace all occurrences of “\N” with an empty cell.
  • Save the file as a CSV file. This will make it easier to import into the DBMS of your choice.
  • Open your DBMS.
  • Create a new schema or table by right-clicking on the left pane and selecting “New Database.” I’ve named my new database “imdb.”

SQL movie database

  • Set valid data types for each column you are importing. I recommend using nvarchar(MAX) for string columns, since you do not know how long the strings will be for each field. You can change the column datatype later if required.

SQL movie database

  • Repeat this process for each of the files you have downloaded.

After completing these steps, your SQL movie database will be in place! You are now ready to start analyzing and querying the data.

Thankfully, this dataset came with some descriptive documentation . To get an even better idea of the data, you can quickly select the top 1000 rows from each table.

Let’s start looking for our first movie. Imagine you want to watch a horror movie. How can we isolate only the horror movies? Fortunately, this task is frighteningly simple.

If this query causes any confusion, open this SQL cheat sheet to refresh your knowledge. Have this cheat sheet open for the rest of the tutorial to help you along!

What if we wanted to refine this horror movie list further? We could restrict the results to horror movies created after 1990, with an average rating above 9.0 and at least 10,000 votes.

This will involve getting data from multiple tables. Opening each table and taking a look at the column headers, we can see the following tables will be involved:

  • title_basics : handles the genre of movie and the release year (represented by the column startYear ).
  • title_ratings : handles the rating ( averageRating ) and votes ( numVotes ).

The two tables can be joined on the shared column, tconst . As explained in the IMDb documentation here , tconst is an alphanumeric unique identifier of the title. Let’s write our query:

Executing this query returns a single result, but not the result we want! On closer inspection, we can see that this title is a video game, not a movie. Let’s alter our query to include only movies, and expand the search by reducing the minimum number of votes required to 1,000 and the minimum rating required to 8.0.

Executing this query also yields a single result! Looks like we won’t have to decide what to watch anymore, since there’s only one option that fits our criteria!

Let’s run through another scenario. What if we want to see all of the movies Steven Spielberg has directed? How would this work?

By looking through the tables, we can determine the following:

  • name_basics : It contains the names of all actors, writers, directors, and others involved in the creation of film and TV titles.
  • title_crew : It acts as a linking table for titles, directors, and writers. We’ll use this table to connect Steven Spielberg to the titles he’s involved with.
  • title_basics : We have already used this table. It contains title information like name, release date, rating, etc.

Let’s get to work! Let’s write a query for the name_basics table to try and find the famous director Steven Spielberg.

Executing this query yields a single result:

This gives us the important value of nconst . From the documentation, we know that nconst is the alphanumeric unique identifier of the name/person.

We can feed this value into the title_crew table, which contains the director and writer information for all the titles in IMDb, and match Steven Spielberg to all the titles he’s involved with.

Executing this query results in a list of 45 titles. You can see from the value of the directors column that Steven Spielberg was the director of them all.

We need a way of using this list of titles alongside the title_basics table to get the name of the movies instead of just the tconst. Let’s use a subquery for this!

Execute this query to see the result:

There we have it, all of the Steven Spielberg movie titles from our database!

Don’t stop here! Write your own custom queries to extract more insights from this large dataset. There are many ways to practice SQL. If you feel like you’ve had enough of working with this dataset, check out this post on 12 Ways to Learn SQL Online for more excellent learning resources.

You have learned how to import and analyze large existing datasets into the DBMS of your choice and to use SQL to analyze a movie database. This is a powerful tool in your SQL arsenal. Not to mention, you’ll never have to worry about not being able to choose a movie to watch again! Completing SQL exercises on movie databases is a helpful way to learn, but if you would like more structure, check out this SQL Practice Set from LearnSQL.com .

You may also like

imdb movie assignment sql

How Do You Write a SELECT Statement in SQL?

imdb movie assignment sql

What Is a Foreign Key in SQL?

imdb movie assignment sql

Enumerate and Explain All the Basic Elements of an SQL Query

IMDB-Data-Analysis-in-SQL

This project was carried out to answer a set of analytical questions to suggest a movie production house on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie..

glow (1)

Table of Content (TOC)

  • Database Creation for the Project
  • Table Creation
  • Data Insertion

Data Analysis

  • EXECUTIVE SUMMARY AND RECOMMENDATIONS

1. Overview

This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022.

With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.

IMDB Data Analysis in MySQL

RSVP Movies is an Indian film production company that has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

Why this Analysis?

The production company wants to plan its every move analytically based on data and has approached for help with this new project.

We have been provided with the data of the movies that have been released in the past three years. Let’s analyze the data set and draw meaningful insights that can help them start their new project.

We will use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights.

We will be carrying out the entire analytics process into four segments, where each segment leads to significant insights from different combinations of tables.

2. Database Creation for the Project

A. check the list of database.

  • The very first step of any MySQL analysis is to access the database and check if related data is available or not.
  • Use show databases; to access the list of databases:

b. Create Database

  • Create a new database for this project.
  • Use Create database IMDB;
  • Use show databases; to confirm the list of databases:

c. Use Database

  • Instruct the system to use *IMDB Database* by running use imdb;

3. Table Creation

Steps to follow before creating the table:.

  • Download the IMDb dataset. And try to understanding every table and its importance.
  • Understand the ERD and the table details. Study them carefully and understand the relationships between the table.

image

  • Inspect each table given in the subsequent tabs and understand the features associated with each of them.
  • Draft your table with the correct Data Type and Constraints in a paper or note file.
  • Open your MySQL Workbench and start writing the DDL and DML commands to create the database.

Create Table

For this project we need a total of 6 tables:

a. Create Table Movie

B. create table genre, c. create table director_mapping.

| Table Name: director_mapping | Column Description | | ———– | ———– | | movie_id | Movie Id of the movie directed by a director | | name_id | Name ID of the director |

d. Create Table role_mapping

E. create table names, f. create table ratings.

Now, Run show tables; to ensure that all the six tables are created.

4. Data Insertion

In the previous steps, we created six tables. Now, we will insert the data into these tables. Here, we will be showing the syntax of 5 rows insertion into each table. (The complete data insertion syntax is available in the Repository)

a. Inserting data into Movie Table

B. inserting data into genre table, c. inserting data into director_mapping table, d. inserting data into role_mapping table, e. inserting data into names table, f. inserting data into ratings table, checking tables for inserted values:.

Select * from Movie;

Select * from Genre;

Select * from Director_Mapping;

Select * from Role_Mapping;

Select * from Names;

Select * from Ratings;

All the sample data inserted looks good. SO, we can go ahead with insertion of complete data. For insertion to work smoothly, lets drop all data from tables using TRUNCATE :

Insert Complete data

Run the command to insert complete data: IMDB File 3 Insert all data

1. Find the total number of rows in each table of the schema?

Alternative 1:.

Number of Rows after ignoring the Null Rows

Alternative 2:

Rows count inclusive of Null Rows:

TABLE_NAME Tables_in_imdb director_mapping 3867 genre 14662 movie 8519 names 23714 ratings 8230 role_mapping 15173

2. Which columns in the movie table have null values?

id_null title_null year_null date_null duration_null country_null world_null language_null production_null 0 0 0 0 0 20 3724 194 528

3.1. Find the total number of movies released each year?

Movies per year:, 3.2. find the total number of movies released each year, movies per month, 4.1 find the count of indian movies., 4.2 find the count of movies from usa, 4.3 find the count of movies which are either from india or usa, 4.4 find the count of movies that are either from india or usa and released in 2019., 5. find the unique list of the genres present in the data set, 6.1 find the movies count for each genre., 6.2 find the genre with the maximum number of movies., 6.3 find the genre with minimum number of movies., 6.4 find the top-3 genre with the maximum number of movies., 6.4 find the movies count for action genre., 6.5 find the genre count for each movie., 6.6 find the list of indian movies that belongs to 3 genre., 6.7 longest indian movie tagged with 3 genre..

‘tt6200656’, ‘Kammara Sambhavam’, ‘182’, ‘3’

6.8 Which genres are tagged with ‘Kammara Sambhavam’ movie.

genre Action Comedy Drama

7.1. How many movies belong to only one genre?

Create a list of Movies with a genre count
Restrict the list to Genre count = 1
Count the total number of rows

7.2. How many movies belong to two genres?

7.3. how many movies belong to three genres, 8.1. what is the average duration of movies in each genre, 8.2. rank the genre by the average duration of movies in each genre., 9. what is the rank of the ‘thriller’ genre of movies among all the genres in terms of the number of movies produced, 10. find the minimum and maximum values in each column of the rating table except the movie_id column, 11. which are the top 10 movies based on average rating, 12. summarize the ratings table based on the movie counts by median ratings., 13. which production house has produced the most number of hit movies (average rating > 8).

Create list of production house with count of movies where average rating > 8 and Ranked over “Movies count”
Applied CTE to pull the production house with Rank = 1
NOTE: applied (production_company IS NOT NULL) as there are few movies without production house name

14. How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?

15. find movies of each genre that start with the word ‘the’ and which have an average rating > 8, 16. of the movies released between 1 april 2018 and 1 april 2019, how many were given a median rating of 8, 17. do german movies get more votes than italian movies, q18. which columns in the names table have null values, 19. who are the top three directors in the top three genres whose movies have an average rating > 8.

Pull the Top three Genre by Movie count where avg_rating > 8

Pull the Directors with Movie count where avg_rating > 8

Keeping “top_3_genres” as CTE, restrict the 2nd code to avg_rating > 8 and directors of top_3_genre

Trying Row_Number() function:

20. who are the top two actors whose movies have a median rating >= 8, 21. which are the top three production houses based on the number of votes received by their movies, 22. rank actors with movies released in india based on their average ratings. which actor is at the top of the list.

– Note: The actor should have acted in at least five Indian movies.

ALTERNTIVE 1 (Using Rank Window Function):

Alternative 2 (using cte):, 23.find out the top five actresses in hindi movies released in india based on their average ratings.

– Note: The actresses should have acted in at least three Indian movies.

24. Select thriller movies as per avg rating and classify them in the following category:

Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies

——————————————————————————————–*/

EXECUTIVE SUMMARY AND RECOMMENDATIONS {##-EXECUTIVE-SUMMARY-AND-RECOMMENDATIONS}

1. insights.

Based on 7,997 released and recorded on IMDB between 2017 and 2019, a summary of audience interest and recommendations are mentioned as below:

  • Average Duration: 103.89359
  • Total number of Actors: 12611 (7445 actor & 5166 Actress)

1. Year and Month wise Movie Release Pattern:

  • A year wise record of movies indicates a slight decrease in number of movies from 3052 movies in 2017 to 2001 movies in 2019.
  • Maximum number of movies were released in March, followed by September, October, and January. While more interesting fact is about the least number of movies being released in mid-year and end of year months, could be because of more people prefer vacation and family time in this time of year.

2. Geographical Region Distribution

  • USA and India produced 1059 movies together in 2019 alone, way above half of total movies released (2001) in the year.

3. Genre Popularity

  • Movies were tagged with genre tags as Drama, Fantasy, Thriller, Comedy, Horror, Family, Romance, Adventure, Action, Sci-Fi, Crime, and Mystery.
  • Drama is most popular genre among all the genre with 4285 tags across three years, followed by Comedy and Thriller.
  • There were 3289 movies with only one genre tags, while remaining were tagged with multiple genres.

4. The average duration of movies are around 103.89359 minutes, and even genre vise average revolves around the same figure.

5. top production houses.

  • Marvel Studios rules the best Production House category with 551245 votes based on the number of votes received by the movies they have produced, followed by Syncopy, and New Line Cinema.
  • Star Cinema, and Twentieth Century Fox are the top 2 multi-Lingual production house based on the most number of superhit movies.

6. Top Director

  • James Mangold has given most number of Superhit Movies, followed by Soubin Shahir, Joe Russo, and Anthony Russo.
  • A.L. Vijay, Andrew Jones, and Chris Stokes are the top directors based on number of movies.

7. Top Actors and Actress

  • Mammootty with 8 Superhit movies is most successful actor followed by Mohanlal with 5 Superhits.
  • There are quite a few number of actors with 4 Superhit movies under their name, which include Amrinder Gill, Amit Sadh, Johnny Yong Bosch, Tovino Thomas, Dulquer Salmaan, Siddique, Rajkummar Rao, Fahadh Faasil, Pankaj Tripathi, Dileesh Pothan, Joju George, and Ayushmann Khurrana.
  • Vijay Sethupathi, Fahadh Faasil, and Yogi Babu are the top three Indian actors who have acted atleast in five movies.
  • Taapsee Pannu, Divya Dutta, and Kriti Kharbanda are the top three Hindi Speaking actress who have acted at least in three movies.
  • Parvathy Thiruvothu, Susan Brown, and Amanda Lawrence are the best rated actresses in Drama genre.

8. Top-10 movies based on average rating are: Kirket, Love in Kilnerry, Gini Helida Kathe, Runam, Fan, Android Kunjappan Version 5.25, Yeh Suhaagraat Impossible, Safe, The Brighton Miracle, and Shibu

  • Based on Median rating counts, most of the movies are rated between 5 and 8, and falls under hit movie categories.

9. Top Grossing Movies

The highest-grossing movies of each year are:

i. Thank You for Your Service, a comedy movie released in 2017

ii. The Villain, a thriller movie released in 2018

iii. Joker, a drama movie released in 2019

2. Recommendation:

Based on Insights, the recommendations for RSVP are as following:

  • Concentrate on multi-genre drama-comedy movies with a pinch of thriller, keeping an average duration of around 104 minutes.
  • Plan for release of movie between January to March. Focus on multilingual movies which can be launched in India and USA as preferred audience market.
  • Rope in either Star Cinema or Twentieth Century Fox as the production house, under the directorial of James Mangold with assistance of A.L. Vijay.
  • Mammootty and Mohanlal can be the lead actors along with assistance from other side actors. Inclusion of Vijay Sethupathi would act as stardom promotion for the movie.
  • Parvathy Thiruvothu is one of the most rated drama actresses to be brought in.

Codersarts

How We Work

IMDB Subset Database: SQL Queries and Analysis

imdb movie assignment sql

Introduction

Welcome to our new blog post! We're really excited to share something special with you today. In this post, we'll be introducing a fascinating new project requirement: the 'IMDB Subset Database: SQL Queries and Analysis.' We'll walk you through the project requirements and discuss our approach to tackling them. Plus, we'll give you a sneak peek at some of the outputs with screenshots. So, let's jump right in and explore together!

Project Requirement : 

This document describes a database on movies, movie-stars, etc. and asks a range of questions on this database. You must work out how to produce answers to these questions using SQL or PLpgSQL (as specified in the question).

In this assignment, you will work with a very small subset of the Internet Movie Database (aka IMDB). This database has information about movies, TV series, actors, directors, etc. The database for the assignment only contains highly-rated movies, and the people associated with them. All of the data about "lesser" movies, TV series, and other IMDB content has been removed to keep the size manageable; the actual database is over 50GB.

Some of the terminology IMDB uses may require some explanation:

The database deals with a wide variety of humans, animals and animated characters that appear in movies. The term "people" isn't broad enough, but we use it anyway since most of the references are to people.

Movies (and other media) are released in different forms in different regions of the world. Some versions are cut, to fit with local laws. Others are dubbed or subtitled, to fit the local language. The title is also often changed, and to a phrase with quite a different meaning to the original. The various versions of a movie are called "Aliases".

People have the following ER model

imdb movie assignment sql

Movies have the following ER model

imdb movie assignment sql

Aliases for movies have the following ER model

imdb movie assignment sql

The above entities are linked together as follows

imdb movie assignment sql

In the queries, references to title mean Movies.title

In most cases, the order of results doesn't matter; the testing code will use order by to force a specific order

Queries should not take more than 3 seconds to run; queries that take longer to run will be penalised

Sample Outputs and Views

To give you an idea of what you're aiming for, there are sample outputs in each question. Note that these assume that you are creating a view for the question and then invoking that view. You are not required to create views, but you will probably find it convenient. 

If you create views, use

If you decide to change the view appearance later, you will also need to include

before creating the view.

Questions : 

"Find the titles of movies that have the highest rating."

"Find the titles of movies that have the highest rating, ordered by rating (highest first), then by title (alphabetically)."

"Find the titles of movies that have the highest rating, ordered by title."

"Find the titles of all the movies in which both Johnny Depp and Helena Bonham Carter have appeared."

"Find the titles of all the movies in which either Johnny Depp or Helena Bonham Carter or both have appeared."

"Find the titles of the three highest-rated movies."

"Find the titles of the ten lowest-rated movies."

"Find the name of the director who directed the most number of movies."

"Find the name of the director who directed the highest-rated movie."

"Find the titles of all the movies that have aliases."

"Find the names of all the people who have acted in a movie that also has aliases."

"Find the title of the highest-rated movie that has aliases."

Solution Approach : 

In addressing the requirements of the IMDB Subset Database project, we employed a systematic approach leveraging SQL queries and analysis techniques. Below is a breakdown of the methods and techniques utilized in solving each of the provided questions:

Querying Movies with Highest Rating:

We employed a SQL query to identify movies with the highest rating. By selecting movies based on their rating, we ensured a focused approach to retrieve the most critically acclaimed films from the database.

Sorting Highest-Rated Movies:

To fulfill this requirement, we utilized SQL's ORDER BY clause to arrange movies first by their rating (highest first) and then alphabetically by title. This ensured a structured presentation of the highest-rated movies.

Ordering Movies by Title with Highest Rating:

Similar to the previous question, we utilized SQL's ORDER BY clause to organize movies by title while maintaining the criterion of highest rating. This provided an alternative view of the highest-rated movies.

Identifying Movies with Specific Actors:

Leveraging SQL's JOIN operation, we crafted a query to find movies in which both Johnny Depp and Helena Bonham Carter have appeared. By joining the Movies and People tables based on the respective roles of the actors, we retrieved the desired movie titles.

Searching Movies by Actor:

Utilizing SQL's OR logical operator, we constructed a query to identify movies featuring either Johnny Depp, Helena Bonham Carter, or both. This approach facilitated a comprehensive search for movies involving the specified actors.

Top Three Highest-Rated Movies:

We employed SQL's LIMIT clause to select the top three movies based on their rating. This straightforward approach allowed us to extract the titles of the three highest-rated films efficiently.

Bottom Ten Lowest-Rated Movies:

Similar to the previous question, we utilized SQL's LIMIT clause, coupled with appropriate ordering, to identify the ten movies with the lowest ratings. This approach ensured a concise presentation of the least-rated movies.

Finding Director with Most Movies:

Through SQL's GROUP BY and COUNT functions, we determined the director who directed the most number of movies. By aggregating the data on directors and counting the occurrences, we pinpointed the director with the highest movie count.

Locating Director of Highest-Rated Movie:

Employing SQL's MAX function along with appropriate joins, we identified the director responsible for directing the highest-rated movie. By selecting the director associated with the maximum rating value, we isolated the director of the top-rated film.

Movies with Aliases:

By utilizing SQL's JOIN operation with the Aliases table, we identified movies that have aliases. This allowed us to capture movies with multiple versions or titles.

Actors in Movies with Aliases:

Leveraging SQL's JOIN operation between People and Aliases tables, we extracted the names of individuals who have acted in movies with aliases. This approach facilitated the identification of actors involved in movies with alternate titles.

Highest-Rated Movie with Aliases:

Through SQL's JOIN operation and appropriate selection criteria, we pinpointed the title of the highest-rated movie that has aliases. By combining data from multiple tables, we precisely identified the desired movie title.

Some Outputs : 

imdb movie assignment sql

At CodersArts, we're excited to unveil our latest project, the "IMDB Subset Database: SQL Queries and Analysis." Delving into the world of cinema data, our team is poised to demonstrate the power of SQL queries in unraveling insightful trends and patterns within the IMDB dataset. With a keen focus on efficient data retrieval and analysis, we aim to provide comprehensive solutions to a range of queries, offering a deeper understanding of movie ratings, actors' roles, and directorial contributions.

From inception to implementation, CodersArts guides you through the intricacies of the IMDB Subset Database project. We meticulously dissect the dataset, identifying crucial entities and relationships that underpin the world of movies and entertainment. Through strategic SQL query formulation and execution, we uncover hidden gems within the data, allowing users to extract valuable insights with ease. Our commitment to efficiency ensures that queries run seamlessly, providing prompt and accurate results to meet your analytical needs.

But our journey doesn't stop at data retrieval. CodersArts is dedicated to empowering users with actionable insights derived from SQL analysis. By leveraging the power of structured query language, we equip you with the tools to make informed decisions in the realm of movie analytics. Whether it's identifying top-rated movies, analyzing actor collaborations, or spotlighting directorial achievements, our expertise in SQL queries and analysis elevates your understanding of the IMDB dataset, paving the way for informed decision-making and enriched cinematic experiences.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at [email protected] .

Recent Posts

BookHub: Library Management System

Data Management Solutions for Healthcare: Integrating SQL

FitTrack: Fitness Database Design and Implementation

Kommentarer

SOLVING QUERIES ON IMDB DATASET USING SQL

select   (select count(*) from directors) as directors_rows_count,   (select count(*) from directors_genres) as director_genres_rows_count,   (select count(*) from movies) as movies_rows_count,   (select count(*) from movies_directors) as movie_directors_rows_count,   (select count(*) from roles) as roles_rows_count;

imdb movie assignment sql

select * from directors;

imdb movie assignment sql

-- 0 for 'No' and 1 for 'Yes' as we have used aggregate function sum().     select sum(case   when id is Null then 1 else 0 end ) as id,   (case   when name is Null then 1 else 0 end ) as name,   (case   when year is Null then 1 else 0 end ) as year,   (case   when rankscore is Null then 1 else 0 end ) as rankscore   from movies;

imdb movie assignment sql

select name as movie, year from movies group by year order by year;

imdb movie assignment sql

select year,count(name) from movies group by year order by year;

imdb movie assignment sql

select year,count(name) as movies_count from movies group by year order by movies_count desc limit 10;

imdb movie assignment sql

select movies.name, movies.rankscore from movies   inner join (select name, max(rankscore) as maxrank from movies) movies2 on movies.rankscore=movies2.maxrank;

imdb movie assignment sql

create temporary table table1 (select md.director_id, md.movie_id , dr.director_name ,mv.movie_name from movies_directors as md inner join   (select id as director_id,concat(first_name," ",last_name) as director_name from directors) dr   on md.director_id=dr.director_id   inner join (select id as movie_id, name as movie_name from movies) mv on md.movie_id = mv.movie_id);   select * from table1;

imdb movie assignment sql

select director_name , count(movie_name) as movies_count from table1 group by director_name order by movies_count desc limit 20 ;

imdb movie assignment sql

select genre, count(movie_id) as movies_count from movies_genres group by genre order by movies_count desc;

imdb movie assignment sql

with directors_with_most_genres as( select director_id, count(genre) as genre_count from directors_genres group by director_id order by genre_count desc)   select concat(directors.first_name,directors.last_name) as director_name, directors.id as directors_id, directors_with_most_genres.genre_count from directors inner join directors_with_most_genres   on directors.id=directors_with_most_genres.director_id;

imdb movie assignment sql

with rolescount as   (select movie_id,count(role) as number_of_roles from roles group by movie_id )   select movies.name, rolescount.number_of_roles from movies inner join rolescount   on movies.id=rolescount.movie_id order by number_of_roles desc;

imdb movie assignment sql

select * from movies where name like 'An%' having rankscore>9;

imdb movie assignment sql

select * from movies where name like 'Fig%ub'and length(name)=10 and year=1999;     -- Note: Space(" ") is also is a charecter.

imdb movie assignment sql

select * from movies having year between 1800 and 2000 and rankscore>9.5 order by year;

imdb movie assignment sql

select lower(name),rankscore,year from movies where name in ('top gun','blade runner','border');

imdb movie assignment sql

logo

IMDb 4: Querying the IMDb MySQL database and visualising its data

imdb movie assignment sql

This post is the fourth and final post in this series. Here we will query the database in a couple of different ways and perform some visualisation of the data.

SQL Queries using MySQL

After creating and loading the data into the database, we can now query it. Queries can be posed by typing the SQL commands directly into the MySQL terminal or by writing them in a file, which would then be run from the terminal using the SOURCE command. In the file SQL_Queries_1.sql we consider many questions and answer them by querying the IMDb database. This section is an on going piece of work and we intend to add more queries to the repository in the future.

For each query in the file SQL_Queries_1.sql we create a view by

The result of the query which is stored in the view can be seen by

To delete the view

The database is quite large, so for illustration purposes we will quite often limit ourselves to the first few entries only.

A few example queries

We will consider a few queries for illustration purposes.

  • Query 9: Who are the actors who played James Bond in a movie? How many times did they play the role of James Bond?

To see the results of this query:

imdb movie assignment sql

  • Query 10: How many actors played James Bond?

imdb movie assignment sql

  • Query 11: I don’t recognise some of the names shown above, so lets look at them more closely!

imdb movie assignment sql

Clearly, a few of these movies contain the character James Bond, but are not the James Bond movies we have in mind. In particular, the appearance of the movie Deadly Hands of Kung Fu is quite interesting as it looks to be a 1970’s kung fu flick. Its IMDb page can be found here . From this page we quote its synopsis:

“ It’s one of the “Bruceploitation” films that were made to cash in on Bruce Lee after his death. The story follows Bruce Lee after he dies and ends up in Hell. Once there, he does the logical thing and opens a gym. After fending off the advances of the King Of Hell’s naked wives, he discovers that the most evil people in Hell are attempting a takeover, so Bruce sets out to stop it. As if it wasn’t weird enough, the evil people are: Zatoichi (the blind swordsman hero of Japanese film), James Bond , The Godfather, The Exorcist, Emmanuelle (the “heroine” of many European softcore porn films), Dracula, and, of course, Clint Eastwood (played by a Chinese guy). Aiding Bruce is The One-Armed Swordsman (hero of kung-fu films), Kain from the U.S. tv series, Kung-Fu (actually played by a Chinese guy this time), and Popeye the Sailor Man! Yes, Popeye the Sailor Man. He eats spinach and helps Bruce fight some mummies.”

WOW !!! I certainly was not expecting that, I need to see this movie! In the script imdb_scraper.py we provide a couple of functions that can be used to extract the url of a movie poster on its IMDb webpage using its title_id .

The functions used to scrape the movie poster url made use of BeautifulSoup and urllib.request. To see how these functions can be used in practice seen the screenshot below.

imdb movie assignment sql

SQL Queries using python and data visualisation

In the notebook MySQL_IMDb_visualisation.ipynb we query the IMDb database to explore and visualise the IMDb dataset using pandas and matplotlib. This notebook is by no means a thorough exploration of the IMDb dataset. Its purpose is to practice querying a database using python, then to process and visualise the retrieved data with the pandas package. In particular, we consider the following questions:

  • What are the average ratings for the TV show ‘The X-files’?
  • What genres are there?
  • How many movies are there in each genre?
  • How many movies are made in each genre each year?

How do the average ages of leading actors and actresses compare in each genre?

What is a typical runtime for movies in each genre?

This section is also an ongoing piece of work, which will be added to in the future.

To connect to the MySQL IMDb database we use the following code. Of course you will need to use your own password.

We then create a cursor.

To execute a query we simply use the cursor’s execute and fetchall methods. For example to show all tables in the IMDb database

To print the contents of the tables variable we can simply do the following

Since we have ran the SQL script containing many queries, we have many tables in our database.

Visualising the ratings of the tv show “The X-Files”

What is the average rating of each episode of The X-Files?

How many episodes were there in The X-Files per season? And what was the average of the average episode ratings for each season?

We will plot the results of the above two queries to illustrate the average ratings of The X-files episodes. Please see the notebook for the details of how to produce this figure.

imdb movie assignment sql

Let’s start with something fairly simple. What genres are there? How many movies are there in each genre?

We will visualise this data using a horizontal bar chart.

imdb movie assignment sql

How many movies are made in each genre each year? We only consider up to and including 2019.

We will visualise this data using a line plot.

imdb movie assignment sql

We limit ourselves to movies between 1919 and 2019. To determine the age of an actor/actress when the movie was being made the title start year and the birth year of the person must both be non- NULL . If either of these are NULL , then that entry is neglected.

First we create an intermediate table Leading_people with ‘year’,’title_id’,’job_category’, and ‘ordering’ data. We then query this table to combine this with ‘age’ and ‘genre’ information.

Let’s group by year, job_category (actor/actress, i.e., gender) and also genre using pandas’ groupby function

We define two functions which will be passed to the aggregrate function. These functions will be used to calculate the first and third quartile.

Let’s visualise this data for just the Drama movies, which is the top genre. The lines are the mean values and the bands are given by the first and third quartiles.

imdb movie assignment sql

Now let’s visualise for the top 9 genres.

imdb movie assignment sql

We see a very clear trend in pretty much all genres shown. The leading men are typically older than the leading ladies. Although this trend is not as clear cut for documentaries.

Some titles have extremely large and unrealistic values. We choose to ignore these by introducing a cutoff of 300 minutes for the runtime minutes. We visualise this data for 6 genres as a histogram plot.

imdb movie assignment sql

We finish by closing the connection to the database.

In this post we looked at querying the IMDb database, we created in previous posts, in a few different ways. We then proceeded to visualise the retrieved data. This post was meant to only scratch the surface of what could be done with this data. At a later date we will likely return to this dataset again to try out other ETL tools such as Microsoft’s SQL Server Integrations Services ( SSIS ) and others. We may also investigate trends further by performing statistical analyses and possibly even use some machine learning algorithms. Well this concludes the series of posts on building and querying a MySQL database for the IMDb dataset.

imdb movie assignment sql

IMDb Movie Assignment

You have the data for the 100 top-rated movies from the past decade along with various pieces of information about the movie, its actors, and the voters who have rated these movies online. In this assignment, you will try to find some interesting insights into these movies and their voters, using Python.

Task 1: Reading the data

Subtask 1.1: read the movies data..

Read the movies data file provided and store it in a dataframe movies .

Assignment: The Caspian crisis

  • Podcast Episode

The Documentary Podcast (2020)

Add a plot in your language

User reviews

  • May 21, 2024 (United Kingdom)
  • See more company credits at IMDbPro

Technical specs

  • Runtime 28 minutes

Related news

Contribute to this page.

  • IMDb Answers: Help fill gaps in our data
  • Learn more about contributing

More to explore

Production art

Recently viewed

IMAGES

  1. Creating IMDB Movies Database on Microsoft SQL Server

    imdb movie assignment sql

  2. GitHub

    imdb movie assignment sql

  3. IMDb Movie Data Visualisation Assignment Part-1

    imdb movie assignment sql

  4. Creating IMDB Movies Database on Microsoft SQL Server

    imdb movie assignment sql

  5. How to download IMDB datasets for SQL

    imdb movie assignment sql

  6. IMDb Movie Data Visualisation Assignment Part-4

    imdb movie assignment sql

VIDEO

  1. Hum Kisi Se Kum Nahin movie unknown facts interestingfacts revisit shooting locations box office2002

  2. MySQL Queries on Sample Movie Database Beginner to Advance SQL Project for Data Analysis

  3. Y4 ASSIGNMENT SQL

  4. Assignment

  5. Power BI/SQL : IMDB Movie Madness

  6. Suman Setty Enjoying Beauty's Show

COMMENTS

  1. AntaraChat/SQL---IMDb-Movie-Analysis

    This case study was analysed using Advanced SQL. It also contains a seperate pdf file "Executive Summary" that has the entire analysis summary details It was done as part of the Executive PG Diploma Data Science program of UpGrad in collaboration with IIIT Bangalore .

  2. Use SQL on a Movie Database to Decide What to Watch

    Open your DBMS. Create a new schema or table by right-clicking on the left pane and selecting "New Database.". I've named my new database "imdb.". Right-click on the database → Tasks → Import Flat File and follow the Import Wizard to create a table for each file: Set valid data types for each column you are importing.

  3. IMDB-Data-Analysis-in-SQL

    IMDB-Data-Analysis-in-SQL This project was carried out to answer a set of analytical questions to suggest a Movie Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.

  4. Creating IMDB Movies Database on Microsoft SQL Server

    Hit connect. Open up a New Query window. To create your database, enter this and click on Execute button or hit F5 to run the query. create database IMDB. Hit refresh on the Object Explorer window ...

  5. IMDB Subset Database: SQL Queries and Analysis

    This database has information about movies, TV series, actors, directors, etc. The database for the assignment only contains highly-rated movies, and the people associated with them. All of the data about "lesser" movies, TV series, and other IMDB content has been removed to keep the size manageable; the actual database is over 50GB.

  6. Exploring IMDb Data Through SQL Queries

    Sep 3, 2023. --. The IMDb dataset is a treasure trove of information for movie enthusiasts and data analysts alike. In this article, we'll embark on a journey through the IMDb dataset using SQL ...

  7. Data Engineering Project

    In this article, I will create a data pipeline for transferring and analyzing movie data from IMDb. The data pipeline will be created using the following tools: Data ingestion: Web scraping from IMDB using Python. Data storage: Google BigQuery. Data analysis: DBT. Data visualization: Power BI. Data orchestration: Apache Airflow.

  8. PDF HW1: Querying the IMDB movie database

    HW1: Querying the IMDB movie database Requirements: Each student should work independently on this assignment. Objectives: To practice writing complex SQL queries, and to get familiar with the IMDB dataset, which will be used in the group project. Assignment tools: PostgreSQL What to turn in: These files: IMDBschema.sql and IMDBqueries.sql.

  9. Solving Queries on Imdb Dataset Using Sql

    on movies.id=rolescount.movie_id order by number_of_roles desc; 15.Show the movies whose name start with 'An' and having rankscore>9. select * from movies where name like 'An%' having rankscore>9; 16.Show the movies having name start with 'Fig' end with 'ub' , having charecter length 10 and released in 1999.

  10. IMDB Movie Dataset Analysis

    Domain: Movies Tech Stack: SQLObjective: RSVP Movies plans to produce next movie based on data of highest rated movies released in the past three yearsKey Ac...

  11. IMDB: Real-Time Project Movie Review Analysis Using SQL

    Normalization is a process of organizing a database to avoid duplication, and improve data integrity. STEP 1: To create a database, you can choose a suitable name, such as moviesdb1. The code ...

  12. PDF CS 327E Lab 1: Exploring the IMDB dataset through SQL

    and tb.title_type = 'movie' and tb.start_year is not null order by tb.start_year desc; Notice that the comment above the SQL provides a brief explanation of the query in English. Place the 20 queries in a file called imdb_queries.sql and add it to your git repo. Step 6. Using LucidChart, draw an ER diagram of the imdb schema.

  13. IMDb Project (SQL)

    Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.

  14. IMDb 4: Querying the IMDb MySQL database and visualising its data

    Queries can be posed by typing the SQL commands directly into the MySQL terminal or by writing them in a file, which would then be run from the terminal using the SOURCE command. In the file SQL_Queries_1.sql we consider many questions and answer them by querying the IMDb database. This section is an on going piece of work and we intend to add ...

  15. Querying from IMDB Database using MySQL

    I wrote a SQL query to answer the following question: Find all the actors that made more movies with Yash Chopra than any other director in the IMBD database. Sample schema: Following is my query: (SELECT A.actor_id as actors, B.director_id as director_id, B.movies as movies_with_director, B.director_id as yash_chops_id, B.movies as movies_with ...

  16. Imdb Upgrad

    IMDb Movie Assignment. You have the data for the 100 top-rated movies from the past decade along with various pieces of information about the movie, its actors, and the voters who have rated these movies online. In this assignment, you will try to find some interesting insights into these movies and their voters, using Python.

  17. Data Analysis End-to-End IMDb dataset

    The dataset contains the 100 best performing movies from the year 2010 to 2016. However, a scatter plot tells a different story. You can notice that there are some movies with negative profit ...

  18. "The Documentary Podcast" Assignment: The Caspian crisis ...

    IMDb is the world's most popular and authoritative source for movie, TV and celebrity content. Find ratings and reviews for the newest movie and TV shows. Get personalized recommendations, and learn where to watch across hundreds of streaming providers.