We are going to get a list with the best Pictures of Oscars’ Awards. We’ll use a BeautifulSoup.
Source: Academy Award for Best Picture
1 Building the model
– Import libraries:
from bs4 import BeautifulSoup as bs import requests from pandasgui import show
1.1 Get film info Box and store in Python dictionary
– Load and print HTML:
r = requests.get("https://en.wikipedia.org/wiki/Wings_(1927_film)") # Convert to a beautiful soup object soup = bs(r.content) # Print out the HTML contents = soup.prettify() print(contents)
<!DOCTYPE html> <html class="client-nojs" dir="ltr" lang="en"> <head> <meta charset="utf-8"/> <title> Wings (1927 film) - Wikipedia </title> <script> document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAPnXQpAAEIAAH0lzFsAAACS","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Wings_(1927_film)","wgTitle":"Wings (1927 film)","wgCurRevisionId":999345930,"wgRevisionId":999345930,"wgArticleId":61046,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Good articles","Use mdy dates from October 2020","Template film date with 2 release dates","CS1: long volume value","Commons category link is on Wikidata","Articles with Internet Archive links", ........................................................................................................................ ........................................................................................................................ ........................................................................................................................ </script> <script type="application/ld+json"> {"@context":"https:\/\/schema.org","@type":"Article","name":"Wings (1927 film)","url":"https:\/\/en.wikipedia.org\/wiki\/Wings_(1927_film)","sameAs":"http:\/\/www.wikidata.org\/entity\/Q272036","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q272036","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2002-07-08T19:01:08Z","dateModified":"2021-01-09T18:39:25Z","image":"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/6\/67\/Wings_poster.jpg","headline":"1927 film by William A. Wellman, Harry d\u2019Abbadie d\u2019Arrast"} </script> <script> (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":154,"wgHostname":"mw1271"});}); </script> </body> </html>
info_box = soup.find(class_="infobox vevent") info_rows = info_box.find_all("tr") for row in info_rows: print(row.prettify())
<tr> <th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:110%;font-style:italic;"> Wings </th> </tr> <tr> <td colspan="2" style="text-align:center"> <a class="image" href="/wiki/File:Wings_poster.jpg"> <img alt="Wings poster.jpg" class="thumbborder" data-file-height="1500" data-file-width="990" decoding="async" height="333" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/220px-Wings_poster.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/330px-Wings_poster.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/440px-Wings_poster.jpg 2x" width="220"/> </a> <div style="font-size:95%;padding:0.35em 0.35em 0.25em;line-height:1.25em;"> Film poster </div> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Directed by </th> <td> <a href="/wiki/William_A._Wellman" title="William A. Wellman"> William A. Wellman </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Produced by </th> <td> <a href="/wiki/Lucien_Hubbard" title="Lucien Hubbard"> Lucien Hubbard </a> <br/> <a href="/wiki/Adolph_Zukor" title="Adolph Zukor"> Adolph Zukor </a> <br/> <a href="/wiki/Jesse_L._Lasky" title="Jesse L. Lasky"> Jesse L. Lasky </a> <br/> <a class="mw-redirect" href="/wiki/B.P._Schulberg" title="B.P. Schulberg"> B.P. Schulberg </a> <br/> <a href="/wiki/Otto_Hermann_Kahn" title="Otto Hermann Kahn"> Otto Hermann Kahn </a> <br/> <span style="font-size:85%;"> ( <i> uncredited </i> ) </span> <sup class="reference" id="cite_ref-1"> <a href="#cite_note-1"> [1] </a> </sup> <sup class="reference" id="cite_ref-2"> <a href="#cite_note-2"> [a] </a> </sup> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Written by </th> <td> <b> Titles: </b> <br/> Julian Johnson </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Screenplay by </th> <td> <a href="/wiki/Hope_Loring" title="Hope Loring"> Hope Loring </a> <br/> <a href="/wiki/Louis_D._Lighton" title="Louis D. Lighton"> Louis D. Lighton </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Story by </th> <td> <a href="/wiki/John_Monk_Saunders" title="John Monk Saunders"> John Monk Saunders </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Starring </th> <td> <a href="/wiki/Clara_Bow" title="Clara Bow"> Clara Bow </a> <br/> <a class="mw-redirect" href="/wiki/Charles_(Buddy)_Rogers" title="Charles (Buddy) Rogers"> Charles (Buddy) Rogers </a> <br/> <a href="/wiki/Richard_Arlen" title="Richard Arlen"> Richard Arlen </a> <br/> <a href="/wiki/Gary_Cooper" title="Gary Cooper"> Gary Cooper </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Music by </th> <td> <a class="mw-redirect" href="/wiki/J.S._Zamecnik" title="J.S. Zamecnik"> J.S. Zamecnik </a> <span style="font-size:85%;"> ( <i> uncredited </i> ) </span> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Cinematography </th> <td> <a href="/wiki/Harry_Perry_(cinematographer)" title="Harry Perry (cinematographer)"> Harry Perry </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Edited by </th> <td> <a href="/wiki/E._Lloyd_Sheldon" title="E. Lloyd Sheldon"> E. Lloyd Sheldon </a> <br/> Lucien Hubbard <span style="font-size:85%;"> ( <i> uncredited </i> ) </span> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;"> Production <br/> company </div> </th> <td> <div style="vertical-align:middle;"> <a class="mw-redirect" href="/wiki/Paramount_Famous_Lasky_Corporation" title="Paramount Famous Lasky Corporation"> Paramount Famous Lasky Corporation </a> </div> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Distributed by </th> <td> <a href="/wiki/Paramount_Pictures" title="Paramount Pictures"> Paramount Pictures </a> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space:normal;"> Release date </div> </th> <td> <div class="plainlist"> <ul> <li> August 12, 1927 <span style="display:none"> ( <span class="bday dtstart published updated"> 1927-08-12 </span> ) </span> (New York City, premiere) </li> <li> January 15, 1928 <span style="display:none"> ( <span class="bday dtstart published updated"> 1928-01-15 </span> ) </span> (Los Angeles) </li> </ul> </div> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space:normal;"> Running time </div> </th> <td> <b> Original release: </b> <br/> 111 minutes <sup class="reference" id="cite_ref-3"> <a href="#cite_note-3"> [2] </a> </sup> <br/> <b> Restoration: </b> <br/> 144 minutes <sup class="reference" id="cite_ref-4"> <a href="#cite_note-4"> [3] </a> </sup> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Country </th> <td> United States </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Language </th> <td> <a href="/wiki/Silent_film" title="Silent film"> Silent </a> (English <a href="/wiki/Intertitle" title="Intertitle"> intertitles </a> ) </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Budget </th> <td> US$ 2 million ($28,850,173 adjusted for inflation) <sup class="reference" id="cite_ref-Silent_Era_5-0"> <a href="#cite_note-Silent_Era-5"> [4] </a> </sup> </td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;"> Box office </th> <td> $3,600,000 (worldwide rentals) <sup class="reference" id="cite_ref-Variety_6-0"> <a href="#cite_note-Variety-6"> [5] </a> </sup> </td> </tr>
– Get content of infobox and put it in a dictionary:
def get_content_value(row_data): if row_data.find("li"): return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")] else: return row_data.get_text(" ", strip=True).replace("\xa0", " ") movie_info = {} for index, row in enumerate(info_rows): if index == 0: #Title movie_info['title'] = row.find("th").get_text(" ", strip=True) elif index == 1: #It's a picture continue else: content_key = row.find("th").get_text(" ", strip=True) content_value = get_content_value(row.find("td")) movie_info[content_key] = content_value
– Print out the content:
for key, value in movie_info.items(): print(key,":", value)
title : Wings Directed by : William A. Wellman Produced by : Lucien Hubbard Adolph Zukor Jesse L. Lasky B.P. Schulberg Otto Hermann Kahn ( uncredited ) [1] [a] Written by : Titles: Julian Johnson Screenplay by : Hope Loring Louis D. Lighton Story by : John Monk Saunders Starring : Clara Bow Charles (Buddy) Rogers Richard Arlen Gary Cooper Music by : J.S. Zamecnik ( uncredited ) Cinematography : Harry Perry Edited by : E. Lloyd Sheldon Lucien Hubbard ( uncredited ) Production company : Paramount Famous Lasky Corporation Distributed by : Paramount Pictures Release date : ['August 12, 1927 ( 1927-08-12 ) (New York City, premiere)', 'January 15, 1928 ( 1928-01-15 ) (Los Angeles)'] Running time : Original release: 111 minutes [2] Restoration: 144 minutes [3] Country : United States Language : Silent (English intertitles ) Budget : US$ 2 million ($28,850,173 adjusted for inflation) [4] Box office : $3,600,000 (worldwide rentals) [5]
1.2 Get info box for all movies
r = requests.get("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture") # Convert to a beautiful soup object soup = bs(r.content) # Print out the HTML contents = soup.prettify() print(contents)
<!DOCTYPE html> <html class="client-nojs" dir="ltr" lang="en"> <head> <meta charset="utf-8"/> <title> Academy Award for Best Picture - Wikipedia </title> <script> document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAMl5wpAAL0AAnAcKxwAAACE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Academy_Award_for_Best_Picture","wgTitle":"Academy Award for Best Picture","wgCurRevisionId":1000537225,"wgRevisionId":1000537225,"wgArticleId":61702,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Latin-language text","Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles containing potentially dated statements from 2020","All articles containing potentially dated statements", "Articles containing potentially dated statements from 2014","Articles contradicting other articles","Academy Awards","Awards for best film","Best Picture Academy Award ....................................................................................................................... ....................................................................................................................... ....................................................................................................................... </script> <script type="application/ld+json"> {"@context":"https:\/\/schema.org","@type":"Article","name":"Academy Award for Best Picture","url":"https:\/\/en.wikipedia.org\/wiki\/Academy_Award_for_Best_Picture","sameAs":"http:\/\/www.wikidata.org\/entity\/Q102427","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q102427","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2001-08-21T23:02:36Z","dateModified":"2021-01-15T14:46:55Z","headline":"annual award from the Academy of Motion Picture Arts and Sciences (AMPAS)"} </script> <script> (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":140,"wgHostname":"mw1385"});}); </script> </body> </html>
info_box = soup.find(class_="mw-parser-output") info_rows = info_box.find_all("tr",{"style":"background:#FAEB86"}, "a") for row in info_rows: print(row.prettify())
<tr style="background:#FAEB86"> <td> <a href="/wiki/Wings_(1927_film)" title="Wings (1927 film)"> <i> <b> Wings </b> </i> </a> </td> <td> <b> <a href="/wiki/Famous_Players-Lasky" title="Famous Players-Lasky"> Famous Players-Lasky </a> </b> </td> </tr> <tr style="background:#FAEB86"> <td> <i> <b> <a href="/wiki/The_Broadway_Melody" title="The Broadway Melody"> The Broadway Melody </a> </b> </i> </td> <td> <b> <a href="/wiki/Metro-Goldwyn-Mayer" title="Metro-Goldwyn-Mayer"> Metro-Goldwyn-Mayer </a> </b> </td> </tr> ......................................................................................... ......................................................................................... ......................................................................................... <tr style="background:#FAEB86"> <td> <i> <b> <a href="/wiki/Parasite_(2019_film)" title="Parasite (2019 film)"> Parasite </a> </b> </i> </td> <td> <b> <a href="/wiki/Kwak_Sin-ae" title="Kwak Sin-ae"> Kwak Sin-ae </a> and <a href="/wiki/Bong_Joon-ho" title="Bong Joon-ho"> Bong Joon-ho </a> </b> </td> </tr>
– Put it in a list:
movies = soup.select('tr[style$="FAEB86"]') movies[0:10] print(movies[1].a['href']) print(movies[-1].a['href']) movies2 = [] for index, movie in enumerate(movies): movies2.append(movies[index].a) movies2[-5:] len(movies2)
92
2 The Model
2.1 Get infobox one film and clean tags
def get_content_value(row_data): if row_data.find("li"): return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")] elif row_data.find("br"): return [text for text in row_data.stripped_strings] else: return row_data.get_text(" ", strip=True).replace("\xa0", " ") def clean_tags(soup): #clean up references for tag in soup.find_all(["sup", "span"]): tag.decompose() def get_info_box(url): r = requests.get(url) soup = bs(r.content) info_box = soup.find(class_="infobox vevent") info_rows = info_box.find_all("tr") clean_tags(soup) movie_info = {} for index, row in enumerate(info_rows): if index == 0: movie_info['title'] = row.find("th").get_text(" ", strip=True) else: header = row.find('th') if header: content_key = row.find("th").get_text(" ", strip=True) content_value = get_content_value(row.find("td")) movie_info[content_key] = content_value return movie_info get_info_box("https://en.wikipedia.org/wiki/Wings_(1927_film)")
{'title': 'Wings', 'Directed by': 'William A. Wellman', 'Produced by': ['Lucien Hubbard', 'Adolph Zukor', 'Jesse L. Lasky', 'B.P. Schulberg', 'Otto Hermann Kahn'], 'Written by': ['Titles:', 'Julian Johnson'], 'Screenplay by': ['Hope Loring', 'Louis D. Lighton'], 'Story by': 'John Monk Saunders', 'Starring': ['Clara Bow', 'Charles (Buddy) Rogers', 'Richard Arlen', 'Gary Cooper'], 'Music by': 'J.S. Zamecnik', 'Cinematography': 'Harry Perry', 'Edited by': ['E. Lloyd Sheldon', 'Lucien Hubbard'], 'Production company': 'Paramount Famous Lasky Corporation', 'Distributed by': 'Paramount Pictures', 'Release date': ['August 12, 1927 (New York City, premiere)', 'January 15, 1928 (Los Angeles)'], 'Running time': ['Original release:', '111 minutes', 'Restoration:', '144 minutes'], 'Country': 'United States', 'Language': 'Silent (English intertitles )', 'Budget': 'US$ 2 million ($28,850,173 adjusted for inflation)', 'Box office': '$3,600,000 (worldwide rentals)'}
2.2 Get all films and put them in a list.
r = requests.get("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture") soup = bs(r.content) contents = soup.prettify() print(contents) movies = soup.select('tr[style$="FAEB86"]') movies[0:10] print(movies[1].a['href']) movies2 = [] for index, movie in enumerate(movies): movies2.append(movies[index].a) movies2[-5:] len(movies2)
<!DOCTYPE html> <html class="client-nojs" dir="ltr" lang="en"> <head> <meta charset="utf-8"/> <title> Academy Award for Best Picture - Wikipedia </title> <script> document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAMl5wpAAL0AAnAcKxwAAACE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Academy_Award_for_Best_Picture","wgTitle":"Academy Award for Best Picture","wgCurRevisionId":1000537225,"wgRevisionId":1000537225,"wgArticleId":61702,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Latin-language text","Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles containing potentially dated statements from 2020","All articles containing potentially dated statements", "Articles containing potentially dated statements from 2014","Articles contradicting other articles","Academy Awards","Awards for best film","Best Picture Academy Award ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... </script> <script type="application/ld+json"> {"@context":"https:\/\/schema.org","@type":"Article","name":"Academy Award for Best Picture","url":"https:\/\/en.wikipedia.org\/wiki\/Academy_Award_for_Best_Picture","sameAs":"http:\/\/www.wikidata.org\/entity\/Q102427","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q102427","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2001-08-21T23:02:36Z","dateModified":"2021-01-15T14:46:55Z","headline":"annual award from the Academy of Motion Picture Arts and Sciences (AMPAS)"} </script> <script> (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":140,"wgHostname":"mw1385"});}); </script> </body> </html>
/wiki/The_Broadway_Melody
92
2.3 Last step
base_path = "https://en.wikipedia.org/" movie_info_list = [] for index, movie in enumerate(movies2): try: relative_path = movie['href'] full_path = base_path + relative_path title = movie['title'] movie_info_list.append(get_info_box(full_path)) except Exception as e: print(movie.get_text()) print(e) movie_info_list[0] len(movie_info_list)
92
3 Save and load Movie Data
import json def save_data(title, data): with open(title, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) def load_data(title): with open(title, encoding="utf-8") as f: return json.load(f)
save_data("oscars_movies.json", movie_info_list)
4 We need to clean the data
4.1 Convert running time into an integer
movie_info_list = load_data("oscars_movies.json") len([movie.get('Running time', 'N/A') for movie in movie_info_list]) print([movie.get('Running time', 'N/A') for movie in movie_info_list])
[['Original release:', '111 minutes', 'Restoration:', '144 minutes'], '100 minutes', ['152 minutes', '133 minutes (restored)'], '124 minutes', '112 minutes', '112 minutes', '105 minutes', '132 minutes', ['177 minutes', '185 minutes (roadshow)'], '116 minutes', '126 minutes', ['221 minutes', "234–238 minutes (with overture , intermission , entr'acte , and exit music)"], '130 minutes', '118 minutes', '133 minutes', '102 minutes', '126 minutes', '101 minutes', '172 minutes', '118 minutes', '155 minutes', '110 minutes', '138 minutes', '113 minutes', '152 minutes', '118 minutes', '108 minutes', '90 minutes', '182 minutes', '161 minutes', '115 minutes', '212 minutes', '125 minutes', '152 minutes', '227 minutes', '128 minutes', '170 minutes', '174 minutes', '120 minutes', '109 minutes', '153 minutes', '113 minutes', '170 minutes', '104 minutes', '177 minutes', '129 minutes', '200 minutes', '133 minutes', '119 minutes', '93 minutes', '184 minutes', '105 minutes', '124 minutes', '124 minutes', '191 minutes', '132 minutes', '161 minutes', '161 minutes', '120 minutes', '163 minutes', '134 minutes', '99 minutes', '181 minutes', '118 minutes', '131 minutes', '195 minutes', '142 minutes', '178 minutes', '162 minutes', '195 minutes', '123 minutes', '122 minutes', '155 minutes', '135 minutes', '113 minutes', '201 minutes', '132 minutes', '112 minutes', '151 minutes', '122 minutes', '120 minutes', '131 minutes', '119 minutes', '100 minutes', '120 minutes', '134 minutes', '119 minutes', '129 minutes', '111 minutes', '123 minutes', '130 minutes', '132 minutes']
import re #function for pass from string to integer def minutes_to_integer(running_time): if running_time == "N/A": return None if isinstance(running_time, list): #if it's a list fin = "minutes" for time in running_time: if time.endswith(fin): return int(time.split(" ")[0]) else: # is not a list return int(running_time.split(" ")[0]) # Let's to create a feature for Running time in integer for movie in movie_info_list: movie['Running time (min)'] = minutes_to_integer(movie.get('Running time', "N/A")) movie_info_list[0] #Now we can see integers in a new element called "Running time (int)" print([movie.get('Running time (min)', 'N/A') for movie in movie_info_list])
[111, 100, 152, 124, 112, 112, 105, 132, 177, 116, 126, 221, 130, 118, 133, 102, 126, 101, 172, 118, 155, 110, 138, 113, 152, 118, 108, 90, 182, 161, 115, 212, 125, 152, 227, 128, 170, 174, 120, 109, 153, 113, 170, 104, 177, 129, 200, 133, 119, 93, 184, 105, 124, 124, 191, 132, 161, 161, 120, 163, 134, 99, 181, 118, 131, 195, 142, 178, 162, 195, 123, 122, 155, 135, 113, 201, 132, 112, 151, 122, 120, 131, 119, 100, 120, 134, 119, 129, 111, 123, 130, 132]
4.2 Convert Budget & Box office to numbers
print([movie.get('Budget', 'N/A') for movie in movie_info_list])
['US$ 2 million ($28,850,173 adjusted for inflation)', '$379,000', '$1.2 million', '$1,433,000', '$750,000', '$1,180,280', '$325,000', '$1,950,000', '$2.183 million', 'N/A', 'US$1,644,736 (est.)', '$3.85 million', '$1.29 million', '$800,000', '$1.34 million', '$878,000 (equivalent to $13,738,644 in 2019) –$1 million', 'N/A', '$1.25 million', '$2.1 million or $3 million', '$1,985,000', '£527,530', 'N/A', '$1.4 million', '$2.7 million', '$4 million', '$1.7–2.5 million', '$910,000', '$350,000', '$6 million', '$2.8 million', '$3.3 million', '$15.2 million', '$3 million', '$6.75 million', '$15 million', '$1 million ( )', '$17 million', '$8.2 million', '$2 million', '$2 million', '$10 million', '$3.2 million', '$12.6 million', '$1.8 million', '$6–7.2 million', '$5.5 million', '$13 million', '$3–4.4 million', ['$960,000', '(equivalent to $4.31\xa0million in 2019)'], '$4 million', '$15 million', '$8 million', '$6.2 million', '$5.5 million (£3 million)', '$22 million', '$8 million', '$18 million', '$28 million', '$6 million', '$23.8 million', '$25 million', '$7.5 million', '$22 million', '$19 million', '$14.4 million', '$22 million', '$55 million', '$65–70 million', '$27–31 million', '$200 million', '$25 million', '$15 million', '$103 million', '$58 million', '$45 million', '$94 million', '$30 million', '$6.5 million', '$90 million', '$25 million', '$15 million', '$15 million', '$15 million', '$15 million', '$44.5 million', '$20–22 million', '$18 million', '$20 million', '$1.5-4 million', '$19.5–20 million', '$23 million', ['₩17.0 billion']]
import re amounts = r"thousand|million|billion" number = r"\d+(,\d{3})*\.*\d*" word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})" value_re = rf"\${number}" def word_to_value(word): value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000} return value_dict[word] def parse_word_syntax(string): value_string = re.search(number, string).group() value = float(value_string.replace(",", "")) word = re.search(amounts, string, flags=re.I).group().lower() word_value = word_to_value(word) return value*word_value def parse_value_syntax(string): value_string = re.search(number, string).group() value = float(value_string.replace(",", "")) return value def money_conversion(money): if money == "N/A": return None if isinstance(money, list): money = money[0] word_syntax = re.search(word_re, money, flags=re.I) value_syntax = re.search(value_re, money) if word_syntax: return parse_word_syntax(word_syntax.group()) elif value_syntax: return parse_value_syntax(value_syntax.group()) else: return None
for movie in movie_info_list: movie['Budget ($)'] = money_conversion(movie.get('Budget', "N/A")) movie['Box office ($)'] = money_conversion(movie.get('Box office', "N/A"))
money_conversion(str(movie_info_list[-40]["Budget"]))
6200000.0
print([movie.get('Budget ()', 'N/A') for movie in movie_info_list])
['N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']
movie_info_list[34]
{'title': 'Lawrence of Arabia', 'Directed by': 'David Lean', 'Produced by': 'Sam Spiegel', 'Screenplay by': ['Robert Bolt', 'Michael Wilson'], 'Based on': ['Seven Pillars of Wisdom', 'by', 'T. E. Lawrence'], 'Starring': ['Alec Guinness', 'Anthony Quinn', 'Jack Hawkins', 'José Ferrer', 'Anthony Quayle', 'Claude Rains', 'Arthur Kennedy', 'Omar Sharif', "Peter O'Toole"], 'Music by': 'Maurice Jarre', 'Cinematography': 'F.A. Young', 'Edited by': 'Anne V. Coates', 'Production company': 'Horizon Pictures', 'Distributed by': 'Columbia Pictures', 'Release date': ['10 December 1962'], 'Running time': '227 minutes', 'Country': ['United Kingdom'], 'Language': 'English', 'Budget': '$15 million', 'Box office': '$70 million', 'Running time (min)': 227, 'Budget ($)': 15000000.0, 'Box office ($)': 70000000.0}
4.3 Convert dates into datetime object
from datetime import datetime dates = [movie.get('Release date', 'N/A') for movie in movie_info_list] def clean_date(date): return date.split("(")[0].strip() def date_conversion(date): if isinstance(date, list): date = date[0] if date == "N/A": return None date_str = clean_date(date) fmts = ["%B %d, %Y", "%d %B %Y"] for fmt in fmts: try: return datetime.strptime(date_str, fmt) except: pass return None
for movie in movie_info_list: movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A')) print([movie.get('Release date (datetime)', 'N/A') for movie in movie_info_list])
[datetime.datetime(1927, 8, 12, 0, 0), datetime.datetime(1929, 2, 1, 0, 0), datetime.datetime(1930, 4, 21, 0, 0), datetime.datetime(1931, 1, 26, 0, 0), datetime.datetime(1932, 4, 12, 0, 0), datetime.datetime(1933, 4, 15, 0, 0), datetime.datetime(1934, 2, 22, 0, 0), datetime.datetime(1935, 11, 8, 0, 0), datetime.datetime(1936, 3, 22, 0, 0), datetime.datetime(1937, 8, 11, 0, 0), datetime.datetime(1938, 8, 23, 0, 0), datetime.datetime(1939, 12, 15, 0, 0), datetime.datetime(1940, 3, 21, 0, 0), datetime.datetime(1941, 10, 28, 0, 0), datetime.datetime(1942, 6, 4, 0, 0), datetime.datetime(1942, 11, 26, 0, 0), datetime.datetime(1944, 5, 3, 0, 0), datetime.datetime(1945, 11, 29, 0, 0), datetime.datetime(1946, 11, 21, 0, 0), datetime.datetime(1947, 11, 11, 0, 0), datetime.datetime(1948, 5, 4, 0, 0), datetime.datetime(1949, 11, 8, 0, 0), datetime.datetime(1950, 10, 13, 0, 0), datetime.datetime(1951, 10, 4, 0, 0), datetime.datetime(1952, 1, 10, 0, 0), datetime.datetime(1953, 8, 5, 0, 0), datetime.datetime(1954, 7, 28, 0, 0), datetime.datetime(1955, 4, 11, 0, 0), datetime.datetime(1956, 10, 17, 0, 0), datetime.datetime(1957, 10, 2, 0, 0), datetime.datetime(1958, 5, 15, 0, 0), datetime.datetime(1959, 11, 18, 0, 0), datetime.datetime(1960, 6, 30, 0, 0), datetime.datetime(1961, 10, 18, 0, 0), datetime.datetime(1962, 12, 10, 0, 0), datetime.datetime(1963, 9, 29, 0, 0), datetime.datetime(1964, 10, 21, 0, 0), datetime.datetime(1965, 3, 2, 0, 0), datetime.datetime(1966, 12, 12, 0, 0), datetime.datetime(1967, 8, 2, 0, 0), datetime.datetime(1968, 9, 26, 0, 0), datetime.datetime(1969, 5, 25, 0, 0), datetime.datetime(1969, 12, 4, 0, 0), datetime.datetime(1971, 10, 7, 0, 0), datetime.datetime(1972, 3, 14, 0, 0), datetime.datetime(1973, 12, 25, 0, 0), datetime.datetime(1974, 12, 12, 0, 0), datetime.datetime(1975, 11, 19, 0, 0), datetime.datetime(1976, 11, 21, 0, 0), datetime.datetime(1977, 4, 20, 0, 0), datetime.datetime(1978, 12, 8, 0, 0), datetime.datetime(1979, 12, 19, 0, 0), datetime.datetime(1980, 9, 19, 0, 0), datetime.datetime(1981, 3, 30, 0, 0), datetime.datetime(1982, 11, 30, 0, 0), datetime.datetime(1983, 11, 23, 0, 0), datetime.datetime(1984, 9, 6, 0, 0), datetime.datetime(1985, 12, 18, 0, 0), datetime.datetime(1986, 12, 19, 0, 0), datetime.datetime(1987, 10, 4, 0, 0), datetime.datetime(1988, 12, 16, 0, 0), datetime.datetime(1989, 12, 15, 0, 0), datetime.datetime(1990, 10, 19, 0, 0), datetime.datetime(1991, 1, 30, 0, 0), datetime.datetime(1992, 8, 3, 0, 0), datetime.datetime(1993, 11, 30, 0, 0), datetime.datetime(1994, 6, 23, 0, 0), datetime.datetime(1995, 5, 18, 0, 0), datetime.datetime(1996, 11, 15, 0, 0), datetime.datetime(1997, 11, 1, 0, 0), datetime.datetime(1998, 12, 11, 0, 0), datetime.datetime(1999, 9, 8, 0, 0), datetime.datetime(2000, 5, 1, 0, 0), datetime.datetime(2001, 12, 13, 0, 0), datetime.datetime(2002, 12, 27, 0, 0), datetime.datetime(2003, 12, 1, 0, 0), datetime.datetime(2004, 12, 15, 0, 0), datetime.datetime(2004, 9, 10, 0, 0), datetime.datetime(2006, 9, 26, 0, 0), datetime.datetime(2007, 5, 19, 0, 0), datetime.datetime(2008, 8, 30, 0, 0), datetime.datetime(2008, 9, 4, 0, 0), datetime.datetime(2010, 9, 6, 0, 0), datetime.datetime(2011, 5, 15, 0, 0), datetime.datetime(2012, 8, 31, 0, 0), datetime.datetime(2013, 8, 30, 0, 0), datetime.datetime(2014, 8, 27, 0, 0), datetime.datetime(2015, 9, 3, 0, 0), datetime.datetime(2016, 9, 2, 0, 0), datetime.datetime(2017, 8, 31, 0, 0), datetime.datetime(2018, 9, 11, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]
5 Save/load file in binary file with pinckle
import pickle def save_data_pickle(name, data): with open(name, 'wb') as f: #wirte a binary file pickle.dump(data, f) def load_data_pickle(name): with open(name, 'rb') as f: #read a binary file return pickle.load(f)
save_data_pickle("oscars_winners_data_cleaned.pickle", movie_info_list) a = load_data_pickle("oscars_winners_data_cleaned.pickle") a == movie_info_list
True
6 Attach IMDB/Rotten Tomatoes/Metascore scores
movie_info_list = load_data_pickle('oscars_winners_data_cleaned.pickle') movie_info_list[-41]
{'title': 'Kramer vs. Kramer', 'Directed by': 'Robert Benton', 'Produced by': ['Richard Fischoff', 'Stanley R. Jaffe'], 'Screenplay by': 'Robert Benton', 'Based on': ['Kramer Versus Kramer', 'by', 'Avery Corman'], 'Starring': ['Dustin Hoffman', 'Meryl Streep', 'Jane Alexander'], 'Music by': ['Paul Gemignani', 'Herb Harris', 'John Kander', 'Erma E. Levin', 'Roy B. Yokelson', 'Antonio Vivaldi'], 'Cinematography': 'Néstor Almendros', 'Edited by': 'Gerald B. Greenberg', 'Distributed by': 'Columbia Pictures', 'Release date': ['December 19, 1979'], 'Running time': '105 minutes', 'Country': 'United States', 'Language': 'English', 'Budget': '$8 million', 'Box office': '$173 million', 'Running time (min)': 105, 'Budget ($)': 8000000.0, 'Box office ($)': 173000000.0, 'Release date (datetime)': datetime.datetime(1979, 12, 19, 0, 0)}
– Let’s extract films info from www.omdbapi.com:
import requests import urllib import os def get_omdb_info(title): base_url = "http://www.omdbapi.com/?" #Put API code OMDB_API_KEY='######' in .bashrc parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title} #or we can also do it this way #parameters = {"apikey": 'f8185070', 't': title} params_encoded = urllib.parse.urlencode(parameters) full_url = base_url + params_encoded return requests.get(full_url).json() def get_rotten_tomato_score(omdb_info): ratings = omdb_info.get('Ratings', []) for rating in ratings: if rating['Source'] == 'Rotten Tomatoes': return rating['Value'] return None get_omdb_info("Kramer vs. Kramer")
{'Title': 'Kramer vs. Kramer', 'Year': '1979', 'Rated': 'PG', 'Released': '19 Dec 1979', 'Runtime': '105 min', 'Genre': 'Drama', 'Director': 'Robert Benton', 'Writer': 'Avery Corman (from the novel by), Robert Benton (written for the screen by)', 'Actors': 'Dustin Hoffman, Meryl Streep, Jane Alexander, Justin Henry', 'Plot': "Ted Kramer's wife leaves him, allowing for a lost bond to be rediscovered between Ted and his son, Billy. But a heated custody battle ensues over the divorced couple's son, deepening the wounds left by the separation.", 'Language': 'English', 'Country': 'USA', 'Awards': 'Won 5 Oscars. Another 34 wins & 25 nominations.', 'Poster': 'https://m.media-amazon.com/images/M/MV5BNDM3YjNlYmMtOGY3NS00MmRjLWIyY2UtNDA0MWM3OTNlZTY2XkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '88%'}, {'Source': 'Metacritic', 'Value': '77/100'}], 'Metascore': '77', 'imdbRating': '7.8', 'imdbVotes': '132,176', 'imdbID': 'tt0079417', 'Type': 'movie', 'DVD': 'N/A', 'BoxOffice': '$106,260,000', 'Production': 'Columbia Pictures Corporation', 'Website': 'N/A', 'Response': 'True'}
– Extract imdb, metascore and rotten_tomatoes:
for movie in movie_info_list: title = movie['title'] omdb_info = get_omdb_info(title) movie['imdb'] = omdb_info.get('imdbRating', None) movie['metascore'] = omdb_info.get('Metascore', None) movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info) movie_info_list[-32]
{'title': 'Rain Man', 'Directed by': 'Barry Levinson', 'Produced by': 'Mark Johnson', 'Screenplay by': ['Barry Morrow', 'Ronald Bass'], 'Story by': 'Barry Morrow', 'Starring': ['Dustin Hoffman', 'Tom Cruise', 'Valeria Golino'], 'Music by': 'Hans Zimmer', 'Cinematography': 'John Seale', 'Edited by': 'Stu Linder', 'Production company': ['Guber-Peters Company', 'Star Partners II, Ltd.'], 'Distributed by': 'United Artists', 'Release date': ['December 16, 1988'], 'Running time': '134 minutes', 'Country': 'United States', 'Language': 'English', 'Budget': '$25 million', 'Box office': '$354.8 million', 'Running time (min)': 134, 'Budget ($)': 25000000.0, 'Box office ($)': 354800000.0, 'Release date (datetime)': datetime.datetime(1988, 12, 16, 0, 0), 'imdb': '8.0', 'metascore': '65', 'rotten_tomatoes': '89%'}
save_data_pickle('oscars_winners_data_final.pickle', movie_info_list)
7 Save data as .json and .csv
movie_info_copy = [movie.copy() for movie in movie_info_list]
– Json:
for movie in movie_info_copy: current_date = movie['Release date (datetime)'] if current_date: movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y") else: movie['Release date (datetime)'] = None save_data("disney_data_final.json", movie_info_copy)
– CSV:
import pandas as pd df = pd.DataFrame(movie_info_list) df.head()
title | Directed by | Produced by | Written by | Screenplay by | Story by | Starring | Music by | Cinematography | Edited by | ... | metascore | rotten_tomatoes | Color process | Based on | Production companies | Narrated by | Suggested by | Hangul | Revised Romanization | McCune–Reischauer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wings | William A. Wellman | [Lucien Hubbard, Adolph Zukor, Jesse L. Lasky,... | [Titles:, Julian Johnson] | [Hope Loring, Louis D. Lighton] | John Monk Saunders | [Clara Bow, Charles (Buddy) Rogers, Richard Ar... | J.S. Zamecnik | Harry Perry | [E. Lloyd Sheldon, Lucien Hubbard] | ... | N/A | 93% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | The Broadway Melody | Harry Beaumont | [Irving Thalberg, Lawrence Weingarten] | [Sarah Y. Mason, (continuity), Norman Houston,... | NaN | Edmund Goulding | [Charles King, Anita Page, Bessie Love, Jed Pr... | (see article ) | John Arnold | [Sam S. Zimbalist, Uncredited:, William LeVanw... | ... | N/A | 33% | Technicolor | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | All Quiet on the Western Front | Lewis Milestone | Carl Laemmle Jr. | [Maxwell Anderson, George Abbott, Del Andrews,... | NaN | NaN | [Lew Ayres, Louis Wolheim] | David Broekman | Arthur Edeson | [Edgar Adams, Milton Carruth, (silent version,... | ... | 91 | 98% | NaN | [All Quiet on the Western Front, by, Erich Mar... | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Cimarron | Wesley Ruggles | [William LeBaron, Louis Sarecky, (assoc.)] | NaN | [Howard Estabrook, Louis Sarecky] | NaN | [Richard Dix, Irene Dunne] | Max Steiner | Edward Cronjager | William Hamilton | ... | 70 | 50% | NaN | [Cimarron, 1929 novel, by, Edna Ferber] | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Grand Hotel | Edmund Goulding | Irving Thalberg | William A. Drake | NaN | NaN | [Greta Garbo, John Barrymore, Joan Crawford, W... | [William Axt, Charles Maxwell] | William H. Daniels | Blanche Sewell | ... | N/A | 86% | NaN | [Grand Hotel, (play) 1930, by William A. Drake... | NaN | NaN | NaN | NaN | NaN | NaN |
df.to_csv("disney_movie_data_final.csv")