Oscar Awards

We are going to get a list with the best Pictures of Oscars’ Awards. We’ll use a BeautifulSoup.

Source: Academy Award for Best Picture

1 Building the model

– Import libraries:

from bs4 import BeautifulSoup as bs
import requests
from pandasgui import show

1.1 Get film info Box and store in Python dictionary

– Load and print HTML:

r = requests.get("https://en.wikipedia.org/wiki/Wings_(1927_film)")

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
print(contents)
    <!DOCTYPE html>
    <html class="client-nojs" dir="ltr" lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       Wings (1927 film) - Wikipedia
      </title>
      <script>
       document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAPnXQpAAEIAAH0lzFsAAACS","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Wings_(1927_film)","wgTitle":"Wings (1927 film)","wgCurRevisionId":999345930,"wgRevisionId":999345930,"wgArticleId":61046,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Good articles","Use mdy dates from October 2020","Template film date with 2 release dates","CS1: long volume value","Commons category link is on Wikidata","Articles with Internet Archive links",
........................................................................................................................
........................................................................................................................
........................................................................................................................
      </script>
      <script type="application/ld+json">
       {"@context":"https:\/\/schema.org","@type":"Article","name":"Wings (1927 film)","url":"https:\/\/en.wikipedia.org\/wiki\/Wings_(1927_film)","sameAs":"http:\/\/www.wikidata.org\/entity\/Q272036","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q272036","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2002-07-08T19:01:08Z","dateModified":"2021-01-09T18:39:25Z","image":"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/6\/67\/Wings_poster.jpg","headline":"1927 film by William A. Wellman, Harry d\u2019Abbadie d\u2019Arrast"}
      </script>
      <script>
       (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":154,"wgHostname":"mw1271"});});
      </script>
     </body>
    </html>
info_box = soup.find(class_="infobox vevent")
info_rows = info_box.find_all("tr")
for row in info_rows:
    print(row.prettify())
<tr>
     <th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:110%;font-style:italic;">
      Wings
     </th>
    </tr>
    
    <tr>
     <td colspan="2" style="text-align:center">
      <a class="image" href="/wiki/File:Wings_poster.jpg">
       <img alt="Wings poster.jpg" class="thumbborder" data-file-height="1500" data-file-width="990" decoding="async" height="333" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/220px-Wings_poster.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/330px-Wings_poster.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/67/Wings_poster.jpg/440px-Wings_poster.jpg 2x" width="220"/>
      </a>
      <div style="font-size:95%;padding:0.35em 0.35em 0.25em;line-height:1.25em;">
       Film poster
      </div>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Directed by
     </th>
     <td>
      <a href="/wiki/William_A._Wellman" title="William A. Wellman">
       William A. Wellman
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Produced by
     </th>
     <td>
      <a href="/wiki/Lucien_Hubbard" title="Lucien Hubbard">
       Lucien Hubbard
      </a>
      <br/>
      <a href="/wiki/Adolph_Zukor" title="Adolph Zukor">
       Adolph Zukor
      </a>
      <br/>
      <a href="/wiki/Jesse_L._Lasky" title="Jesse L. Lasky">
       Jesse L. Lasky
      </a>
      <br/>
      <a class="mw-redirect" href="/wiki/B.P._Schulberg" title="B.P. Schulberg">
       B.P. Schulberg
      </a>
      <br/>
      <a href="/wiki/Otto_Hermann_Kahn" title="Otto Hermann Kahn">
       Otto Hermann Kahn
      </a>
      <br/>
      <span style="font-size:85%;">
       (
       <i>
        uncredited
       </i>
       )
      </span>
      <sup class="reference" id="cite_ref-1">
       <a href="#cite_note-1">
        [1]
       </a>
      </sup>
      <sup class="reference" id="cite_ref-2">
       <a href="#cite_note-2">
        [a]
       </a>
      </sup>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Written by
     </th>
     <td>
      <b>
       Titles:
      </b>
      <br/>
      Julian Johnson
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Screenplay by
     </th>
     <td>
      <a href="/wiki/Hope_Loring" title="Hope Loring">
       Hope Loring
      </a>
      <br/>
      <a href="/wiki/Louis_D._Lighton" title="Louis D. Lighton">
       Louis D. Lighton
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Story by
     </th>
     <td>
      <a href="/wiki/John_Monk_Saunders" title="John Monk Saunders">
       John Monk Saunders
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Starring
     </th>
     <td>
      <a href="/wiki/Clara_Bow" title="Clara Bow">
       Clara Bow
      </a>
      <br/>
      <a class="mw-redirect" href="/wiki/Charles_(Buddy)_Rogers" title="Charles (Buddy) Rogers">
       Charles (Buddy) Rogers
      </a>
      <br/>
      <a href="/wiki/Richard_Arlen" title="Richard Arlen">
       Richard Arlen
      </a>
      <br/>
      <a href="/wiki/Gary_Cooper" title="Gary Cooper">
       Gary Cooper
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Music by
     </th>
     <td>
      <a class="mw-redirect" href="/wiki/J.S._Zamecnik" title="J.S. Zamecnik">
       J.S. Zamecnik
      </a>
      <span style="font-size:85%;">
       (
       <i>
        uncredited
       </i>
       )
      </span>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Cinematography
     </th>
     <td>
      <a href="/wiki/Harry_Perry_(cinematographer)" title="Harry Perry (cinematographer)">
       Harry Perry
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Edited by
     </th>
     <td>
      <a href="/wiki/E._Lloyd_Sheldon" title="E. Lloyd Sheldon">
       E. Lloyd Sheldon
      </a>
      <br/>
      Lucien Hubbard
      <span style="font-size:85%;">
       (
       <i>
        uncredited
       </i>
       )
      </span>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;">
       Production
       <br/>
       company
      </div>
     </th>
     <td>
      <div style="vertical-align:middle;">
       <a class="mw-redirect" href="/wiki/Paramount_Famous_Lasky_Corporation" title="Paramount Famous Lasky Corporation">
        Paramount Famous Lasky Corporation
       </a>
      </div>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Distributed by
     </th>
     <td>
      <a href="/wiki/Paramount_Pictures" title="Paramount Pictures">
       Paramount Pictures
      </a>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space:normal;">
       Release date
      </div>
     </th>
     <td>
      <div class="plainlist">
       <ul>
        <li>
         August 12, 1927
         <span style="display:none">
          (
          <span class="bday dtstart published updated">
           1927-08-12
          </span>
          )
         </span>
         (New York City, premiere)
        </li>
        <li>
         January 15, 1928
         <span style="display:none">
          (
          <span class="bday dtstart published updated">
           1928-01-15
          </span>
          )
         </span>
         (Los Angeles)
        </li>
       </ul>
      </div>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space:normal;">
       Running time
      </div>
     </th>
     <td>
      <b>
       Original release:
      </b>
      <br/>
      111 minutes
      <sup class="reference" id="cite_ref-3">
       <a href="#cite_note-3">
        [2]
       </a>
      </sup>
      <br/>
      <b>
       Restoration:
      </b>
      <br/>
      144 minutes
      <sup class="reference" id="cite_ref-4">
       <a href="#cite_note-4">
        [3]
       </a>
      </sup>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Country
     </th>
     <td>
      United States
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Language
     </th>
     <td>
      <a href="/wiki/Silent_film" title="Silent film">
       Silent
      </a>
      (English
      <a href="/wiki/Intertitle" title="Intertitle">
       intertitles
      </a>
      )
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Budget
     </th>
     <td>
      US$ 2 million ($28,850,173 adjusted for inflation)
      <sup class="reference" id="cite_ref-Silent_Era_5-0">
       <a href="#cite_note-Silent_Era-5">
        [4]
       </a>
      </sup>
     </td>
    </tr>
    
    <tr>
     <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
      Box office
     </th>
     <td>
      $3,600,000 (worldwide rentals)
      <sup class="reference" id="cite_ref-Variety_6-0">
       <a href="#cite_note-Variety-6">
        [5]
       </a>
      </sup>
     </td>
    </tr>

– Get content of infobox and put it in a dictionary:

def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

movie_info = {}
for index, row in enumerate(info_rows):
    if index == 0: #Title
        movie_info['title'] = row.find("th").get_text(" ", strip=True)
    elif index == 1: #It's a picture
        continue
    else:
        content_key = row.find("th").get_text(" ", strip=True)
        content_value = get_content_value(row.find("td"))
        movie_info[content_key] = content_value

– Print out the content:

for key, value in movie_info.items():
    print(key,":", value)
    title : Wings
    Directed by : William A. Wellman
    Produced by : Lucien Hubbard Adolph Zukor Jesse L. Lasky B.P. Schulberg Otto Hermann Kahn ( uncredited ) [1] [a]
    Written by : Titles: Julian Johnson
    Screenplay by : Hope Loring Louis D. Lighton
    Story by : John Monk Saunders
    Starring : Clara Bow Charles (Buddy) Rogers Richard Arlen Gary Cooper
    Music by : J.S. Zamecnik ( uncredited )
    Cinematography : Harry Perry
    Edited by : E. Lloyd Sheldon Lucien Hubbard ( uncredited )
    Production company : Paramount Famous Lasky Corporation
    Distributed by : Paramount Pictures
    Release date : ['August 12, 1927 ( 1927-08-12 ) (New York City, premiere)', 'January 15, 1928 ( 1928-01-15 ) (Los Angeles)']
    Running time : Original release: 111 minutes [2] Restoration: 144 minutes [3]
    Country : United States
    Language : Silent (English intertitles )
    Budget : US$ 2 million ($28,850,173 adjusted for inflation) [4]
    Box office : $3,600,000 (worldwide rentals) [5]

1.2 Get info box for all movies

r = requests.get("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture")

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
print(contents)
<!DOCTYPE html>
    <html class="client-nojs" dir="ltr" lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       Academy Award for Best Picture - Wikipedia
      </title>
      <script>
       document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAMl5wpAAL0AAnAcKxwAAACE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Academy_Award_for_Best_Picture","wgTitle":"Academy Award for Best Picture","wgCurRevisionId":1000537225,"wgRevisionId":1000537225,"wgArticleId":61702,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Latin-language text","Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles containing potentially dated statements from 2020","All articles containing potentially dated statements",
    "Articles containing potentially dated statements from 2014","Articles contradicting other articles","Academy Awards","Awards for best film","Best Picture Academy Award
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................
      </script>
      <script type="application/ld+json">
       {"@context":"https:\/\/schema.org","@type":"Article","name":"Academy Award for Best Picture","url":"https:\/\/en.wikipedia.org\/wiki\/Academy_Award_for_Best_Picture","sameAs":"http:\/\/www.wikidata.org\/entity\/Q102427","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q102427","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2001-08-21T23:02:36Z","dateModified":"2021-01-15T14:46:55Z","headline":"annual award from the Academy of Motion Picture Arts and Sciences (AMPAS)"}
      </script>
      <script>
       (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":140,"wgHostname":"mw1385"});});
      </script>
     </body>
    </html>
info_box = soup.find(class_="mw-parser-output")
info_rows = info_box.find_all("tr",{"style":"background:#FAEB86"}, "a")
for row in info_rows:
    print(row.prettify())
    <tr style="background:#FAEB86">
     <td>
      <a href="/wiki/Wings_(1927_film)" title="Wings (1927 film)">
       <i>
        <b>
         Wings
        </b>
       </i>
      </a>
     </td>
     <td>
      <b>
       <a href="/wiki/Famous_Players-Lasky" title="Famous Players-Lasky">
        Famous Players-Lasky
       </a>
      </b>
     </td>
    </tr>
    
    <tr style="background:#FAEB86">
     <td>
      <i>
       <b>
        <a href="/wiki/The_Broadway_Melody" title="The Broadway Melody">
         The Broadway Melody
        </a>
       </b>
      </i>
     </td>
     <td>
      <b>
       <a href="/wiki/Metro-Goldwyn-Mayer" title="Metro-Goldwyn-Mayer">
        Metro-Goldwyn-Mayer
       </a>
      </b>
     </td>
    </tr>
.........................................................................................
.........................................................................................
.........................................................................................
    <tr style="background:#FAEB86">
     <td>
      <i>
       <b>
        <a href="/wiki/Parasite_(2019_film)" title="Parasite (2019 film)">
         Parasite
        </a>
       </b>
      </i>
     </td>
     <td>
      <b>
       <a href="/wiki/Kwak_Sin-ae" title="Kwak Sin-ae">
        Kwak Sin-ae
       </a>
       and
       <a href="/wiki/Bong_Joon-ho" title="Bong Joon-ho">
        Bong Joon-ho
       </a>
      </b>
     </td>
    </tr>

– Put it in a list:

movies = soup.select('tr[style$="FAEB86"]')
movies[0:10]
print(movies[1].a['href'])
print(movies[-1].a['href'])

movies2 = []
for index, movie in enumerate(movies):
   movies2.append(movies[index].a)

movies2[-5:]
len(movies2)
92

2 The Model

2.1 Get infobox one film and clean tags

def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def clean_tags(soup): #clean up references
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()

def get_info_box(url):
    r = requests.get(url)
    soup = bs(r.content)
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")
    clean_tags(soup)
    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find("th").get_text(" ", strip=True)
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value
    return movie_info


get_info_box("https://en.wikipedia.org/wiki/Wings_(1927_film)")
    {'title': 'Wings',
     'Directed by': 'William A. Wellman',
     'Produced by': ['Lucien Hubbard',
      'Adolph Zukor',
      'Jesse L. Lasky',
      'B.P. Schulberg',
      'Otto Hermann Kahn'],
     'Written by': ['Titles:', 'Julian Johnson'],
     'Screenplay by': ['Hope Loring', 'Louis D. Lighton'],
     'Story by': 'John Monk Saunders',
     'Starring': ['Clara Bow',
      'Charles (Buddy) Rogers',
      'Richard Arlen',
      'Gary Cooper'],
     'Music by': 'J.S. Zamecnik',
     'Cinematography': 'Harry Perry',
     'Edited by': ['E. Lloyd Sheldon', 'Lucien Hubbard'],
     'Production company': 'Paramount Famous Lasky Corporation',
     'Distributed by': 'Paramount Pictures',
     'Release date': ['August 12, 1927 (New York City, premiere)',
      'January 15, 1928 (Los Angeles)'],
     'Running time': ['Original release:',
      '111 minutes',
      'Restoration:',
      '144 minutes'],
     'Country': 'United States',
     'Language': 'Silent (English intertitles )',
     'Budget': 'US$ 2 million ($28,850,173 adjusted for inflation)',
     'Box office': '$3,600,000 (worldwide rentals)'}

2.2 Get all films and put them in a list.

r = requests.get("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture")

soup = bs(r.content)
contents = soup.prettify()
print(contents)


movies = soup.select('tr[style$="FAEB86"]')
movies[0:10]
print(movies[1].a['href'])

movies2 = []
for index, movie in enumerate(movies):
   movies2.append(movies[index].a)

movies2[-5:]
len(movies2)
<!DOCTYPE html>
    <html class="client-nojs" dir="ltr" lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       Academy Award for Best Picture - Wikipedia
      </title>
      <script>
       document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YAMl5wpAAL0AAnAcKxwAAACE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Academy_Award_for_Best_Picture","wgTitle":"Academy Award for Best Picture","wgCurRevisionId":1000537225,"wgRevisionId":1000537225,"wgArticleId":61702,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing Latin-language text","Webarchive template wayback links","Articles with short description","Short description is different from Wikidata","Articles containing potentially dated statements from 2020","All articles containing potentially dated statements",
    "Articles containing potentially dated statements from 2014","Articles contradicting other articles","Academy Awards","Awards for best film","Best Picture Academy Award
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
   </script>
      <script type="application/ld+json">
       {"@context":"https:\/\/schema.org","@type":"Article","name":"Academy Award for Best Picture","url":"https:\/\/en.wikipedia.org\/wiki\/Academy_Award_for_Best_Picture","sameAs":"http:\/\/www.wikidata.org\/entity\/Q102427","mainEntity":"http:\/\/www.wikidata.org\/entity\/Q102427","author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},"datePublished":"2001-08-21T23:02:36Z","dateModified":"2021-01-15T14:46:55Z","headline":"annual award from the Academy of Motion Picture Arts and Sciences (AMPAS)"}
      </script>
      <script>
       (RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":140,"wgHostname":"mw1385"});});
      </script>
     </body>
    </html>
    /wiki/The_Broadway_Melody
    92

2.3 Last step

base_path = "https://en.wikipedia.org/"

movie_info_list = []
for index, movie in enumerate(movies2):
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        movie_info_list.append(get_info_box(full_path))
    except Exception as e:
        print(movie.get_text())
        print(e)

movie_info_list[0]
len(movie_info_list)
    92

3 Save and load Movie Data

import json

def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)


def load_data(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)
save_data("oscars_movies.json", movie_info_list)

4 We need to clean the data

4.1 Convert running time into an integer

movie_info_list = load_data("oscars_movies.json")

len([movie.get('Running time', 'N/A') for movie in movie_info_list])
print([movie.get('Running time', 'N/A') for movie in movie_info_list])
    [['Original release:', '111 minutes', 'Restoration:', '144 minutes'], '100 minutes', ['152 minutes', '133 minutes (restored)'], '124 minutes', '112 minutes', '112 minutes', '105 minutes', '132 minutes', ['177 minutes', '185 minutes (roadshow)'], '116 minutes', '126 minutes', ['221 minutes', "234–238 minutes (with overture , intermission , entr'acte , and exit music)"], '130 minutes', '118 minutes', '133 minutes', '102 minutes', '126 minutes', '101 minutes', '172 minutes', '118 minutes', '155 minutes', '110 minutes', '138 minutes', '113 minutes', '152 minutes', '118 minutes', '108 minutes', '90 minutes', '182 minutes', '161 minutes', '115 minutes', '212 minutes', '125 minutes', '152 minutes', '227 minutes', '128 minutes', '170 minutes', '174 minutes', '120 minutes', '109 minutes', '153 minutes', '113 minutes', '170 minutes', '104 minutes', '177 minutes', '129 minutes', '200 minutes', '133 minutes', '119 minutes', '93 minutes', '184 minutes', '105 minutes', '124 minutes', '124 minutes', '191 minutes', '132 minutes', '161 minutes', '161 minutes', '120 minutes', '163 minutes', '134 minutes', '99 minutes', '181 minutes', '118 minutes', '131 minutes', '195 minutes', '142 minutes', '178 minutes', '162 minutes', '195 minutes', '123 minutes', '122 minutes', '155 minutes', '135 minutes', '113 minutes', '201 minutes', '132 minutes', '112 minutes', '151 minutes', '122 minutes', '120 minutes', '131 minutes', '119 minutes', '100 minutes', '120 minutes', '134 minutes', '119 minutes', '129 minutes', '111 minutes', '123 minutes', '130 minutes', '132 minutes']
import re

#function for pass from string to integer
def minutes_to_integer(running_time):
    if running_time == "N/A":
        return None
    if isinstance(running_time, list): #if it's a list
        fin = "minutes"
        for time in running_time:
            if time.endswith(fin):
                return int(time.split(" ")[0])
    else: # is not a list
        return int(running_time.split(" ")[0])

# Let's to create a feature for Running time in integer
for movie in movie_info_list:
    movie['Running time (min)'] = minutes_to_integer(movie.get('Running time', "N/A"))

movie_info_list[0]

#Now we can see integers in a new element called "Running time (int)" 
print([movie.get('Running time (min)', 'N/A') for movie in movie_info_list])
    [111, 100, 152, 124, 112, 112, 105, 132, 177, 116, 126, 221, 130, 118, 133, 102, 126, 101, 172, 118, 155, 110, 138, 113, 152, 118, 108, 90, 182, 161, 115, 212, 125, 152, 227, 128, 170, 174, 120, 109, 153, 113, 170, 104, 177, 129, 200, 133, 119, 93, 184, 105, 124, 124, 191, 132, 161, 161, 120, 163, 134, 99, 181, 118, 131, 195, 142, 178, 162, 195, 123, 122, 155, 135, 113, 201, 132, 112, 151, 122, 120, 131, 119, 100, 120, 134, 119, 129, 111, 123, 130, 132]

4.2 Convert Budget & Box office to numbers

print([movie.get('Budget', 'N/A') for movie in movie_info_list])
    ['US$ 2 million ($28,850,173 adjusted for inflation)', '$379,000', '$1.2 million', '$1,433,000', '$750,000', '$1,180,280', '$325,000', '$1,950,000', '$2.183 million', 'N/A', 'US$1,644,736 (est.)', '$3.85 million', '$1.29 million', '$800,000', '$1.34 million', '$878,000 (equivalent to $13,738,644 in 2019) –$1 million', 'N/A', '$1.25 million', '$2.1 million or $3 million', '$1,985,000', '£527,530', 'N/A', '$1.4 million', '$2.7 million', '$4 million', '$1.7–2.5 million', '$910,000', '$350,000', '$6 million', '$2.8 million', '$3.3 million', '$15.2 million', '$3 million', '$6.75 million', '$15 million', '$1 million ( )', '$17 million', '$8.2 million', '$2 million', '$2 million', '$10 million', '$3.2 million', '$12.6 million', '$1.8 million', '$6–7.2 million', '$5.5 million', '$13 million', '$3–4.4 million', ['$960,000', '(equivalent to $4.31\xa0million in 2019)'], '$4 million', '$15 million', '$8 million', '$6.2 million', '$5.5 million (£3 million)', '$22 million', '$8 million', '$18 million', '$28 million', '$6 million', '$23.8 million', '$25 million', '$7.5 million', '$22 million', '$19 million', '$14.4 million', '$22 million', '$55 million', '$65–70 million', '$27–31 million', '$200 million', '$25 million', '$15 million', '$103 million', '$58 million', '$45 million', '$94 million', '$30 million', '$6.5 million', '$90 million', '$25 million', '$15 million', '$15 million', '$15 million', '$15 million', '$44.5 million', '$20–22 million', '$18 million', '$20 million', '$1.5-4 million', '$19.5–20 million', '$23 million', ['₩17.0 billion']]
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

def money_conversion(money):
    if money == "N/A":
        return None
    if isinstance(money, list):
        money = money[0]
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)
    if word_syntax:
        return parse_word_syntax(word_syntax.group())
    elif value_syntax:
        return parse_value_syntax(value_syntax.group())
    else:
        return None
for movie in movie_info_list:
    movie['Budget ($)'] = money_conversion(movie.get('Budget', "N/A"))
    movie['Box office ($)'] = money_conversion(movie.get('Box office', "N/A"))
money_conversion(str(movie_info_list[-40]["Budget"]))
    6200000.0
print([movie.get('Budget ()', 'N/A') for movie in movie_info_list])
    ['N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A']
movie_info_list[34]
    {'title': 'Lawrence of Arabia',
     'Directed by': 'David Lean',
     'Produced by': 'Sam Spiegel',
     'Screenplay by': ['Robert Bolt', 'Michael Wilson'],
     'Based on': ['Seven Pillars of Wisdom', 'by', 'T. E. Lawrence'],
     'Starring': ['Alec Guinness',
      'Anthony Quinn',
      'Jack Hawkins',
      'José Ferrer',
      'Anthony Quayle',
      'Claude Rains',
      'Arthur Kennedy',
      'Omar Sharif',
      "Peter O'Toole"],
     'Music by': 'Maurice Jarre',
     'Cinematography': 'F.A. Young',
     'Edited by': 'Anne V. Coates',
     'Production company': 'Horizon Pictures',
     'Distributed by': 'Columbia Pictures',
     'Release date': ['10 December 1962'],
     'Running time': '227 minutes',
     'Country': ['United Kingdom'],
     'Language': 'English',
     'Budget': '$15 million',
     'Box office': '$70 million',
     'Running time (min)': 227,
     'Budget ($)': 15000000.0,
     'Box office ($)': 70000000.0}

4.3 Convert dates into datetime object

from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

def clean_date(date):
    return date.split("(")[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
    if date == "N/A":
        return None
    date_str = clean_date(date)
    fmts = ["%B %d, %Y", "%d %B %Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))


print([movie.get('Release date (datetime)', 'N/A') for movie in movie_info_list])
    [datetime.datetime(1927, 8, 12, 0, 0), datetime.datetime(1929, 2, 1, 0, 0), datetime.datetime(1930, 4, 21, 0, 0), datetime.datetime(1931, 1, 26, 0, 0), datetime.datetime(1932, 4, 12, 0, 0), datetime.datetime(1933, 4, 15, 0, 0), datetime.datetime(1934, 2, 22, 0, 0), datetime.datetime(1935, 11, 8, 0, 0), datetime.datetime(1936, 3, 22, 0, 0), datetime.datetime(1937, 8, 11, 0, 0), datetime.datetime(1938, 8, 23, 0, 0), datetime.datetime(1939, 12, 15, 0, 0), datetime.datetime(1940, 3, 21, 0, 0), datetime.datetime(1941, 10, 28, 0, 0), datetime.datetime(1942, 6, 4, 0, 0), datetime.datetime(1942, 11, 26, 0, 0), datetime.datetime(1944, 5, 3, 0, 0), datetime.datetime(1945, 11, 29, 0, 0), datetime.datetime(1946, 11, 21, 0, 0), datetime.datetime(1947, 11, 11, 0, 0), datetime.datetime(1948, 5, 4, 0, 0), datetime.datetime(1949, 11, 8, 0, 0), datetime.datetime(1950, 10, 13, 0, 0), datetime.datetime(1951, 10, 4, 0, 0), datetime.datetime(1952, 1, 10, 0, 0), datetime.datetime(1953, 8, 5, 0, 0), datetime.datetime(1954, 7, 28, 0, 0), datetime.datetime(1955, 4, 11, 0, 0), datetime.datetime(1956, 10, 17, 0, 0), datetime.datetime(1957, 10, 2, 0, 0), datetime.datetime(1958, 5, 15, 0, 0), datetime.datetime(1959, 11, 18, 0, 0), datetime.datetime(1960, 6, 30, 0, 0), datetime.datetime(1961, 10, 18, 0, 0), datetime.datetime(1962, 12, 10, 0, 0), datetime.datetime(1963, 9, 29, 0, 0), datetime.datetime(1964, 10, 21, 0, 0), datetime.datetime(1965, 3, 2, 0, 0), datetime.datetime(1966, 12, 12, 0, 0), datetime.datetime(1967, 8, 2, 0, 0), datetime.datetime(1968, 9, 26, 0, 0), datetime.datetime(1969, 5, 25, 0, 0), datetime.datetime(1969, 12, 4, 0, 0), datetime.datetime(1971, 10, 7, 0, 0), datetime.datetime(1972, 3, 14, 0, 0), datetime.datetime(1973, 12, 25, 0, 0), datetime.datetime(1974, 12, 12, 0, 0), datetime.datetime(1975, 11, 19, 0, 0), datetime.datetime(1976, 11, 21, 0, 0), datetime.datetime(1977, 4, 20, 0, 0), datetime.datetime(1978, 12, 8, 0, 0), datetime.datetime(1979, 12, 19, 0, 0), datetime.datetime(1980, 9, 19, 0, 0), datetime.datetime(1981, 3, 30, 0, 0), datetime.datetime(1982, 11, 30, 0, 0), datetime.datetime(1983, 11, 23, 0, 0), datetime.datetime(1984, 9, 6, 0, 0), datetime.datetime(1985, 12, 18, 0, 0), datetime.datetime(1986, 12, 19, 0, 0), datetime.datetime(1987, 10, 4, 0, 0), datetime.datetime(1988, 12, 16, 0, 0), datetime.datetime(1989, 12, 15, 0, 0), datetime.datetime(1990, 10, 19, 0, 0), datetime.datetime(1991, 1, 30, 0, 0), datetime.datetime(1992, 8, 3, 0, 0), datetime.datetime(1993, 11, 30, 0, 0), datetime.datetime(1994, 6, 23, 0, 0), datetime.datetime(1995, 5, 18, 0, 0), datetime.datetime(1996, 11, 15, 0, 0), datetime.datetime(1997, 11, 1, 0, 0), datetime.datetime(1998, 12, 11, 0, 0), datetime.datetime(1999, 9, 8, 0, 0), datetime.datetime(2000, 5, 1, 0, 0), datetime.datetime(2001, 12, 13, 0, 0), datetime.datetime(2002, 12, 27, 0, 0), datetime.datetime(2003, 12, 1, 0, 0), datetime.datetime(2004, 12, 15, 0, 0), datetime.datetime(2004, 9, 10, 0, 0), datetime.datetime(2006, 9, 26, 0, 0), datetime.datetime(2007, 5, 19, 0, 0), datetime.datetime(2008, 8, 30, 0, 0), datetime.datetime(2008, 9, 4, 0, 0), datetime.datetime(2010, 9, 6, 0, 0), datetime.datetime(2011, 5, 15, 0, 0), datetime.datetime(2012, 8, 31, 0, 0), datetime.datetime(2013, 8, 30, 0, 0), datetime.datetime(2014, 8, 27, 0, 0), datetime.datetime(2015, 9, 3, 0, 0), datetime.datetime(2016, 9, 2, 0, 0), datetime.datetime(2017, 8, 31, 0, 0), datetime.datetime(2018, 9, 11, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

5 Save/load file in binary file with pinckle

import pickle

def save_data_pickle(name, data):
    with open(name, 'wb') as f: #wirte a binary file
        pickle.dump(data, f)


def load_data_pickle(name):
    with open(name, 'rb') as f: #read a binary file
        return pickle.load(f)
save_data_pickle("oscars_winners_data_cleaned.pickle", movie_info_list)
a = load_data_pickle("oscars_winners_data_cleaned.pickle")
a == movie_info_list
    True

6 Attach IMDB/Rotten Tomatoes/Metascore scores

movie_info_list = load_data_pickle('oscars_winners_data_cleaned.pickle')

movie_info_list[-41]
    {'title': 'Kramer vs. Kramer',
     'Directed by': 'Robert Benton',
     'Produced by': ['Richard Fischoff', 'Stanley R. Jaffe'],
     'Screenplay by': 'Robert Benton',
     'Based on': ['Kramer Versus Kramer', 'by', 'Avery Corman'],
     'Starring': ['Dustin Hoffman', 'Meryl Streep', 'Jane Alexander'],
     'Music by': ['Paul Gemignani',
      'Herb Harris',
      'John Kander',
      'Erma E. Levin',
      'Roy B. Yokelson',
      'Antonio Vivaldi'],
     'Cinematography': 'Néstor Almendros',
     'Edited by': 'Gerald B. Greenberg',
     'Distributed by': 'Columbia Pictures',
     'Release date': ['December 19, 1979'],
     'Running time': '105 minutes',
     'Country': 'United States',
     'Language': 'English',
     'Budget': '$8 million',
     'Box office': '$173 million',
     'Running time (min)': 105,
     'Budget ($)': 8000000.0,
     'Box office ($)': 173000000.0,
     'Release date (datetime)': datetime.datetime(1979, 12, 19, 0, 0)}

– Let’s extract films info from www.omdbapi.com:

import requests
import urllib
import os

def get_omdb_info(title):
    base_url = "http://www.omdbapi.com/?"

    #Put API code OMDB_API_KEY='######' in .bashrc
    parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title}
    #or we can also do it this way
    #parameters = {"apikey": 'f8185070', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

get_omdb_info("Kramer vs. Kramer")
   {'Title': 'Kramer vs. Kramer',
     'Year': '1979',
     'Rated': 'PG',
     'Released': '19 Dec 1979',
     'Runtime': '105 min',
     'Genre': 'Drama',
     'Director': 'Robert Benton',
     'Writer': 'Avery Corman (from the novel by), Robert Benton (written for the screen by)',
     'Actors': 'Dustin Hoffman, Meryl Streep, Jane Alexander, Justin Henry',
     'Plot': "Ted Kramer's wife leaves him, allowing for a lost bond to be rediscovered between Ted and his son, Billy. But a heated custody battle ensues over the divorced couple's son, deepening the wounds left by the separation.",
     'Language': 'English',
     'Country': 'USA',
     'Awards': 'Won 5 Oscars. Another 34 wins & 25 nominations.',
     'Poster': 'https://m.media-amazon.com/images/M/MV5BNDM3YjNlYmMtOGY3NS00MmRjLWIyY2UtNDA0MWM3OTNlZTY2XkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg',
     'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.8/10'},
      {'Source': 'Rotten Tomatoes', 'Value': '88%'},
      {'Source': 'Metacritic', 'Value': '77/100'}],
     'Metascore': '77',
     'imdbRating': '7.8',
     'imdbVotes': '132,176',
     'imdbID': 'tt0079417',
     'Type': 'movie',
     'DVD': 'N/A',
     'BoxOffice': '$106,260,000',
     'Production': 'Columbia Pictures Corporation',
     'Website': 'N/A',
     'Response': 'True'}

– Extract imdb, metascore and rotten_tomatoes:

for movie in movie_info_list:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)


movie_info_list[-32]
    {'title': 'Rain Man',
     'Directed by': 'Barry Levinson',
     'Produced by': 'Mark Johnson',
     'Screenplay by': ['Barry Morrow', 'Ronald Bass'],
     'Story by': 'Barry Morrow',
     'Starring': ['Dustin Hoffman', 'Tom Cruise', 'Valeria Golino'],
     'Music by': 'Hans Zimmer',
     'Cinematography': 'John Seale',
     'Edited by': 'Stu Linder',
     'Production company': ['Guber-Peters Company', 'Star Partners II, Ltd.'],
     'Distributed by': 'United Artists',
     'Release date': ['December 16, 1988'],
     'Running time': '134 minutes',
     'Country': 'United States',
     'Language': 'English',
     'Budget': '$25 million',
     'Box office': '$354.8 million',
     'Running time (min)': 134,
     'Budget ($)': 25000000.0,
     'Box office ($)': 354800000.0,
     'Release date (datetime)': datetime.datetime(1988, 12, 16, 0, 0),
     'imdb': '8.0',
     'metascore': '65',
     'rotten_tomatoes': '89%'}
save_data_pickle('oscars_winners_data_final.pickle', movie_info_list)

7 Save data as .json and .csv

movie_info_copy = [movie.copy() for movie in movie_info_list]

– Json:

for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")
    else:
        movie['Release date (datetime)'] = None

save_data("disney_data_final.json", movie_info_copy)

– CSV:

import pandas as pd

df = pd.DataFrame(movie_info_list)
df.head()
titleDirected byProduced byWritten byScreenplay byStory byStarringMusic byCinematographyEdited by...metascorerotten_tomatoesColor processBased onProduction companiesNarrated bySuggested byHangulRevised RomanizationMcCune–Reischauer
0WingsWilliam A. Wellman[Lucien Hubbard, Adolph Zukor, Jesse L. Lasky,...[Titles:, Julian Johnson][Hope Loring, Louis D. Lighton]John Monk Saunders[Clara Bow, Charles (Buddy) Rogers, Richard Ar...J.S. ZamecnikHarry Perry[E. Lloyd Sheldon, Lucien Hubbard]...N/A93%NaNNaNNaNNaNNaNNaNNaNNaN
1The Broadway MelodyHarry Beaumont[Irving Thalberg, Lawrence Weingarten][Sarah Y. Mason, (continuity), Norman Houston,...NaNEdmund Goulding[Charles King, Anita Page, Bessie Love, Jed Pr...(see article )John Arnold[Sam S. Zimbalist, Uncredited:, William LeVanw......N/A33%TechnicolorNaNNaNNaNNaNNaNNaNNaN
2All Quiet on the Western FrontLewis MilestoneCarl Laemmle Jr.[Maxwell Anderson, George Abbott, Del Andrews,...NaNNaN[Lew Ayres, Louis Wolheim]David BroekmanArthur Edeson[Edgar Adams, Milton Carruth, (silent version,......9198%NaN[All Quiet on the Western Front, by, Erich Mar...NaNNaNNaNNaNNaNNaN
3CimarronWesley Ruggles[William LeBaron, Louis Sarecky, (assoc.)]NaN[Howard Estabrook, Louis Sarecky]NaN[Richard Dix, Irene Dunne]Max SteinerEdward CronjagerWilliam Hamilton...7050%NaN[Cimarron, 1929 novel, by, Edna Ferber]NaNNaNNaNNaNNaNNaN
4Grand HotelEdmund GouldingIrving ThalbergWilliam A. DrakeNaNNaN[Greta Garbo, John Barrymore, Joan Crawford, W...[William Axt, Charles Maxwell]William H. DanielsBlanche Sewell...N/A86%NaN[Grand Hotel, (play) 1930, by William A. Drake...NaNNaNNaNNaNNaNNaN
df.to_csv("disney_movie_data_final.csv")