S3 datasets

IMDb distributes part of its data as non-commercial downloadable datasets. Cinemagoer imports this data into a local database and accesses it through the standard API.

For this, you will first need to install SQLAlchemy and a database driver for the database engine you choose.

Cinemagoer supports all databases supported by SQLAlchemy. In this documentation we use SQLite in examples because it is the simplest setup.

Then, follow these steps:

Download the files from https://datasets.imdbws.com/ and put all of them in the same directory.

You can download all IMDb dataset files with the installed download-from-s3 command:
```
download-from-s3
```
This creates a timestamped directory named like imdb-dataset-YYYY-MM-DD with the *.tsv.gz files.
Create or choose a database.
Import the data using the s32cinemagoer.py script:
```
s32cinemagoer.py /path/to/the/tsv.gz/files/ URI
```
URI is the SQLAlchemy connection string used to access the database.

For SQLite, use a URI like:
```
s32cinemagoer.py ~/Downloads/imdb-datasets/ sqlite:///cinemagoer.db
```
You can also use other SQLAlchemy-supported database URIs.

For faster local iteration while developing, you can create reduced dataset files with s3-reduce:
```
s3-reduce /path/to/the/tsv.gz/files/
```
This writes smaller files under partials/ in that directory.

Once the import is finished - which should take about an hour or less on a modern system - you will have a database with all the information and you can use the normal Cinemagoer API:

from imdb import Cinemagoer

ia = Cinemagoer('s3', uri='sqlite:///cinemagoer.db')

results = ia.search_movie('the matrix')
for result in results:
    print(result.movieID, result)

matrix = results[0]
ia.update(matrix)
print(matrix.keys())

Note

Running the script again will drop the current tables and import the data again.

Note

If you install tqdm, a progress bar is shown while the database is populated when using the --verbose argument.