Searching Movies Dataset with CloudSearch

Searching Movies Dataset with CloudSearch

Takahiro Iwasa
Takahiro Iwasa
15 min read
CloudSearch

CloudSearch, built on Apache Solr, offers full-text search. In this post, we will explore how to search movies dataset using CloudSearch.

CloudSearch Domain

Creating CloudSearch Domain

Create a CloudSearch domain with the following command.

aws cloudsearch create-domain --domain-name searching-movies-data
{
    "DomainStatus": {
        "DomainId": "123456789012/searching-movies-data",
        "DomainName": "searching-movies-data",
        "ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
        "Created": true,
        "Deleted": false,
        "DocService": {},
        "SearchService": {},
        "RequiresIndexDocuments": false,
        "Processing": false,
        "SearchPartitionCount": 0,
        "SearchInstanceCount": 0
    }
}

According to the official documentation, creating the domain takes approximately ten minutes.

It takes about ten minutes to create endpoints for a new domain.

By running the following command, confirm whether the domain status shows Processing: false meaning completion.

aws cloudsearch describe-domains --domain-name searching-movies-data
{
    "DomainStatusList": [
        {
            "DomainId": "123456789012/searching-movies-data",
            "DomainName": "searching-movies-data",
            "ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
            "Created": true,
            "Deleted": false,
            "DocService": {
                "Endpoint": "doc-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
            },
            "SearchService": {
                "Endpoint": "search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
            },
            "RequiresIndexDocuments": false,
            "Processing": false,
            "SearchInstanceType": "search.small",
            "SearchPartitionCount": 1,
            "SearchInstanceCount": 1,
            "Limits": {
                "MaximumReplicationCount": 5,
                "MaximumPartitionCount": 10
            }
        }
    ]
}

Updating Access Policy

For security reasons, update the domain access policy to allow access only from your IP address. You can specify your IP at aws:SourceIp.

aws cloudsearch update-service-access-policies \
  --domain-name searching-movies-data \
  --access-policies '
  {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["cloudsearch:*"],
      "Condition": {"IpAddress": {"aws:SourceIp": "xxx.xxx.xxx.xxx/32"}}
    }
  ]
}'

Defining Index Fields

Define the index fields with the following command. This post uses The Movies Dataset by Kaggle (CC0: Public Domain).

FieldType
adulttext
belongs_to_collectiontext
budgetdouble
genrestext
homepagetext
idint
imdb_idtext
original_languagetext
original_titletext
overviewtext
popularitydouble
poster_pathtext
production_companiestext
production_countriestext
release_datetext
revenueint
runtimedouble
spoken_languagestext
statustext
taglinetext
titletext
videotext
vote_averagedouble
vote_countint
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name adult --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name belongs_to_collection --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name budget --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name genres --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name homepage --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name id --type int
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name imdb_id --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name original_language --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name original_title --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name overview --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name popularity --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name poster_path --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name production_companies --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name production_countries --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name release_date --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name revenue --type int
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name runtime --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name spoken_languages --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name status --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name tagline --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name title --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name video --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name vote_average --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name vote_count --type int

aws cloudsearch index-documents --domain-name searching-movies-data

Indexing Movie Dataset

Download the movies dataset from The Movies Dataset by Kaggle. AWS CLI aws cloudsearchdomain upload-documents requires you to upload a file by JSON or XML. However, CloudSearch console can accept CSV files, so you can index the movies dataset without transforming the format. In this post, only the first 1,000 rows will be cut by running the following command.

head -1000 movies_metadata.csv > sample.csv

Select Upload documents in Actions menu.

Select the CSV.

Review and upload the CSV.

You should see the document count. Because the CSV includes two rows as the header, the count of 998 is correct.

Searching Movies Dataset

Searching for Texts

By running the following command, search for text containing the keyword of house (case-insensitive) in the title and overview fields.

$ curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q=house&q.options={fields:["title","overview"]}&return=title,overview' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4325    0  4325    0     0  48310      0 --:--:-- --:--:-- --:--:-- 49712

You should see the following response.

{
  "status": {
    "rid": "8fDJv8swsgEK1DyD",
    "time-ms": 1
  },
  "hits": {
    "found": 26,
    "start": 0,
    "hit": [
      {
        "id": "local_file_466",
        "fields": {
          "overview": "Hip Hop duo Kid & Play return in the second follow-up to their 1990 screen debut House Party. Kid (Christopher \"Kid\" Reid) is taking the plunge and marrying his girlfriend Veda (Angela Means), while his friend Play (Christopher Martin) is dipping his toes into the music business, managing a roughneck female rap act called Sex as a Weapon. Play books the ladies for a concert with heavy-hitting pr",
          "title": "House Party 3"
        }
      },
      {
        "id": "local_file_271",
        "fields": {
          "overview": "Ben Archer is not happy. His mother, Sandy, has just met a man, and it looks like things are pretty serious. Driven by a fear of abandonment, Ben tries anything and everything to ruin the \"love bubble\" which surrounds his mom. However, after Ben and Jack's experiences in the Indian Guides, the two become much closer.",
          "title": "Man of the House"
        }
      },
      {
        "id": "local_file_465",
        "fields": {
          "overview": "A rancher, his clairvoyant wife and their family face turbulent years in South America.",
          "title": "The House of the Spirits"
        }
      },
      {
        "id": "local_file_817",
        "fields": {
          "overview": "High-schooler Grover Beindorf and his younger sister Stacy decide that their parents, Janet and Ned, are acting childishly when they decide to divorce after 18 years of marriage, so they lock them up in the basement until they'll sort out their problems. Their school friends also decide to do the same with their parents to solve their respective problems",
          "title": "House Arrest"
        }
      },
      {
        "id": "local_file_932",
        "fields": {
          "overview": "After one member of their group is murdered, the performers at a burlesque house must work together to find out who the killer is before they strike again.",
          "title": "Lady of Burlesque"
        }
      },
      {
        "id": "local_file_969",
        "fields": {
          "overview": "A middle-aged couple has a drifter enter their lives. The fish-store owners find that the mysterious young man awakens the couple in ways they didn't expect. Things get tense when the drifter begins an affair with the woman of the house.",
          "title": "Caught"
        }
      },
      {
        "id": "local_file_532",
        "fields": {
          "overview": "When Michael McCann is thrown over by the woman he loves, he becomes something of a misanthrope and a miser, spending all of his spare money on collectible gold coins. Living in the same town is an affluent family with two sons: John and Tanny. Tanny's a wild boy, whom John cannot control, and one night he breaks into McCann's house, and steals the gold and disappears, which nearly confirms McCann's distrust of mankind. But then, a mysterious young woman dies in the snow outside McCann's house, and her small daughter makes her way to McCann's house and into McCann's life and heart. He names her Matilda, and raises her, finding companionship and a new joy in life with his adopted daughter. But the secret of Matilda's birth may tear them apart.",
          "title": "A Simple Twist of Fate"
        }
      },
      {
        "id": "local_file_880",
        "fields": {
          "overview": "In the late 19th century, Paula Alquist is studying music in Italy, but ends up abandoning her classes because she's fallen in love with the gallant Gregory Anton. The couple marries and moves to England to live in a home inherited by Paula from her aunt, herself a famous singer, who was mysteriously murdered in the house ten years before. Though Paula is certain that she sees the house's gaslights dim every evening and that there are strange noises coming from the attic, Gregory convinces Paula that she's imagining things. Meanwhile, a Scotland Yard inspector, Brian Cameron, becomes sympathetic to Paula's plight.",
          "title": "Gaslight"
        }
      },
      {
        "id": "local_file_103",
        "fields": {
          "overview": "Failed hockey player-turned-golf whiz Happy Gilmore -- whose unconventional approach and antics on the grass courts the ire of rival Shooter McGavin -- is determined to win a PGA tournament so he can save his granny's house with the prize money. Meanwhile, an attractive tour publicist tries to soften Happy's image.",
          "title": "Happy Gilmore"
        }
      },
      {
        "id": "local_file_289",
        "fields": {
          "overview": "A deadly airborne virus finds its way into the USA and starts killing off people at an epidemic rate. Col Sam Daniels' job is to stop the virus spreading from a small town, which must be quarantined, and to prevent an over reaction by the White House.",
          "title": "Outbreak"
        }
      }
    ]
  }
}

Searching for Numbers

By running the following command, search for number 5.0 in the vote_average field.

curl --location --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:5.0&return=title,overview' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4246    0  4246    0     0  54304      0 --:--:-- --:--:-- --:--:-- 55868

You should see the following response.

{
  "status": {
    "rid": "w+Xgv8swwQEK1DyD",
    "time-ms": 0
  },
  "hits": {
    "found": 35,
    "start": 0,
    "hit": [
      {
        "id": "local_file_144",
        "fields": {
          "overview": "Far from home in the lush bamboo forests of China, ten-year-old Ryan Tyler, with the help of a young girl, goes on a wonderful journey to rescue a baby panda taken by poachers.",
          "title": "The Amazing Panda Adventure"
        }
      },
      {
        "id": "local_file_158",
        "fields": {
          "overview": "Eight people embark on an expedition into the Congo, a mysterious expanse of unexplored Africa where human greed and the laws of nature have gone berserk. When the thrill-seekers -- some with ulterior motives -- stumble across a race of killer apes.",
          "title": "Congo"
        }
      },
      {
        "id": "local_file_237",
        "fields": {
          "overview": "Hatch Harrison, his wife, Lindsey, and their daughter, Regina, are enjoying a pleasant drive when a car crash leaves wife and daughter unharmed but kills Hatch. However, an ingenious doctor, Jonas Nyebern, manages to revive Hatch after two lifeless hours. But Hatch does not come back unchanged. He begins to suffer horrible visions of murder -- only to find out the visions are the sights of a serial killer.",
          "title": "Hideaway"
        }
      },
      {
        "id": "local_file_241",
        "fields": {
          "overview": "The band is back together! Gumby reunites with The Clayboys to perform at a concert benefiting local farmers. But things take an unexpected turn when Gumby s dog, Lowbelly, reacts to the music by crying tears of real pearl! Fortune turns into disaster as Gumby s archenemies, the Blockheads, devise an elaborate scheme to dognap Lowbelly and harvest her pearls for themselves. When the Blockheads initial plan fails, they kidnap The Clayboys as well...and replace them with clones! The battle between Clayboys and clones is filled with trains and planes, knights and fights, thrills and spills. True to classic Gumby adventures, Gumby: The Movie takes viewers in and out of books, to Toyland, Camelot, outer space and beyond!",
          "title": "Gumby: The Movie"
        }
      },
      {
        "id": "local_file_309",
        "fields": {
          "overview": "Stuart Smalley, the Saturday Night Live character, comes to the big screen. Stuart, the disciple of the 12 step program, is challenged by lifes injustices.",
          "title": "Stuart Saves His Family"
        }
      },
      {
        "id": "local_file_351",
        "fields": {
          "overview": "Modern Stone Age family the Flintstones hit the big screen in this live-action version of the classic cartoon. Fred helps Barney adopt a child. Barney sees an opportunity to repay him when Slate Mining tests its employees to find a new executive. But no good deed goes unpunished.",
          "title": "The Flintstones"
        }
      },
      {
        "id": "local_file_399",
        "fields": {
          "overview": "Greed and playing into the hand of providence provides the focus of this Mexican comedy adapted from a novel by Jorge Ibarguengoitia. Marcos, an architect, has just returned to the home of his wealthy uncle Ramon after squandering his money in Mexico City and subsequently finding himself falsely accused of a crime. Although he is flat-broke, he conceals this from Ramon, telling him that he has returned home to buy a local gold mine. Marcos finds the lies come easily as begins trying to induce his uncle to fund his endeavor. Irascible Ramon, who likes Marcos for his similar love of drinking and smoking is duped, but Ramon's sons are not fooled by Marcos. To them he is a threat, and they fear he will be placed in the will. Soon all of them are trying to out-manipulate each other. Even Ramon, who is not as innocent as he appears is involved in the mayhem.",
          "title": "Dos Crímenes"
        }
      },
      {
        "id": "local_file_402",
        "fields": {
          "overview": "In Providence's Italian neighborhood, Federal Hill, five young are immersed in drugs, crime and violence. Everything changes when one of the guys in the band know love.",
          "title": "Federal Hill"
        }
      },
      {
        "id": "local_file_414",
        "fields": {
          "overview": "One man must learn the meaning of courage across four lifetimes centuries apart.",
          "title": "Being Human"
        }
      },
      {
        "id": "local_file_415",
        "fields": {
          "overview": "Jed Clampett and kin move from Arkansas to Beverly Hills when he becomes a billionaire, after an oil strike. The country folk are very naive with regard to life in the big city, so when Jed starts a search for a new wife there are inevitably plenty of takers and con artists ready to make a fast buck",
          "title": "The Beverly Hillbillies"
        }
      }
    ]
  }
}

Searching for Range

By running the following command, search for number larger than 5.0 in the vote_average field.

$ curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:[7.0,}&return=title,overview' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2975    0  2975    0     0  38172      0 --:--:-- --:--:-- --:--:-- 39666

You should see the following response.

{
  "status": {
    "rid": "vJ/vv8swyAEK1DyD",
    "time-ms": 2
  },
  "hits": {
    "found": 254,
    "start": 0,
    "hit": [
      {
        "id": "local_file_1",
        "fields": {
          "overview": "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
          "title": "Toy Story"
        }
      },
      {
        "id": "local_file_6",
        "fields": {
          "overview": "Obsessive master thief, Neil McCauley leads a top-notch crew on various insane heists throughout Los Angeles while a mentally unstable detective, Vincent Hanna pursues him without rest. Each man recognizes and respects the ability and the dedication of the other even though they are aware their cat-and-mouse game may end in violence.",
          "title": "Heat"
        }
      },
      {
        "id": "local_file_13",
        "fields": {
          "overview": "An outcast half-wolf risks his life to prevent a deadly epidemic from ravaging Nome, Alaska.",
          "title": "Balto"
        }
      },
      {
        "id": "local_file_14",
        "fields": {
          "overview": "An all-star cast powers this epic look at American President Richard M. Nixon, a man carrying the fate of the world on his shoulders while battling the self-destructive demands within. Spanning his troubled boyhood in California to the shocking Watergate scandal that would end his presidency.",
          "title": "Nixon"
        }
      },
      {
        "id": "local_file_16",
        "fields": {
          "overview": "The life of the gambling paradise – Las Vegas – and its dark mafia underbelly.",
          "title": "Casino"
        }
      },
      {
        "id": "local_file_17",
        "fields": {
          "overview": "Rich Mr. Dashwood dies, leaving his second wife and her daughters poor by the rules of inheritance. Two daughters are the titular opposites.",
          "title": "Sense and Sensibility"
        }
      },
      {
        "id": "local_file_25",
        "fields": {
          "overview": "Ben Sanderson, an alcoholic Hollywood screenwriter who lost everything because of his drinking, arrives in Las Vegas to drink himself to death. There, he meets and forms an uneasy friendship and non-interference pact with prostitute Sera.",
          "title": "Leaving Las Vegas"
        }
      },
      {
        "id": "local_file_26",
        "fields": {
          "overview": "The evil Iago pretends to be friend of Othello in order to manipulate him to serve his own end in the film version of this Shakespeare classic.",
          "title": "Othello"
        }
      },
      {
        "id": "local_file_28",
        "fields": {
          "overview": "This film adaptation of Jane Austen's last novel follows Anne Elliot, the daughter of a financially troubled aristocratic family, who is persuaded to break her engagement to Frederick Wentworth, a young sea captain of meager means. Years later, money troubles force Anne's father to rent out the family estate to Admiral Croft, and Anne is again thrown into company with Frederick -- who is now rich, successful and perhaps still in love with Anne.",
          "title": "Persuasion"
        }
      },
      {
        "id": "local_file_29",
        "fields": {
          "overview": "A scientist in a surrealist society kidnaps children to steal their dreams, hoping that they slow his aging process.",
          "title": "The City of Lost Children"
        }
      }
    ]
  }
}

Cleaning Up

Clean up the CloudSearch domain with the following command.

aws cloudsearch delete-domain --domain-name searching-movies-data
Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.