STUDY ๐Ÿ“–/ํŒจ์ŠคํŠธ ์บ ํผ์Šค ์ฑŒ๋ฆฐ์ง€ - 2022.04~

ํŒจ์ŠคํŠธ์บ ํผ์Šค ์บ์‹œ๋ฐฑ ์ฑŒ๋ฆฐ์ง€ 04์ผ์ฐจ - ์ถ”์ฒœ์‹œ์Šคํ…œ[Part2]

ํž˜ํžˆํž˜ 2022. 4. 21. 16:32
๋ฐ˜์‘ํ˜•

๐Ÿ‘‰๐Ÿป ์ฐธ์—ฌ ์ธ๊ฐ• : ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์ถ”์ฒœ์‹œ์Šคํ…œ ๊ตฌํ˜„ ์˜ฌ์ธ์› ํŒจํ‚ค์ง€ Online.
๋ฐ์ผ๋ฆฌ ๋ฏธ์…˜ ์ง„ํ–‰ ๊ธฐ๊ฐ„(66์ผ)์ผ ๋™์•ˆ ์ง„ํ–‰๋œ๋‹ค. ์ œ๋ฐœ ์ด๋ฒˆ์—๋Š” ๊ผญ ์„ฑ๊ณตํ•˜๊ธฐ๋ฅผ ๐Ÿ™๐Ÿป Plz

 

์˜ค๋Š˜ ๋“ค์€ ์ธ๊ฐ•

[Part2] 03-06. TF-IDF๋ž€ --

 

์˜ค๋Š˜ ๋ฐฐ์šด ๋‚ด์šฉ

 

:  Vector Representation

- review ๋ฌธ์„œ ์ „์ฒด์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ์ˆ˜๋Š” n๊ฐœ

- m๊ฐœ์˜ review ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •

- m๊ณผ n์˜ ์กฐํ•ฉ

- ๋นˆ๋„์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ํ™œ์šฉํ•จ.

 

:TF - IDF
- ํ”ํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋Š” ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๋‹จ์–ด

- Information Retrieval์—์„œ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๊ฐœ๋…
- ๊ฐ ๋‹จ์–ด์—์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•ด์„œ Keyword extraction ๋“ฑ์— ํ™œ์šฉ
- ๋ฌธ์„œ์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ฌธ์„œ๋ผ๋ฆฌ ๊ด€๋ จ์žˆ์Œ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ.


- TF : ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ์— ๋“ฑ์žฅํ•œ ๋นˆ๋„์ˆ˜
- DF : ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ ๋ฌธ์„œ์˜ ์ˆ˜

- N : ์ „์ฒด ๋ฌธ์„œ์˜ ์ˆ˜

๋งŒ์•ฝ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ๋“ฑ์žฅํ•œ๋‹ค๋ฉด ํ”ํ•˜๋‹ค๊ณ  ๊ฐ„์ฃผํ•˜๊ณ  ๊ทธ ๋‹จ์–ด์˜ ์ •๋ณด๋ ฅ์€ ์—†๋‹ค๊ณ  ํŒ๋‹จํ•จ

 

- DF๊ฐ€ ํฐ ๋‹จ์–ด๋Š” ์ •๋ณด๋ ฅ์ด ์ ๋‹ค.

- ๋ฌธ์„œ (๋ง๋ญ‰์น˜)์— ๋ชจ๋‘ ํฌํ•จ๋œ ๋‹จ์–ด๋Š” ํ”ํžˆ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋กœ, ์˜๋ฏธ๊ฐ€ ํฌ์ง€ ์•Š์Œ.

- ํ”ํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ธใ„ด ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ์—์„œ๋„ ์ œ์™ธํ•˜๊ฒŒ ๋จ.

- ๋Œ€๋ถ€๋ถ„ ๋ฌธ๋ฒ•์ ์ธ ์—ญํ• ์„ ํ•˜๋Š” ์กฐ์‚ฌ, ๊ด€์‚ฌ ๋“ฑ์ด ํ•ด๋‹น.

 

- TF๋Š” ํ•ด๋‹น ์–ธ์–ด๊ฐ€ ํ•ด๋‹น ๋ฌธ์„œ์— ์•Œ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€ ์ฒดํฌ

- IDF๋Š” ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ ๋ฌธ์„œ์—์„œ ํŠน๋ณ„ํ•œ์ง€ ์ฒดํฌ

 

 

 

 

 

 

 


--
์‹ค์Šต์„ ์ง„ํ–‰ํ•˜์˜€๋Š”๋ฐ, ์„ธ์ƒ์— ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์–ด๋ ค์›Œ์ง„๋‹ค.
์•„๋งˆ ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•์ด ์ข€ ์•ฝํ•ด์„œ ๊ทธ๋Ÿฐ๊ฐ€?

kss : ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ๋ถ„๋ฆฌ ๋ชจ๋“ˆ ์„ค์น˜, 
Konlpy : ํ˜•ํƒœ์†Œ ๊ธฐ๋ฐ˜ ํ† ํฌ๋‚˜์ด์ง•
mecab ์„ค์น˜

- ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
- ๋‰ด์Šค ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ
- with as ๊ฐ์ฒด ์‚ฌ์šฉ
- ๋ช…์‚ฌ๋งŒ ์ถ”์ถœํ•˜๊ธฐ
- ๋ฌธ์žฅ ๋ถ„๋ฆฌํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ

- Scikit-learn์„ ์ด์šฉํ•˜์—ฌ TF-IDF
- TfidfVectorizer

 

import os
import pandas as pd
import numpy as np
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

path = '/content/drive/MyDrive/data/movielens'

ratings_df = pd.read_csv(os.path.join(path, 'ratings.csv'), encoding = 'utf-8')
movies_df = pd.read_csv(os.path.join(path, 'movies.csv'), index_col ='movieId', encoding= 'utf-8')
tags_df = pd.read_csv(os.path.join(path, 'tags.csv'), encoding='utf-8')

total_count = len(movies_df.index)
total_genres = list(set([genre for sublist in list(map(lambda x: x.split('|'), 
                                                       movies_df['genres'])) for genre in sublist]))
                                                       
genre_count = dict.fromkeys(total_genres)

for each_genre_list in movies_df['genres']:
  for genre in each_genre_list.split('|'):
    if genre_count[genre] == None:
      genre_count[genre] = 1
    else:
      genre_count[genre] = genre_count[genre]+1

genre_count


for each_genre in genre_count:
  genre_count[each_genre] = np.log10(total_count/genre_count[each_genre])
  # ์žฅ๋ฅด ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์‚ฐ ํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ž„. Drama์˜ ๊ฒฝ์šฐ ๋งŽ์ด ๋‚˜์™€ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ท„์Œ.

genre_count



# Compute IDF for tag
total_movie_count = len(set(tags_df['movieId']))
# key: tag, value : numver of movies with such tag
tag_count_dict = dict.fromkeys(unique_tags)

for each_movie_tag_list in tags_df['tag']:
  for tag in each_movie_tag_list.split(","):
    if tag_count_dict[tag.strip()] == None:
      tag_count_dict[tag.strip()] = 1
    else:
      tag_count_dict[tag.strip()] += 1

tag_idf = dict()
for each_tag in tag_count_dict:
  tag_idf[each_tag] = np.log10(total_movie_count / tag_count_dict[each_tag])
  # ๋งŽ์ด ๋‚˜์˜จ tag์— ๊ฐ€์ค‘์น˜๋ฅผ ๋œ ์ฃผ๊ธฐ ์œ„ํ•จ. / tag_count_dict[each_tag]

tag_idf

 

 

 

 

 

๋‚ด์ผ์„ ์œ„ํ•œ ๋‹ค์ง

๋‚ด์ผ์€ ์ข€ ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™๋‹ค. ใ…œใ…œ ํ•˜๋‚˜๋งŒ ๋“ฃ๊ณ  ์ž˜ ์ •๋ฆฌํ•ด๋‘์ž





https://bit.ly/3L3avNW

๋ณธ ํฌ์ŠคํŒ…์€ ํŒจ์ŠคํŠธ์บ ํผ์Šค ํ™˜๊ธ‰ ์ฑŒ๋ฆฐ์ง€ ์ฐธ์—ฌ๋ฅผ ์œ„ํ•ด ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.