문자열 데이터를 숫자로 바꿔주는 CountVectorizer 라이브러리와 Analyzer 파라미터/ 여러 작업을 한 번의 실행으로 끝내자! 파이프라인 / CountVectorizer 의 fit

# 의미 없는 구두점을 제거했고, 의미 없는 단어도 제거했으면,

# 이제는, 남아있는 단어들을 숫자로 바꿔줘야한다.

# 단어를 숫자로 바꿔주는 것을 VECTORIZER 이다

문자열을 수치화시킬 준비가 다 되었다면 이제 CountVectorizer 라이브러리를 불러와야 한다.

sample_data = ['This is the first document', 'I loved them', 'This document is the second document', 
'I am loving you' , 'And this is the third one']

from sklearn.feature_extraction.text import CountVectorizer   #CountVectorizer 라이브러리 불러오기

vec = CountVectorizer()

X = vec.fit_transform(sample_data)

단어들을 하나하나 분리해 문장에 있는 단어에 1이 들어가있는 형태이다.

다른 예시로 구두점 제거, 불용어 제거를 묶어서 파이프라인으로 만들어보자.

# 이제는 실제 이메일의 데이터를 가지고, 지금가지 한 작업들을 종합해서 처리해보자.

# 1. 구두점 제거
# 2. 불용어 제거

# 이 두가지를 하나로 묶어서 사용하겠습니다. = > 파이프라이닝 한다.

import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')

def message_cleaning(sentence) :
   # 1. 구두점 제거
   Test_punc_removed = [char for char in sentence if char not in string.punctuation ]
   # 2. 각 글자들을 하나의 문자열로 합친다.
   Test_punc_removed_join = ''.join(Test_punc_removed)
   # 3. 문자열에 불용어가 포함되어있는지 확인해서, 불용어 제거한다.
   Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in my_stopwords]
   # 4. 결과로 남은 단어들만 리턴한다.
   return Test_punc_removed_join_clean

위에서 정의한 message_cleaning 함수를 CountVectorizer의 Analyzer 파라미터의 넣어준다.

Analyzer 파라미터에 함수를 넣어주면 Vectorizer를 실행하기 전에

위 함수를 데이터에 적용한 후에 Vectorizer를 실행하게 된다.

X = vec.fit_transform(spam_df['text'])

fit_transform 함수로 Vectorizer를 적용시켜준다.

정상적으로 문자열 클리닝과 Vectorizer 가 적용된 것을 확인할 수 있다.

vec = CountVectorizer(analyzer= message_cleaning)

X = vec.fit_transform(spam_df['text'])

위 문장처럼 CountVectorizer 에 파라미터까지 지정된 값을 변수에 넣고

fit_transform 함수로 spam_df['text']의 데이터에 모두 적용시켜주었다.

fit_transform 함수는 맨처음 알고리즘을 적용시켜줄 때만 사용하게되고

testing_sample = ['Free money!!!', "Hi Kim, Please let me know if you need any further information. Thanks"]

new_data = np.array(testing_sample)

new_data = vec.transform(new_data)                   ### 벡터라이징 단계 

new_data = new_data.toarray()

classifier1.predict(new_data)                        ### 예측 단계

새로운 데이터가 들어왔을때 다시 fit_transform 함수를 사용하지않고 이미 변수에 저장되어있는 작업을 다시 실행만 시켜주면 되므로 이 때는 transform 함수를 사용하면된다.

정리하자면 맨처음 인공지능에 데이터를 적용시킬때만 fit_transform 함수 사용

후에 새로운 데이터가 들어와 다시 적용시켜줄때는 transform 함수를 사용한다.

저작자표시 (새창열림)

'머신러닝 > 머신러닝 라이브러리' 카테고리의 다른 글

Prophet 라이브러리의 사용법 (0)	2022.05.11
문자열 데이터를 처리하기 위해, 구두점 제거와 Stopwords 사용하는 코드 : string 라이브러리, STOPWORDS 라이브러리로 불용어 처리하기 (0)	2022.05.10
WORD CLOUD 라이브러리에 대해 (0)	2022.05.09
머신러닝의 종류 (5) - Hierarchical Clustering 그리고 Dendrogram의 대해 (0)	2022.05.09
머신러닝의 종류 (4) - K-Means : WCSS와 Elbow Method 설명 (0)	2022.05.09