데이터분리하기 train_test

데이터분리하기 train_test_split

Haribo- 2022. 11. 22. 17:09

데이터 분리하기

머신러닝의 입력으로 사용하기 위해서는 데이터를 분리해야합니다. titanic 데이터에서 생존 여부인 Survived 을 예측하는 머신러닝을 수행한다고 했을 때 데이터를 분리해봅시다.

이번 실습에서는 [실습7]에서 이상치를 처리한 데이터를 바탕으로 feature 데이터와 label 데이터를 분리합니다. 이 후 학습용, 평가용 데이터로 분리합니다.

학습용, 평가용 데이터 분리는 sklearn 라이브러리의 train_test_split을 사용하여 분리할 수 있습니다.

X_train, X_test, y_train, y_test = train_test_split(feature 데이터, 
label 데이터, 
test_size= 0~1 값, 
random_state=랜덤시드값)

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

# 데이터를 읽어옵니다.

titanic = pd.read_csv('./data/titanic.csv')

# Cabin 변수를 제거합니다.

titanic_1 = titanic.drop(columns=['Cabin'])

# 결측값이 존재하는 샘플 제거합니다.

titanic_2 = titanic_1.dropna()

# 이상치를 처리합니다.

titanic_3 = titanic_2[titanic_2['Age']-np.floor(titanic_2['Age']) == 0 ]

print('전체 샘플 데이터 개수: %d' %(len(titanic_3)))

"""

1. feature 데이터와 label 데이터를 분리합니다.

"""

X = titanic_3.drop(columns=['Survived']) # Survived 변수를 제거하여 X에 저장합니다.

y = titanic_3['Survived'] # Survived 변수를 y에 저장합니다.

print('X 데이터 개수: %d' %(len(X)))

print('y 데이터 개수: %d' %(len(y)))

"""

2. X,y 데이터를 학습용, 평가용 데이터로 분리합니다.

"""

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 분리한 데이터의 개수를 출력합니다.

print('학습용 데이터 개수: %d' %(len(X_train)))

X 데이터 개수: 687
y 데이터 개수: 687
학습용 데이터 개수: 480
평가용 데이터 개수: 207

'ML' 카테고리의 다른 글

단순 선형 회귀 분석하기 (0)	2022.11.23
단순 선형 회귀 분석하기 - 데이터 전 처리 (0)	2022.11.22
이상치 처리하기 (0)	2022.11.22
수치형 자료 변환하기 - 표준화 (0)	2022.11.22
수치형 자료 변환하기 - 정규화 (0)	2022.11.22

현재글데이터분리하기 train_test_split

programmers답, Japan media, IBM자격증, 코딩테스트, RGB_color #Hexadecimal_color #html #css, SQLZOO #SQLZOO정답 #SQL코딩테스트, 프로그래머스, grogrammers, OTT market share, programmers, IBMcertificate, 프로그레머스답, 반구조화데이터 #semi-structured #ridge schema, tableau, sql limit, 문제풀이, css, SQL,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

하리보를 좋아하는 헤일리