한국어 데이터셋

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

NLP/AI/Statistics

한국어 데이터셋 본문

NLP

한국어 데이터셋

Danbi Cho 2020. 9. 22. 20:48

자연어처리의 많은 task에서 활용되는 데이터, 혹은 대용량 말뭉치에 대하여 소개한다.

1) NSMC

2) Wikipedia

3) KorQuAD

4) AI-Hub

5) 세종 말뭉치

6) KCC 뉴스 데이터

7) 이외의 오픈 데이터셋

1) NSMC: 네이버 영화평 데이터 (긍정/부정) - 이진 분류 task

github.com/e9t/nsmc

e9t/nsmc

Naver sentiment movie corpus. Contribute to e9t/nsmc development by creating an account on GitHub.

github.com

2) Wikipedia: 한국어 위키백과의 문서 데이터

ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B2%A0%EC%9D%B4%EC%8A%A4_%EB%8B%A4%EC%9A%B4%EB%A1%9C%EB%93%9C

위키백과:데이터베이스 다운로드 - 위키백과, 우리 모두의 백과사전

위키백과, 우리 모두의 백과사전. 위키백과의 자료를 여러가지 용도로 이용하려는 사람들을 위해, 위키백과에서는 주기적으로 전체 문서를 묶어서 배포하고 있습니다. 여기에서 한국어 위키백

ko.wikipedia.org

3) KorQuAD: 한국어 QA 데이터셋

korquad.github.io/

KorQuAD

What is KorQuAD 2.0? KorQuAD 2.0은 KorQuAD 1.0에서 질문답변 20,000+ 쌍을 포함하여 총 100,000+ 쌍으로 구성된 한국어 Machine Reading Comprehension 데이터셋 입니다. KorQuAD 1.0과는 다르게 1~2 문단이 아닌 Wikipedia artic

korquad.github.io

4) AI-Hub: task 별 데이터셋 (한국어-영어 번역 말뭉치, 한국어 대화 데이터셋, 음성 데이터셋, 기계독해 등)

> 텍스트 이외에 이미지, 영상 등에 대한 데이터셋도 공개하고 있다.

www.aihub.or.kr/aidata/87

한국어-영어 번역 말뭉치 | AI Hub

한국어-영어 번역(병렬) 말뭉치 AI데이터 Korean-English AI Training Text Corpus

www.aihub.or.kr

5) 세종 말뭉치: 국립국어원에서 제공하는 한국어 대용량 말뭉치

ithub.korean.go.kr/user/guide/corpus/guide1.do

::: 국립국어원 언어정보나눔터 :::

찾기 말뭉치 메뉴에서 ‘말뭉치 찾기’를 클릭하게 되면, 말뭉치 용례를 검색할 수 있는 화면으로 이동합니다. 1. 검색 조건 설정 말뭉치 검색 조건 설정에는 크게 내부/외부, 말뭉치 분류, 매체,

ithub.korean.go.kr

아래의 github에서 쉽게 다운로드할 수 있다.

github.com/coolengineer/sejong-corpus?fbclid=IwAR1eHgz5ske4YFYlqSA8SnAhjvlhI2vlzSlyT78nFLBFUcy_geDU-lDa0OE

coolengineer/sejong-corpus

Korean sejong corpus download and simple analysis. Contribute to coolengineer/sejong-corpus development by creating an account on GitHub.

github.com

6) KCC: 뉴스 기사 한국어 대용량 말뭉치

> 데이터 크기와 형태에 따라 KCC150, KCCq28, KCC940, KCC460의 데이터가 공개되어 있다.

nlp.kookmin.ac.kr/kcc/

Korean Contemporary Corpus

KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences ================================================================================= KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences Total 272,698,565 words (1

nlp.kookmin.ac.kr

7) 이외의 오픈 데이터셋

> 아래의 github에서 한국어 데이터셋에 대하여 다양한 데이터를 소개하고 있다.

github.com/songys/AwesomeKorean_Data

songys/AwesomeKorean_Data

한국어 데이터 세트 링크. Contribute to songys/AwesomeKorean_Data development by creating an account on GitHub.

github.com

'NLP' 카테고리의 다른 글

Smoothing 기법: Laplace(add-one) smoothing, Back-off smoothing (0)	2020.10.06
Language Model: N-gram 언어모델 (0)	2020.09.29
Language Model: 통계적 언어 모델 (0)	2020.09.29
Tokenization: 어절, 형태소, 음절, 자모 단위 토큰화 (0)	2020.09.22
Natural Language Processing (0)	2020.09.22

'NLP' Related Articles

Comments

NLP/AI/Statistics

한국어 데이터셋 본문

한국어 데이터셋

'NLP' 카테고리의 다른 글

티스토리툴바