Active learning strategies for extracting phrase-level topics from scientific literature

Tao Yue; Yu Li; Zhang Runjie

doi:10.11925/infotech.2096-3467.2020.0281

Active learning strategies for extracting phrase-level topics from scientific literature

Tao Yue^*, Yu Li, Zhang Runjie

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

源语言	英语
页（从-至）	134-143
页数	10
期刊	Data Analysis and Knowledge Discovery
卷	4
期	10
DOI	http://doi.org/10.11925/infotech.2096-3467.2020.0281
出版状态	已出版 - 2020
已对外发布	是

访问文件

10.11925/infotech.2096-3467.2020.0281

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{cbb4224aaf6a4102b5cb86fcfd1598ea,

title = "Active learning strategies for extracting phrase-level topics from scientific literature",

abstract = "[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10\%\textasciitilde{}30\% selectively annotated texts. The proposed model yielded the same results as those of models with 100\% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model{\textquoteright}s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.",

keywords = "Active Learning, Information Extraction, Neural Network",

author = "Tao Yue and Yu Li and Zhang Runjie",

year = "2020",

doi = "10.11925/infotech.2096-3467.2020.0281",

language = "English",

volume = "4",

pages = "134--143",

journal = "Data Analysis and Knowledge Discovery",

issn = "2096-3467",

publisher = "Chinese Academy of Sciences",

number = "10",

}

TY - JOUR

T1 - Active learning strategies for extracting phrase-level topics from scientific literature

AU - Yue, Tao

AU - Li, Yu

AU - Runjie, Zhang

PY - 2020

Y1 - 2020

N2 - [Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

AB - [Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

KW - Active Learning

KW - Information Extraction

KW - Neural Network

UR - http://www.scopus.com/pages/publications/85101630135

U2 - 10.11925/infotech.2096-3467.2020.0281

DO - 10.11925/infotech.2096-3467.2020.0281

M3 - Article

AN - SCOPUS:85101630135

SN - 2096-3467

VL - 4

SP - 134

EP - 143

JO - Data Analysis and Knowledge Discovery

JF - Data Analysis and Knowledge Discovery

IS - 10

ER -

Active learning strategies for extracting phrase-level topics from scientific literature

摘要

访问文件

其它文件与链接

指纹

引用此