Recent years have seen a rapid increase in the number of publications, scientific articles, reports, medical records, etc. that are available and readily accessible in electronic format. This has led to an increased need for text mining in the biomedical field. In order to transform unstructured collections of text into structured information and data, information extraction systems must accurately identify different biological and medical entities, such as chemical ingredients, genes, proteins, medications, diseases, symptoms, etc. Figure 1 shows a medical text that contains seven named diseases (highlighted in red) and four anatomical entities (highlighted in yellow).
The identification of entities like these allows search engines to index, organize, and link medical documents, as well as mine relations and extract associations from medical research literature. This then allows users to gather information from many disparate pieces of text and construct accurate and thorough medical knowledge graphs. We refer to this task of identification and tagging of entities in text as members of predefined categories (such as diseases, chemicals, genes, etc.) as Named Entity Recognition (NER).
NER is a widely studied task in the area of natural language processing (NLP) and a number of works have applied machine learning (ML) approaches to NER in the medical domain. Building such NER systems with high precision and high recall for the medical domain is quite a challenging task due to the limited availability of quality labeled data and the linguistic variation of that data, which includes the use of abbreviations, non-standardized descriptions, and lengthened names of biomedical entities.
An NER system can be devised as a supervised ML task in which the training data consists of labels for each token in a text. A typical approach to an NER task is to first extract word-level features and then train a linear model for word-level classification. Early approaches to NER systems in the biomedical domain include ABNER, BANNER, and GIMLI, which utilized a variety of hand-engineered features as input for a linear model.
Recently, there has been a growing interest in using neural network (NN)-based methods for NER tasks as they can be trained end-to-end on the data that is available and learn the relevant features directly. As shown in Figure 2, a typical, state-of-the-art NN model for NER consists of the following layers:
● Convolutional Neural Network (CNN) filters of different widths are used to extract character-level features. These features can encode morphological and lexical information observed in characters.
● A pre-trained word embedding layer that encodes the semantic relationships between words. These word embeddings are typically trained on a large corpus of medical texts such as PubMed abstracts.
● A word-level bidirectional LSTM layer to model the long-range dependence structure in medical texts. A BiLSTM computes two hidden states for every word in a sequence which are later concatenated.
● A decoder layer to project the LSTM hidden states by computing an affine transformation followed by a linear-chain Conditional Random Field (CRF) layer that models the sequence-level data likelihood.
We train an NER model using stochastic gradient descent by minimizing the sequence’s negative log-likelihood loss. We found that an issue with this fully supervised approach is that it relies on high-quality labeled data, which is expensive to obtain. To address this, we investigated whether the parameters learned during a language modeling (LM) task could be effectively used to increase the performance of a NN-based NER model. In LM tasks, current approaches include an LSTM layer to model the context of previous words in order to predict the next target word in the sequence.
To pre-train our NER model that includes a BiLSTM, we performed forward and backward language modeling with shared embedding and decoder parameters and refer to it as BiLM (Figure 3). We then transferred the weights of the BiLM to an NER model with the same architecture as the BiLM. This resulted in better parameter initialization of the NER model as it helped prevent overfitting and improved model training and convergence speed.
We evaluated our approach on four datasets for biomedical NER consisting of diverse medical entity types. The precision, recall, and F1 score can be seen in Table 1. We can observe that BiLM pre-training leads to a remarkable improvement in F1 score as compared to the current state-of-the-art approaches on all the datasets.
Furthermore, from our experiments, we observe that BiLM weight transfer leads to faster model training and also requires fewer training examples to achieve a particular F1 score.
Our work demonstrates that applying BiLM pre-training to medical NER can lead to much improved F1 scores. This means that transfer learning can be a very cost-effective method for improving the performance of NER models in the medical domain. Weight transfers lead to better model convergence during training as compared to the commonly used random initialization of weights. So, going forward, we believe this weight initialization strategy for training an NER model will help perform faster model training while also mitigated the effects of having limited labeled data.
If you’re interested in more details, please take a look at the preliminary version of our paper here: https://arxiv.org/abs/1711.07908