Illustration by Chenyu Wang

Automated ICD Coding Using Deep Learning

Petuum, Inc.
5 min readJan 18, 2018


By Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, Eric P. Xing

If you haven’t already read the first post on our work in AI for healthcare, Predicting Discharge Medications at Admission Time using Deep Learning, go check it out! This post will take a look at another aspect of our healthcare-specific machine learning (ML) platform — how it helps hospitals assign the correct codes to each patient visit.

The International Classification of Diseases (ICD) is a healthcare classification system maintained by the World Health Organization. Diseases and health statuses are classified according to certain rules and uniquely identified by character codes. The ICD was created in 1893, when a French doctor named Jacques Bertillon named 179 categories of causes of death. It has been revised every ten years since then and has become an important standard for information exchange between hospitals. Its application has expanded to various facets of healthcare such as insurance payment, administration management, and research.

The efficiency of ICD coding has begun to receive more attention because of how crucial it is for making clinical and financial decisions. Hospitals with better coding quality see many benefits, including more accurate classification and retrieval of medical records, and better communication with other hospitals to jointly promote healthcare quality and facilitate research (e.g. knowledge graphs, disease-related models, intelligent diagnosis, etc.).

Typically, ICD coding is performed by a professional coder who follows strict guidelines and chooses the appropriate codes according to a doctor’s diagnosis and the patient’s electronic medical record (EMR, or health information system (HIS) in China). This coding process is complex and extremely prone to errors since doctors often use abbreviations in diagnoses, causing ambiguous and imprecise matching to ICD codes. Additionally, many diagnoses don’t match exactly to an ICD code — often, two closely linked diagnoses will be encoded in a single combination ICD code and in some cases, doctors may write one diagnosis for a disease that should correspond to multiple ICD codes. The coding process requires a comprehensive consideration of each patient’s health condition. However, very few medical practitioners are capable of taking over the process since they lack training in professional coding.

In order to solve this industry-wide problem with ICD coding, we propose a new attention-driven deep learning model that automatically translates doctors’ diagnoses into the correct corresponding ICD codes. We are designing different recurrent neural networks that allow the model to automatically distinguish the different types of ICD definitions and written diagnoses and accurately capture hidden semantic information. To address mismatched ICD codes and written diagnoses, our model also introduces a mechanism of attention that allocates different weights to each diagnostic description a doctor writes.

The overall architecture of our model is shown in the following figure. Our experimental data comes from MIMIC-III, which is a free database that can be used for scientific purposes and contains nearly 60,000 inpatient records from 2001 to 2012 from the the Beth Israel Deaconess Medical Center.

To train our model, we first extracted written diagnosis descriptions from discharge summaries and discarded the records that did not contain descriptions. This gave us nearly 11,000 valid records including nearly 60,000 diagnosis sentences.

From there, we used two independent neural networks to learn two different kinds of texts: written diagnoses and ICD code definitions. Each neural network included character-level and word-level recurrent neural networks to obtain hidden semantic information for diagnostic texts. When checking each ICD code, each diagnosis sentence was allocated a different weight based on the hidden semantic representation of the ICD code and diagnosis sentences. The features of the diagnosis sentences were then weighted on an average and passed through a fully connected layer to get a confidence score. After regularization, our model conjectured the ICD code and provided a score of the probability that it should be assigned. We chose 50 of the most frequently occurring ICD codes as coding targets.

By experimenting with real data from a hospital, our ICD encoder achieved F1 values ​​of 0.53 and AUC values ​​of 0.90, significantly better than the coding model without attentional mechanisms. F1-score is a harmonic mean of precision and recall that is widely used to evaluate the performance of a binary classifier on imbalanced data. The AUC_ROC score is calculated as the area under the ROC curve, which is drawn by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. Intuitively, the AUC_ROC score measures the probability that the model assigns a higher score for a positive instance than a negative one, the lower bound of which is $0.5$. The performances of our intact model and ablation models are shown below in Table 2.

In order to verify the reliability of the model, we also analyzed the hidden semantic distribution and attention allocation of the model. We found that the character-level long short-term memory network (LSTM) word encoder can correct various typos and recognize different morphologies that appeared in written diagnosis descriptions by generating similar representations for them. In the process of attention distribution, our model can also effectively distinguish the importance of each diagnosis description when checking different ICD codes, so the accuracy of ICD codes is greatly improved.

Furthermore, our model can give probabilities between 0 and 1, and we can adjust the thresholds according to specific requirements to compromise between accuracy and sensitivity. For example, we can choose a smaller threshold to make the model more sensitive and hand over the model’s ICD code output to professional coders for secondary screening. Typically, the coder needs to select the appropriate code from among tens of thousands of ICD codes, but after initial coding with our model, they will be able to select from a small range of probable codes.

Although the diagnosis descriptions that we could extract from the discharge records are not complete due to the noisy data format, our model still shows very high coding precision. This demonstrates the great feasibility of automatic ICD coding based on physician diagnostic texts. By gaining more well-formed medical records, we firmly believe that the efficacy of the model can be further improved.

If you’re interested in more details and can’t wait for our next post, take a look at our paper:



Petuum, Inc.

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform