Stay in Touch



NLP Data Scientist (Consultant)



Health Catalyst



Boston, MA, US


About Health Catalyst

Health Catalyst provides data and analytics software and services to help providers and risk-bearing entities unleash their data to operate in a data-informed manner, driving improvements in their clinical and financial operations.  Health Catalyst was named as one of the 30 Best Workplaces in Technology by Fortune Magazine and the 11th best place to work by Glassdoor.  Health Catalyst’s platform and applications are being used at leading health systems including, John Muir Health, UPMC, MultiCare Health System, Partners HealthCare, Banner Health, Stanford Hospital & Clinics, Texas Children’s Hospital, and over 40 others; enabling the Company to analyze healthcare records of over 100 million patients. Our team lives the cultural attributes of Smart, Hardworking and Humble.  Learn more about working at Health Catalyst here:


Job Summary

The NLP Data Scientist will be responsible for developing and implementing approaches towards the development of clinical and biomedical data elements from unstructured text. The NLP Data Scientist will partner with the rest of the Analytics and Data Science team to define problems and create solutions using advanced statistics, testing methods, data analysis, data mining, predictive modeling, optimization, machine learning and deep learning. He/she will also collaborate with ETL teams across Health Catalyst to ensure appropriate availability of free text clinical notes and structured data to enable access to training data as necessary. The incumbent will utilize methods including regular expressions, supervised and unsupervised learning, utilization of 3rd party tools, and other approaches as necessary to extract, validate, and productionize NLP data and machine learning pipelines to complement Health Catalyst’s rich structured data offerings.

The ideal candidate will have experience not only in NLP but also in applying their skills in a healthcare setting. Candidates who have exposure to or interest in bioinformatics and oncology data would be highly preferred.

This role will begin on a consultancy basis, with the goal of converting to a fulltime Health Catalyst employee.


Duties & Responsibilities      

  1. Work with colleagues to prioritize key data elements needed for Life Sciences. Research and recommend the most appropriate methods for data element generation on a case by case basis, based on factors such as feasibility of basic approaches such as regular expressions, versus availability of label data for machine learning.
  2. Deploy data-science and technology based algorithmic solutions to address business needs and methods that reduce “black box” output as much as possible, linking to provenance and any available information that might have informed predictions.
  3. Collaborate with domain experts and perform independent research to understand nuances of target data elements in the clinical and biomedical space, ensuring the highest sensitivity and specificity possible.
  4. Validate methods, performing manual and automated quality control and communicating/bringing in others as appropriate to ensure that methods are robust.
  5. Extract data elements including in the genomics / biomarker data space from pathology notes, disease specific data in the cancer space, and more as projects dictate.
  6. Implement and/or collaborate with data engineers to productionize NLP pipelines.
  7. Propose and develop an extensible schema that captures extracted data elements, links to appropriate data provenance, and provides statistical confidence in data extracted.
  8. Document methods developed.


Required Skills

  • Proficient in the following: Python, R, SQL and deep learning tools such as PyTorch, Tensorflow and Keras
  • Strong understanding and experience applying machine learning and deep learning algorithms to NLP
  • Familiarity with electronic health records (EHR) and other real-world data (RWD) and understanding of ontologies and other tools required for rule-based NLP
  • Ability to communicate and develop technical documentation


Education & Relevant Experience

  • M.S. or PhD in Computer Science, Computational Linguistics, Data Science, or related field
  • 5+ of experience using data science and technology approaches using RWD
  • 5+ years in NLP or closely related field
  • 3+ years in healthcare or biomedical informatics strongly preferred
  • Experience working in an agile environment



The above statements describe the general nature and level of work being performed in this job function.  They are not intended to be an exhaustive list of all duties, and indeed additional responsibilities may be assigned by Health Catalyst.


Apply for the job

Subscribe to our blog.


Blog & Newsletter Signup