
Given the potential issues with the credibility of content generated by DotaGPT, we are committed to strictly regulating the model’s use to prevent misuse. Our datasets, DoctorFLAN and DotaBench, have been released under terms that uphold the highest ethical standards. This commitment ensures that while advancing the capabilities of large language models in healthcare, we also safeguard sensitive medical data.
Recent advancements in large medical LLMs such as PMC-LLaMA, Med-PaLM, Med-PaLM2, and HuatuoGPT-II have significantly contributed to enhancing the domain-specific knowledge of these models and have supported the subsequent application of medical LLMs. Leveraging these advancements, several popular medical application models are trained on extensive patient-doctor dialogues with the goal of functioning as autonomous virtual doctors, providing medical consultations directly to patients.
Despite advancements, the accuracy of these models in generating expert-level medical advice remains insufficient. Directly providing their responses to patients without medical training poses significant risks, as these patients may not be able to identify errors. For instance, a patient with suspected appendicitis presenting with abdominal pain and fever may receive an incomplete recommendation from the model, potentially delaying critical intervention.
In contrast, healthcare professionals, equipped with specialized medical knowledge, are capable of identifying such errors. This highlights the potential of developing large medical language models designed to assist doctors in addition to direct patient consultation. While recent efforts have been made to develop medical LLMs as assistants to support doctors on specific scenarios, such as MedDM for differential diagnosis and treatment recommendations and Dia-LLaMA for CT report generation. However, these works typically address only isolated tasks, leaving a significant gap in the development of LLMs capable of comprehensively supporting the full spectrum of tasks within a doctor’s workflow.
Developing a medical LLM capable of assisting across the entire clinical workflow requires a dataset that comprehensively covers all relevant tasks while providing detailed and accurate responses. Furthermore, a practical benchmark is essential to evaluate whether the model can generate outputs that effectively support doctors in real-world scenarios.
As shown in Table 6, existing datasets for online medical consultation dialogues, such as Huatuo-26M, MedDialog, and others, primarily provide responses for pre-diagnosis scenarios. However, these datasets only cover a limited portion of medical scenarios, making them unsuitable for comprehensive, end-to-end medical workflows. Conversely, structured resources such as knowledge graphs (e.g., CMeKG) and multiple-choice question answer datasets (e.g., MedMCQA and CMExam) cover a broader range of clinical scenarios but are limited in their ability to generate knowledge-intensive, context-rich responses. Thus, there is an urgent need for a comprehensive dataset that not only encompasses the entire spectrum of a doctor’s workflow but also provides detailed and context-rich answers. Such a dataset is crucial for effectively training and deploying LLMs in clinical settings.
Furthermore, existing benchmarks are insufficient for effectively evaluating models as medical assistants due to their lack of alignment with practical, real-world scenarios. Common benchmarks, such as PubMedQA, MedQA, MultiMedQA, MedMCQA, CMExam, and CMB, primarily focus on assessing knowledge accuracy through multiple-choice questions. However, real-world medical tasks are rarely limited to answering multiple-choice questions. Instead, they often require more nuanced decision-making accompanied by detailed analysis and explanations. Similarly, benchmarks like PromptCBLUE, which evaluate isolated skills such as Named Entity Recognition in medical NLP tasks, fail to capture the integrated and contextually rich requirements of doctor-assistant applications. While open-ended benchmarks like HealthSearchQA offer broader evaluations, they still fall short of covering the full spectrum of tasks encountered in a doctor’s workflow. Thus, there is a clear need for more realistic and comprehensive benchmarks that accurately simulate diverse medical practice scenarios. These benchmarks should be designed to evaluate the ability of LLMs to function as effective doctor assistants, providing contextually aware, detailed, and practical responses that align with real-world requirements.
Prior clinical NLP systems, such as cTAKES have primarily focused on retrospective information extraction, aiming to standardize clinical notes through rule-based processing for tasks like concept normalization and coding. We shifted the focus from retrospective extraction to prospective generation, designing workflow-aligned, open-ended tasks that reflect real-world clinical needs. To support workflow-aligned generation, we first defined a set of 22 representative tasks that spanned the entire clinical workflow. These tasks were derived through expert interviews and validated via a large-scale survey with licensed physicians to ensure their practical relevance and generalizability. Building on this task framework, we constructed two complementary datasets: DoctorFLAN, which covers single-turn Q&A aligned with each task, and DotaBench, which extends the task design into multi-turn dialogue settings.
To ensure that the tasks identified aligned closely with the practical needs of medical professionals, we organized a symposium with 16 medical experts to discuss key tasks in the medical workflow. To avoid omissions, the experts categorized the workflow into four phases: Pre-diagnosis, Diagnosis, Treatment, and Post-treatment. In each phase, the experts identified and outlined the specific tasks that doctors typically perform in daily practice.
Pre-diagnosis tasks are actions that doctors perform before the diagnostic process. The tasks identified in this phase include Triage, as outlined in Table 7. Compared to the diagnostic and treatment tasks, the pre-diagnosis tasks generally involve fewer complex medical decisions. However, the introduction of LLMs has the potential to enhance workflow efficiency by automating the generation of simple decision-making outcomes.
Diagnosis tasks encompass all activities performed by doctors during the diagnostic process that contribute to formulating the final diagnosis. The tasks are summarized in Table 7. Given the complexity of medical decision-making in this phase, LLMs have significant potential to assist doctors in improving decision quality. For example, in the questioning prompts task, LLMs can generate questions based on the patient’s condition, encouraging doctors to conduct more comprehensive and thorough inquiries. In clinical practice, less experienced doctors may overlook critical diagnostic considerations, failing to take a complete medical history. LLMs can alleviate this by providing additional prompts that guide thorough questioning. For instance, when evaluating a patient with abdominal pain, some doctors may focus solely on the location and intensity of pain, while an LLM might prompt the doctor to inquire about changes in bowel habits, potentially revealing diagnostic clues such as irritable bowel syndrome or inflammatory bowel disease. Additionally, some tasks, such as Case Summary, can enable LLMs to automatically generate medical case summaries, thereby saving time and effort.
Treatment tasks refer to all actions performed by doctors after diagnosis and before patient discharge. These tasks include outpatient tasks such as Medication Advice and inpatient tasks such as Surgical Plan, with a complete task definition provided in Table 7. LLMs have the potential to assist doctors in these tasks by providing advice, thereby improving decision accuracy and consistency.
Post-treatment tasks are those that occur after a patient has completed their primary treatment and is transitioning to long-term recovery or ongoing management. The tasks in this phase primarily involve Health Guidance and Follow-up Plan, as detailed in Table 7. While long-term management tasks involve fewer complex decisions, they still require considerable time and effort from doctors. LLMs can help by quickly generating suggestions, improving workflow efficiency in this phase.
To further validate the universality of the tasks defined in the focus group discussions and gain deeper insights into doctors’ needs for medical LLM assistance, we conducted a survey with doctors from 13 tertiary hospitals. To ensure respondent qualifications, we distributed the survey exclusively within verified professional groups composed of licensed, practicing physicians with relevant clinical experience. The survey does not collect any personally identifiable information in order to respect respondent privacy and encourage candid feedback. We initially listed all 22 predefined tasks and ask participants to rate each task on a scale from 1 to 5, where 5 indicates that LLM assistance is crucial for improving work efficiency, and 1 signifies no impact on task efficiency. In addition, we invited the doctors to propose new tasks across four phases of their workflow, beyond the predefined tasks. Following this, we inquired about the challenges they encounter when using medical LLMs in practice, providing valuable feedback for the development of future medical assistant models. We initially received 82 completed questionnaires. To ensure the validity of the responses, we applied two criteria: (1) the completion time must be more than one-third of the average duration (191.82 seconds) observed across all submissions, indicating potential lack of thoughtful consideration, and (2) responses should not exhibit marked uniformity (e.g., repetitive selection of the same answer option), suggesting insufficient engagement with the content. After applying these criteria, we identified 71 valid responses for analysis. The results revealed that most of the 22 predefined tasks received high ratings, with scores exceeding 4, indicating that LLM assistance is highly effective for these tasks. As shown in Fig. 3, tasks such as Triage, Case Summary, Medication Inquiry, and Preoperative Education were rated particularly highly. Doctors found medical LLM assistance in these tasks especially valuable due to their repetitive nature (e.g., Case Summary, Preoperative Education), relatively low medical risk (e.g., Triage), and high information demands (e.g., Medication inquiry). None of the tasks was proposed by more than five respondents, reinforcing that the final set of 22 tasks is widely applicable and relevant across the surveyed doctors. Among the participants, 46.5% reported using LLMs to assist with their clinical work. When asked about the limitations of current medical LLM capabilities, respondents showed strong consensus on several issues. Specifically, 42.2% of doctors identified problems with noncompliance to instructions, 48.5% reported instances of incorrect answers, and 39.4% expressed concerns about the LLM’s inability to provide accurate references. Additionally, doctors emphasized the necessity of continuously updating the LLM’s knowledge base and incorporating self-correction mechanisms to improve the reliability and accuracy of the model’s outputs.
We further compared the tasks defined in our framework with those in typical medical datasets, using KUAKE-QIC as a representative example. While some overlap exists between the datasets, our defined tasks introduce 17 additional tasks not covered by KUAKE-QIC, highlighting the broader scope and versatility of our approach, as illustrated in Fig. 4.
To create a comprehensive dataset covering the entire clinical workflow, we constructed single-turn DoctorFLAN based on the 22 predefined tasks. First, we collected raw medical data from a variety of sources, then we heuristically filtered and map the data to the relevant tasks. The dataset was refined in two stages: instruction normalization and response polishing. Following the initial construction, we conducted manual verification of a subset of the data by medical experts to ensure its quality, as shown in Fig. 5.
We used three primary data sources: medical multiple-choice questions (MCQs) (e.g., https://www.medtiku.com/), medical encyclopedia entries (e.g., https://m.120ask.com/), and high-quality existing medical datasets such as PromptCBLUE. MCQs are chosen for their ability to simulate a broad range of clinical scenarios, making them highly relevant to real-world practice. The medical encyclopedia, which contains detailed information on topics such as drugs and symptoms, provides a comprehensive and reliable reference, especially for tasks like Medication Inquiry. Additionally, we include overlapping datasets from resources like the Case Summary subset in PromptCBLUE.
After collecting the raw data, we performed deduplication using Jaccard similarity (threshold = 0.8) to eliminate near-duplicate entries and improve data quality. We then categorized the data into the 22 predefined task types using a carefully designed set of task-specific regular expressions. Each task was associated with multiple regex patterns, which are iteratively refined based on expert feedback. In each iteration, we sampled 50 examples for manual annotation by a senior physician to assess classification quality. The refinement process continued until the regex-based categorization achieves over 95% agreement with expert labels, ensuring high precision and consistency. The description of the regex process and an example for task classification are provided in Section A of the Supplementary Information.
Although we have gathered data for the 22 tasks, the initial dataset contained issues such as poorly worded instructions and overly brief responses. To address these problems, we implemented a two-step refinement process: instruction alignment and response polishing. In the instruction alignment phase, medical professionals were enlisted to manually draft task-specific instructions for each of the 22 tasks, ensuring that the instructions accurately reflect real-world clinical scenarios and align with the intended task. In the response polishing phase, we asked GPT-4 to generate more comprehensive responses by referencing the original data to enhance their quality. The final dataset contains 92350 samples, divided into a training set DoctorFLAN-train and a test set DoctorFLAN-test. The test set includes 25 randomly sampled entries from each task, for a total of 550 samples.
To ensure that the responses generated by GPT-4 are factually accurate and realistic, we used a structured review process in which a sample of 1050 responses (50 samples per item across 22 tasks) was reviewed by three medical professionals, each reviewing 350 items. The verification process was overseen by a senior expert with a high-level title, who dedicated 10 hours to ensure a thorough assessment. Each model response was reviewed alongside its corresponding reference answer, and the reviewers are instructed to revise or refine the outputs as needed based on that reference. Rather than conducting blind, independent annotation, this process was designed as a reference-grounded refinement task aimed at improving factual correctness and clinical appropriateness. This approach balances thoroughness with practical limitations, ensuring credible verification within the available resources. The verification criteria include Correctness, where a response is considered correct if it contains no factual errors, and Practicality, where a response is deemed practical if it is more effective than the original answer. Our results demonstrate correctness (100%) and practicality (99.9%), underscoring the robustness of the DoctorFLAN. In a detailed examination of the data verification stage, we identified an instance where a doctor noted the lack of practicality, commenting on the “lack of specific details,” as shown in Table 8. Such feedback suggests that the responses refined by GPT-4 can sometimes fall short in complex practical medical contexts, highlighting an area for future improvement.
Extending the single-turn dataset DoctorFLAN, we introduced multi-turn DotaBench to evaluate multi-turn dialogues involving medical assistants. This extension is motivated by the need to assess an LLM’s ability to operate in realistic clinical settings, where conversations often span multiple turns and involve a sequence of logically connected questions. While DoctorFLAN captures isolated queries, DotaBench focuses on multi-turn interactions in which each question is designed to build upon the previous one, simulating the stepwise inquiry process commonly used by physicians in real-world consultations.
To ensure clinical authenticity, we selected CMB-Clin as the source corpus. CMB-Clin is a multi-round question-answering dataset derived from real medical records. However, its original format consists of 2-4 standalone Q&A pairs that lack contextual continuity, making it unsuitable for dialogue-based evaluation in its raw form.
To address this limitation, we worked with licensed physicians to manually restructure the data into coherent three-turn dialogues. Specifically, we extracted key clinical elements from each case, such as chief complaints, physical findings, and diagnostic test results, and asked physicians to reformulate them into contextually connected questions that reflect realistic consultation workflows. The original answers from CMB-Clin are retained as reference responses, which are later used to support reference-based evaluation under the LLM-as-a-judge framework. A representative example illustrating this transformation is included in Supplementary Tables 1 and 2. Unlike DoctorFLAN, which directly involves LLMs in data generation, DotaBench is crafted without LLM intervention, thereby eliminating the need for subsequent data verification and ensuring controlled evaluation conditions.
The statistical analysis of the DoctorFLAN and DotaBench datasets is presented in Table 9. The DoctorFLAN dataset comprises 92,326 instances across 22 distinct tasks, involving 27 medical specialties in total as detailed in Table 10, demonstrating the comprehensive coverage of DoctorFLAN in real clinical scenarios. In addition, we extracted a subset of 25 instances from each task, referred to as DoctorFLAN-test for evaluation. The training and test sets are created via a random split. The DotaBench dataset includes 74 instances of 3-turn conversations.
We fine-tuned two open-source backbone models, Yi-6B and Baichuan2-7B-Base, using a standard supervised fine-tuning (SFT) framework with an autoregressive, decoder-only architecture. To ensure the model captures both domain-specific expertise and general ability, we constructed a mixed training corpus comprising 92k task-aligned medical samples from DoctorFLAN, 101k general-purpose instruction samples from datasets such as Evol-instruct, ShareGPT, and 51k additional medical QA pairs from CMExam.
All models were trained on 4 NVIDIA A100 GPUs. We set the maximum input sequence length to 4096 tokens and used a per-GPU batch size of 4, training for 3 epochs with a learning rate of 5 × 10. The optimization used the AdamW optimizer with decoupled weight decay, and gradient checkpointing was enabled to reduce memory consumption. Mixed precision training was performed using fp16 format to accelerate computation.
The objective function is the negative log-likelihood (NLL) of the target response given the prompt, encouraging the model to generate accurate and fluent outputs aligned with medical task instructions. Specifically, the loss is defined as:
where x denotes the input prompt and y the target token at time step t. The final model checkpoint was selected after three training epochs based on manual review and preliminary validation performance, without using early stopping or automated selection heuristics.
To comprehensively evaluate the performance of medical-specific models trained on various backbones and datasets, we assessed a wide range of Chinese medical LLMs on DoctorFLAN-test and DotaBench.
Among the domain-specific models, we included BianQue-2, a medical model fine-tuned from ChatGLM-6B using patient-doctor dialogs; DISC-MedLLM, a model based on the Baichuan-13B-Base architecture designed for deep medical interactions; HuatuoGPT-7B, fine-tuned from Baichuan-7B for Chinese medical consultation; and HuatuoGPT-II-7B, a state-of-the-art medical LLM built on Baichuan2-7B with extensive medical knowledge.
We also evaluated general-purpose models to provide a performance baseline. These included Qwen-1.8B-Chat, fine-tuned with SFT and reinforcement learning with human feedback; Baichuan-13B-Chat, which shares the same backbone as DISC-MedLLM and demonstrates strong general performance; and Baichuan2 models, including Baichuan2-7B-Chat and Baichuan2-13B-Chat. We further include Yi-6B-Chat and Yi-34B-Chat, which represent two scales of models from the Yi series, comparable to Qwen and Baichuan.
To broaden the comparison, we additionally report results from proprietary models such as GPT-3.5, GPT-4, and Claude-3.
All models are evaluated using the same decoding hyperparameters: max_new_tokens = 1024, top_p = 0.7, temperature = 0.5, and repetition_penalty = 1.1. We adopt CoT prompting, without using any additional augmentation techniques.
Considering both accuracy, reliability, and cost, our evaluation methodology incorporates both automatic and human evaluations.
Our task involves open-ended answer generation in medical contexts, where multiple correct and clinically valid responses may exist. In such settings, traditional metrics such as BLEU and ROUGE, which rely on N-gram overlap with reference answers, are often inadequate. These metrics fail to capture semantic consistency when answers are phrased differently yet medically equivalent, and are also highly sensitive to variations in response length. To address these limitations, we employed GPT-4 (gpt-4-0125-preview) for automatic evaluation, a method shown to be highly effective in previous research. To ensure evaluation accuracy, we adopted a reference-based model evaluation approach, where the LLM refers to the provided reference and scores responses based on predefined criteria. These scoring standards include: Accuracy (assessing the correctness and reliability of the information), Coherence (evaluating the clarity and logical flow of the responses), Relevance (measuring how closely each response addresses the prompt), and Thoroughness (judging the depth and completeness of the response in covering the topic). During evaluation, we applied Chain-of-Thought (CoT) prompting both in response generation and in the LLM-as-a-judge scoring process. We did not use any external augmentation techniques, such as retrieved rationales or tool-assisted reasoning. The evaluation was performed using GPT-4, accessed via the official OpenAI API with default inference settings. To support reproducibility, we provided the full evaluation prompt in Supplementary Figs. 1 and 2.
To balance accuracy and resource constraints, we conducted human evaluation on a subset of models. For DoctorFLAN-test, which contains 550 questions in total, we divided them into six roughly equal parts, with 91 or 92 questions per evaluator. Each evaluator was assigned a set of questions and tasked with rating the responses of all six models for each question, ensuring a fair and consistent evaluation across all models. The evaluation team consists of six healthcare professionals with varying levels of experience: three mid-level professionals with 5-6 years of experience, two associate senior professionals with 12 years of experience, and one senior professional with 26 years of experience. Evaluators are compensated based on their professional seniority, with senior professionals receiving an hourly rate of 250 RMB, while mid-level professionals were paid 165 RMB per hour. For DotaBench, we invited three doctors to participate in the evaluation process, with each spending an average of 3 hours reviewing the data.

