메인 콘텐츠로 건너뛰기

Automating German Discharge Summaries with Large Language Models

요약

Clinical Documentation Types

Medical documentation is divided into two main styles: structured data—organized into categories for easy analysis—and narrative text, which gives detailed, case-specific information. Structured formats support research and decision-making systems, while the narrative discharge summary helps guide post-hospital care and ensures doctors communicate important details to each other.

The Burden of Manual Summary Writing

Physicians spend a significant portion of their day managing electronic health records, with documentation taking up about 44% of their working hours. Writing discharge summaries is especially time-sensitive and critical; delays can lead to higher risks for rehospitalization and errors with medications, making automation an appealing solution.

Progress in Automated Summarization

Efforts to automate discharge summaries began with rule-based systems that used medical vocabularies but struggled to scale. Now, advanced methods—like transformer-based models (BERT, BART, GPT)—are capable of summarizing clinical data into concise documents. These models are usually trained on English medical data, making adaptation to other languages, like German, challenging.

Study Approach and Data Sources

This study used data from 25 pancreatic surgery patients at a German hospital. Four sources provided structured information: patient self-report, physician admission notes, intraoperative records, and discharge treatment documentation. Most of the structured data came from free-text entries, which were manually abstracted for analysis.

Structuring Data for the Model

To improve output quality, the dataset was organized into four sections ("General Information," "Before Surgery," "During Surgery," "Inpatient Stay") and followed precise inclusion rules—ensuring relevant details like abnormal lab results, lifestyle factors, and specific clinical findings were captured only under appropriate conditions.

Optimizing Summaries through Prompt Engineering

Careful crafting of instructions (prompts) for the language model improved results. Using templates gave summaries a consistent structure, while assigning the model a "role" ensured professional tone and accurate German language use. Prompt chaining—where one prompt's output becomes the next prompt's input—helped break down complex writing tasks, although it made extracting the final summary more complicated.

In-Context Learning Limitations

Adding example summaries to prompts (in-context learning) caused the model to copy sentences verbatim, sometimes out of context, and did not reliably reduce errors. The method also struggled due to hardware limits: only a couple of examples could be used at once.

Model Performance and Error Types

Summaries generated by the LLaMA3 model averaged about three mistakes each. Frequent issues included incorrect age or date calculations, misclassification of symptoms, literal copying leading to gender mistakes, grammatical errors, and occasional "hallucinations" (creation of false information). Content completeness was inconsistent—key details like family history or lifestyle habits were often omitted even when available.

Quality Evaluation Metrics

To assess how well AI-generated summaries matched physician-written ones, researchers used ROUGE scores (for word and sentence overlap) and BERTScore (for semantic similarity). LLM-generated summaries achieved moderate alignment, with about a quarter of the content matching and a BERTScore of 0.64. Qualitative surveys found the model strongest in correctness and fluency, but only 60% of summaries were rated "comprehensive."

Challenges and Future Directions

Limitations included the small data sample, missing details in structured inputs, and the model's uneven grasp of German medical language. For higher-quality outputs, future work should use larger datasets, better integration of unstructured clinical notes, more advanced model fine-tuning, and error feedback from clinicians. Incorporating retrieval models and human oversight could further minimize mistakes and omissions, making AI-generated summaries more valuable for everyday clinical use.

Source & Reference : Automated generation of discharge summaries: leveraging large language models with clinical data | Scientific Reports