Extracción automatizada de información en español de texto libre de informes de patología oncológica.

Saved in:
Bibliographic Details
Title: Extracción automatizada de información en español de texto libre de informes de patología oncológica.
Alternate Title: Automated extraction of information from free text of Spanish oncology pathology reports.
Authors: Marcela Mendoza-Urbano, Diana1, Felipe Garcia, Johan2, Sebastian Moreno, Juan2,3, Carlos Bravo-Ocaña, Juan4, José Riascos, Alvaro2,3,4, Harvey, Angela Zambrano5, Prada, Sergio I.6,7 sergio.prada@fvl.org.co
Source: Colombia Medica. 2023, Vol. 54 Issue 1, p1-12. 12p.
Subjects: NATURAL language processing, HUMAN services programs, CANCER patients, SOFTWARE architecture, INFORMATION retrieval, AUTOMATION, MEDICAL records, ELECTRONIC health records, ALGORITHMS, ONCOLOGY, WORLD Wide Web
Geographic Terms: SPAIN
Abstract (English): Background: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry. Objective: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. Methods: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. Results: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. Conclusion: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject. [ABSTRACT FROM AUTHOR]
Abstract (Spanish): Introducción: Los reportes de patología están almacenados como texto libre sin estructura, gramática, fragmentados o abreviados, con variabilidad lingüística entre patólogos. Por esta razón, la extracción de información de tumores requiere un esfuerzo humano significativo. Almacenar información en un formato eficiente y de alta calidad es esencial para implementar y establecer un registro hospitalario de cáncer. Objetivo: Este estudio busca describir la implementación de un algoritmo de Procesamiento de Lenguaje Natural para reportes de patología oncológica. Métodos: Desarrollamos un algoritmo para procesar reportes de patología oncológica en Español, con el objetivo de extraer 20 descriptores médicos. El abordaje se basa en la coincidencia sucesiva de expresiones regulares. Resultados: La validación se hizo con 140 reportes de patología. La identificación topográfica se realizó por humanos y por el algoritmo en todos los reportes. La morfología fue identificada por humanos en 138 reportes y por el algoritmo en 137. El valor de coincidencias parciales (fuzzy matches) promedio fue de 68.3 para Topografía y 89.5 para Morfología. Conclusión: Se hizo una validación preliminar del algoritmo contra extracción humana sobre un pequeño grupo de reportes, con resultados satisfactorios. Esto muestra que múltiples atributos del espécimen pueden ser extraídos de manera precisa de texto libre de reportes de patología en Español, usando un abordaje de expresiones regulares. Adicionalmente, desarrollamos una página web para facilitar la validación colaborativa a gran escala, lo que puede ser beneficioso para futuras investigaciones en el tema. [ABSTRACT FROM AUTHOR]
Copyright of Colombia Medica is the property of Universidad del Valle and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: MedicLatina
FullText Links:
  – Type: pdflink
Text:
  Availability: 0
Header DbId: lth
DbLabel: MedicLatina
An: 164146652
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Extracción automatizada de información en español de texto libre de informes de patología oncológica.
– Name: TitleAlt
  Label: Alternate Title
  Group: TiAlt
  Data: Automated extraction of information from free text of Spanish oncology pathology reports.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Marcela+Mendoza-Urbano%2C+Diana%22">Marcela Mendoza-Urbano, Diana</searchLink><relatesTo>1</relatesTo><br /><searchLink fieldCode="AR" term="%22Felipe+Garcia%2C+Johan%22">Felipe Garcia, Johan</searchLink><relatesTo>2</relatesTo><br /><searchLink fieldCode="AR" term="%22Sebastian+Moreno%2C+Juan%22">Sebastian Moreno, Juan</searchLink><relatesTo>2,3</relatesTo><br /><searchLink fieldCode="AR" term="%22Carlos+Bravo-Ocaña%2C+Juan%22">Carlos Bravo-Ocaña, Juan</searchLink><relatesTo>4</relatesTo><br /><searchLink fieldCode="AR" term="%22José+Riascos%2C+Alvaro%22">José Riascos, Alvaro</searchLink><relatesTo>2,3,4</relatesTo><br /><searchLink fieldCode="AR" term="%22Harvey%2C+Angela+Zambrano%22">Harvey, Angela Zambrano</searchLink><relatesTo>5</relatesTo><br /><searchLink fieldCode="AR" term="%22Prada%2C+Sergio+I%2E%22">Prada, Sergio I.</searchLink><relatesTo>6,7</relatesTo><i> sergio.prada@fvl.org.co</i>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="JN" term="%22Colombia+Medica%22">Colombia Medica</searchLink>. 2023, Vol. 54 Issue 1, p1-12. 12p.
– Name: Subject
  Label: Subjects
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22NATURAL+language+processing%22">NATURAL language processing</searchLink><br /><searchLink fieldCode="DE" term="%22HUMAN+services+programs%22">HUMAN services programs</searchLink><br /><searchLink fieldCode="DE" term="%22CANCER+patients%22">CANCER patients</searchLink><br /><searchLink fieldCode="DE" term="%22SOFTWARE+architecture%22">SOFTWARE architecture</searchLink><br /><searchLink fieldCode="DE" term="%22INFORMATION+retrieval%22">INFORMATION retrieval</searchLink><br /><searchLink fieldCode="DE" term="%22AUTOMATION%22">AUTOMATION</searchLink><br /><searchLink fieldCode="DE" term="%22MEDICAL+records%22">MEDICAL records</searchLink><br /><searchLink fieldCode="DE" term="%22ELECTRONIC+health+records%22">ELECTRONIC health records</searchLink><br /><searchLink fieldCode="DE" term="%22ALGORITHMS%22">ALGORITHMS</searchLink><br /><searchLink fieldCode="DE" term="%22ONCOLOGY%22">ONCOLOGY</searchLink><br /><searchLink fieldCode="DE" term="%22WORLD+Wide+Web%22">WORLD Wide Web</searchLink>
– Name: SubjectGeographic
  Label: Geographic Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22SPAIN%22">SPAIN</searchLink>
– Name: Abstract
  Label: Abstract (English)
  Group: Ab
  Data: Background: Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry. Objective: This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports. Methods: An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions. Results: The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology. Conclusion: A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject. [ABSTRACT FROM AUTHOR]
– Name: Abstract
  Label: Abstract (Spanish)
  Group: Ab
  Data: Introducción: Los reportes de patología están almacenados como texto libre sin estructura, gramática, fragmentados o abreviados, con variabilidad lingüística entre patólogos. Por esta razón, la extracción de información de tumores requiere un esfuerzo humano significativo. Almacenar información en un formato eficiente y de alta calidad es esencial para implementar y establecer un registro hospitalario de cáncer. Objetivo: Este estudio busca describir la implementación de un algoritmo de Procesamiento de Lenguaje Natural para reportes de patología oncológica. Métodos: Desarrollamos un algoritmo para procesar reportes de patología oncológica en Español, con el objetivo de extraer 20 descriptores médicos. El abordaje se basa en la coincidencia sucesiva de expresiones regulares. Resultados: La validación se hizo con 140 reportes de patología. La identificación topográfica se realizó por humanos y por el algoritmo en todos los reportes. La morfología fue identificada por humanos en 138 reportes y por el algoritmo en 137. El valor de coincidencias parciales (fuzzy matches) promedio fue de 68.3 para Topografía y 89.5 para Morfología. Conclusión: Se hizo una validación preliminar del algoritmo contra extracción humana sobre un pequeño grupo de reportes, con resultados satisfactorios. Esto muestra que múltiples atributos del espécimen pueden ser extraídos de manera precisa de texto libre de reportes de patología en Español, usando un abordaje de expresiones regulares. Adicionalmente, desarrollamos una página web para facilitar la validación colaborativa a gran escala, lo que puede ser beneficioso para futuras investigaciones en el tema. [ABSTRACT FROM AUTHOR]
– Name: AbstractSuppliedCopyright
  Label:
  Group: Ab
  Data: <i>Copyright of Colombia Medica is the property of Universidad del Valle and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=lth&AN=164146652
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.25100/cm.v54i1.5300
    Languages:
      – Code: spa
        Text: Spanish
    PhysicalDescription:
      Pagination:
        PageCount: 12
        StartPage: 1
    Subjects:
      – SubjectFull: NATURAL language processing
        Type: general
      – SubjectFull: HUMAN services programs
        Type: general
      – SubjectFull: CANCER patients
        Type: general
      – SubjectFull: SOFTWARE architecture
        Type: general
      – SubjectFull: INFORMATION retrieval
        Type: general
      – SubjectFull: AUTOMATION
        Type: general
      – SubjectFull: MEDICAL records
        Type: general
      – SubjectFull: ELECTRONIC health records
        Type: general
      – SubjectFull: ALGORITHMS
        Type: general
      – SubjectFull: ONCOLOGY
        Type: general
      – SubjectFull: WORLD Wide Web
        Type: general
      – SubjectFull: SPAIN
        Type: general
    Titles:
      – TitleFull: Extracción automatizada de información en español de texto libre de informes de patología oncológica.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Marcela Mendoza-Urbano, Diana
      – PersonEntity:
          Name:
            NameFull: Felipe Garcia, Johan
      – PersonEntity:
          Name:
            NameFull: Sebastian Moreno, Juan
      – PersonEntity:
          Name:
            NameFull: Carlos Bravo-Ocaña, Juan
      – PersonEntity:
          Name:
            NameFull: José Riascos, Alvaro
      – PersonEntity:
          Name:
            NameFull: Harvey, Angela Zambrano
      – PersonEntity:
          Name:
            NameFull: Prada, Sergio I.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Text: 2023
              Type: published
              Y: 2023
          Identifiers:
            – Type: issn-print
              Value: 01208322
          Numbering:
            – Type: volume
              Value: 54
            – Type: issue
              Value: 1
          Titles:
            – TitleFull: Colombia Medica
              Type: main
ResultId 1