Web data integration

From Wikipedia, the free encyclopedia

Web data integration (WDI) is the process of aggregating and managing data from different websites into a single, homogeneous workflow. This process includes data access, transformation, mapping, quality assurance and fusion of data. Data that is sourced and structured from websites is referred to as "web data". WDI is an extension and specialization of data integration that views the web as a collection of heterogeneous databases.

Data integration techniques in the context of the web, forms the foundation for businesses taking advantage of data available on the ever-increasing number of publicly-accessible websites.[1] Corporate spending on this area amounted to about USD 2.5bn in 2017, and it is expected that by 2020 the market will reach almost USD 7bn.[2]

Web data integration extends and specializes data integration to see the web as a collection of views of databases accessible over the web protocols, including, but not limited to:[3]

  • Open data catalogs
  • Government data catalogs
  • Web applications and sites
  • The semantic web (SPARQL)
  • HTML embedded structured data
  • HTML data tables
  • Spreadsheets
  • PDFs
  • Online encyclopedias

Data access and transformation

WDI has technical challenges different from data integration due to the data access and transformation required for the web data sources being often unstructured or semi-structured data without a standard query mechanism.

Data quality

Applications

References

Related Articles

Wikiwand AI