What happens to data enriched by crowdsourcing, machine learning/AI, or a combination of methods? We ran a survey from March 20 to April 30 2024 to help find out. We were interested in the barriers and successes for projects trying to incorporate enriched data into collections systems (such as catalogues or discovery platforms).
We're analysing the results over the summer to present at (hopefully) some conferences later in the year, and wanted to share some early results as they're already looking quite interesting – and might provide some tips for people embarking on workflows with enriched data.
We received 63 responses. 20 responses were from the UK, 13 from the USA, 4 from multiple European countries, 3 each from China, Sweden and the Netherlands. 2 or fewer responses were received from Australia, Belgium, Denmark, Finland, France, Hong Kong, Ireland, Italy, Spain and Switzerland.
(We devoted a lot of our outreach efforts (mostly in the form of individual emails to people associated with projects, or people we knew in a region who might pass our messages on) to reaching people outside our existing networks, but didn't receive any responses from Latin America, South Asia or Africa. We'd love to find funding to properly collaborate with other groups, and translate and localise our surveys.)
43% of responses were about crowdsourcing projects; 24% about machine learning / AI projects, and 22% were about projects that combined crowdsourcing and machine learning.
The majority of responses were from libraries (35%), followed by museums and archives (both 9.5%); other projects were based in universities, non-profit organisations and combined services. More large organisations responded than small ones – 33% of responses came from organisations with more than 500 paid employees; 30% with 100 – 499 employees; 18% with 5 – 19 employees and 11% with 1 – 4 employees.
The majority of respondents were able to ingest enriched data to some extent – 20% could ingest both new and updated records; 8% could only ingest new records, another 8% were partially able to ingest records (for a range of reasons) and 6% could only ingest updated records. 22% of respondents were not able to ingest enriched data, and 21% were still planning or hoping to complete ingest.
Barriers to ingesting enriched data include lack of technical skills, the restrictions of formats such as MARC, and the inability to ingest 'third party' metadata or transcriptions. Issues reported included gatekeeping and institutional politics, lack of staff time (e.g. for technical processes, quality control and data cleaning), erroneous or incomplete data not meeting required standards, and data replication problems.
Responses about factors important for successful ingest often began with the word 'agreement', including collaboration between departments, organisations, and with volunteers, on topics such as conditions for data re-use, specifications and standards, and the distribution of work between teams. Initial analysis suggests that the use of APIs, standard formats, data standards and controlled vocabularies contribute to success by reducing the overhead of creating a pipeline of import/exports across platforms and tools.
At least 63% of projects had some manual or automated quality assurance processes for enriched data. 13% had no process and some projects are still deciding on their processes. 87% respondents provided more information on what 'data quality' meant for their project; analysis is ongoing. 29% of projects using machine learning (17 respondents) reported that corrections to the data help improve the model. The ability of systems and workflows to display information on crowdsourcing- or machine learning-enhanced records to staff or the public was very mixed; analysis is ongoing.
When we began this research, we thought that the affordances of the collections systems used by GLAMs would be a significant factor in the successful integration of enriched data into these systems. However, our results so far indicate that skills, resources, and inter-personal and institutional relationships are also significant.