The preprocessing of genealogical documentation is imperative to ensuring accuracy and a smooth indexing process. Continue reading to learn more about the role of preprocessing in genealogical indexing.

Indexing genealogical documents involves a considerable amount of work because of the historical documents and the volume of data involved. If the indexing process is not handled with caution, it can result in errors. It is difficult to index most of the historical documents since they are largely old and dating back to centuries ago. These documents present a serious challenge when it comes to extracting information. Many indexing professionals have difficulty deciphering the writing on historical documents and are frequently overwhelmed by the huge volume of records they must index before building a family tree.

A genealogist uses historical records, genetic tests, oral interviews, and other methods in order to learn more about a family and determine its relationship to others. The first step in their research is to collect family papers from various sources. A digital index is then created from the documents and transcribed into a searchable digital database. The historical documents are reviewed and analyzed carefully during document research.

However, before we can index a text collection, the document must undergo several steps of preprocessing.

What Is Preprocessing?

Preprocessing is a data mining technique that transforms raw data into an understandable format. Data from the real world usually has many errors, is incomplete, inconsistent, and cannot be processed further. This problem can be solved by preprocessing the data.

The preprocessing of data is critical to the indexing process. Preprocessing involves many steps, including data cleaning, data transformation, feature selection, and more. As a result of data pre-processing, the data is presented in a more meaningful manner for subsequent analysis.

Raw data gathered from various sources is transformed into more analytical information using this technique. This is the process of taking all of the available information, organizing and sorting it, and combining it.

The raw data can be incomplete or inconsistent and can present a lot of redundant information. There are three main types of problems with raw data.

  • Missing or inaccurate data brings absence of information and creates gaps that can affect the final analysis. Many times, missing data occurs when there’s a problem during the collection phase, such as an error that caused the system to go down, or mistakes made when entering the information
  • The noisy data category includes erroneous data uncovered in the data set and outliers you can find, but that are just irrelevant information. Data gathering causes noise due to human mistakes, rare exceptions, and other mistakes
  • The issue of inconsistent data arises when you maintain data in files of different formats. The presences of duplication of formats or mistakes in codes of names often result in inaccurate data, which causes deviations that need to be fixed before analysis

Without correcting these issues, the final analysis would be unreliable and result in faulty conclusions. In genealogy, errors can make a huge difference and even the smallest error can change the scope of a family tree. Hence, preprocessing data to make sure it is consistent is vital before indexing.

SBL’s Expertise in Preprocessing

As part of our genealogical indexing process, we start with a detailed preprocessing of the data. We examine all the data that is needed to determine budgets, prepare keying instructions, and maximize return on investment.

Our preprocessing methodology.

  1. Understand the project requirement which helps us match the data with the context
  2. Classify the images based on different event types and preparing the batch wise image count
  3. Based on the client inventory /metadata the team will generate a detailed in scope and out of scope report
  4. Prepare the project details including year range and estimation according to archive & clients
  5. The input image will be analyzed for its readability (legibility), missing pieces, format differences, discrepancies etc.
  6. The keying instructions are checked to see if they are complete and clear. For any clarification consult with end user
  7. The images are further differentiated based on its complexity and event
  8. Estimate the record count in each event & complexity level
  9. Preparation of report based on basic elements and customer requirements

The SBL Advantage in Preprocessing

We follow a comprehensive and well-defined preprocessing method that ensures the accuracy and reliability of information. Using preprocessed data reduces the chance of making errors during genealogical indexing.
Our cost-effective services ensure that your raw data is transformed to structured data before indexing. Following this, our indexing professionals work meticulously on your data to ensure that you derive the most accurate results.
Contact theGenealogy Division of SBL to kick-start your indexing project. Our expert team is always ready to assist you.