Data cleaning — transforming messy, inconsistent, or incomplete data into something reliable and analysis-ready — is the focus of Course 4 and runs through several other courses in the GDA certificate. It's also one of the most underestimated skills on the job. Analysts spend as much as 80% of their time cleaning data in many real-world roles.
Types of Dirty Data
The certificate introduces several categories of data quality issues you'll need to recognize and address:
Missing data: NULL values, blank cells, or placeholder text like 'N/A' or '-'.
Duplicate data: Repeated rows from merging datasets or import errors.
Inconsistent formatting: Dates in different formats, mixed cases ('new york' vs 'New York'), inconsistent units.
Inaccurate data: Values that are technically non-null but clearly wrong (e.g., age = 500).
Irrelevant data: Columns or rows that don't serve the analysis at hand.
Data Cleaning in Spreadsheets
Key spreadsheet operations for cleaning data: Remove Duplicates (Data menu), TRIM() to strip extra whitespace, PROPER()/UPPER()/LOWER() to normalize text case, Text to Columns to split combined fields, Find & Replace for bulk changes, and conditional formatting to spot outliers visually.
Data Cleaning in SQL
-- Find NULL values
SELECT * FROM customers WHERE email IS NULL;
-- Count duplicates
SELECT customer_id, COUNT(*) as count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
-- Trim whitespace in SQL
SELECT TRIM(customer_name) FROM customers;
-- Standardize case
SELECT LOWER(email) FROM customers;
Documenting Your Cleaning Process
The certificate emphasizes documenting what you did to your data and why. This matters both for the exam (you'll be asked about best practices) and for real-world work (your colleagues and future-you need to know what changed). Keep a change log noting each transformation, what the original data looked like, and what decision rule you applied.