Data cleaning and management course using Stata

All 3 days only R8000.00 (Excl. VAT) per delegate

Wed 29th – Fri 31st Aug (3 days) | Wed 05th – Fri 07th Dec (3 days)

Johannesburg, South Africa

About this course

Every data analyst who is keen about producing accurate and excellent statistical outputs spends ample time exploring, “feeling” and cleaning their data before engaging on statistical analysis. It is in fact thought that data cleaning and management can take more time than the actual statistical analysis. This is because of two reasons. First, the validity of data analysis outputs depends on the quality of the data used. Second, a reviewer or reader of a report is more likely to spot data analysis errors than data cleaning and management errors. Despite the extreme importance of data cleaning and management for programme and research data, the concepts and procedures are rarely taught in postgraduate schools, and there are scarcely any short courses in the continent that cover them. This leave many data analysts without a systematic approach to data cleaning and management. After 15 years of data analysis experience in diverse projects and training data analysts, CESAR has drawn from its rich experience to put together a comprehensive data cleaning and management short course. The course will be taught using Stata and participants will be required to have some experience with Stata or similar software.

The course will teach participants the following:

Day 1

  • Definitions and dimensions of data quality
  • Data management

Combining data
This section will cover why and how to carry out data combination. The different procedures for merging and appending data will be covered including how to use data merging to troubleshoot for data errors.

Learn to fix dates
Understanding dates is a central part of longitudinal studies (including intervention studies) but also important for cross-sectional surveys. The course will teach participants how Stata stores dates, how to convert dates from different formats to Stata format, how to format dates, subtract dates to get age and duration, how to extract part of dates (e.g. month from date), etc.  So participants will learn Stata dates and the numerous commands for dealing with dates in Stata including nuisance dates.

Handling string – inlist, stringmatch, inrange, trim and substring
Database managers are trained to avoid or reduce the use of string field in their research data. This is for a reason. However, strings are not entirely avoidable in datasets. For example, the following types of fields are usually string: unique identifiers, open ended questions and questions that require participants to specify their option, etc.

Day 2

What is special about Stata egen command?
The generate command is one of the most commonly used commands in Stata. The course will show different applications of generate apart from the basic ones and include conditional generate. Beyond the generate command, there is an extended-generate (egen) command which has numerous application for diverse types of data management including risk score creation. There is a time to use generate and there is a time to use egen. These will be covered.

Learn to aggregate the data

Sometimes analysts will have to convert their data from individual level data to aggregate level data. This is important both for longitudinal studies, multiple sites M&E data and certain forms of data analysis. Two approaches will be taught for aggregating data – the intuitive step-by-step approach and Stata inbuilt collapse command approach. Participants will also learn the difference between _n and _N for such analysis.

Reshaping data
Stata allows data management and statistical analysis to be carried out in wide and long formats, but some analysts may be more comfortable with a particular format. Stata can reshape data from wide to long format and vice-versa using the reshape command.

Foreach and forvalues

The foreach and forvalues are Stata loop commands that help to carry out analysis more speedily by applying the same procedure to many variables at the same time. Loops allow you to run the same command for several variables at one time without having to write separate code for each variable. The commands and their application will be covered.  

Day 3

Automatic outputting of results
Manual copying of results from Stata results window remains the most common way many analysts transfer results from Stata to Word document or Excel reports. Yet this can both be time consuming and error-prone. This course will teach participants how to use Stata tabout command for outputting stata results automatically. Other similar commands will also be mentioned

Data cleaning

The course will cover the philosophies, commands and practice of exploring data for the following types of errors. Steps on how to correct them will also be covered.

  • Duplicates records
  • Illogical sequence
  • Missing data reports
  • Illegal values

Timing and procedures for data cleaning and management

  • Before data collection – data collection planning
  • During data collection – data collection execution
  • After data collection – cleaning data for analysis
Course Content

This course if for those who want to heighten their knowledge of data management and analysis and improve the quality of their outputs. Priority will be given to participants who have some experience with data analysis using Stata or similar software, as well as those who have attended CESAR’s Stata course in the past.

Who Should attend
  • Researchers
  • PhD and master’s students
  • M & E specialists
  • Data analysts
  • Statisticians and biostatisticians

For more details about our services contact:

Rose More
Tel: +27 11 403 1411 / +27 72 509 1861

Price Includes
  • Course attendance
  • Full refreshments: lunch
  • Welcome tea
  • Two breaks for tea including pastries
  • Course lecture notes and training manual
  • Complimentary parking
  • Certificate of attendance