| ALSWH sampling scheme
Selection of the sample
The study sample was selected by Medicare Australia (previously known as the Health Insurance Commission) from three zones - urban, rural and remote. The age groups sampled from the Medicare database in April 1996 were 18-22 years, 45-49 years and 70-74 years. By the time the invitations to participate were mailed later in 1996, some women at the upper limit of the age groups had had their birthday and were a year older. Hence some women recruited were 23, 50 and 75 years old and so the cohort age ranges in the study are: 18-23; 45-50 and 70-75 years (although you will note that there are relatively fewer women in the oldest year of each cohort). The cohorts are now officially referred to by their years of birth but some ALSWH material may refer to them as ‘Young’, ‘Mid-aged’ and ‘Older’ and data sets use ‘y’, ‘m’, and ‘o’ (further information below).
Sampling from the population was random within each age group, except that women from rural and remote areas were selected in twice the proportions of the Australian population living in these areas. Women from capital cities and other metropolitan areas made up the balance of the samples.
There were also a small number of women who were sent an invitation to participate whose age lies outside the cohort ages (by a year or two), probably due to errors in date of birth in the Medicare database. However the survey data for these women have been retained. We recommend that when using the data, these women are either excluded or their age set to the nearest valid age.
Calculation of the sample weights
The women were selected based on their postcode recorded by Medicare. The first three digits of their Study ID number reflects the selection (age group code, state code, area code). The variable in the datasets called ‘inarea’ reflects the area from which the women were sampled (urban, rural, remote). However by the time the survey was mailed, some women, particularly in the younger age group, had moved. The variable ‘y1area’ reflects their actual area of residence when completing the survey.
The number of respondents who lived in urban, rural and remote areas at the time of completing the first survey (wave 1 area) was used to create the sample weights for each age group for each area (urban, rural, remote), by comparing these numbers of respondents to the most recent census figures (1991). The sample weights appear in the datasets and are labeled y1wtarea, m1wtarea, o1wtarea.
Representativeness and attrition
The International Journal of Epidemiology paper is the best reference for current retention rates and representativeness (Lee C, Dobson AJ, Brown WJ, Bryson L, Byles J, Warner-Smith P, Young AF. (2005) Cohort Profile: The Australian Longitudinal Study on Women’s Health. International Journal of Epidemiology; 34: 987-991.)
Annual updated information can also be found on the ALSWH website under Project / Sample.
Longitudinal analysis
When doing longitudinal analyses, remember to weight for area of residence at Survey 1 (y1wtarea, m1wtarea, o1wtarea) in all crosstabs, frequencies and analyses to adjust for the initial deliberate oversampling in rural and remote areas. Not required when running models that include area of residence.
Missing data
Some participants completed a short survey instead of the full survey, accounting for some missing data. The type of survey completed is identified with variables such as y2survey for Survey 2 of the 1973-78 cohort. Survey 2 of the 1946-51 cohort Q70 on income is missing the first category ($1-$119). There are large amounts of missing data in some income questions. Surveys 2, 3 and 4 of the 1946-51 cohort are missing the question about being admitted to hospital. Survey 2 of the 1973-78 cohort is missing the question about ability to manage on income. Survey 2 of the 1946-51 cohort Q67 is unreliable as the instruction was incorrectly stated as “mark one only” rather than “mark all that apply”. Many participants realised that this was an error and answered the question as it should have been. Others may not have done so.
Notes about data files
The quantitative survey data are available as SAS data files, SPSS data files, or as tab delimited text files. The file includes almost all survey items as well as all derived and calculated variables.
As well as the survey datasets, there are some supplementary datasets that have been created. Some of these are provided routinely with the datasets (such as the anthropometric variables) and some require a written request for the data (geo-coding variables, FFQ data). Information about dates of deaths and withdrawal of participants is available in the participant status file.
The qualitative data recorded on the back page are also available for analyses. For further information refer to the Qualitative processing protocols at www.alswh.org.au/accessingdata.html
For more information about using ALSWH data and applying to the Publications, Substudies and Analyses Committee for access to the data please refer to the website: www.alswh.org.au/accessingdata.html
Extra resources to support data analysis
Check the data map, the data dictionary and Data Dictionary Supplement for further information about survey items and derived variables. They are available by following this link.
The Data Dictionary is a Microsoft Access XP database that gives a detailed description of the questions used in the survey, their source and how they are used, as well as information on the derived and calculated variables. The Data Dictionary is constantly updated and is available at: http://www.alswh.org.au/InfoData/datadict.html . The table is over 1,000 pages long so don’t try to print it.
The Data Dictionary Supplement is a series of Microsoft Word documents that accompanies the Data Dictionary. The Data Dictionary Supplement contains information about scales and other measures used in the ALSWH surveys. Before using any summary or scale score included in an ALSWH dataset, the appropriate section of the Data Dictionary Supplement should be reviewed. The Data Dictionary and Data Dictionary Supplement are available on the website: www.alswh.org.au/infodata.html
Check the survey databooks if unsure about response frequencies. Electronic copies of the surveys and databooks are available at the following link.
Several reports are available via the web that may be useful. For example Changes Report 1:”Transitions in Selected Variables, Surveys 1, 2 and 3” (December 2004) and Changes Report 2 “Changes Report 2: “Examples from the Australian Longitudinal Study on Women’s Health for Analysing Longitudinal Data.” (June 2005) See the reports page.
See the Data Dictionary Supplement for information on cleaning and coding of anthropometric variables (heights, weights, body mass index). These variables are provided in a separate dataset to the survey data. In 2008 the anthropometric data were included in all the survey data sets.
More information about quantitative survey data files
There are different naming conventions for survey items and derived items. The first variable in each file is the study ID. This is a unique participant identifier and is different to the ID which is kept only at the ALSWH by the Data Manager Cohorts and is a direct link to the identity of participants. All files are linked by the ID which must be used to merge data files. The survey questions and method used in the calculation of the derived variables are listed in the Data Dictionary. A few survey items at survey 1 (birth date, country of birth, language spoken at home) were removed or aggregated into groups as these were considered potentially able to make participants ‘identifiable’.
Missing data have been replaced with plausible values wherever possible. It is not recommended to arbitrarily replace missing values with the null value or any other value.
Some consistency checks between data items on subsequent surveys have been done but many have not. Where appropriate, data have been revised.
In general, skip codes have been applied where necessary (i.e. where some participants have been instructed to skip some items, a dummy code has been entered for these items). The skip value is obtained from the Data Dictionary or from the relevant text format file.
Questions involving “mark all that apply” responses have been coded to 0 (no response) or 1 (yes response). In general, a “none of the above” response option was offered at the end of each set of “mark all that apply” questions. If responses to all sections of a specific question were missing, including the null option (“none of the above”), all responses were set to missing.
It is up to the analyst to become familiar with and carefully examine all data before proceeding with data analysis.
Naming conventions for datasets
The datasets are named whanaaaB.txt, where:
n = survey number;
aaa = agegroup – yng, mid or old (yng = 1973-78 cohort; mid = 1946-51 cohort; old = 1921-26 cohort);
B = level B data, with identifying information removed.
Eg wha1yngB.txt is text data for survey 1 of the 1973-78 cohort.
Naming conventions for variables
anvarname where:
a = agegroup – y, m or o (y = 1973-78 cohort; m = 1946-51 cohort; o = 1921-26 cohort);
n = survey number;
survey variables
varname is the question number in the survey eg: q1, q34a.
derived and calculated variables
varname is a descriptive name eg pcode, cesd10.
Prefixes are as follows:
m1 = survey 1 of the (mid) 1946-51 cohort
o1 = survey 1 of the (older) 1921-26 cohort
y1 = survey 1 of the (younger) 1973-78 cohort
m2 = survey 2 of the (mid) 1946-51 cohort
o2 = survey 2 of the (older) 1921-26 cohort
y2 = survey 2 of the (younger) 1973-78 cohort
m3 = survey 3 of the (mid) 1946-51 cohort
o3 = survey 3 of the (older) 1921-26 cohort
y3 = survey 3 of the (younger) 1973-78 cohort
m4 = survey 4 of the (mid) 1946-51 cohort
o4 = survey 4 of the (older) 1921-26 cohort
y4 = survey 4 of the (younger) 1973-78 cohort
Associated Documentation files
Label files allocate meanings to variables.
Eg: m1q1=’How is your health now?’
Format files allocate meanings to the values of variables.
Eg: 1=very good, 2=good etc.
Notes about specific variables
Menopause - The menopause status variable is recalculated as each new dataset becomes available for the 1946-51 cohort. Make sure you get the most recent menopause status dataset.
Child data set – The fourth survey for the 1973-78 cohort included a set a questions relating to child birth. These questions have been put on a Child data set.
Items that form part of a scale – Be careful that you do not inappropriately analyse single items from a scale. For example, the 36 items in the SF-36 should not be considered as separate items, other than the first self-rated health item. The Data Dictionary Supplement has details about which scales have been included in the surveys.
Measure of depressive symptomatology - the 10-item CES-D scale has an extra item at the end (“I felt terrific”) which is not included in the calculation of the CES-D score. The CES-D score is available in the datasets.
Counting symptoms - when looking at symptoms, the general rule is to count the number of women who had the symptom “sometimes” or “often”.
Measures of exercise - the exercise questions were changed after Survey 1. The new exercise measures from Survey 2 are not comparable to Survey 1 in longitudinal analysis. Refer to the Data Dictionary Supplement for more information.
Summary variables - there are a few “standard” ways to collapse some of the main categorical variables we collect. For example, education (highest qualification) can be dichotomised as “school only”, ”post school” or in three categories: “no formal qualifications”, “school qualifications”, “trade/tertiary qualifications” and so on. There have been several variables created to summarise sets of items in the surveys (eg. the illicit drug use items) and it is important that data analysts become familiar with these new variables (See Data Dictionary Supplement)
Area of residence - the recommended measure is ARIA+. This is an index of accessibility/remoteness based on the distance to the nearest service centre. The scores range from 0 to 15 and the ABS has defined 5 categories for remoteness: major cities of Australia, inner regional Australia, outer regional Australia, remote, and very remote. Only a few of the study’s women live in very remote areas, so the fourth and fifth categories are often grouped together. Aria+ is recommended over the previously used RRMA area classification.
Use of general practitioners - in Survey 2 and Survey 3 of the 1973-78 cohort there are two items about frequency of use of GPs (for “Pap tests, contraception, routine pregnancy tests” and for “all other reasons”). Responses to these two items have been combined into a single measure of GP use. Refer to the Data Dictionary Supplement for further details.
ATSI status - asked at Survey 1 in all age groups. This variable can be used in statistical models but results should not be reported separately by ATSI status in any papers (as we do not have a representative sample in the study). Guidelines for ethical conduct in Aboriginal and Torres Strait Islander health research are available at. www.nhmrc.gov.au
Coding issues - for some variables, the category coded as 1 (reference category) is not the first of the ordered categories. For example, the reference category for alcohol risk is ‘low risk’. Similarly, the reference BMI category is “acceptable weight”. In the question about how much would you like to weigh now, the response option “Happy as I am” generally appears as the first option except in Young 1 where it is the third option.
|