Running My First Program

  • Course: Data Management and Visualization by Wesleyan University
  • Assignment: Run your first program (Week 2)
  • Working Title: Trust, Optimism and Civic Participation in A Cross-Sectional Study

1) Program code


# -*- coding: utf-8 -*-
"""
Created for: Data Management and Visualization
https://www.coursera.org/learn/data-visualization/
Week 2 Assignment: Running Your First Program
Author: Mariano Junge
"""

import pandas as pd

# read dataset into dataframe object
data = pd.read_csv('ool_pds.csv', low_memory=False)

# number of observations (rows)
print "total number of observations in the dataset:";
print(len(data))
# number of variables (columns)
print("total number of variables in the dataset:")
print(len(data.columns))

# Compute distribution of observations by variables
# Convert objects to numeric type
# convert_objects function will be deprecated in a new release
# version 17 preferred code for converting string to numeric:
data["W2_QN1"] = pd.to_numeric(data["W2_QN1"], errors='coerce')
data["W2_QE1"] = pd.to_numeric(data["W2_QE1"], errors='coerce')
data["W2_QB3"] = pd.to_numeric(data["W2_QB3"], errors='coerce')

# Only objects for which the variable W2_QFLAG has the value 1
# have participated in wave 2 of the survey which is relevant for me.
# W2_QFLAG: DATA ONLY: Qualification Flag
print("W2_QFLAG: Participation in wave 2 of the survey")
print("NaN= did not participate, 1=participated in wave 2 of the survey")
print("*Frequency*")
W2_QFLAG = data["W2_QFLAG"].value_counts(sort=False, dropna=False)
print(W2_QFLAG)
print("*Percent*")
W2_QFLAG = data["W2_QFLAG"].value_counts(sort=False, normalize=True, dropna=False)
print(W2_QFLAG)

# Variable No. 1
## W2_QN1: Generally speaking, would you say that most people can be trusted,
## or that you can't be too careful in dealing with people?
print("W2_QN1: Trust of participants in survey wave 2")
print("1=Most people can be trusted, 2=Can't be too careful, -1=REFUSED")
print("*Frequency*")
W2_QN1 = data["W2_QN1"].value_counts(sort=False, dropna=False)
print(W2_QN1)
print("*Percent*")
W2_QN1 = data["W2_QN1"].value_counts(sort=False, normalize=True, dropna=False)
print(W2_QN1)

# Variable No. 2
## W2_QE1: When you think about your future, are you generally optimistic,
## pessimistic, or neither optimistic nor pessimistic?
print("W2_QE1: Optimism of participants in survey wave 2")
print("1=Optimistic, 2=Pessimistic, 3=Neither, -1=REFUSED")
print("*Frequency*")
W2_QE1 = data["W2_QE1"].value_counts(sort=False, dropna=False)
print(W2_QE1)
print("*Percent*")
W2_QE1 = data["W2_QE1"].value_counts(sort=False, normalize=True, dropna=False)
print(W2_QE1)

# Variable No. 3
## W2_QB3: How often would you say you vote?
print("W2_QB3: Voting Behavior of participants in survey wave 2")
print("1=Always, 2=Nearly Always, 3=Part Of The Time, 4=Seldom, -1=REFUSED")
print("*Frequency*")
W2_QB3 = data["W2_QB3"].value_counts(sort=False, dropna=False)
print(W2_QB3)
print("*Percent*")
W2_QB3 = data["W2_QB3"].value_counts(sort=False, normalize=True, dropna=False)
print(W2_QB3)

2) Program output as frequency tables

W2_QFLAG: Participation in wave 2 of the survey

Frequency Ratio
NaN 693 0.302092
1 1601 0.697908

Legend: 1=participated in wave 2 of the survey

W2_QN1: Trust of participants in survey wave 2

Frequency Ratio
NaN 693 0.302092
1 598 0.260680
2 947 0.412816
-1 56 0.024412

Legend: 1=Most people can be trusted, 2=Can’t be too careful, -1=REFUSED

W2_QE1: Optimism of participants in survey wave 2

Frequency Ratio
NaN 693 0.302092
1 880 0.383609
2 230 0.100262
3 460 0.200523
-1 31 0.013514

1=Optimistic, 2=Pessimistic, 3=Neither, -1=REFUSED

W2_QB3: Voting Behavior of participants in survey wave 2

Frequency Ratio
NaN 693 0.302092
1 851 0.370968
2 418 0.182214
3 89 0.038797
4 200 0.087184
-1 43 0.018745

Legend: 1=Always, 2=Nearly Always, 3=Part Of The Time, 4=Seldom, -1=REFUSED

3) Summary

Wave 2 of the survey “Outlook On Life” only covers about 70% of all participants. For this reason all variables used in this analysis have >30% missing values (which Python does not output as NaN because the missing values have been coded with empty spaces in the original data file). The number of participants that have refused to answer the question fortunately is low (average: 43.33 or 1.89%, standard deviation: 12.5 or 0.55%). I have failed so far to create a working subset of the data that only includes a) participants of wave 2 of the survey which b) did not refuse to answer any of the questions. I will keep working on this.

When asked if “Generally speaking, would you say that most people can be trusted, or that you can’t be too careful in dealing with people?” only 598 (26.068%) agreed with the statement “Most people can be trusted”, while 947 (41.2816%) stated “Can’t be too careful”.

When asked to think about their future 880 (38.3609%) said to be “generally optimistic“, while 230 (10.0262%) had a pessimistic outlook on life, and 460 (20.0523%) neither.

When asked to estimate their voting habit 851 (37.0968%) claimed to vote “always“, 418 (18.2214%) “nearly always”, 89 (3.8797%) as least “part of the time”, and 200 (8.7184%) “seldom”.

This numbers look pretty high to me. Taking into account only those participants who were questioned in the second wave of the survey (i.e. ignoring the missing values), and assuming that “always” can be quantified as 100%, “nearly always” as 66%, and “part of the time” as 33%, that would accord to a voter turnout of roughly 72%. Compare this to effectively about 45% total voting age population turnout for parliamentary elections, and about 55% for presidential elections.

In my opinion it is bad practice to use vague semantic quantifiers like “nearly” and “part of”, and it is definitely bad practice not to have included the answer “never” as an option in the survey. Social pressure might be skewing the numbers here. Or we might actually have a higher-than-average politically engaged sample.

Based on this data I have done a quick and dirty line chart (with Calc and GIMP):

Line chart depicting values for the three variables Trust, Optimism and Voting Behavior

Trust, Optimism and Voting Behavior

One problem that I will be facing are the different scales used to quantify the variables W2_QN1, W2_QE1 and W2_QB3. I am furthermore still struggling with the definition of subsets, and looking for help handling different data types in Python (and converting from one data type to another).

Leave a Reply

%d bloggers like this: