EC309 Applied Econometrics Assignment 2, 2026 | Maynooth University (MU)

University	Maynooth University (MU)
Subject	EC309 Applied Econometrics

EC309 Assignment 2

This assignment is due Thursday April 23 12.00 noon. Getting an assignment in on time is part of the exercise. Please submit in plenty of time and do not leave it until the last minute to upload. You will get a 15-minute grace period to allow for upload; Moodle will shut down 15 minutes after the deadline. If it is after this time, you will have to email me. So all assignments that I receive by email will automatically have 25% deducted (before grading) if received within one hour of the 12.00 noon deadline. If received more than 60 minutes but less than or equal to 120 minutes of the time due (12.00 noon), it will have 50% deducted (before grading). Any assignment received by email after 120 minutes will not be graded. Please remember the official deadline is 12.00 if you are emailing after Moodle has shut down, you are already 15 minutes late and will incur the penalty as described above.

Question 1

uses R and Q2 and 3 use STATA.

1. Short Exercise in R.

For this question, use the same dataset that you used for Assignment 1, this is called EC309_assign2_q1_data.dta for this assignment. Save all your commands ( see below in a command file (R file) and save your regression results ( see part d and e) in a text file and upload both of them to Moodle with the usual naming convention assign2_q1_command_surname_first_name (for command file) and assign2_q1_output_surname_first_name (for output/results file) and

In the data set, price is daily price in US$, distance is distance in miles the hotel is from the city centre and stars is the star rating ranging from between 1 and 5. The variable accommodation-type is a string variable denoting the accommodation type, e.g. Hotel, Apartment etc.. All parts (a to e) carry equal marks.

a. Import the dataset from STATA format into R.

b. Restrict the sample to accommodation_type which is either Hotel or Apartment, remember this is a string variable.

c. Generate a dummy variable which is equal to 1 if accommodation type is an Apartment, zero otherwise. Make up a name for this yourself.

d. Run the regression

price = β₁ + β₂ distance + β₃ stars + β₄ [apartment_dummy]

Where you insert your name for what I called , (you call it something else).

e. Send your output to a text file and upload this together with your command file. Make sure to have the question name and your own name on the end of the each of the two filenames and include comments in your command file so it is clear what you are doing.

Question 2

Suppose we have records on breathalysers tests testing blood alcohol levels for the period 2015 to 2025 in two states, state1 and state2. State 1, in an effort to crack down on drink driving, introduced new legislation which resulted in stiffer penalties on drink driving in 2019 that continued for the end of the period. State2 did not have any changes to this legislation. The data set EC309_assign2_q2_data.dta has data on the number of drivers who failed the breathalyser test for the years 2015 to 2025 for two states State1 and State2. Where fail is the number of failed tests in a given year for a given state. State is a string variable which is either state1 or state2. year ranges from 2015 to 2025. Set up a do file with proper comments and create a log file to store all your results and submit them both as assign2_q2_surname_first name.do and .log respectively. (enter your own name!). You can include comments and the any additional answers in your log file

a. Using the command list, list all the variables. [5 points]

b. Using data for State1 only, and OLS, show how you would provide an estimate of the effect of the legislation on the number of failed breathalyser tests? What is the estimate? [15 points]

c. Using mean values, find the Difference in Difference (DiD) estimate of the effect of the legislation. Comment on your results. Are they different to results from b). Explain why/why not. Can you provide any reason why they might be different? Which is the better estimator to use and why? It often helps to graph the data so graph the data and use this graph when discussing your data (I have included a command below which will graph the data by state and year.) [35 points]

d. Re-do the DID exercise done earlier using regression analysis. You will need to create dummy variables here. You should observe that both yield the same results in this case where there are no other control variables. [ 20 points]

e. What assumptions are needed for this DiD estimator to yield a true causal effect? Be specific using this particular example. In practice, what else might you do to validate your results? You don’t have to do it just explain. You can provide the answer to this question in the log file. [25]

twoway (scatter fail year if state==”state1″,connect(l) xline(2019) xlabel(2015(2)2025) msymbol(t)) (scatter fail year if state==”state2″,connect(l) msymbol(d)),legend(label(1 “State1”) label(2 “State2”) )

Be careful that this has to be on one line in your do file or you can use /// if moving onto a different line.

Question 3

Please read the Notes to Question 3 below before you attempt the question. This question will be use a random subsample from the original EUSILC data. The EU SILC is a survey that the Central Statistics Office (CSO) has undertaken every year since 2004, and it focuses on particular on income and living conditions of the survey participants. It’s part of an EU-wide programme which allows policymakers to make comparisons across member states. You will use a random subsample of fulltime workers aged between 18 and 64 from the EUSILC data from the 2024 survey. I have also included the codebook which may be of help. This data is called EC309_assign2_q3_data.dta on Moodle. Sometimes you will get large data sets and a codebook. You may not need all of the variables nor all of the observations and you may have to create new variables. This question provides some practice at this. It may help to go back over the DATA Management material in the Tutorial 1 document.

Set up a do file with proper comments and create a log file to store all your results and submit them both as assign2_q3_surname_first name.do and .log respectively. (enter your own name!). You can include comments and the any other answers in your log file.

a. Use the STATA command to describe your data set. [10 points]

b. Generate the variables: (see notes below for guidance on what I need for b))

i. Generate new variables/label variable/label values/rename variables as per instructions below for education, earnings, gender, region and age variables. See Notes on Question 3
ii. Generate the variables trimearn. (document in your logfile how you decided to trim your earnings, show how many people you have gotten rid of)
ii. Provide some summary statistics for lnearn age, gender, education and region variables. You can use the sum or tabulate command whichever is more appropriate for the type of variable. Discuss your results. [45 points]

c. Run a regression: where the dependent variable is lnearn and the independent variables are controls for age, gender, education and region as defined below but only for the trimmed earnings sample (i.e. if trimearn==1). Interpret your results. Is this what you would expect? [45]

Notes for Question 3

You will need to consult the EUSILC codebook which is also up on your Moodle Assignment 2 tab.

I have asked you to label your variable values so make sure to look this up on how to do this using the label define and label values command on EC309 STATA tutorial 1 or you can just type help label and scroll down to find how to label the values.

Education:

The education variable in the data set is a categorical variable called pe041 and ranges in values from 0 to 80 (see codebook for how these values are coded). Firstly, create a new variable which groups these into five categories. So, create a new variable education which ranges from 1 to 5, (Remember where to use = and ==.

education =1 if less than upper secondary (0 ≤pe041 ≤20)
education=2 if highest education is upper secondary 30 ≤pe041 ≤39
education=3 if highest education is some college but less than bachelor’s degree if (40 ≤pe041 ≤59
education=4 bachelor’s degree (60 ≤pe041 ≤69)
education=5 Postgraduate (70 ≤pe041 ≤80)

Now label the education values 1 to 5 using the label value command in STATA and choose your own label. This makes it easier to interpret the regression coefficients.

Note there may be missing values for this variable so they should not be included, check this

tab education if pe0441==.

Now perform a cross tab to make sure you did it correctly.

tab pe041 education

Label the education values 1 to 5 using the label value command in STATA and choose your own label to make it easier to interpret the regression coefficients.

Earnings:

We are going to use the variable gross annual earnings variable py010g. When we are running earnings regressions, we usually use log earnings so to generate the log use the following command.

gen lnearn=ln(py010g)

Region

The variable region is a string variable that has unusual labels. You can check the codebook to see what these refer to.

If we want to use these as dummy variables, the i. variable name command we have to convert the string variable to a numeric variable so we want to get rid of the” IE” in the variable values. One way to do this is the use the destring command. We create a new variable region2 using the command below.

destring region, generate(region2) ignore(“IE”)

Now label the values 4 to 6 using the EUSILC codebook so that the results are easier to read in the regression output. Again, you can use your own labels.

Now tabulate region region2 to see what you get.

Gender

The gender variable is pb150,

i. rename this variable so the name makes more sense and it easier to read in a regression output.

ii. check with the EUSILC codebook to see what the values refer to and use the value label commands in STATA to label the values so that they are more meaningful in the regression output.

Age:

The age variable aggp3 is categorical. In your data set it only has 3 categories as I have already omitted individuals who are < 18 or >65. Firstly, rename the variable so it is easy to know what it is. Also label your values to make more sense when reading them in regression output or tab command.

Checking for Outliers: Now that you have created the variables, we might want to look at earnings data for outliers (unusually high or low values).

sum py010g, detail

Do you see any outliers, very high or very low earnings? Then decide if you are going to exclude any outliers (i.e. have an upper and lower limit) and why. So, explain your reasons for doing this. Maybe for the lower limit, decide using the information on what someone on the 18+ minimum wage would earn in 2023?

Generate a variable trimearn=1 if earnings fall between your upper and limit.

Facing issues with EC309 regression and data analysis? Get help today

Get A Free Quote

Get Help By Expert

Working on your EC309 Applied Econometrics assignment and finding it hard to manage coding and analysis? Many students struggle with using R and STATA, running regressions, and interpreting results correctly. If you’re facing similar issues or running out of time, you can choose Ireland Assignments, where our experts provide accurate econometrics assignment help based on your course requirements. You can also check our maynooth university assignment examples to understand how proper solutions are structured. Use our online assignment help and get a customised solution for your EC309 assignment.