The assignment consists of several questions to be solved with SPSS. Only questions 3-7 needs to be answered in the report. Also, when writing the report, try to imagine that you are writing for someone with the sameknowledge in statistics as a student taking Business Statistics 1. It is also important that the “Instruction for lab reports in statistics” is followed(available on Canvas). The deadline is on the 22th of October. The lab reports should be uploaded on Canvas on the “Assignments” page. A movie enthusiast has collected information on movies from four streaming platforms; Netflix, Hulu, PrimeVideo and Disney+, in a dataset called Movies.sav.
1. Open the data in SPSSTo open the dataset in SPSS, simply double click on the file or follow the instructions below. 1.Start SPSS from the start menu.2.Then use “File” → ”Open” → ”Data” to open the Movies.sav file.3.Save the file on g: (suggestively).
2. Labelling variables and values. The name of the variables for the release year, IMDB-score and whether the movie is available on Disney+ has not yet been specified and are just calledvar001, var002, var003. To specify proper names for these variables, click on the “Variable View” cell in the left lower corner. To rename the variable “var001”, click in the “var001” cell and write e.g. “Year”. Now rename “var002” and “var003” to e.g. “IMDB” and “Disney”, respectively. The variable named “Age” is the age recommendation of the movie. It includes 5 different age recommendations, which can be seen in the table below. Numeric codeAge recommendation1all27+313+416+518+At this moment only the numeric codes are available in the data. To facilitate the understanding of our data we can use so called value label to indicate that the numeric value 1 means that the movie is suitable for all ages, that the numeric value of 2 means that the movie has an age recommendation of 7+, etc. To do this click the “Values” cell for the variable “age”. In the box that appears on the screen you should write 1 in the “value” window and “all ” in the “value label” window, then click add. Similarly, give the label “7+” to the value 2, “13+” to the value 3, “16+” to the value 4, and “18+” to the value 5.
3. Graphical analysis. The first step in a data analysis is usually to perform a graphical analysis of the data. The graphical functions are available in the “Graphs” menu under “Legacy Dialog”. First, we will investigate the proportion of movies available on Netflix in the dataset by using a Pie chart. Select “Pie” and then in the box that pops up “summaries for groups of cases”. Finally define slices by “Netflix”. The graph is a visual presentation of the “Netflix” variable. After the graph is generated, double click on the graph to enter the Chart Editor. Go to “Elements” → “Show Data Labels” and display percentage.Next, we wish to check the distribution of “Age” for the movies available on Netflix. Bar plots are useful for this purpose. From the graph menu, choose “Legacy Dialogs” and then “Bar”. In the box that appears on the screen you chose “Clustered” and “Summarizes for groups of cases”. Use “Age” on the category axis and define clusters by “Netflix”. Copy the graphs into a Word document by selecting “copy” from the “edit” menu in the output window. Paste the graphs into Word document by the “paste” function.Questions: a.What is the proportion of movies available on Netflix in the dataset? b.What is the mode for the “Netflix” variable?c.Is the mode a good measure of central tendency for the variable “Netflix”? d.What is the mode of the “Age” variable? Is the mode the same for movies available and movies non-available on Netflix in the dataset? When answering the questions, a and d, you should refer to the diagrams.
4. Measures of location and spread. The graphs do not provide all relevant measures. We would like to have several different measures of location and spread, for comparison. Use “Analyze” → ”Descriptive Statistics”→ “Frequencies”. Choose the following variables: Age, Disney and Runtime. Use the “Statistics” button to calculate the Mean, Median, Mode, Variance, Standard Deviation, Min, Max, and Range value for the three variables. As we can see from the output, frequency tables are a good way to display the distribution of variables with few outcomes (often used for categorical variables), but when you have a variable with many outcomes, like “Runtime”, the frequency tables are too messy and long to be useful. Instead, the distributions of such variables are often displayed in
histograms. To remove the frequency table for “Runtime” simply mark the table (click on it in the output window) and press delete. To create a histogram for the variable “Runtime”, go to “Graph” →“Legacy dialogs” → “Histogram” and select “Runtime”. Questions: Include the descriptive table, the frequency tables for “Age” and “Disney”, and the histogram for the variable Runtime in your report and answer the following questions. a.At what measurement of scale (i.e. ratio, interval, ordinal or nominal scale) are the variables measured? Discuss the threevariables; “Age”, “Disney” and “Runtime”. b.Are all three measures of central tendency (mode, median & mean) relevant for all three variables? Discuss each variable(“Age”, “Disney” and “Runtime”) and each measure of tendency: for instance: Age: mean can/cannot be used here because…mediancan/cannot be used because…c.Are the measures of variability (variance, standard deviation & range) relevant for all three variables? Discuss each variable(“Age”, “Disney” and “Runtime”) and each measure ofvariability. d.Are there missing data? If so, how many, and for which variables?
5. Confidence interval and hypothesis test.The movie enthusiast likes older movies and believes that movies made before the year 2000 is better than movies produced in year 2000 or later. Conduct a T-test to see whether the IMDB-score supports the enthusiast’s believe. To do this, we need to create a new variable indicating whether the movie was released before 2000 or not. Go to “Transform” →“Compute variable”. In the cell “Target Variable” you specify the name of the variable we create, suggestively “Pre2000”, in the cell “Numeric Expression” we specify what the variable should be equal to. In the first step we specify the variable to be equal to 1 and press OK. Now we have created a variable that is equal to 1 for all observations. However, we want the variable to be equal to 1 if the movie was released before 2000 and 0 otherwise. To make this change go to “Transform” →“Compute variable” again. Let “Target Variable” be the name of the variable you created (Pre2000 if you followed the suggestion). In the “Numeric Expression” cell write 0 this time and press the “If…” button. Click the “Include if cases…” button and write “Year>1999”. This will
ensure that we change the variable to be equal to 0 only for the movies released in the year 2000 or later. Press Continue, OK and OK. To conduct an independent samples T-test, use “Analyze” → ”Compare means”→ “Independent samples T test”. Test variable should be “IMDB”, group variableshould be the variable you created above (Pre2000). Click define groups and let the movies released before 2000 be group 1 and the movies release after that to be group 2 by writing the value 1 for group 1 and the value 0 for group 2 (which group you define as 1 and 2 does not really matter).Questions: a.Conduct an independent samples T-test using a 5% significance level to decide if the IMDB-score is higher for the movies released before 2000. You should clearly state your hypotheses and use both the p- value approach and the critical value approach. Also,your conclusion should be clearly written.The Movie enthusiast is thinking of subscribing to Netflix but is only willing to do so if more than 30% of all available movies are good. The enthusiast regards a movie as good if it has an IMDB-score of at least 7. To begin the analysis, we need to create a variable indicating whether a movie has an IMDB-score of at least 7. Go to “Transform” →“Compute Variable”, press the “Reset” button to reset clear everything from the previous computation. Now, call the variable “GoodMovie” and let it be equal to 0 for all observations in the first step. In the second step, go again to “Transform” →“Compute Variable” and “Target Variable” be “GoodMovie”. In the “Numeric Expression” cell write 1 this time and press the “If…” button. Click the “Include if cases…” button and write “IMDB>=7”. Press Continue, OK and OK. The variable “GoodMovie” is now equal to 1 if it has an IMDB-score of at least 7 and 0 otherwise. To find the proportion of good movies on Netflix in the sample go to “Analyze” →“Descriptive Statistics” →“Crosstabs”. Define the Rows to be “Netflix” and the columns to be “GoodMovie”. From the output you can calculate the proportion of good movies on Netflix. If you want SPSS to do it for you go to “Analyze” →“Descriptive Statistics” →“Crosstabs” and click the “Cells” button and tick the box for “Row” in the percentages part of the window. This type of tables is called contingency tables. Questions: b.Calculate by hand based on the output in the contingency table a 95% confidenceinterval of the share of good movies on Netflix. Show your calculationsand interpret the confidence interval.
c.By hand, conduct a hypothesis test on the 95% confidence level (i.e. ????????=0.05) to test the null hypothesisthat the share of good movies on Netflix is at most 30% vs. the alternative hypothesis that the share of good movies on Netflix is greater than 30%. State the hypotheses, significance level, etc. and show your calculations. Use both the p-value- and the critical value approach. State your conclusions based on the hypothesis test. d.Based on the result would you advice the enthusiast to subscribe to Netflix?
6. Repeated hypothesis test by splitting the data. A friend tells the enthusiast that the average IMDB-score of a movie is 6.25. The enthusiast wants to test using a one sample t-test whether the average IMDB-score of each streaming platform is equal to 6.25 or not. The nominal variable, “Platform”, identifies which streaming platform each movie is available on. This task can be done by splitting the data based on this variable.To split the data, choose: “Data “→ ”Split File”→ Select “Compare groups”→ Select “Platform” → Click “OK”. Note: you will only get a line of code in the output window when you do this. Now when you do the t-test (see below) SPSS will do a t-test for each streaming platform automatically. In fact, anything you now do will be repeated for each streaming platform, this saves time if you want to repeat the same thing for multiple groups. To begin the one-sample t test, from the menus choose: “Analyze”→ ”Compare Means”→ “One-Sample t-test”. Select “IMDB”as the test variable. → Type 6,25as the test value. Click “Options”. → Type 95 as the confidence interval percentage. → Click “Continue”. → Click “OK” in the One-Sample t-test dialog box. The SPSS output reports (for each streaming platform) the t-statistic, the p-value (named Sig (two-tailed)” in SPSS), the mean difference, ????????̅−μ0, the square root of the sampling variance �????????2/???????? (named std. error mean in SPSS) and the upper and lower bounds for the confidence interval. Use the output to solve the following problems for the streaming platforms individually:Questions: a.Calculate the t-statistics by hand for one streaming platform and check that SPSS provided the correct value of the t-statistics. Use the values available in the descriptive table provided in SPSS.b.Test ????????0∶μ= 6 .25against ????????1∶μ≠ 6 .25on the 5% level (i.e. ????????= 0.05) by comparing the t-statistic to the critical value for all platforms. The critical values are found in the distribution table uploaded on Canvas; it is not a part of the SPSS output.
c. Test ????????0∶μ= 6 .25against ????????1∶μ≠ 6 .25 on the 10% level (i.e. ????????= 0.1) by comparing the p-value (given in the table by SPSS) to ????????. No calculations required. d.Shortly comment on the result for each streaming platform. What can you say?When answering the questions, you should refer to the tables.
7. ANOVA.
Lastly, out of curiosity the movie enthusiast wants to know whether the average runtime of the movies is different based on the age recommendation. This can be tested using an ANOVA test (F-test). Recall that the variable “Age” identifies the age recommendation of the movie. Firstly, we want to produce a descriptive table of the runtime of the movies for each age recommendation. To do this we want to split the data based on the variable “Age”. Go to “Data “→ ”Split File”→ Select “Compare groups”→ Select “Age” → Click “OK”. Then, go to “Analyze” → ”Descriptive Statistics”→ “Descriptives”, select the variable “Runtime” and press the button “Options” and tick the box for Mean, Std. deviation, min and max. Press continue and OK.Before we perform the F-test we must ensure that the file is not split anymoreby clicking “Data “→ ”Split File” and then click “Reset” and “OK”.To perform the test go to“Analyze”→”Compare Means”→ ”One-Way ANOVA”. Now we must specify “Dependent list” as the variable “Runtime” and for “Factor” we choose the variable “Age” and then click on the “OK” button. An ANOVA-table will now appear in the output widow. Include both the descriptive table and the ANOVA table in your report.Questions:
a.Based on the sum of squares between groups and sum of squares within groups found in the output, show how to calculate the mean square treatment (between groups) and the mean square error (withingroups) and finally how to calculate the F-statistic.
b.Use the output to test ????????0: μ1= μ2= μ3= μ4=μ5against ????????1: Not all μ???????? (????????=1,2,3,4,5)are equal on the 1% level (i.e. α = 0.01) by comparing the F-statistic to the critical value. You can use df=1000 for the denominator. State your conclusion of the test.