Assessment 1 Information
| Subject Code: | DATA4200 | ||
| Subject Name: | Data Acquisition and Management | ||
| Assessment Title: | Sampling and data mining project | ||
| Assessment Type: | Report | ||
| Word Count: | 1600 | Words | (+/-10%) |
| Weighting: | 30% | ||
| Total Marks: | 30 | ||
| Submission: | via MyKBSand Turnitin | ||
| Due Date: | By TuesdayWeek 5 (Report) 23:55AEST | ||
Your Task
Read the Assessment Instructions and complete sections (a) – (e)
Consider the rubric at the end of the assignment for guidance on structure and content.
- LO3: Create analysis-ready data sets by applying and exploring basic validation, preprocessing, filtering and cleaning techniques
- LO4: Evaluate and apply data mining software
Submit your written report (in Word) and your software file (e.g. Excel, Power BI) via MyKBS by Tuesday 23:55 AEST Week 5.
Assessment Description
Business Problem: Airbnb is a U.S. company which provides an online marketplace for short- term and/or holiday accommodation. Airbnb collect large volumes of data to gain insight into their clients and associated customers, such as review scores, host acceptance rate, ‘superhosts’, popular accommodation types and density of listings in particular location.
Data sets: We have obtained data on Airbnb listings in Melbourne with a variety of variables. Sampled datasets, the original data and data dictionary will be available from Week 4. See sections below.
Assessment Instructions
Analysis and Report (30 marks)
Use Microsoft Excel or Power BI or Tableau.
Recall the sampling methods below that you have learnt about in lectures.
A data dictionary file and the following datasets (as .csv files) that contain sample data generated using quota, systematic, simple random, and stratified sampling will be available from week 4, see section c. below. You will also have to access the original population dataset cleansed_listings_dec_18.csv from the source, see section a. and section e. below.
Create a report and include your response to the following questions:
- Access the data file cleansed_listings_dec_18.csv, by going to the link provided on MyKBS under the Assessment 1 tab. You will initially be downloading a zip folder from the Melbourne Airbnb Open Data project on Kaggle. Extract all the files within the folder and then choose the file cleansed_listings_dec_18.csv. Browse over the columns and comment on which variables appear to be the most useful in terms of insights into current listings. Document that in your report. (150 words, 2 marks)
- List an advantage, possible disadvantage and limitations of each of the sampling methods. (150 words, 2 marks)
- Access the sampled data sets on MyKBS. Choose a number of different variables, as in part (a), then for each of the sampled datasets create summary statistics for each of those variables. That is, make sure that the selected variables are the same for each of the four datasets and document them in your report. (300 words, 6 marks)
- Interpret and compare the results of the summary stats across all four sample datasets. What conclusions can you draw from the comparison. Document your findings in your report. (500 words, 10 marks)
- Repeat the above for the original dataset cleansed_listings_dec_18.csv. Explain with statistical examples which sampling method summary stats (across all chosen variables) were nearest in value to the original dataset summary stats.
Explain the variations in your report and include the supporting data. Explain possible ethical issues that could occur from the use of sampled data.
Briefly evaluate the software that you have used to produce the summaries. (500 words, 10 marks)
Important Study Information
Academic Integrity Policy
KBS values academic integrity. All students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Academic Integrity and Conduct Policy.
What is academic integrity and misconduct? What are the penalties for academic misconduct? What are the late penalties?
How can I appeal my grade?
Click here for answers to these questions: http://www.kbs.edu.au/current-students/student-policies/.
Word Limits for Written Assessments
Submissions that exceed the word limit by more than 10% will cease to be marked from the point at which that limit is exceeded.
Study Assistance
Students may seek study assistance from their local Academic Learning Advisor or refer to the resources on the MyKBS Academic Success Centre page. Further details can be accessed at https://elearning.kbs.edu.au/course/view.php?id=1481
Generative AI Traffic Lights
Please see the level of Generative AI that this assessment has been designed to accept:
| Traffic Light |
Amountof Generative Artificial Intelligence (AI) usage |
Evidence Required | This assessment (✓) |
Level1 | This assessment fully integrates Generative AI, encouraging you to harness the technology's full potential in collaboration withyour own expertise. It willhighlight your ability to demonstrate how effectively you can work alongside AI to achieve sophisticated outcomes, blending human intellect and artificial intelligence. | Your collaboration with AI must be clearlyreferenced and documented in the appendix of your submission, including all prompts and responses used for the assessment. | |
Level2 | This assessment invites you to engage with Generative AI as a means of expanding your creativity and idea generation. It will highlight your ability to complement your original thinking with the capabilities of AI. For example, through brainstorming and preliminary concept development. | Your collaboration with AI must be clearlyreferenced and documented in the appendix of your submission, including all prompts and responses used for the assessment. |
✓ |
Level3 | This assessment showcases your individual knowledge and skills in the absence of Generative AI support. It willhighlight your personal abilities. For example, to analyse, synthesise, and create based on your own understanding and learning. | Use of generative AI is prohibited and may potentially result in penalties for academic misconduct, including but not limited to a mark of zero for the assessment. |
Assessment Marking Guide
| Section | Criteria | NN (Fail) 0-0.5 mark | P (Pass) 50%-64% | CR (Credit) 74%-65% | DN (Distinction) 75%-84% | HD (High Distinction) 85%-100% |
| (a) | Comments on the usefulness of at least 4 variables in relation to insights (2 marks) | No comments | Comments on one selected variable | Comments on two selected variables | Comments on three selected variables | Comments on at least 4 selected variables |
| (b) | State at least 3 advantage/disadvantage and limitations (2 marks) | not stated | One advantage / disadvantage and one limitation stated | Two advantage / disadvantage and two limitations stated | Any three advantages/disadvantages and less than 3 limitations | At least 3 advantage/disadvantage andlimitations stated |
| (c) | Summary statistics for each sample across the four selected variables (6) | One sample and one selected variable | Two samples and two selected variables | 2-3 samples and 3 variables | Any three advantages/disadvantages and less than 3 limitations | At least 3 advantage/disadvantage andlimitations stated |
| (d) | Comparisons made of results generated above and conclusions drawn and documented (10 marks) | No or limited comparison/conclusions drawn | Results compared to 2 samples and 2 selected variables with limited conclusions | Results compared to 2 samples and 2 selected variables with limited conclusions | 3 -4 samples 3 variables used in comparison of results with meaningful conclusions | 4 samples and at least 4 variablesused in comparison of results withmeaningful conclusions |
| (e) | Explained with statistical examples which sampling method summary stats across all selected variables were nearest the main dataset, and variations were explained. Explain ethical issues and evaluate the software. (10 marks) | No, or very limited explanation of the comparative variations across 0-1 selected variables.
Ethics not considered
Evaluation of software not mentioned | Comparison of summary stats across one sample and just two simple variables
Ethics considered in a very general way
Evaluation of software very general | Comparison of summary stats across at least two sample and two unrelated variables
Ethics considered in a more relevant way, but may not be practical
Evaluation of software relevant | Comparison of summary stats across at all samples and three variables
Ethics considered in a very relevant, practical and realistic way
Evaluation of software relevant andspecific to this project | Comparison of summary stats acrossat least four sample and at least four variables. Diverse variable choices and originality shown.
Report engaging, novel and well integrated
Ethics considered in a very relevant, novel and practical way
Evaluation of software detailed, relevant and specific to this project |