Data Project
Words: 550
Pages: 2
70
70
DownloadName:
Instructor:
Subject:
Date of submission:
Data Visualisation Data Project
Introduction
Data visualization is an integral skill for researchers. Consequently, scholars should equip themselves with the skill in question because it could aid them to gain useful insights from datasets (Ginde, n.p.). It is notable that R is an open source software that could aid scholars in achieving such an objective. It is why the researcher investigated a dataset that was retrieved from the Michigan State University using R. This paper investigates the Taxi05 data set, which has distance, fare, call, minutes, and payment as the only variables. It is critical that some of the variables are quantitative while other variables are qualitative. Specifically, fare, distance, and minutes are quantitative variables while payment and call are qualitative variables.
It is important to highlight that the researcher sought the help of three packages for examining the data under discussion. These packages include data.table, ggplot2, and lattice. The data.table package was used for reading data directly from the source. Analogously, the lattice and ggplot2 packages aided the researcher in creating the visualization from the data set. For instance, the researcher created a bar graph displaying the mode of payment used in the data set. From the graph in (see figure 1.), it is evident that card payments were slightly higher than cash payments.
Figure SEQ Figure * ARABIC 1. A bar graph displaying the mode of payment for the Taxi05 data set
A histogram for the distance variable was generated (see figure 2.
Wait! Data Project paper is just an example!
). The histogram indicates that distance is skewed to the right and centered close to 2.5.
Figure SEQ Figure * ARABIC 2. A frequency histogram displaying the distribution of distance for the Taxi05 data set
A relative frequency histogram for the distribution of minutes was created with the help of the lattice package. From the graph below, the distribution of minutes is skewed to the left and centered at around ten minutes (see figure 3.).
Figure SEQ Figure * ARABIC 3. A relative frequency histogram for the distribution of minutes in the Taxi05 data set
Summary statistics for the fare in the Taxi05 data set were also examined and results displayed in below (see figure 4.). They indicate that mean = 14.26, median = 11.30, standard deviation =9.01, variance=81.21, range=65, interquartile range = 8.95, first quartile = 8.30, size (n) = 500, and the third quartile = 17.25.
Figure SEQ Figure * ARABIC 4. An image displaying summary statistics for the Taxi05 data set
A box plot of fare was also displayed using the ggplot2 package (see figure 5.). It indicates that fare is positively skewed has significant outliers for any cost above 30.
Figure SEQ Figure * ARABIC 5. A box plot showing the distribution of fare from the Taxi05 data set
A side-by-side boxplot of fare separated by call was also created (see figure 6.). From the figure, it is clear that the average cost of fare for dispatch calls is higher than the cost of fare for Street_Hall calls. It is also clear that Street_Hall is positively skewed with several outliers while Dispatch is approximately normally distributed with few outliers.
Figure SEQ Figure * ARABIC 6. A box plot for the distribution fare grouped y call in the Taxi05 data set
Ultimately, a scatter plot of distance against fare was generated (see figure 7.). The scatterplot indicates that there is a positive linear relationship between distance and the cost of fare.
Figure SEQ Figure * ARABIC 7. A scatter plot of distance and fare from the Taxi05 data set
Conclusion
In conclusion, it is evident that data visualization is important because of the deductions obtained from this data visualization exercise. For instance, the researcher was able to examine the difference in the distribution of fare grouped by calls. The researcher was also able to establish the presence of a positive linear relationship between distance and the cost of fare. Critical to the debate is the code used to produce the reported output, which is displayed in the appendix below. In short, data visualization could aid researchers to clear and concise insights from data sets (Rahlf, pp.1).
Works Cited
Ginde, Gouri. Visualisation of massive data from scholarly Article and Journal Database: A
Novel Scheme. Department of Computer Science and Engineering, PESIT Bangalore South Campus, Bangalore, India. n.d., https://arxiv.org/ftp/arxiv/papers/1611/1611.01152.pdf. Accessed 07 February 2018.
Rahlf, Thomas. “Data Visualisation with R: 100 Examples.” Journal of Statistical Software, vol.
81, no. 2, 2017, pp.1-5.
Appendix
Codes used to generate the output.
Subscribe and get the full version of the document name
Use our writing tools and essay examples to get your paper started AND finished.