Which of these types of projects are you conducting? And, what is the objective of your project?

Overview Over the course of completing Assignment #5, you will conduct a small project on a topic that is useful or meaningful for you. You will either set your own question or objective, and you will adapt and build upon skills and concepts learned in class to solve new problems. Completion of this project involves synthesizing concepts and adapting skills you have learned throughout the semester. Please see file #1 for candidate project ideas.

Where relevant, part of your assignment may involve adapting class example scripts (or concepts), but it is also important to have a component of building onto that foundation by using a new package, analysis method, or visualization type. It is also important to implement good programming skills (such as creating functions and avoiding repetitive code). You are also welcome to propose your own topic, but you must speak with me for project approval at least three weeks before the assignment due date. Before making your project selection, To ensure your success and minimize stress, start on this project right away!

This project is meant to be completed over 3-4 weeks of part-time but regular effort. Assignment #5 serves as both a learning opportunity and an assessment tool. This assignment also provides training, serving as a valuable stepping stone towards completing larger projects (e.g. BINF*6999 or your thesis and beyond). Completing this project contributes towards achieving all five of our course-level learning outcomes: 1. obtain data from key databases relevant for bioinformatics and to understand the sources and limitations of these data 2. filter, manipulate, analyze, and visualize bioinformatics data 3. conduct reproducible analyses and use software tools for version control and collaboration 4. understand and apply selected algorithms commonly used in bioinformatics, including for sequence alignment and clustering 5. adapt the above skills to learn new tools and conduct new analyses not explicitly covered in class 1 General Guidelines For most class members, your project should be written entirely in R. It is required that either all or the majority of your assignment be written in R. If suitable for your project (e.g. see file 1 with candidate project topics), you may use a software tool outside of R as part of your work. It is essential that you explain clearly what you did in the other tool. Also, you must provide the output from that tool, as an input for your R script. You will submit both a PDF and an R script file or a RMarkdown file for your Assignment #5, using separate Dropbox folders on CourseLink. You should also include any needed input data files.

You should show your code, not only the figures generated. Please take care with the total length of your PDF. The expected assignment length is 8-12 pages (maximum length 18 pages). This longer maximum page limit compared to Assignments 1 and 2 is there to accommodate the additional text sections and reference list. Long code and lengthy outputs are NOT expected. For example, instead of lengthy data outputs to screen (and to your PDF), use head() to limit the scope of what is printed. As well, there are some functions that have quietly = TRUE (or similar) as an optional argument, which you may consider for functions that produce long outputs to screen (e.g. outputting progress during sequence alignment). You can also write your own functions to reduce redundancy in your code. Organize your project using the numbered headings below, in order. Include the numbered headings in your assignment. You can add a more detailed sub-section title if you wish. E.g. “Introduction: Impact of Normalization Methods in Gene Expression Analysis”. You should use GitHub to manage your own work. You may keep your repository private. Include your GitHub repository link at the top of your assignment. You may heavily draw from a Bio conductor vignette for your project, if suitable. Of course, you must cite any such sources used. However, running an existing vignette, with the provided data, end to end would not be a sufficient project.

You must build upon such tutorials and include a novel component for your project. Many examples are provided in file 1. Examples include: comparing the results from using two packages, testing a biological hypothesis using a different data set compared to what is used in the vignette, combining two software tools for a novel project (e.g. combining DADA2 output with a phylogeny community analysis, such as using phyloseq or picante), trying a range of parameter settings for selected functions and exploring the impact of methodological choices upon the results, or re-creating all or part of an interesting analysis or figure from a publication using your own code. 2 Detailed Project Requirements and Sections 1. Introduction (short written section of 2-3 paragraphs) (10%) Pose a question or objective that interests you, and set up your project. Using one of the themes outlined in file 1, or your own project idea (pending approval), narrow into a specific project objective and data set. Your project may involve building a tool (e.g. classifier), exploring an idea using visualizations, testing a specific biological hypothesis by applying analytical methods, or exploring the impact of analytical choices upon results.

Be clear in your introduction: Which of these types of projects are you conducting? And, what is the objective of your project? Aim to cite 2-4 suitable references in your introduction. You may use any of the references uploaded to Course Link or other literature you find on your topic. Guidelines:

This length guideline for written paragraphs applies to all of the written sections in Assignment #5. Tips: The following is one possible way to organize your introduction into a nice flow.

o Paragraph 1: What is the overarching topic you are interested in? Why is this scientifically important and interesting and/or socially relevant? o Paragraph 2: You can outline what is an important sub-area of research or a gap in knowledge. (As an example, perhaps your broader theme is gene expression analysis, and in paragraph 2 you could narrow in to introduce the importance of statistical choices.) o Paragraph 3: What is the specific objective of your study?

What kind of study are you performing (e.g. Exploratory? Software comparison? Hypothesis testing? Classifier creation? Learning about tools to re-create a published figure?) What will you do? 2. Description of Data Set (1 paragraph) (5%) Provide a short written description of your data set. What data set have you chosen to analyze to address your question? Describe the data set. Where are the data from, e.g. from the literature? from a public database? On what date did you obtain the data? How did you obtain the data?

What are the properties of the data set? How many samples are there, and what are the key variables you are interested in? If you are comparing groups (e.g. treatments, organisms, cell types, genes), describe the nature of the groups that you are comparing. Cite the source of the data set.

Depending upon the size of the data set, you may consider analyzing only a subset here, to keep the project scope manageable. Data should be analyzable on a desktop computer for this assignment. As well, please note that you will need to provide a data file to enable your code to run. If unpublished data are included with your assignment,

3. Code Section 1– Data Acquisition, Exploration, Filtering, and Quality Control (25%) Whenever possible, include code for data acquisition for your project. Also, include all data files needed for analysis with your submission. After data acquisition, you will next conduct data exploration, filtering, and quality control.

What is suitable in this section will depend upon your project. Here are a few examples: o Forkeyquantitative variables (e.g. organism length, mass, etc.), calculate data summaries, such as using the summary() function. Also, prepare a suitable plot, such as a boxplot or histogram or violin plot, showing the distribution of the data. If suitable for your project, you may use multiple panels in a single figure if you have multiple variables of interest. Check on outliers. Are these likely real data points, or could they be data entry errors? Note that we are looking for errors, not looking to exclude data that disagree with a specific hypothesis. You may decide, for example, that even if there is an outlier that it is likely real data and you might perform a data transformation so that the observation doesn’t have extreme influence upon your results. Or, you may perform your analysis with vs. without extreme data points for comparison. o ForDNAsequence data, you may consider doing a check for sequence lengths and exclude sequences having very short length compared to the majority of your data, for example.

The suitable choice would depend upon what gene and gene region you are analyzing. For example, for COI data for animals, there is only a very small amount of natural length variability (and thus very short sequences would be due either to poor quality or to metabarcoding projects which may involve sequencing short gene fragments). By contrast, in markers such as 18S, there can be a large amount of natural sequence length variability to consider. You may also consider excluding sequences with a large proportion of Ns. A suitable visualization for DNA sequence data could be a histogram of sequence lengths.

If 4 suitable for your project, you may investigate outliers based upon a pairwise distance matrix built either from k-mer frequencies or aligned sequences (i.e. are there sequences very different from others in the dataset?). Another helpful quality control step could be to build a visualization (such as a dendrogram, including scale bar of branch lengths) to look for sequences very different from others.

You could then BLAST unusual sequences (using NCBI tool) and check the top few hits and see if there could be a misidentification in the dataset (or another serious issue, such as the wrong gene, a different gene region being sequenced, wrong sequence orientation, wrong organism/contamination, etc.). o ForRNA-Seqdata, you should examine library size among samples and make a suitable analytical choice that takes into account variability in library size. A suitable plot for that type of analysis could be to plot gene expression levels before vs. after normalization (e.g. see the gene expression vignette we ran together in class). You should also explicitly consider how you will choose to treat lowly-expressed genes for your project.

Last Completed Projects

topic title academic level Writer delivered

Leave a Comment