Corrected title of what filetypes should we use

feb6a97a · Wigal, Jacob (CIV) · 744a43e5 · feb6a97a
Commit feb6a97a authored 5 years ago by Wigal, Jacob (CIV)
--- a/why_we_use_git_and_gitlab.ipynb
+++ b/why_we_use_git_and_gitlab.ipynb
@@ -124,7 +124,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "<h2>What text-based files should we use?</h2>\n",
+    "<h2>What text-based filetypes should we use?</h2>\n",
    "<h3>Storing and Sharing Code</h3>\n",
    "<p>\n",
    "<font size=\"3\">\n",

 %% Cell type:markdown id: tags:
 <h1>Why we use git and GitLab</h1>
 %% Cell type:markdown id: tags:
 <h2 align="left"> As a team, we need:</h2>
 <ul>
    <font size="3">
  1. a way to store and share data and code;<br>
  2. a collaborative space for testing new ideas; and, <br>
  3. precise version control to avoid losing past data/code.<br>
    </font>
 </ul>
 %% Cell type:markdown id: tags:
 <font size="3">
 <ul>
 Using <b>git</b> and <b>GitLab</b> is one way to do these three things. Together, they can:<br>
  1. store and share data and code;<br>
  2. enable users to see, edit, and comment other's work without overwriting files; and,<br>
  3. enable distributed version control and passoword protection of key files.<br>
    <br>
 <b>But first... some quick definitions:</b> [Git](https://git-scm.com/) is version control software (VCS) that enables people to use simple commands to keep track of files and file changes. When you edit a file, git helps you keep track of exactly what changed, who changed it, and why. [GitLab](https://about.gitlab.com/) is a deployment of git to a cloud server that has a user interface showing files and tracked changes. It allows multiple users to easily view edits tracked by git in the cloud. It also adds many more features like password protection and wikis. [GitHub](https://github.com/), which you may have heard of before, is a website and online platform similar to GitLab.
 </font>
 %% Cell type:markdown id: tags:
 <h2>Why we are using git</h2>
 <font size="3">
 We make changes to files all the time. Most changes we make (even mistakes) are not a big issue when a single person is in control of the file and its management. For example, when you write a report in Microsoft Word, you can pick up where you last saved the document. You can edit the file, save it, and if you see yourself make a mistake, you can "undo". You are fully aware of how your Word document has changed over time.
    <br>
    <br>
 Moving to a distributed, simultaneous file-sharing system with multiple users and organizations turns this workflow upside-down. In a distributed setting, there are a lot of different people who may want to edit your file. They may not know what changes you made in the past and may make changes that you do not notice. It is also easy for multiple people to work on the same file and make conflicting changes that do not work well together (e.g., simulataneous edits to the same sentence or paragraph in a report). In worst-case situations, one person may overwrite a file and lose important information without knowing and with no possibility of an "undo". Even if changes are tracked, files may have long version histories that are nearly impossible to make sense of.
    <br>
    <br>
    <i>This</i> is where git comes in.
    <br>
    <br>
   In general, git forces users to be deliberate about how they change files and tracks all changes people make to files. A change to a file in git is called a <i>commit</i>. With every commit, git also requires you to write a message about the change you are making called a <i>commit message</i>. Every commit and commit message helps explain what happened between each version of the same file. Everyone with access to the file can see every commit and commit message, as well as download or revert to old versions of the file at any time. Git also enables users to <i>pull</i> other people's changes into their files, or update files with newer versions.
   <br>
    <h3>Diffs</h3>
  Git also automatically produces an output called a <i>diff</i> that shows the line-by-line edits made in each commit. Diffs pinpoint the exact differences between file versions and provide a powerful way to understand version details and commit messages left by users.
   <br>
 Below is an example of a diff showing changes someone made to a text file called index.md. This diff shows us the updating_the_geo_nodes.md line was <font color='red'> deleted </font> and the version_specific_updates.md line was <font color='green'> added </font>. Notice how this diff identifies only the exact part of the line that changed, as the "<b>ing</b>" stays the same between commits!
    </font>
 %% Cell type:markdown id: tags:
 ![Clone Repo GitLab](https://docs.gitlab.com/ee/user/project/merge_requests/img/merge_request_diff_v12_2.png "GitLab diff")
 %% Cell type:markdown id: tags:
 <font size="3">
 Diffs are only visible with git for certain filetypes. Git was origially designed to deal with code for software development, so filetypes associated with most programming languages work great in git. For example, git will track commits and diffs for files like...<br>
    <ul>
  <li>Matlab files (.m)</li>
  <li>R scripts (.r)</li>
  <li>Text files (.txt)</li>
  <li>Python programs (.py)</li>
  <li>HTML files (.html)</li>
 <li>Tabular data files (.csv)</li>
  <li>Non-Tabular, text-based data (e.g., .json)</li>
  <li>and many more... </li>
 </ul>
 </font>
 %% Cell type:markdown id: tags:
 <font size="3">
    <h align="left"> Git works really well with those files because they are <i><b>flat, text-based</b></i> formats that git easily reads when someone makes a commit and can analyze when generating diffs. </h>
    However, git <i><b>can't</b></i> read the contents of <i><b>non-text-based</b></i> files. In other words, when someone makes changes to files that are <i>not</i> text-based, git doesn't understand the diff. These filetypes include...
    <ul>
  <li>Rendered Documents (.pdf)</li>
  <li>Binary images (.jpg)</li>
  <li>Formatted word docs (.doc)</li>
  <li>Tabular data with multiple sheets (.xlsx)</li>
  <li>Binary data formats (e.g., .dbf)</li>
  <li>and many more... </li>
 </ul>
 <i>If we make a commit with these files, we won't be able to see the diff!</i>
    <br><br>
 Below is an example of a diff generated after committing files associated with a binary datatype called a <i>shapefile</i>. Here, we made changes to three binary files associated with a shapefile &hyphen; the database file(.dbf), the the main file (.shp), and the index file (.shx). Each file can have commits made and the user can move freely through this commit history. However, <b>the diff cannot show what changed</b>. The user is stuck with figuring out the differences between the original file and this new version. The only information provided is the commit message, "Added point and changed date". For large files, this limited information may not be useful enough to track changes. It is often difficult to manage large, distributed projects without seeing the diffs.
    </font>
 %% Cell type:markdown id: tags:
 ![Clone Repo GitLab](https://elevationusvistttest.s3-us-west-1.amazonaws.com/shp+diff.png "GitLab diff shp")
 %% Cell type:markdown id: tags:
-<h2>What text-based files should we use?</h2>
+<h2>What text-based filetypes should we use?</h2>
 <h3>Storing and Sharing Code</h3>
 <p>
 <font size="3">
 In general, <b>git will work with any computer code we write in any format.</b> Git was designed for software development, and works great with essentially any program we write from scratch. It doesn't matter if you write your program in python, C, C++, C#, java, javascript, r, matlab, or any other language. All programs are essentially text-based. For special files that may be required to run something, git has built-in functionality to help manage these issues.
 </font>
 </p>
 <h3>Storing and Sharing Data</h3>
 <p>
 <font size="3">
 The two biggest constraints when using git to manage data is: git requires limited file size and only text-based filetypes generate diffs.
 <br>
 <b>Data File Size:</b> When storing code and and data on GitLab, the files are stored in <i>repositories</i>. Because git and GitLab were designed for managing text-based code (which often has very small file size), git repositories tend to have limitations on their total storage. However, there is generally no limitation on the number of repositories one can have. So, this issue is usually circumvented by keeping each repository as small as possible. When using git, its important to know the maximum storage capacity on a repository, and try and segment projects and data into separate repositories when possible.
 <br>
 <b>Data Filetype:</b> While software development is done almost exclusively with flat, text-based files, data is often stored in non-flat, non-text-based files. A common example of a non-flat file would be an excel file with functions and data stored across multiple tabs. A common example of non-text-based data includes the binary shapefiles shown above.
 <br>
        <ul>
 <b>To utilize git for data management, we have to use flat, text-based data filetypes whenever possible.</b> In general, this just means choosing an appropriate filetype for a particular kind of data and being deliberate about using that filetypes for all commits. To demonstrate what we mean, we will focus on an important type of data we want to share that comes in many filetypes &ndash; geospatial data.
      </ul>      </p>
 %% Cell type:markdown id: tags:
 <h2>Geospatial data</h2>
 <p>Geospatial data describes the position of points, lines, polygons, and/or images on the earth as well as their coordinate-linked attributes. Geospatial data typically includes geometric features, coordinates of those features, the coordinate reference system and projection those coordinates should be read in, and lists of coordinate-linked data visible in an attributes table. <br><br>Geospatial data has text-based and non-text-based formats:
   <table><font color="red"> <tr> <th>Text-based</th> <th>Non-text-based</th> </tr> <tr> <td><b>.geojson</b></td> <td>.shp</td> </tr> <tr> <td>.wkt</td> <td>.gpkg</td> </tr><tr> <td>.gml</td> <td>.kml/.kmz</td> </tr> </font> </table>
 <br>
 <p>
 Notice that there binary filetypes like shapefiles, geopackages, and kml/kmz <i>and</i> flat, text-based filetypes like geojson, wkt, and gml. When we use git, if we commit and share geospatial data as on of these text-based filetypes, we will get useful diffs. We highlight GeoJSON, because this is the flat, text-based filetype we chose from this list to use. We chose GeoJSON for a number of reasons.
 </p>
 <h3>GeoJSON</h3>
 <font size="3">
 The GeoJSON format is the standard for text-based geospatial data formats. It is an open format, meaning it will work on most operating systems and GIS software. It is built on top of the already well-established JSON format, and has a number of helpful supporting packages that can be employed in programming languages such as Python and R.
    </font>
 %% Cell type:markdown id: tags:
 <h1>In conclusion</h1>
    <font size="3">
 We are using git and GitLab with the text-based geospatial data format GeoJSON.
    <br>
    <h3>Frequently Asked Questions:</h3><br>
    <i>"But I have always used shapefiles. Is there some sort of disadvantage?"</i><br>
 <ul>
 No. You should convert to GeoJSON. Besides, you can always convert back if you ever needed to!
 <br>
 Also, unlike shape files which have dependent files (.shx, .dbf, .prj) at risk of getting misplaced or left untracked, with GeoJSON there is only one file to track (.geojson). GeoJSON files are just text, so we can visibly see any edits made to the files in GitLab as they are made. We can also easily manipulate the files using programming languages like Python or R! These are a few reasons we ask that any geospatial data you have be converted to GeoJSON before pushing it to the repository. Conversion is painless, and there is a Jupyter Notebook in this folder to help.
 </font>
 %% Cell type:markdown id: tags:
 <h3>Helpful Links:</h3>
 <h4>Git desktop client</h4>
 Free: https://www.sourcetreeapp.com
 Free demo/free for academics: https://www.git-tower.com
 <br>
 <br>
 <h4>GeoJSON Web Visualization</h4>
 Github has native visualization: https://github.com/paultag/dc/blob/master/coffee.geojson
 Github Gists (shareable links): https://gist.github.com/cageyjames/2dc545127f04b93858bd
 Editor/Visualizer: http://geojson.io/