Skip to content
Snippets Groups Projects
Commit f5800cd9 authored by Wigal, Jacob (CIV)'s avatar Wigal, Jacob (CIV) :sparkles:
Browse files

update branch

parent 8a61ceee
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<h1><font size="10">Why we use git and gitlab...</font></h1>
%% Cell type:markdown id: tags:
<h2 align="left"> As a team, we need:</h2><br>
<font size="3">
<ul>
&nbsp;1. a way to store and share data and code;<br>
&nbsp;2. a collaborative space for testing new ideas; and,<br>
&nbsp;3. precise version control to avoid losing past code/data.<br>
</ul>
</font>
%% Cell type:markdown id: tags:
<font size="3">
Using <b>Git</b> and <b>GitLab</b> is one way to do these three things. Together, they can:<br>
<ul>
&nbsp;1. store and share data and code;<br>
&nbsp;2. enable users to see, edit, and comment other's work without overwriting files; and,<br>
&nbsp;3. enable distributed version control and passoword protection of key files.<br>
<br>
</ul>
<b>But first... some quick definitions:</b> [Git](https://git-scm.com/) is version control software (VCS) that enables people to use simple commands to keep track of files and file changes. When you edit a file, git helps you keep track of exactly what changed, who changed it, and why. [GitLab](https://about.gitlab.com/) is a deployment of git to a cloud server that has a user interface showing files and tracked changes. It allows multiple users to easily view edits tracked by git in the cloud. It also adds many more features like password protection and wikis. [GitHub](https://github.com/) is a website and online platform similar to GitLab, but does not have the same tools, integration, privacy, or user controls.
</font>
%% Cell type:markdown id: tags:
<h2>Why we are using git</h2><br>
<font size="3">
We make changes to files all the time. Most changes we make (even mistakes) are not a big issue when a single person is in control of the file and its management. For example, when you write a report in Microsoft Word, you can pick up where you last saved the document. You can edit the file, save it, and if you see yourself make a mistake, you can "undo". You are fully aware of how your Word document has changed over time.<br>
Moving to a distributed, simultaneous file-sharing system with multiple users and organizations turns this workflow upside-down. In a distributed setting, there are a lot of different people who may want to edit your file. They may not know what changes you made in the past and may make changes that you do not notice. It is also easy for multiple people to work on the same file and make conflicting changes that do not work well together (e.g., edits to the same sentence or paragraph in a report). In worst-case situations, one person may overwrite a file and lose important information without knowing and with no possibility of an "undo". Even if changes are tracked, files may have long version histories that are nearly impossible to make sense of.
<i>This</i> is where git comes in.
In general, git forces users to be deliberate about how they change files and how new versions are tracked. Git tracks all changes people make to files. A change to a file in git is called a <i>commit</i>. With every commit, git also requires you to write a message about the change you are making called a <i>commit message</i>. The combination of commit and commit message helps explain what happened between each version of the same file. Everyone with access to the file can see every commit and commit message, as well as download or <i>revert</i> to old versions of the file at any time. Git also enables users to <i>pull</i> other people's changes into their files to get most recent updates and to <i>jump forward</i> files that are out of date to newer versions.
In addition to forcing users to be more deliberate about file updates, everytime a commit is made, git automatically produces an output called a <i>diff</i> that shows what edits were made. Diffs pinpoint the exact differences between file versions and provide a powerful way to understand version details and commit messages left by users.
Below is an example of a diff showing changes to a text file called index.md. This diff shows that changes to index.md were made on line 318. Here, text deletions are shown in <b><font color='red'>red</font></b> and text additions are shown in <b><font color='green'>green</font></b>. Specifically, this commit changed the word <b><font color='red'>upgrad</font>&hyphen;ing</b> to <b><font color='green'>updat</font>&hyphen;ing</b>. The diff is so accurate it identifies only the part of the word thatchanged, as the <b>ing</b> stays the same between commits!
</font>
%% Cell type:markdown id: tags:
![Clone Repo GitLab](https://docs.gitlab.com/ee/user/project/merge_requests/img/merge_request_diff_v12_2.png "GitLab diff")
%% Cell type:markdown id: tags:
<font size="3">
While the combination of deliberate and visual version contral git provides is important for file management across distributed teams, it comes with a cost. Git was origially designed to deal with code for software development, so filetypes associated with most programming languages work great in git. For example, git will track commits and diffs for files like...<br>
<ul>
<li>Matlab files (.m)</li>
<li>R scripts (.r)</li>
<li>Text files (.txt)</li>
<li>Python programs (.py)</li>
<li>HTML files (.html)</li>
<li>Tabular data files (.csv)</li>
<li>Non-Tabular, text-based data (e.g., .json)</li>
<li>and many more... </li>
</ul>
</font>
%% Cell type:markdown id: tags:
<font size="3">
<h align="left"> Git works really well with those files because they are <i><b>flat, text-based</b></i> formats that git reads in when someone makes a commit and can analyze when generating diffs. </h>
However, git <i><b>can't</b></i> read the contents of <i><b>non-text-based</b></i> files. In other words, when someone makes changes to files that <i>are not</i> text-based, git doesn't understand the diff. These filetypes include...
<ul>
<li>Rendered Documents (.pdf)</li>
<li>Binary images (e.g., .jpg)</li>
<li>Formatted word docs (.doc)</li>
<li>Tabular data with multiple sheets (.xlsx)</li>
<li>Binary data formats (e.g., .dbf)</li>
<li>and many more... </li>
</ul>
<i>If we make a commit with these files, we won't be able to see the diff!</i>
Below is an example of a diff not showingwhen committing files associated with a binary datatype called a <i>shapefile</i>. Here, we made changes to three binary files associated with a shapefile &hyphen; the database file(.dbf), the the main file (.shp), and the index file (.shx). While the commit and commit message still work when using git to manage these binary files, <b>the diff cannot show what changed</b>. This means that there is still version control, but now the user is stuck with trying to figure out the differences between the original file and this new version. The only information provided is the commit message, "Added point and changed date". For large files, this limited information may not be useful enough to track changes. It is often difficult to manage large, distributed projects without seeing the diffs.
</font>
%% Cell type:markdown id: tags:
![Clone Repo GitLab](https://elevationusvistttest.s3-us-west-1.amazonaws.com/shp+diff.png "GitLab diff shp")
%% Cell type:markdown id: tags:
<h2>To generate diffs, we should only use text-based files...</h2>
<h2>...but, what does this mean for us?</h2><br>
%% Cell type:markdown id: tags:
<h2>Storing and Sharing Code</h2>
<p>
<font size="3">
In general, <b>git will work with any computer code we write in any format.</b> Git was designed for software development, and works great with essentially any program we write from scratch. It doesn't matter if you write your program in python, C, C++, C#, java, javascript, r, matlab, or any other language. All programs are essentially text-based. For special files that may be required to run something, git has built-in functionality to help manage these issues.
</font>
</p>
%% Cell type:markdown id: tags:
<h2>Storing and Sharing Data</h2>
<p>
<font size="3">
In contrast to code, <b>git is not designed for sharing data. However, if you follow some simple rules, git still works great.</b> The two biggest concerns when using git to manage data is limiting data file size and choosing filetypes that generate diffs.
<b>File Size:</b> When storing code and and data on gitlab or github, the files are stored in <i>repositories</i>. Because git, gitlab, and github were designed for managing text-based code (which often has very small file size), git repositories tend to have limitations on their total storage. However, there is generally no limitation on the number of repositories one can have. So, this issue is usually circumvented by keeping each repository as small as possible. When using git, its important to know the maximum storage capacity on a repository, and try and segment projects and data into separate repositories when possible.
<br><br>
<b>Data File Type:</b> While software development is done almost exclusively with flat, text-based files, data is often stored in non-flat, non-text-based files. A common example of a non-flat file would be an excel file with functions and data stored across multiple tabs. A common example of non-text-based data includes the binary shapefiles shown above.
<b>To utilize git for data management, we have to use flat, text-based data filetypes whenever possible.</b> In general, this just means choosing an appropriate filetype for a particular kind of data and being deliberate about using that filetypes for all commits. To demonstrate what we mean, we will focus on an important type of data we want to share that comes in many filetypes &ndash; geospatial data.
</p>
<h3>Geospatial data has text-based formats and non-text-based formats:</h3>
<font color='red'>NOTE TO JAKE: Please add a paragraph descrbing geospatial data (e.g., what it includes, attributes, shapes, calculations, crs / projection information, etc.). Then introduce the table below. </font>
We list several possible filetypes for geospatial data in the table below:</p>
<table><font color="red"> <tr> <th>Text-based</th> <th>Non-text-based</th> </tr> <tr> <td><b>.geojson</b></td> <td>.shp</td> </tr> <tr> <td>.wkt</td> <td>.gpkg</td> </tr><tr> <td>.gml</td> <td>.kml/.kmz</td> </tr> </font> </table>
<p>
Notice that there binary filetypes like shapefiles, geopackages, and kml/kmz <i>and</i> flat, text-based filetypes like geojson, wkt, and gml. When we use git, if we commit and share geospatial data as on of these text-based filetypes, we will get useful diffs. We highlight geojson, because this is the flat, text-based filetype we chose from this list to use. We promote the use of geojsons for a number of reasons.
</p>
</font>
%% Cell type:markdown id: tags:
<h2>GeoJSON</h2><br>
<font size="3">
We choose to use geojson as our preferred file format for the following reasons: The GeoJSON format is the standard for text-based geospatial data formats. It is an open format, meaning it will work on most operating systems and GIS software. It is built on top of the already well-established JSON format, and has a number of helpful supporting packages that can be employed in programming languages such as Python and R.
</font>
%% Cell type:markdown id: tags:
<h1>In conclusion</h1><br>
<font size="3">
We are using git and GitLab with the text-based geospatial data format GeoJSON.
<br>
<br>
<h3>Frequently Asked Questions:</h3><br>
<i>"But I have always used shapefiles. Is there some sort of disadvantage?"</i><br>
<ul>
No. You should convert to GeoJSON. Besides, you can always convert back if you ever needed to!
<br>
Also, unlike shape files which have dependent files (.shx, .dbf, .prj) at risk of getting misplaced or left untracked, with GeoJSON there is only one file to track (.geojson). GeoJSON files are just text, so we can visibly see any edits made to the files in GitLab as they are made. We can also easily manipulate the files using programming languages like Python or R! These are a few reasons we ask that any geospatial data you have be converted to GeoJSON before pushing it to the repository. Conversion is painless, and there is a Jupyter Notebook in this folder to help.
</font>
%% Cell type:markdown id: tags:
<h3>Helpful Links:</h3>
<br>
<h4>Git desktop client</h4>
Free: https://www.sourcetreeapp.com
Free demo/free for academics: https://www.git-tower.com
<br>
<br>
<h4>GeoJSON Web Visualization</h4>
Github has native visualization: https://github.com/paultag/dc/blob/master/coffee.geojson
Github Gists (shareable links): https://gist.github.com/cageyjames/2dc545127f04b93858bd
Editor/Visualizer: http://geojson.io/
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment