Please review any and all PUBLIC repositories, groups and associate files. These allow anyone on the Internet to access without authentication. Repository and group owners are responsible for their content and permission settings. Go to your project(s), click on Settings > General and expand the "Visibility, project features, permissions" to change this setting.

Commit 909e9098 authored by Wigal, Jacob (CIV)'s avatar Wigal, Jacob (CIV)

added geospatial data and text-based file sections

parent 831b3a2b
......@@ -78,13 +78,15 @@
"metadata": {},
"source": [
"<font size=\"3\">\n",
"Git was origially designed to deal with code for software development, so filetypes associated with most programming languages work great in git. For example, git will track commits and diffs for files like...<br>\n",
"Diffs are only visible with git for certain filetypes. Git was origially designed to deal with code for software development, so filetypes associated with most programming languages work great in git. For example, git will track commits and diffs for files like...<br>\n",
" <ul>\n",
" <li>Matlab files (.m)</li>\n",
" <li>R scripts (.r)</li>\n",
" <li>Text files (.txt)</li>\n",
" <li>Python programs (.py)</li>\n",
" <li>HTML files (.html)</li>\n",
" <li>Tabular data files (.csv)</li>\n",
" <li>Non-Tabular, text-based data (e.g., .json)</li>\n",
" <li>and many more... </li>\n",
"</ul> \n",
"</font>"
......@@ -95,25 +97,21 @@
"metadata": {},
"source": [
"<font size=\"3\">\n",
" <h align=\"left\"> Git works really well with those files because they are <i><b>flat, text-based</b></i> formats that git easily reads in when someone makes a commit and understands when analyzing diffs. </h>\n",
" However, git <i><b>can't</b></i> read the contents of <i><b>non-text-based</b></i> files. In other words, when someone makes changes to files that <i>are not</i> text-based, git doesn't understand the diff. These filetypes include...\n",
" <h align=\"left\"> Git works really well with those files because they are <i><b>flat, text-based</b></i> formats that git easily reads when someone makes a commit and can analyze when generating diffs. </h>\n",
" However, git <i><b>can't</b></i> read the contents of <i><b>non-text-based</b></i> files. In other words, when someone makes changes to files that are <i>not</i> text-based, git doesn't understand the diff. These filetypes include...\n",
" <ul>\n",
" <li>Rendered Documents (.pdf)</li>\n",
" <li>Binary images (.jpg)</li>\n",
" <li>Formatted word docs (.doc)</li>\n",
" <li>Tabular data with multiple sheets (.xlsx)</li>\n",
" <li>Binary data formats (e.g., .dbf)</li>\n",
" <li>and many more... </li>\n",
"</ul> \n",
"\n",
"<i>If we make a commit with these files, we won't be able to see the diff!</i>\n",
" <br>\n",
" <br>\n",
"Geospatial data has text-based formats and non-text-based formats.\n",
" \n",
" <table><font color=\"red\"> <tr> <th>Text-based</th> <th>Non-text-based</th> </tr> <tr> <td><b>.geojson</b></td> <td>.shp</td> </tr> <tr> <td>.wkt</td> <td>.gpkg</td> </tr><tr> <td>.gml</td> <td>.kml/.kmz</td> </tr> </font> </table>\n",
" \n",
"\n",
" <br>\n",
"Below is an example of a diff showing changes someone made to a shape file, a non-text-based format. Notice we can see the commit message, but there is no way to know what point was added, or what date was changed without doing some serious searching in QGIS or ArcGIS.\n",
"Below is an example of a diff generated after committing files associated with a binary datatype called a <i>shapefile</i>. Here, we made changes to three binary files associated with a shapefile &hyphen; the database file(.dbf), the the main file (.shp), and the index file (.shx). Each file can have commits made and the user can move freely through this commit history. However, <b>the diff cannot show what changed</b>. The user is stuck with figuring out the differences between the original file and this new version. The only information provided is the commit message, \"Added point and changed date\". For large files, this limited information may not be useful enough to track changes. It is often difficult to manage large, distributed projects without seeing the diffs.\n",
" </font>"
]
},
......@@ -128,7 +126,52 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>GeoJSON</h2>\n",
"<h2>What text-based files should we use?</h2>\n",
"<br>\n",
"<h3>Storing and Sharing Code</h3>\n",
"<p>\n",
"<font size=\"3\">\n",
"In general, <b>git will work with any computer code we write in any format.</b> Git was designed for software development, and works great with essentially any program we write from scratch. It doesn't matter if you write your program in python, C, C++, C#, java, javascript, r, matlab, or any other language. All programs are essentially text-based. For special files that may be required to run something, git has built-in functionality to help manage these issues.\n",
"</font>\n",
"</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Storing and Sharing Data</h3>\n",
"\n",
"<p>\n",
"<font size=\"3\">\n",
"The two biggest constraints when using git to manage data is: git requires limited file size and only text-based filetypes generate diffs.\n",
"<br><br>\n",
"\n",
"<b>Data File Size:</b> When storing code and and data on GitLab, the files are stored in <i>repositories</i>. Because git and GitLab were designed for managing text-based code (which often has very small file size), git repositories tend to have limitations on their total storage. However, there is generally no limitation on the number of repositories one can have. So, this issue is usually circumvented by keeping each repository as small as possible. When using git, its important to know the maximum storage capacity on a repository, and try and segment projects and data into separate repositories when possible.\n",
"<br><br>\n",
"\n",
"<b>Data Filetype:</b> While software development is done almost exclusively with flat, text-based files, data is often stored in non-flat, non-text-based files. A common example of a non-flat file would be an excel file with functions and data stored across multiple tabs. A common example of non-text-based data includes the binary shapefiles shown above. \n",
"<br>\n",
" <ul>\n",
"<b>To utilize git for data management, we have to use flat, text-based data filetypes whenever possible.</b> In general, this just means choosing an appropriate filetype for a particular kind of data and being deliberate about using that filetypes for all commits. To demonstrate what we mean, we will focus on an important type of data we want to share that comes in many filetypes &ndash; geospatial data.\n",
" </ul> </p>\n",
"<br>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Geospatial data</h2>\n",
"<p>Geospatial data describes the position of points, lines, polygons, and/or images on the earth as well as their coordinate-linked attributes. Geospatial data typically includes geometric features, coordinates of those features, the coordinate reference system and projection those coordinates should be read in, and lists of coordinate-linked data visible in an attributes table. <br><br>Geospatial data has text-based and non-text-based formats:\n",
" <table><font color=\"red\"> <tr> <th>Text-based</th> <th>Non-text-based</th> </tr> <tr> <td><b>.geojson</b></td> <td>.shp</td> </tr> <tr> <td>.wkt</td> <td>.gpkg</td> </tr><tr> <td>.gml</td> <td>.kml/.kmz</td> </tr> </font> </table>\n",
" <br>\n",
" <p>\n",
"Notice that there binary filetypes like shapefiles, geopackages, and kml/kmz <i>and</i> flat, text-based filetypes like geojson, wkt, and gml. When we use git, if we commit and share geospatial data as on of these text-based filetypes, we will get useful diffs. We highlight GeoJSON, because this is the flat, text-based filetype we chose from this list to use. We chose GeoJSON for a number of reasons.\n",
"</p>\n",
"<br>\n",
"<h3>GeoJSON</h3>\n",
"<font size=\"3\">\n",
"The GeoJSON format is the standard for text-based geospatial data formats. It is an open format, meaning it will work on most operating systems and GIS software. It is built on top of the already well-established JSON format, and has a number of helpful supporting packages that can be employed in programming languages such as Python and R.\n",
" </font>"
......@@ -144,7 +187,8 @@
"We are using git and GitLab with the text-based geospatial data format GeoJSON.\n",
" <br>\n",
" <br>\n",
" <i>\"But I have always used shape files. Is there some sort of disadvantage?\"</i><br>\n",
" <h3>Frequently Asked Questions:</h3><br> \n",
" <i>\"But I have always used shapefiles. Is there some sort of disadvantage?\"</i><br>\n",
"\n",
"<ul>\n",
"No. You should convert to GeoJSON. Besides, you can always convert back if you ever needed to!\n",
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment