Skip to content
Snippets Groups Projects
git-1_why_we_use_git_and_gitlab.rst 9.98 KiB

Why we use git and GitLab

As a team, we need:

  1. a way to store and share data and code;
  2. a collaborative space for testing new ideas; and,
  3. precise version control to avoid losing past data/code.

Using git and GitLab is one way to do these three things. Together, they can:

  1. store and share data and code;
  2. enable users to see, edit, and comment on each other’s work without overwriting files; and,
  3. enable distributed version control and passoword protection of key files.

But first… some quick definitions: Git is version control software (VCS) that enables people to use simple commands to keep track of files and file changes. When you edit a file, git helps you keep track of exactly what changed, who changed it, and why. GitLab is a deployment of git to a cloud server with a user interface for viewing files and tracked changes. GitLab also has many more features like password protection and wikis. (GitHub, which you may have heard of before, is another website and online platform similar to GitLab.)

Why we are using git

We make changes to files all the time. Most changes we make (even mistakes) are not a big issue when a single person is in control of the file and its management. For example, when you write a report in Microsoft Word, you can pick up where you last saved the document. You can edit the file, save it, and if you see yourself make a mistake, you can “undo”. You are fully aware of how your Word document has changed over time.

Moving to a file-sharing system with multiple users and organizations reveals some key problems with the familiar workflow presented above. In a distributed, simultaneous setting, there are a lot of different people who may want to edit your file. They may not know what edits you made in the past and they may make significant edits to your file that you may not notice or understand. It is also likely multiple people could work on the same file and make conflicting changes that do not work well together (e.g., simultaneous edits to the same sentence or paragraph in a report). In worst-case situations, one person may overwrite a file and lose important information without knowing and with no power to “undo”.

This is where git comes in.

In general, git forces users to be deliberate about the file editing process by tracking all changes people make to files. Git tracks files by creating a snapshot of a file (or multiple files), called a commit, every time users write a specific command. With every commit, git also requires users to write a message about the change they are making called a commit message. Every commit and commit message helps explain what happened between each version of the same file. Everyone with access to the file can see every commit and commit message, as well as download or revert to old versions of the file at any time. Git also enables users to pull other people’s changes into their files (i.e., update their files with newer versions).

Diffs

Git also automatically produces an output called a diff that shows the line-by-line edits made in each commit. Diffs pinpoint the exact differences between file versions and provide a powerful way to understand version details and commit messages left by users. Below is an example of a diff showing changes someone made to a text file called index.md. This diff shows us the updating_the_geo_nodes.md line was deleted and the version_specific_updates.md line was added . Notice how this diff identifies only the exact part of the line that changed, as the “ing” stays the same between commits!

GitLab diff :scale: 50 %

Diffs are only visible with git for certain filetypes. Git was origially designed to deal with code for software development, so filetypes associated with most programming languages work great in git. For example, git will track commits and diffs for files like…

  • Matlab files (.m)
  • R scripts (.r)
  • Text files (.txt)
  • Python programs (.py)
  • HTML files (.html)
  • Tabular data files (.csv)
  • Non-Tabular, text-based data (e.g., .json)
  • and many more…

Git works really well with those files because they are flat, text-based formats that git easily reads when someone makes a commit and can analyze when generating diffs. However, git can’t read the contents of non-text-based files. In other words, when someone makes changes to files that are not text-based, git doesn’t understand the diff. These filetypes include…

  • Rendered Documents (.pdf)
  • Binary images (.jpg)
  • Formatted word docs (.doc)
  • Tabular data with multiple sheets (.xlsx)
  • Binary data formats (e.g., .dbf)
  • and many more…

If we make a commit with these files, we won’t be able to see the diff! Below is an example of a diff generated after committing files associated with a binary datatype called a shapefile. Here, we made changes to three binary files associated with a shapefile ‐ the database file(.dbf), the the main file (.shp), and the index file (.shx). Each file can have commits made and the user can move freely through this commit history. However, the diff cannot show what changed. The user is stuck with figuring out the differences between the original file and this new version. The only information provided is the commit message, “Added point and changed date”. For large files, this limited information may not be useful enough to track changes. It is often difficult to manage large, distributed projects without seeing the diffs.

GitLab diff shp :scale: 50 %

What text-based filetypes should we use?

Storing and Sharing Code

In general, git will work with any computer code we write in any format. Git was designed for software development, and works great with essentially any program we write from scratch. It doesn’t matter if you write your program in python, C, C++, C#, java, javascript, r, matlab, or any other language. All programs are essentially text-based. For special files that may be required to run something, git has built-in functionality to help manage these issues.

Storing and Sharing Data

The two biggest constraints when using git to manage data is: git requires limited file size and only text-based filetypes generate diffs.

Data File Size: When storing code and and data on GitLab, the files are stored in repositories. Because git and GitLab were designed for managing text-based code (which often has very small file size), git repositories tend to have limitations on their total storage. However, there is generally no limitation on the number of repositories one can have. So, this issue is usually circumvented by keeping each repository as small as possible. When using git, its important to know the maximum storage capacity on a repository, and try and segment projects and data into separate repositories when possible.

Data Filetype: While software development is done almost exclusively with flat, text-based files, data is often stored in non-flat, non-text-based files. A common example of a non-flat file would be an excel file with functions and data stored across multiple tabs. A common example of non-text-based data includes the binary shapefiles shown above.

To utilize git for data management, we have to use flat, text-based data filetypes whenever possible. In general, this just means choosing an appropriate filetype for a particular kind of data and being deliberate about using that filetypes for all commits. To demonstrate what we mean, we will focus on an important type of data we want to share that comes in many filetypes – geospatial data.

Geospatial data

Geospatial data describes the position of points, lines, polygons, and/or images on the earth as well as their coordinate-linked attributes. Geospatial data typically includes geometric features, coordinates of those features, the coordinate reference system and projection those coordinates should be read in, and lists of coordinate-linked data visible in an attributes table. Geospatial data has text-based and non-text-based formats:

Text-based Non-text-based
.geojson .shp
.wkt .gpkg
.gml .kml/.kmz

Notice that there binary filetypes like shapefiles, geopackages, and kml/kmz and flat, text-based filetypes like geojson, wkt, and gml. When we use git, if we commit and share geospatial data as on of these text-based filetypes, we will get useful diffs. We highlight GeoJSON, because this is the flat, text-based filetype we chose from this list to use. We chose GeoJSON for a number of reasons.

GeoJSON

The GeoJSON format is the standard for text-based geospatial data formats. GeoJSON files are just text, so we can visibly see any edits made to the files in GitLab as they are made. It is an open format, meaning it will work on all operating systems and with most GIS software. It is built on top of the already well-established JSON format, and has a number of helpful supporting packages that can be employed in programming languages such as Python and R. An important distinction between the GeoJSON format and other geospatial data formats, is that the coordinate reference system (CRS) of every GeoJSON is WGS 84 EPSG:4326. This means any file you convert to GeoJSON will have its CRS automatically converted to EPSG:4326, which is perfect for analysis. The above are some of the reasons we ask that any geospatial data you have be converted to GeoJSON before pushing it to GitLab. It is easy to convert geospatial data through either QGIS or the CID utility GeoConvert.