Analytics tools for Windows

"Orange Electric" - orange clouds in a Winston Salem sunset
"Orange Electric" (c) 2014 Francois Dion

abstract: in this article, we will cover some of the better analytics tools for Windows. If you are using Windows as your desktop operating system, you might have read about a lot of different tools that people are using on Mac or Linux and you'd like to run this on Windows. Or perhaps you transitioned from a company using Apple or Linux (or BSD/*nix) laptops to Windows laptop, and you want to get up to speed with some of the tools, what works and all that.. you've come to the right place.

From the cloud...


Back in 2016, Microsoft Azure's CTO stated that 25% of all deployments on Azure ran Linux. That was not surprising since Linux was THE operating system for the other cloud vendors. In 2018, this increased to 50% and in 2019, Linux was now by far used in the majority of deployments on Azure (2020 has continued that trend).

But what about the end users of those systems? Especially power users, developers, analysts, data scientists? The landscape is a bit more varied.

Operating system of my readers

After getting a few requests to talk about Windows analytics tools through my LinkedIn account, I got curious as to the breakdown of the readership of my blog, by operating systems. How many were Linux? Where was Windows? OSX? Mobile?

I loaded the data for the past year (12 months) and this is the breakdown I got:


Not as smooth a distribution as I was expecting, and definitely not the distribution of the whole population. But quite interesting, nonetheless. Let's combine Linux and Unix (which also includes BSD) as one (*nix), but keep Apple macOS separate:



Windows and Unix like operating systems (Apple macOS excluded) are neck and neck, with about 1% difference, while macOS has about 10% more. If you are one of those Windows users, keep on reading.


To the Desktop


I won't cover the usual suspects for Analytics on Windows such as any typical SQL query tools or Excel. Everybody uses excel in some way or another (or a similar app like LibreOffice). But there are some Microsoft applications that do help quite a bit, are free and in a few cases, are not available under Linux (unless you run something like VirtualBox).

One such application is Microsoft Power BI Desktop (install, documentation).

Microsoft Power BI Desktop

This is the equivalent of Excel, for data visualizations, dashboards and reporting. It is free, fairly easy to use, but, like excel, is not the best path to automation. Still, I've seen how anybody, with no programming background, can make some decent explorations and reports with this. We have used it over the past several years with some good results. Recommended.

Knime

Another similar option is Knime. It actually does a bit more, as it is aimed at data scientists for end-to-end work (see also Orange, in the Python section below, and in my companion brochure to my SELF 2016 talk: The Hichhiker's Guide to the Open Source Data cience Galaxy). Having said that, scripted pipelines are pretty much the way to go for any advanced research, analytics, extensive charts and reports or data science work.

Whereas shell scripting on Mac or Linux means something like bash or xonsh, these are not available by default. But if you use Windows 10, here is something you might not know: you have some Linux services available to you. This makes a lot of sense if you are using Linux based servers, so you gain experience of the same environment on your computer. Of course, you could stick with just Powershell, but do check out the following:

Windows Superpowers

(see also part III of my "ex-libris" series)

In Windows 10 (if you are still on Windows 7, my condolences, this is not available, but you can still get bash for Windows by installing Git for Windows, which includes git bash) , if you click on the start menu, click Control Panel, then Programs, you get the following screen:

Control Panel -> Programs


Under Programs and Features, there is a "Turn Windows features on or off". Simply click on it and then click on "Windows Subsystem for Linux" (also select "Virtual Machine Platform" if you are planning to upgrade to WSL2) and click OK:



After rebooting, open the Microsoft store, search for a Linux distribution you'd like to use to provide services like bash. Select it and click get. For example, if I search for ubuntu, I get 3 different version. Selecting the first one (defaults to latest) and click get:


This is a download that is less than 500MB. Once downloaded, click on Launch. A window will open, and after a while, you will be asked for a username and password:


This window will provide a shell prompt. You can also open a new one by opening a regular windows command prompt (cmd) and typing bash. Neat, uh?

If you would like to use xonsh instead of bash, simply type at the bash prompt:

sudo apt update
sudo apt install xonsh -y

R and the Tidyverse

(see also part IV of my "ex-libris" series, under the section R)

Now that we've enabled shell scripting and expanded our choices from Powershell, to any shell that's available for Linux, we will now look at the R programming language. This is a natural evolution for analytics, to go from point and click tools, to SQL, to R. R is made for data analysis and statistical computing (that's what it was invented for). If you've heard of Shiny apps or ggplot, you've heard of R.

In the "beginning" was the S programming language. Then R appeared, as an open source alternative to S. This version is available from CRAN. Later, Revolution Analytics offered their version and were acquired by Microsoft. This led to several projects at Microsoft. One was Open R, and MRAN (like CRAN, a repository, but with snapshot versioning). Finally, the latest evolution is R Client, a superset of Open R (technically a local version running MRS locally). On the server side, this is now called Microsoft Machine Learning Server, but as I mentioned, you can run completely locally. R is also available on Microsoft SQL server.

So, to recap, R comes as a command line interface, with the CRAN version and Open R version sticking to something mostly compatible, while the R Client version (recommended) expands this CLI to extra parallel versions of certain packages. If command line is your thing, you can also get a fancier REPL instead of the plain R CLI, radian:

This is not a complete environment, however, and if you prefer a GUI, you'll most likely want a graphical IDE. The most popular and well known is RStudio.

RStudio

As can be seen in the above screenshot, RStudio is automatically using the Microsoft R Client that I installed (it is using multiple CPU cores). We install the most popular packages (through the meta package tidyverse) by typing in the R console window:

install.packages('tidyverse')

You are now ready to develop statistical applications, reports with R Markdown, visualizations with ggplot2, or with one of its extensions:



Another option for an R IDE is using Microsoft VS Code and installing the R Tools extension:

R Tools

One advantage of Microsoft VS Code is that it can also support WSL (Windows Subsystem for Linux), docker, Azure and Python, through different plugins and you stay within the same IDE for all of these.

Jupyter, Python


Python is the most popular language for analytics and data science, very popular for web and backend work (see Django and Flask), and data engineering (see airflow and prefect) and the second overall most popular language after JavaScript (according to some surveys in 2020, while TIOBE index puts it at #3). There is no avoiding it. And there are good reasons to its popularity: It is extremely easy to learn, has an enormous ecosystem (see pypi.org), runs on different processors, scales from tiny embedded systems to large scale clusters and can be used to build small quick prototypes or large enterprise grade applications. Dion Research has developed some projects in the hundred of thousands of lines of code, not counting documentation, and fairly easy to maintain.

Python has been around since the early 1990s. The official Python distribution for Windows can be downloaded as a cPython binary implementation from the official python.org website. Other Python implementations also exist to run on the JVM, on .Net etc. But, on Windows, the most popular distribution for Python is through Anaconda (recommended, choose the Python 3.7 64 bit graphical installer).

Once installed, you have Python 3.7 and conda to manage your packages. You'll also have a few start menu options:



The first menu option is the Anaconda Navigator (this doesn't show up in the menu on Linux but it can be started from the command line as anaconda-navigator). This is a great starting point. In fact, it's pretty much a one stop shop:

Anaconda Navigator

It has pretty much all the different items we've talked about so far (Rstudio, VS Code), except for Microsoft Power BI Desktop, but Orange3 (click install, then launch) has similar features (and quite a bit more):

Orange 3

There is a launcher for Spyder, There is also a launcher for Microsoft VS Code we installed earlier, but we also need to add the Python support:

Python support for VS Code

Having said that, just like RStudio is the go to IDE for R, Pycharm is the go to IDE for Python. I use the professional version myself, but there is also a community edition, with less features. If you are learning, the community edition is fine. If you make a living in this field, buy the commercial version.

Pycharm

The nice thing with pycharm is that it works with either conda environments, or virtualenv environments, or even docker environments (see configuration). We will get back to this, but first, let's talk about two more launchers on the Anaconda Navigator: Jupyter Lab and Jupyter Notebooks (and Voila).

Jupyter notebook (using Python and stemgraphic)

Jupyter Notebook: "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.". It supports R and Python and another 40+ more. There are millions of notebooks available publicly, covering all fields of science and technology.

Jupyter Lab, like Notebook, but with multiple tabs and windows


Jupyter Lab: "JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular: write plugins that add new components and integrate with existing ones.". A mix between the Jupyter Notebook and an IDE.

Managing Environments


Anaconda Navigator also includes a tab to manage various environments. After the install, there is a base environment which includes many Python packages, at a specific version. However, it is best to create an environment for each new project, and add only the packages we need into it. This helps in testing, in deploying and allows going back to a working application from a code base, even many years later.

Clicking on the environment tab on the left, then (+) Create at the bottom, we are presented with the following popup dialog:

Create environment

You would type the name of your environment (use underscores for spaces) and select which version of Python you want to use. But wait, what's this R selection? That means you can create a conda environment that supports Python, R or even both! This is pretty convenient, and once more showing that the anaconda distribution can be your one stop shop for all your analytics needs under Windows, just as it is under macOS or Linux.

Once an environment is created, you can see what packages are installed, which ones are not installed, search for new packages, add them to your environment and they will be installed without having to compile anything:


Back to the cloud


Well, almost back to the cloud. With WSL installed on Windows, you can use all the Linux tools for the cloud, be it for Google Cloud Platform, Amazon Web Services, Microsoft Azure, etc.

For example, the Linux Azure CLI can be installed by opening a command prompt (windows key + R, cmd enter), typing bash (or wsl) and then the following:
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
You can then easily integrate all of this from a collection of Python script, and even support multiple vendors.

Of course, if you prefer running the Windows version through Powershell, then you can download and install it from here. Another interesting tool to have if you are an Azure customer: the Azure Cosmos emulator. This allows one to run Azure Cosmos DB service locally, without an Azure subscription or incurring any fees. It is not 100% full fledged but is a valuable development tool.

In conclusion, I'll leave you with a link to this interesting story: how microsoft built their cross platform Azure CLI with Python.


Francois Dion

Comments