Wednesday, June 22, 2016

Wednesday, March 23, 2016

The Time Machine: Baremetal Management is the new Backup Tape Changing

Before graduating high school, I’d been a paperboy, a bagboy, a dishwasher, facilities engineer for a ski resort (garbageboy), then moved up to rentals at that same ski resort. One of the primary reasons I picked the college I did was that it had a full-time job that I could take advantage of right away to earn some money, and more importantly, get experience in their computer department.

That first “professional” experience was helping connect the dorms to a brand new Ethernet network, the first dorm network connectivity. Before that, all they had were VAX OpenVMS terminals. Speaking of the VAX, the bulk of my time was spent managing the VAX and backups for it as well as 2 brand new DEC Alphas. At the time, making sure those backups were legit was my most important job. Those tapes were to a tape drive that was already an antique when I had to start watching over it. There were dailies and weeklies and monthly fulls, 100% of this was manual, and I did it from a line printer, not a monitor.



I could get called at any time of any day, and I had to restore any files accidentally deleted or corrupted from the old winchester hard drives attached to the VAX. Imagine an irate professor working late into the night on a weekend to finish research and losing a file for whatever reason. I was the person who needed to respond and fix it immediately without complaint. There was one occasion where the files were corrupted, and I spent an entire weekend not able to recover a professor’s culmination of a semester’s worth of chemistry research. I almost lost my job, almost lost my income to pay for tuition, almost lost any credibility in the CompSci department, pretty much almost lost everything I had worked for up until that weekend. This was one of the lowest points of my life and one of the turning points of my career, despite it basically being “undifferentiated heavy lifting,” and I was still making minimum wage, had a timecard, and was at the bottom of the ladder career-wise. I bet that professor still remembers me, if only to wonder whether I've been run over by a truck.

Luckily, I didn’t lose my job and got a second chance. I picked up an additional job working at a thermography company writing printer drivers for an AS/400. Again, basically lowest rung of the ladder to get experience, but this company’s core business was printing very elaborate wedding invitations, graduation announcements, etc. Again, this was a job that held the lowest pay and the highest responsibility, because if the AS/400 couldn’t print, the entire business was at a standstill. Since those early days, I knew, quite viscerally, that I could never get comfortable where I was.

Fast-forward to my career at Nutanix today. I talk to customers on a daily basis about running HPC, Big Data, and container workloads on baremetal. Don’t get me wrong, baremetal is worthy competition as there is nothing so self-service as your own dedicated, brand-new hardware. In my career, I worked for Argonne National Labs and there is no one in the world with a longer, more respectable track record for managing baremetal at scale than the Department of Energy labs. With a batch scheduler or even a multi-framework scheduling distributed system like Mesos, the baremetal becomes a distributed pool of compute. With HDFS or Elasticsearch or Cassandra for example, baremetal becomes a distributed pool of persistent storage.

So why not just use baremetal for these workloads? Well, Hadoop, for example, is great at distributed resiliency, however it does not manage the hardware for you. Sure, a drive can fail, nodes can fail, top-of-rack switches can fail, but does Hadoop recover failed hardware? Brand-new baremetal is great, but how long is that expected to last? What is the amortization and depreciation schedule? Just like driving a new car off the lot, new hardware innovation driven by silicon and server vendors is still in an ever-escalating competition so that by the time that fancy brand-new hardware is installed, it's already depreciating and may be rendered obsolete relatively quickly. The advent of “software-defined” has not slowed down that deathmarch.

There have been many tools created to alleviate these concerns and make it easier to handle hardware management. Cobbler, Razor, and now RACKHD, for example, are stabs in the right direction. Web-scale companies who maintain public clouds or just a ton of infrastructure and services like Facebook and Twitter have built the necessary tooling to scale their own hardware management efforts, but how is this composable or consumable outside of their respective platforms? Not to mention there’s simplifying hardware compatibility and then there’s trying to accommodate any hardware where the rows and columns of interoperability represent an exponentially growing opportunity for issues. Where Nutanix really shines is the infrastructure, the tooling, and the team behind making this the platform for simplifying hardware management for myriad applications at scale.

These baremetal clusters running Hadoop or Mesos are truly responsible for the life-blood of the business, from its data to its second-by-second operation. If you’re running on baremetal, to borrow from my early experiences changing tapes and tweaking printer drivers, you are still stuck spending time on the most menial part of the infrastructure. More value is derived from the systems and data built on the hardware not just the hardware itself, which should be no surprise, so why not spend more time on that? From the H. G. Wells novel, you are dependent on the Morlocks, those tape-changing, baremetal-replacing denizens of the datacenter, to keep up their ceaseless, yet thankless duties. Where I see customers able to take advantage of Nutanix is shifting the time spent to more fruitful pursuits to expanding their intelligence and their careers. Besides full-time Morlocks, plenty of people can get trapped into doing this part-time, beholden to esoteric troubleshooting of the nuances of hardware.

“There is no intelligence where there is no need of change.” - H.G. Wells, The Time Machine

I can imagine if I had not tried to advance my career from changing tapes, that I easily would not be where I am today. If I had been content swapping tapes and performing on-demand restores 24x7, I would have been miserable until I was obsolete. If I had been content configuring ‘bin’ files and configuring Symmetrix nights and weekends, I would have been miserable until I was obsolete.  And so on with just building VM’s and workflows for managing VM’s and hypervisors.

“An animal perfectly in harmony with its environment is a perfect mechanism. Nature never appeals to intelligence until habit and instinct are useless. There is no intelligence where there is no change and no need of change. Only those animals partake of intelligence that have a huge variety of needs and dangers.”  - H.G. Wells, The Time Machine

Of course, this is nothing new from what AWS or other public clouds accomplish for their customers. How much hardware management do I have to do for my AWS usage? Absolutely zero. It has always been zero, and I expect it to always be zero. One of AWS’s secrets to success, in my opinion, is that it emulates the feeling of getting brand-new hardware all the time. If I want a new instance, it’s just like brand-new and only an API call away, cost-permitting of course.

Why turn very smart, very ambitious people into Morlocks, if you’re making your admins spend their critical career-time on provisioning, managing, and troubleshooting hardware? Instead, help them focus on the next-generation of applications or analytics or programming frameworks that make them grow. Help them be heroes to their partners or teams or maybe most importantly to themselves.

“We should strive to welcome change and challenges, because they are what help us grow. With out them we grow weak like the Eloi in comfort and security. We need to constantly be challenging ourselves in order to strengthen our character and increase our intelligence. ” - H.G. Wells, The Time Machine



Monday, January 11, 2016

Stay out of my way Nutanix

After being a specialist at Nutanix for 6 months, the difference in the way I spend my time is significant. I spend more time focused on application platforms and able to dive into a Hadoop distro or Elasticsearch and their associated tools with customers any time I want. I don’t spend much time in Prism or working with storage. In fact, I spend as little amount of time touching any Nutanix settings except to spin up new batches of apps.

Prism and the AHV stays out of my way. I spend what little time in Prism that I do cloning from a couple templates and then I’m done. I spend a little bit more time in Chef (it’s where I cut my teeth) and Saltstack (personal preference and speed). That I have more time to do this means I deploy platforms fast and can switch up my environment mix quickly and relatively easy. When it’s not easy, it means I’m learning something new about the application environment; which is great.

Besides things like Hadoop and Elasticsearch, I spend time working on platforms like Cloud Foundry and Kubernetes and Mesos. The nuanced differences as well as similarities in these are really fascinating to watch and work with customers. This is what I want to spend my time on. With all the time I spend speaking with customers, this is where customer admins would rather spend their time as well. These are the platforms their developers and line-of-business owners are staking their companies on when they say, “We need to do something with all of this data” or “We need to change our apps faster”. It’s integral to their jobs that they understand what’s going on here and how these are evolving.

I don’t spend a lot of time worrying about vm-centric management interfaces. The granularity is wrong for what I work on now and I don’t need to account for any features in a virtualization layer. Of course, if you string enough artificial management and automation layers together, you can build apps. I know that. I’ve done that, but it takes away time I could be spending directly in my platform workflow.

I don’t worry about storage provisioning or allocation like luns or RAID groups or arbitrary constructs like vSphere clusters or resource pools. I do however think about storage performance a lot differently. I can focus on scaling and sharding. I can focus on using intrinsic performance tools that help me actually have a performance dialogue with customers vs just performance confrontations. I can ask more intelligent questions about the workload, the data, and how the data transformation pipeline works.

For example, one of the arguments I’ve heard a lot is whether separating compute and storage so that they can be scaled independently is beneficial. One of the problems there is that usually means you are always dealing with one or the other being a bottleneck since it’s exceedingly rare that we can do capacity planning without any constraints. Also, when was the last time a workload didn’t need to actually pull or push any IO? The rate and variability is key to differentiating how data flows through any useful system. However, I would not have been as familiar with this in my specific areas of focus had I not been working for Nutanix. I can spend time in the application stack, looking at scaling, looking at the working set and better understanding where exactly any given IO is landing because I have that time. I’m not worrying about the virtualization layer or virtual infrastructure management that doesn’t help me learn more about the app platforms customers really care about.


In all, I feel well-rewarded working with Nutanix and the customers I speak with every day, around the world. I am reminded of working on AWS instances as AWS could also care less about forcing you to understand its virtualization layer, if you even know it has one. I use AWS or Nutanix and focus on what I need to build and what I want to learn about today, instead of saying, I’d love to learn more about something like Spark machine learning or Kubernetes 1.1, but only after I am done getting all of this virtual infrastructure patched up properly. I can also trace how this works and contrast something I did in AWS vs a Nutanix cluster because the management approach of both is very similar. Don’t bog me down with hardware management or virtual machine management. Let me build and learn and stay out of my way.

Wednesday, October 21, 2015

Running Tutum across Nutanix Acropolis and AWS for Hybrid Cloud PaaS

Docker acquired Tutum today and it’s something that I’ve been working with as I look at different PaaS models around the container ecosystem. I’ve linked my Tutum account (which is actually my Docker hub account) to a Tutum-auth user inside my AWS account, you can set this under "Account Settings" after you register:

Did you know it is also possible to use Tutum on Nutanix to quickly enable a hybrid cloud deployment? Once you login to your Tutum account, go to the Nodes tab and click on "Bring your own node" to get your deployment string to use in the following steps.

  1. Let’s clone a few (in this case 3 to start) nodes from our base linux template, directions from my blog here. Now we can also have them with a salt-minion deployed and use that to issue commands to all of the nodes from our salt-master: salt 'tutumnode*' cmd.run 'curl -Ls https://get.tutum.co/ | sudo -H sh -s 9a...'
  2. Or we could of course log in interactively and run the prep nodes command: curl -Ls https://get.tutum.co/ | sudo -H sh -s 9a...



You should be able to see all of your nodes grab the Tutum agent and be recognized (as long as they are internet accessible) within the “Nodes” tab of your Tutum dashboard.
Now if you’re new to Tutum, we can deploy our first “stack”, or collection of Dockerized services, to our nodes. The example given here is a Redis, web and load-balancer stack: https://tutum.freshdesk.com/support/solutions/articles/5000583471
lb:
  image: tutum/haproxy
  links:
    - "web:web"
  ports:
    - "80:80"
  roles:
    - global
web:
  image: tutum/quickstart-python
  links:
    - "redis:redis"
  target_num_containers: 4
redis:
  image: tutum/redis
  environment:
    - REDIS_PASS=password
The collection of services will start on your Nutanix nodes and you can seamlessly develop collections of services or stacks that could be deployed both to your on-prem Nutanix cluster or simultaneously and identically to AWS or your public cloud vendor. By default the deployment strategy is emptiest_node but you can deploy to all nodes or potentially in the future we could see specific availability zone deployment strategies, tell them:-)




With a Nutanix cluster and Acropolis/AHV you can quickly spin up nodes for Tutum consumption to build a hybrid-cloud PaaS for development in just a few minutes. While running on Nutanix, these nodes will benefit from the same shared and accelerated pool of compute and storage as well as infrastructure analytics and durable API accessibility for any other platform projects like Hadoop, MongoDB, or an ELK stack from my blog here or from my colleague Ray here. More on Nutanix benefits for next-gen apps here. As always, if you have questions or recommendations on more integration feel free to reach out on twitter @vmwnelson.

Saltstack Setup for Nutanix on Acropolis

In the spirit of my recent posts around config management and orchestration tools, I’ve also seen several customers using Saltstack and want to show how it is very straightforward to set up and use with Nutanix and the Acropolis Hypervisor (AHV). Saltstack is a powerful tool to help deploy 'states' or idempotent (repeatably identical) sets of expected configuration criteria to your VMs. Also, internally, Acropolis uses Saltstack for our own security and config management. You can find help for creating a master image from my post here: http://virtual-hiking.blogspot.com/2015/10/acropolis-image-and-cloning-primer-for.html With your baseline gold image, let’s first install our Salt-master server:
  1. Create a clone from your gold image and set themaster hostname and a static IP address. I’ll be using Ubuntu 14.04 but for other OS images, please use the relevant package manager.
  2. Make sure to register the salt master in DNS so that all of the worker nodes will be able to resolve it correctly. By default, the master expects to use the name ‘salt’ but this can be customized.
  3. Add the salt repo: add-apt-repository ppa:saltstack/salt
  4. Install the salt-master package: apt-get install salt-master –y
  5. Ensure you have the current hostname and salt-master key ready to insert in your /etc/salt/minion file by running this command on the master and copying the output: salt-key –F master
Now we can prep a new worker template with the salt-minion pre-installed:
  1. Create a clone from your gold image, I’ll be using Ubuntu again but for other OS images, please use the relevant package manager.
  2. Add the salt repo: add-apt-repository ppa:saltstack/salt
  3. Install the salt-minion package: apt-get install salt-minion –y
  4. Depending on whether you customized the salt-master hostname, either uncomment or replace the salt master hostname and IP which is in the /etc/salt/minion config file as:
  5. Add the salt-master key to the /etc/salt/minion config file:
  6. With the salt-minion pre-installed, make sure to remove the /etc/salt/minion_id and any other minion identification files: rm /etc/salt/minion_id  rm /etc/salt/minion.*
  7. Shutdown the salt-minion template.
Now after cloning (recommendations here) the VMs will power-on, grab their hostname from DNS/DHCP, and create a new minion-id that will register with the salt master. You can accept the new salt-minions en masse from the salt-master with: salt-key –A –y, then they will be ready to apply formulas. Other options for boostrapping minions include preseeding the keys on the master: https://docs.saltstack.com/en/latest/topics/tutorials/preseed_key.html
Also you have the option of disabling the authentication step, with the necessary "only do this if you know what you're doing" caveats, by editing the /etc/salt/master:
Finally, you also have the option of just using SSH via the salt-ssh package for an agentless (Ansible-like?) deployment: https://docs.saltstack.com/en/latest/topics/ssh/ For this to work, you will need to enable passwordless-SSH and I described preparing for that here.

For next steps, you could use Salt to deploy some sample workloads like vim or nginx:
https://docs.saltstack.com/en/latest/topics/tutorials/walkthrough.html#the-first-sls-formula

...


And you can find more example formulas on github to work with and modify to suit your intended environment:

If you want a quick ELK stack deployment on a single host:

  1. Clone the example on the salt-master server: git clone https://github.com/saltstack-formulas/elasticsearch-logstash-kibana-formula.git
  2. Move the state files to the salt repo directory: mv elasticsearch-logstash-kibana/kibana /srv/salt/
  3. Apply to one of your guest VMs: salt vm_name state.sls kibana
     ...
     ...