Subscribe via iTunes
Subscribe via Stitcher
Show notes
Phil Zito 0:00
This is the smart buildings Academy podcast with Phil Zito Episode 234. Hey folks, Phil Zito here and welcome to Episode 234 of the smart buildings Academy podcast and in this episode we are going to be diving into troubleshooting processes. So this episode is sponsored by our troubleshooting fundamentals webinar, this completely free webinar will be occurring at 12pm Central Daylight Time, on Wednesday, January 13. So if you are interested in learning about troubleshooting you found these podcasts or maybe these videos if you're watching this on YouTube, very helpful, then I encourage you to go to podcast smart buildings Academy comm forward slash 234. Once again, that is podcast smart buildings academy.com, forward slash 234. This free webinar will be helping you to create and apply your own troubleshooting methodology you'll learn Rs, but then you'll learn how to create your own. And you'll be able to apply these methodologies to your service and operational teams. Or maybe if you're in service yourself, you will find a tremendous amount of value from this webinar plus, we'll look into a variety of different troubleshooting scenarios. And we'll work through those. So to recap real quick, if you haven't been following the podcast, if you haven't been following any of our content around troubleshooting, the troubleshooting process consists of three steps, it's a relatively simple process in and of itself. But if you follow it, you will still find that you can see things get pretty complex pretty fast. That's just the nature of troubleshooting. And that's why the process in and of itself is very simple. The first step in the troubleshooting process is to figure it out the desired state. You know, oftentimes we show up at customers, job sites, customer sites, we show up in response to hot calls, cold calls, etc. And the information we get is not very clear. And we don't know exactly what the issue is. So one of our primary goals, in order to troubleshoot a system for proper operation is to determine that systems desired state, we need to understand how should the system be functioning, what would be going on if the system was and I'm making air quotes here, was operating correctly. So we want to figure that out, we want to understand if the system was operating correctly, it would be in this desired state. And then we need to determine the actual state, we need to look at the system and figure out exactly what is actually happening, what is the state of the system. And determining these two things is going to be critical for us to do what's called a root cause analysis, which is step three. And the root cause analysis is where you're going to spend the majority of your time The reality is, if you were to simply look at the desired state and the actual state and look at the delta between those, which is the difference between those that oftentimes may get you to the troubleshooting rec resolution, but there's many times it will not. So for example, if you were to go to a terminal unit, you were to get an issue with a space. And you started troubleshooting the terminal unit, maybe the chairman terminal unit space temp was too high or too low. And you're trying to figure it out. And you're saying okay, well, based on the root cause this are based on the deviation between desired and actual state, the root causes, there's an issue with the terminal unit. Now, if you didn't factor into all of this, that multiple terminal units were having the exact same issue, you wouldn't figure out that, hey, maybe I should go troubleshoot the air handler, maybe it's actually every air handler is having an issue as well. So maybe I need to go check the hydronic plant. So root cause can be a nested scenario. And what I mean by nested scenario is that as you dive into root cause as you start to analyze root cause to try to figure out what it is and what is causing the issue, you can actually find yourself in Oh, it's a terminal unit, but it's actually all terminal unit. So so the air handler Oh, but it's actually all air handler. So it's the hydronic plant. So going and finding root cause is typically going to be the most complex aspect of any troubleshooting. To that end. Let's talk about a couple different common troubleshooting scenarios. Being that we are in the winter season at the time of this recording. I'm sure a lot of you are getting no heat calls. You're getting issues from customers where they're like it's just too cold in here. It's not
warm enough in here we're having issues with heat in the space. So the desired state may be that the zone temp is 72. But the actual state may be that the zone temp in and of itself. Office 68 degrees. Now whenever I get terminal unit or what I call downstream troubleshooting calls downstream, meaning that they're not a primary system, their tertiary system, right we have our hydronic plant is usually our primary system. That's where heat originates in the form of combustion, and we're creating BTS that can then be transferred from the hot water loop into the air handler, air stream and then into the terminal units. That's one way right? The other way is we're going directly from the hydronic plant to hot water coils in the reheat boxes. And then there's a third way, which is actually going and having electricity. So first thing we'd want to figure out, right is, is this happening at all terminal units? And what kind of heat are we using in these terminal units. If we're finding that we're using electric heat, then we are pretty certain that short of no airflow at the unit we are going to be troubleshooting the terminal unit itself. Now if we find out that the heat is caused by hydronic, heat from either an air handler, maybe we've got a constant volume unit, and we're controlling discharge, or maybe we've got a variable air volume air handler, and we're controlling the hot water actually at the terminal units. So whichever one it is, maybe we need to go and troubleshoot the hydronic plant. So we're starting to work this out. So what I like to do is I like to get my troubleshooting notebook, which is something that I have, I keep a dedicated notebook for troubleshooting. And I use this actually even funny enough not being in the field anymore, I
still use this. Whenever I'm working on a V issues, maybe for our film studio, or whether I'm working on network issues, or automation issues for our SharePoint site. I still use my troubleshooting notebook. And the reason why is I diagram out kind of what does the physical system look like? What is sourcing the physical system? So how am I getting heat to the system? How am I getting air to the system, right? And in the case of obviously, my world, I would be looking at how am I getting data flowing into my SharePoint site? How are my power automate flow sequences working? How are our Asana sequences working etc. Nonetheless, though, I draft out how things are operating how things fundamentally are functioning. And then I start to look at the sources that could be the root cause. Obviously, if it's a terminal unit, our primary way of adding heat to a space or removing heat from a space is going to be airflow, that is going to be our primary mechanism for heat addition or heat removal. Because of that, our primary focus when troubleshooting spaces and troubleshooting terminal units, is first going to be establishing that we have proper airflow. So my initial root cause analysis will be focused on airflow, I will check that we have airflow coming into the supply side of the box and that we have airflow coming out of the discharge discharge side of the box. I will then once I've validated flow, I will validate control of flow. So I will then go and make sure that my dampers modulating flow and that my airflow sensor in the terminal unit is reflecting that model modulation. Also, another point I check is airflow coming out of the diffusers as well to make sure no balancing dampers are shut. Once I've validated airflow and airflow control, then I need to go and ensure that I am getting the proper temperature to the airflow because that is how I'm going to be transferring BTS into the space for heating of the space. So now I'm going to be looking at my different heat sources. If I have electric heat, I'm gonna validate that that electric heater is working. Ideally, I would have a discharge air temp on any reheat boxes, which would enable me to very quickly determine if the heating is working from the terminal boxes heat source. If I have hot water, and I have a hot water coil, maybe floating actuator or proportional actuator, I'm going to drive that actuator to 100% validate that the discharge air temperature changes. And if the discharge air temperature does not change, then I'm going to check valve operation and then work upstream to the hydronic plant. Now how do you determine if things are like an air handler issue or a hydronic plant issue or if they're terminal unit issue? My general rule of thumb is that if the issue is only occurring at a single terminal unit, then it is typically not from an upstream unit. So if I've got a terminal unit that has no air, then it's and it's the only terminal unit that has no air then it's typically not the air handler maybe there's an isolation damper blocking the supply side that could be possible but most likely not If I have a hydronic issue, then it's most likely not the hydronic plant. If it's only with a single terminal unit, I may have an isolation debt or isolation valve closed before the heating coil. But it is most likely not a hydronic plant issue. Now, if it's occurring in all the units, then I shift gears and I go, and I start to troubleshoot the actual hydronic plant or the air handler. So I hope you see how that works. Having that thought process is going to save you a significant amount of troubleshooting time, and really keep you focused on that root cause. Moving on to another scenario of no server access, oftentimes, we show up to a site and maybe we have no access to a server, maybe we are unable to access the VA s server and maybe something like we're not able to access the server or we can't get to the user interface, or dah, dah, dah, dah, dah, what we want to understand is desired state and actual state. And when you get to logical troubleshooting scenarios, which the next two scenarios are going to be logical, they're going to be focused on server and BACnet. Communication. So in the case of a server, we're dealing with both physical devices, but we're also dealing with logical devices and settings, you, you can't physically touch a route, you can't physically touch a fire. Well, I guess you could touch a firewall if it's an appliance, but in most cases, it's software. So you can't physically touch these things. It's not like a pressure switch, or a low tap switch, which I can go physically validate. So troubleshooting it and technology issues can be quite difficult.
What we want to focus in on here, though, is once again, our desired state or actual state, then identify a root cause. So if we had a desired state, which is able to access the server, but our actual state is we're able to get to the login screen, but we're not able to access the server, or at least the customer is not, that is a completely different root cause and troubleshooting path, then if our desired state is to access the server and our actual status, when we enter the server's IP address, we get nothing. Those are two totally different troubleshooting scenarios. And you should approach them completely different. Let's talk about the first and then we'll move on to the second. So in the first scenario, we're able to get to the login screen, but we're not able to log in, or at least the customer is not. This is oftentimes a simple fix with either permissioning issues, or with users just fat fingering fingering and their passwords and or their usernames. So if we enter the administrative password, which we know has appropriate permissions, and the administrative user login, then and we're able to get in the system, then we can real quickly validate that the root cause actually has something to do with the user's credentials. If we are not able to log in with the administrative account, but we are able to get to the login splash screen, then that could lead to an indication of different web server issues. Now going to where we're not able to even hit the IP address of the server. Now we can go and start to troubleshoot things out in a different manner. Whenever I am not able to access the system. Actually, whenever I'm dealing with technology architectures, in general, I always want to map out the physical topology, I want to lay out how things are connected, what IP addresses exist, which ports are connected, where route should be occurring, where subnet boundaries should be occurring, etc. Let's say that we had the server on subnet one and the device trying to access the server on subnet two, that would be a completely different troubleshooting scenario than if the device trying to access the server was on the same subnet as the server. Because if it's on the same subnet as the server, I'm most likely not dealing with routing issues, most likely not dealing with too many IP issues, maybe an IP address mismatch. So if I'm on a separate subnet, then I'm going to see Can I ping the server and if I can ping it, then that tells me that I do not have a routing issue. And that is the quickest way to determine inter communication between devices. Now, I'm going to go and put a device on the same subnet as the server and try to connect to it, if I can connect to it. And that ping from the previous subnet worked, then that most likely tells me there's a firewall or some sort of Port filtering mechanism at the router blocking things. If I'm still not able to connect to the server, then that tells me it may be a potentially different issue. Now, if I'm on the same subnet as the server, and I can't connect to that server, and I can ping it, however, then I can start to look into things like what port Am I trying to connect? Am I trying to do Port 80? When it's only Port 443? Are there additional ports potentially stood up? Is there a duplicate web server, there's so many different root cause scenarios that we could dive through, but you start to see a couple concepts that I hope you picked up in that brief troubleshooting scenario I'm walking through. First off, I physically drew out The topology kind of like how I drew out the physical architecture for the terminal unit, I physically drew out the topology for the server. Now by doing this by physically drawing out the topology for the server, I now can start to check things off. I can say, Okay, we've got the right the ports are up, the Ethernet medium, the wirings. Good. Our routes are up, our routers are good. And we can start working through what I like to call segue men, I guess it would be segment tarry segment, segmentation based troubleshooting. So what I do first is I say, are there two subnets, one subnet how's the topology? If
it's two subnets, then I start to rule out things like routes, I start to rule out things like port forwarding. If it's a single subnet, then I don't have to worry about routes I don't have to worry about I said port forwarding, I'm at Port filtering, I don't have to worry about port filtering. Now I can start to look at different issues. So by segmenting and slicing out your topology, you're able to really focus in on issues that most likely are going to be the cause or the root cause of the scenario. Moving on to BACnet devices being offline. So BACnet mstp devices, being offline is a very common thing that almost everyone is going to encounter at some point in their career. So when we're dealing with BACnet, devices, specifically BACnet mstp, devices being offline. First thing we have to understand, and this becomes a kind of core underlying expectation whenever you're doing troubleshooting. It's why our troubleshooting bootcamp course, it is expected that you have basic ba si t and building automation knowledge in order to go through that course. Because in order to troubleshoot, you need to understand how a lot of core concepts work, for example, in serial rs 485, connect communication, you can have controllers that are offline and still have communication pass through. So if you were to go to a scenario where a customer said a controller is offline, or a controllers are offline, and the desired state right, as the controllers are online, the actual status, the controllers are offline, there's several ways you can start to slice this out. First thing I like to do once again, is I like to go and get a topology. And really, I like to know what transformers are connected to the devices as well, because this is a very, very valuable piece of information to rule out several different potential system faults. I know that a lot of people like to talk about using Silla scopes, and things like that, and troubleshooting BACnet mstp devices, you know, maybe I'm just the odd man out, but I have almost never had to use those. And I've troubleshooted hundreds, if not 1000s of BACnet mstp devices in my pretty long service career. So I will tell you that when you start to approach troubleshooting, and you start to think about it from a systematic perspective, and you start to truly understand how systems function, and what makes them work, you will find that resolution of system faults can be relatively easy. Now, you will always run into people who will point out that one scenario that, you know, once in a lifetime scenario have we ran mstp trunk, next to a piece of machinery, and that caused interference that they were only able to find through in a Scylla scope. The reality is, though, that you running into these scenarios is just not that common. All right, so let's talk about BACnet devices being offline. When we think about devices being offline, there's only a handful of reasons why they're going to be offline. And each offline type shows very specific indications of why that error is occurring. So for example, there's power, that's one cause end of line, duplicate MAC addresses device IDs, met baud rate mismatches and polarity reversals. Obviously, there's also things like grounds and breaks and shorts. But besides for those primary reasons, those those those are the big reasons. And each one of those has direct signs of that failure. So as we start to do root cause analysis, if we notice a set of controllers is offline, and that set of controllers corresponds to a specific transformer, then we know it's a power issue. If we notice after a specific point that the controllers are no longer communicating, then that is most likely an end of line issue. A lot of folks will say that that could potentially be a combis issue. But usually if you have like reverse in polarity or you have a break in your combis, that'll usually create enough noise to bring down the entire bus. Now also polarity reversals, that's a pretty easy detection right the controllers stop working after a certain point and it does not correlate to Transformers duty MAC addresses device IDs could be points going or controllers going up, down up down. And a baud rate mismatch could result in slow communications could bring down the trunk. So there's various signs, obviously, in a single podcast, we don't have the time to go through each one of those scenarios. But there's multiple different scenarios. So we would approach that from looking at the key indicators, the key signs that show us what the root cause
is. And then finally, our final scenario here we'll talk through is an air handling unit being down we all know air handling units, being down is also a common call that we tend to get pulled out for service calls and things like that. Now we got to think about how could an air handler go down? What does that mean? So what would the desired state be right, the desired state would be, maybe it shows up as all our spaces are hot, maybe it shows up as this space is not getting airflow, me and that depends on is this a VIP unit, or is this a air handler that's serving a single zone. So depending on the architecture and layout of the physical system, that could be indicating our desired versus our actual state, right, and the actual state may be, we're not getting airflow in the space, all of our zones are hot, etc. So now we have to work through our root cause. And whenever we're dealing with root cause with air handlers, and any air side moving system being down, we're going to have one of the three scenarios, right, we're either going to have a environmental variable, like temperature, humidity, or co2, an issue with those, and each one of those points to a specific part of the air handler. Typically, we're gonna have an issue with airflow, which can show up as temperature, humidity, etc. but also is easily measured by determining there's no airflow. So that's our, our second, and then our third is going to be pressure. So pressure, and this indicates usually an error in sequencing of systems. Because if everything is flowing airflow wise, and we're getting a building static pressure issue, then either we've got most likely a sensor issue or we've got an issue and sequencing our exhaust and supply systems properly. So once we determine our desired state, maybe it's to keep an auditorium cool or warm, and we're not, then we can go and determine our natural state, which is it's not cool or warm. And we start to get to root cause. Knowing that air is the primary mechanism for heat transfer and heat absorption, we're going to go and make sure our proper airflow is being achieved. Once we've moved from that, then we can start to look at the sources of heat, as well as the sources of cooling depending on our desired state. And we can start to determine if they are functioning. And that's pretty easy, right, we can look at our cooling coils or heating coils determine that they are indeed having a measurable delta t that corresponds with the valve position and flow rates. So for example, if we have design conditions, and we have a valve that's open 100%, and it was designed for a 12 degree delta t, we should get that unless we're having some fouling of the coil or some issues with the hydronic system. So we can start to work through these things in a logical manner. And that's what I want to once again impart on you is working through things in a logical manner. So let's recap. The three phase are three step, troubleshooting process is to figure out first the desired state, then to figure out the actual state. And then to determine the root cause. Oftentimes, the error between the desired state the delta between the desired state and actual state can be the root cause, oftentimes it cannot, we want to look at things from a systematic perspective, we want to really make sure that we document the system layout the system topology, this is why it's so critical for you building owners who are listening to this, to demand that you get proper redlines, you demand that you get proper as belts, make sure that your systems are adequately represented. My goodness, I cannot speak today, or act accurately represented in the material that you're receiving at the end of the project. Also, you want to make sure once you are documenting your systems, and you're diagramming your systems as part of the troubleshooting process, that you're logging the actions you're taking. So you can refer back especially when you're doing like protocol or programmatic or API, troubleshooting things that are really logical and very kind of obtuse. You start to troubleshoot these systems and things aren't really clear. You really want to be able to look back did I check the baud rate? Did I check the device IDs? Did I go and check the endpoints? Did I make sure the ports are open? All these logical things that you can't physically see that you check You really want to make sure you're checking those and logging them and ensuring that you understand not only is this going to be helpful if you have to engage higher level support, you can point to all the things you've already done. And the other benefit of this is that if you ever have to revert anything back, that is a logical setting, you know what you change so that you can revert it back.
So to piggyback on that real quick before we end the episode, when you do change a state, put the original state or the original process variable or original or sorry, original setting in the notes, you know, especially if you're changing anything like k factors, or if you're changing IP addresses, or if you're changing any set points P ID loop settings, make sure you have the original settings noted do not rely on the controller, the building automation systems, logs, the audits, anything like that, to keep track of that because it does not necessarily work. Alright folks, Thanks a ton for listening to this episode. I look forward to talking to you and engaging with you in the comments in the discussion sections. Be sure to go to podcast smart buildings academy.com for slash 234 once again, that's podcast at smart buildings academy.com forward slash t 34. If you really enjoyed this podcast episode, I encourage you to sign up for our troubleshooting webinar which once again will be at 12pm Central Daylight Time. Wednesday, January 13. You'll find a sign up link at podcast at smart buildings Academy comm for slash to 34 Thanks a ton. Have a great day. Talk to you in the next episode.
Transcribed by https://otter.ai