Pojačalo Next Silicon EP1: Engineering the Future of Computing I Eyal Nagar

Pojačalo podcast možete gledati na Youtube-u i Facebook-u, a slušati na SoundCloud-u, Spotify-u, -u, na Apple i Google podcasts.

Audio zapis razgovora:

Pojačalo · Engineering the Future of Computing I Eyal Nagar I Nest Silicon EP1

Transkript razgovora:

Ivan Minić: Dobrodošli u prvu epizodu specijala koji realizujemo sa kompanijom Next Silicon, a kroz koji ćemo u naredne četiri epizode opisati šta oni to zapravo rade, kako su nastali, ko su ključni ljudi i zašto je Beograd baš toliko važan za ovu kompaniju. Imali ste prilike pre nekoliko meseci da slušate jednog od ključnih ljudi, posebno sa kancelarijom u Beogradu, Aleksandra Berića. And today my guest is Mr. Eyal Nagar. He is VP of R&D and co-founder of Next Silicon. And to start this conversation, let’s go with an easy question. What do you actually do? What is your job within the company?

Eyal Nagar: My role within the company is to get the vision of the CEO happen, but really to get the execution going, to build the organization, to deliver to the customer needs and meet the business requirements, get things to focus on the delivery and get the uh, the team to efficiently, efficiently deliver. In high, in one highlight I would say that I need to build an, to build an execution machine that delivers everything that the market needs.

Ivan Minić: Next Silicon is a really interesting company looking on our market, because it’s very unusual to have a startup, a hardware startup, where the founding team and the key people are people with really serious experience, decades of experience on really important positions on key players on the global market. Uh, just to illustrate this, um, can you take a little trip down memory lane and uh, sum up your career up until you decided to be part of the founding team?

Eyal Nagar: Okay. Um, so I started my career in 92, after I graduated. I spent six years in Israeli Air Force to uh, develop radars and systems and which are basically a high tech stuff. And then after six years I decided to go and join the industry, the startup industry in Israel, it was in uh, 98. I started in um, startup company that did cable modems before we had cable modems. It was acquired by Texas Instruments. Uh, and then um, a switching company and optical switching company. And then at some point I got a call from my ex CEO saying, “Uh, come join Intel, we are doing a startup within Intel”. It was a WiMAX technology to deliver um, cellular, mobile cellular to the, to the phones. Um, and since then I found myself 14 years in Intel doing innovative stuff, fulfilling missions that considered impossible. We did uh, WiMAX, I did uh, worked on networking solution, offload networking solution, Atom for mobile devices, uh, 3D cameras, 2D cameras, what we call today RealSense. Um, and uh, doing a CPU for HPC. So I kind of did a lot of multidisciplinary stuff, running a chip design and firmware and software and user experience and dealing with acquiring companies and driving business. So I had the luxury to or the luck to do many interesting stuff in Intel. And basically the way I phrase that, I built all of my career for Next Silicon. Whatever I learned in the past, uh, these are assets that are really uh, made me ready for Next Silicon. And for me Next Silicon was something that I did in the past. Okay, there was a lot of technology, but from chip design perspective and knowing how server works and how networking works, etcetera. Um, for me that was, I came like prepared for that. I don’t think we could have done Next Silicon without the experience of the founders and without the different angle that angles that everyone brings in. Uh, the CEO is coming with a vision like uh, uh, and deep technology understanding all the way from the application layer to the silicon, even if he’s saying he doesn’t understand silicon, he does understand and he’s an autodidact person that can um, turn and an idea into a practical solution. Uh, this is something you cannot learn, like looking at um, so many things that happens in the market and identifying what’s your product and what’s your architecture and what things need to look like in the end goal. That’s an amazing capability of a CEO. And then we have the VP architecture Ilan, the co-founder, who can make this multidisciplinary problem into a simple definition. And I’m bringing with uh, I’m bringing the execution skills and wide system understanding and being able to connect software and hardware together because we could have not done that without the tight coupling of hardware and software engineers that understand each other.

Ivan Minić: I think it’s a very unique situation that you have such a unique team of founders that each bring excessive knowledge and experience, but also understanding of what the industry needs, what are the let’s say bottlenecks of the currently available solutions. And you can, you can only do that being involved in cutting edge stuff.

Eyal Nagar: Yes.

Ivan Minić: So, you know, when you worked in Intel, Intel was amazing and it was far bigger than any other competitor on that market. But what was the motivation, what was the reason for uh starting the Next Silicon, the idea behind it?

Eyal Nagar: When I worked in Intel, I saw this um disruption that’s happening around the world, from mobile arm ecosystem and GPU ecosystem, and even in Intel I tried to push for doing looking from the application and system level and derive what the chip needs. I always give this example that when we developed the Atom chip, we needed to support two gigabyte of memory, or sorry, two gigabyte of memory. Uh, and then we saw Apple did it with 256 megabyte to support the tablet solution that they had. It was clear that when you control the operating system and you understand the needs of the software, you can build an optimized hardware. It was tough to do that in in Intel because it was based on general purposeness and supporting all kind of software ecosystem. Uh, so for me it was clear that we need to, we cannot do incremental improvements. We have to come up with architectural innovation. Um, so meeting with Elad through a common friend, when he stated this vision of um, hey there is a, there is a an a crazy idea that we can learn an application, understand the workload and automatically adapt the hardware to support that workload. That’s by itself is the right vision in my mind, that the software dictates or the the workload dictates how the hardware works. By itself it resonated and then I, you always apply your knowledge to judge something. So I came up with my deep knowledge in networking, where I worked on networking solutions uh, and I applied this architecture for networking. And this architecture is, one of the bottleneck in networking is you get packets into the networking processor and when you do software defined, software defined networking, what happens, your your packet rate is slow because it takes time to process the packet. When I applied it to data flow, you can literally do packet processing in data flow manner and packets comes in, do do processing, and then another packet, the packet rate is amazing and you’ve got the flexibility to write it in software, high level software. That’s for me that was the validation of the architecture. In addition to the HPC and compute stuff.

Ivan Minić: If we take a look at how things developed, computational power developed over time for the past 50 plus years, it’s been mostly one way. It was a CPU, it was general purpose, it worked well, didn’t perform extremely good in almost any conditions. Then as the time progressed and the needs were different, uh, it became the GPU became the big topic. Uh, again, really good for some stuff, really bad for some other stuff. Uh, then there were many kinds of specialized, let’s say chips for for specialized needs. Also, if it’s built for that particular purpose, it works in that particular case very well, in any other it’s unusable. Uh, so these are the the the three common ways things were working up until now. What’s the idea behind this? You already said it’s um, it it it can configure itself based on the workload, but let’s talk a little bit more uh, how it’s made, how it’s built, what was the, let’s say what was prior to Maverick 2 that was unveiled few weeks ago?

Eyal Nagar: Okay. Um, so first of all, the philosophy of um general purpose wins in the long run is something you can learn from the history. There were offload engine for networking for TCP IP, for iSCSI, whatever. Now you see only software based solution with general purpose CPU. There was, there were hardware video decoders. You see general purpose video decoders now or whatever. Um, so ASIC is a good solution, high performance for a certain period of time. But long run you will see uh general purpose. The main value of what we bring is general purpose experience. You get a programming experience of a CPU, but the performance of the most efficient hardware architecture for parallel compute. Data flow is not something we invented. There are solution of data flow. Actually, I we did data flow when we did networking chips to reconfigurable array to accelerate some portion of header processing. But the problem with data flow is to make it easy easy to program and easy to use. The benefit of data flow is clearly it’s and most the most efficient architecture in terms of power and in terms of parallel compute performance, you use data flow um, when you use data flow there’s no overhead of dealing with how to manage instructions, how to store them. There are no instructions. There are actually compute graphs running on the data flow itself, on the grid of data itself. So it clearly, it’s clearly more efficient in power because all you have is compute elements that you need for the compute. It’s much higher performance because you can parallelize tens of thousands of threads on the same context of machine on the same area of memory, you have um, the most, it’s it’s a parallelism beast. The problem was how to feed it and how to program it efficiently. And that’s what we did in Next Silicon. We invested a lot of software effort to program the data flow, to run the important portions of the of the application, while the user don’t care. He doesn’t need to bother on how to map this specific loop in the code or how to do the memory migration. That’s what, that what was the claim to fame and we actually delivered it in Maverick 2. It’s not a vision. It is working in Maverick 2. You can take stuff off the shelf. You can take benchmarks like, for example, um, HPCG. HPCG is the is the most, uh, let’s say representative benchmark for the high performance computing world. We take HPCG from GitHub, compile it, run it and we get amazing performance that surpassing CPUs and GPUs. and we get it in half the power and we get scalability in a in a rack level. um, without a lot of effort to modify even the MPI framework, which simply use MPI as is and we can do that. So since we are building on existing software ecosystem and we are not modifying every line of code to manually tune it, it gets us to scale to run applications out of the box with hardware today.

Ivan Minić: I think the the crucial thing what what you’ve been talking about is the fact that basically you can run any code written in anything and yes, the first time it will be inefficient. But already the second and third time it will get better and really soon, and usually in a matter of nano or microseconds, it will be optimized to work as good as possible. And that as good as possible will outperform pretty much anything in, you know, 10 times or six times or whatever. And and the other thing, the efficiency itself in terms of power, the power has been a big talk past couple of years because every GPU has been using insane amount of power all the time and it it it was the biggest issue now to building a data center is not where or how we’re going to cool it down, but where we where we will find the electricity. Uh, you’ve been working on this for roughly eight years. This is not the first thing that came out of Next Silicon, not the first thing that people could see. Uh, what was the progress up until Maverick 2?

Eyal Nagar: How did how do we go from from the beginning to Maverick 2?

Ivan Minić: Yeah.

Eyal Nagar: So we, when we started, you always convince yourself that uh it’s it’s just I’ll do some MVP and and that’s all it will be okay. Although I did it a few times before and I knew that uh to do a startup, you usually need to go all the all the way. But it’s like a giving birth. After three years you forget and you give another baby. So I forgot and uh when we talked I knew that uh, it’s not only software, we need to do chips, we need to do probably boards and DRAMs, etcetera, but I said, “Okay, uh, we’ll just do the next thing”. You don’t look at the um at all the way. You look, you start with an idea. This idea sounds amazing. Although it’s a big idea that you big like big work to do until you get to the core value of it. There’s a lot of unnecessary things you need to build in order to to bring the this value of, hey all I want to do is run on data flow, I identify the hotspot, get telemetry, and then I need to build so much stuff to do that. I need to build again, tons of things to do that. Uh and um, so we started with a simulator. Okay, let’s run floating point to start with. And there’s a joke that when we first start floating point on the simulator it took us 14,000 cycles to run one floating point. If you compare to Intel, it’s three cycles. And uh and I, you know, got my head, “Hey did I leave Intel for that?” Uh, but uh quickly we fixed that, etcetera and you see that uh the idea makes sense. So, uh, you you take benchmarks and you run them on simulators and you saw you show the value to customers. And then we came up um, and you work in parallel. On one hand you need to drive business value uh, and to deliver something to customers to see and experience and give give you feedback, feedback. On the other hand, you cannot ignore the product that you need to build. So on one hand we built a real chip uh, run it on FPGA first, etcetera. On the other hand, we built the software simulator to demonstrate the claim claim to fame of the technology. And then I remember uh supercomputer 19, we met with uh we started to meet with with those supercomputer and national labs, etcetera. Um, and I remember the jaw drop of some of the people when they figure eventually what we are doing. There were ideas about building chips per workload. Some ideas in in US and stuff like that. And when they met with us and they saw the idea and said, “Okay, basically you are building cheaper workload, you’re just, you’re just doing it smartly by reconfiguring the hardware”. And then we got um, uh, some POs and some business deals. And uh, we like uh focused the product more and we taped out our first product. And the first chip came came in uh and we like we were like 15 hardware engineers uh that did that chip. We had it’s a rollercoaster to do a startup. You have crisis every week that looks to you like the end of the world and the week after you have another crisis uh that you forget the previous one. So it was not easy to get to the point that we taped out, but I had experience with startups and I always remember that that ritual of, I have a crisis, after a week it’s done, there’s it’s a new world, I forget it. So I never, we all had the resiliency when we had this problem. You have to come up with with resiliency, you have to remember your end goal and vision, and you have to execute efficiently for the next milestone. You don’t think about all the way, you think about my next three months with a big picture of the rest of the, the rest of the problem. So we got the chip and it worked in like two days. No one believed that. Again, talking about birth, giving a chip is like giving a birth. Getting a chip from the silicon from the fab is like giving a birth. You know surprises happen. Uh and it’s actually you need to be lucky to have a working chip. So many billions of transistors, what’s the chances that you want, what are the chances that you want to have a problem? But we got it to work uh after two days. It was crazy, uh crazy achievement. And we got to run this full um like, I don’t know, crazy concept of you run an application, you learn it as it runs, you have telemetry from the application itself, you understand what’s the hotspot of the application. Now you decide to to offload that hotspot. You need to to compile it to the chip, to the data flow, online while it’s running, and then hijack the injection to that when you call this function, you you stop it while it’s run on the host, you put it on the chip, it runs on the chip correctly, go back to the host when it finishes this hotspot and all of that worked for us in the first silicon. So that by itself was a science fiction for us that we can do all of that flow and it works correctly and we got some benchmarks to run in high performance. So we got a proof of the correctness and some proof of the performance of that one. And then we did the second generation chip while we had to fix some architectural issues that we found. For example, running threads out of order instead of blocking each other, etcetera. So we learned from the mistakes uh and gen two came in. Again, a lot of uh concerns that it won’t work and it worked after two or three days and we got it to work and run the software on it and on it and we got performance on real uh staff and um we built special board designs and with water cooling and things that big companies that big companies are building and we did it with 300 million dollars. That’s what we did and we did another chip in the middle and we developed a risk 5 core because we needed a risk 5 next to our data flow, not to confuse with other technologies, it’s not uh a risk 5 array of CPUs or something like that, it’s a new architecture where between those hotspots that are running on the data flow, there are folk points and sync points and you need the risk 5 to stay next to the data flow, not to go back to the host and move the memory back. So we developed all of these things with three chips, all of them are working in a step uh, with so many layers of software that we developed um, across the globe. Compilers, runtime software, uh, research of a mapping compute to data flow, um, firmware that runs on the chip, the hardware itself, the chip itself, the board, the cooling, the system, all of that uh was done in those eight years with less than 400 people. In average it was 200 people over eight years. Um, and now we have a working chip with a working software at customers hands running in performance at scale. Pretty amazing uh dream to fulfill.

Ivan Minić: You know what they say for startup founders, they sleep like babies, wake up every two hours and cry.

Eyal Nagar: Exactly.

Ivan Minić: Uh, so when I when I watched the launch, uh, there were some really cool materials but one of the things that that was the most interesting to me was the fact that, uh, um, there was a video where you take different pieces of code to to run it and you put it in a Next Silicon profiler to see and it visualizes everything for you. I mean, for you guys who who made it, it’s probably boring, for me it’s colorful and nice and it explains how things are mapped, so to say, and how things are working. Um, that also means that you had to develop ton of these additional things not just to support your development, but to support your users in using the whole thing. So, uh, once you, once you’ve done the things that were necessary to build a working chip, how was it with the customers?

Eyal Nagar: You explained it very well. Are you free for a job offer?

Ivan Minić: We can talk.

Eyal Nagar: Uh, yeah that’s a good point because we have built something that no one done before. Okay, data flow, general purpose compute, how do I debug that? How do I do break points? How do I visualize the code? So we not only needed to invent the technology itself, we needed to invent things that are necessary evil. This what we call profiler, the UI that you are talking about, uh, is redefining how we think about uh, about debug and how we think about compute and what happens for happened from that experience is customers started to look at this profiler, not even the core technology, as an amazing tool that tells them about their software. I mean, you tell me what’s my hotspot, you tell me where is, where is my code? How does it look like in terms of compute? So, all in all, the feedback from customer was, wow that’s that technology is cool. I can do so much innovation with it. I can relook at my code, I can vectorize things that I can could not imagine, imagine. I could do memory optimization I could not think of. The graph view of things is is uh is a killer capability that I can use to rewrite my code and and innovate more. So it’s not only that we deliver something to running what your code now, we deliver an infrastructure that let you, let our users and the researchers to rethink about how they do compute and how they use this infrastructure to run their stuff. So the feedback that we got from customers was so amazing and and such a big excitement and we’ve seen that every supercomputer show that we go right now, you know we’re building stuff for high performance computing and later it will go for AI. Every supercomputer we we are getting, we feel like the the idol in the room. I mean so much excitement and so much time they will, they’ve been waiting to get Maverick 2 to run in their labs and to really deliver performance that um, you know that gave us the, it gave us the energy to overcome all the problems and simply be persistent in delivery.

Ivan Minić: A friend of mine once said, he he has a PhD in economics. He said that if you really want to calculate something, something serious as an economist, you hire a physicist to do that.

Eyal Nagar: Yeah.

Ivan Minić: And many of the applications are that deep science, whether it’s uh climate simulation or stuff like that. Uh, it was so interesting to me to, you know, research these topics. The amount of uh things you take into account in order to properly try to simulate what’s happening on this planet is just ridiculous. And that’s just one application you can you can use for for something like this. Um, so uh one of the things that that’s also part of this uh big launch is the fact that you have one more, a bit simpler so to say, chip. What’s the story behind that?

Eyal Nagar: Okay. I mentioned it a bit, but um we we needed some low end processor, not so low end because it needs some parallelism capabilities or vectorization capabilities etcetera, in order to basically we divide the the type of code that we have in application to likely code, that’s the parallel section that happens many time. Uh unlikely code that is a serial section that happens in the beginning. Uh you have probably a lot of line of codes but it’s not happening so much. And we call it uh semi likely or between likely and and unlikely. So the unlikely runs on the host CPU. The likely runs on the data flow. In between you have some unlikely code that you don’t want to go back to the CPU to run it, you want to keep it on the chip. So we had to develop a risk five solution or whatever control solution uh that takes care of this uh semi likely flows. We decided on risk five because uh it’s an open platform and you can renovate and you can do some um special instructions to, for example, inject directly to the data flow. Once we did that, we hired top talent team uh to do that. That’s the key believe in the company, we want to hire the best talent that can do uh the stuff we are doing. Uh so we developed uh a core that is configurable to run to run a low end and to run high performance um compute. And then we said, “Okay, so we will also look for further to the future”. Again, as I mentioned, we have a vision to accelerate everything that is compute intensive. HPC, AI, networking, whatever comes in the future we want to accelerate that. We want to give a better solution for that. So we knew that we need to have a CPU and an accelerator in order to have a better system solution where you can run uh stuff that needs a single thread on a CPU and you can easily migrate it to run on accelerator because sometimes the bottlenecks comes from the Amdahl’s law of moving serial to parallel, parallel to serial. So we decided that if we develop a risk five, we will develop a performance core and we will make it configurable to be an efficient core or performance core. So during that development of Maverick 2, we also developed this risk five high-end CPU that can run, that can power actually a server grade solution. So we have partners and um customers that are looking at that next generation chip that will be based on this um test chip that we built that has two risk five cores and uh fully coherent system. Uh so the next generation will be a CPU and an accelerator or even potentially a stand alone CPU that can run server workloads. Uh and we’ve got everything uh to do that. I think we have the most advanced risk five core for high-end workloads, for for running server class stuff.

Ivan Minić: This is all really interesting and I love talking about stuff like this because this is something why many of us decided to be engineers. Some were not as good to to become part of a team like this, but we are sitting in Belgrade, we are talking in your new Belgrade offices. It’s a really big, really nice office, really nice chairs, tables, monitors and everything but the most important thing is you have 80 people working in Belgrade and I know some of these guys, they’re really cool. And the most amazing thing about them and their work is they are actually really big and important part of building this thing. They are not support staff, they are frontline players and how did this happen?

Eyal Nagar: Believe it or not, um, when we started this site in Next Silicon, in Serbia, this is how we imagined it, hundred people. Actually not exactly that. It went beyond our expectations. We thought we will focus on hardware and low level software, eventually we have here all the disciplines. Uh, I’ll talk about it in a second. But when we started with uh Alex Beric, uh first of all the introduction introduction came from um from our investors, Playground, Sasha which is a Serbian guy that knows what’s going on here and he connected us to Alex Beric. So he sent the name to us and then I looked at the name and say, “Okay, I know this guy”. Uh we worked together in Intel on the camera uh that we developed. We actually acquired a company called Silicon Hive and we jointly developed um a 2D camera for Sky Lake, for Next Silicon, for Intel chip. So I know this guy, I trust him, he’s talented. Let’s start. Okay, we have a leader, we can build around him and we can build uh a meaningful site. When we talked, um I always told Alex, it’s not about finding an outsource or a low cost solution, it’s about building uh an extension to our team. Whatever we do in Israel, we do in Serbia. It doesn’t matter whenever you find the talent and actually that’s our motto when we when we look for a talent and we find the talent in some geography, yes there is some inefficiency of remote work, but uh we appreciate the talent and uh we have a talent in Zurich, in Swiss, we have a talent in Berlin. Uh and here we started with, “Okay, let’s do uh RTL and front end design and maybe some firmware design etcetera”. We didn’t, it’s not like outsource work. I mean, everything that we do in Israel we do here. These guys here now in Serbia can actually do a chip. You have the skills, again that was the vision. We build the skills to do a product here in Serbia. It’s not um some job that we throw, that we don’t want to do. It’s the same exciting job that we do in the in the headquarters, we do here in Serbia. We have research guys that are doing the most advanced research that we do in the company. We have compilers, we have um performance modeling and architectural modeling. We have um and I need to be careful not to forget but device runtime, uh we have drivers, we have firmware, we have RTL front end design, we have RTL verification. So everything, you have the full blown skills to build a product here in Serbia. Uh that was the vision and uh you know we again we always try to copy from things that we know. The the model that uh you know I discussed with with Alex and both of us like the founders and Alex uh discussed was, let’s build, let’s copy from Israel. You build the product capabilities and you build um a full pride of local team that can deliver, not just do tasks. They can really deliver and we are so happy that this, this site is is um so meaningful for the success of Next Silicon. Second generation and this uh this chip of uh risk five and the third generation that is coming, we could have not done that we could not, we could not, could have not done it without the Serbia site. We really same team, different geographies.

Ivan Minić: One of the things that actually never happened to me before in my life and I’ve been through many things, uh, when I first discovered what you guys have been working on and met Mr. Beric and we talked, uh, I was like, “I have a couple of friends that are really interested in these topics. Uh maybe you can work together, maybe you can figure out something”. And uh he explained it to me uh and and from what I later learned more about the company and the product is that you guys are basically looking for uh top 1% of top 1% of people and the idea is not to have 5,000 people working on something, the idea is to have 400 amazing people working on amazing things. So how come, and and we will talk more about this in the next episode with Alex with us, but from your perspective, how come it’s possible to find 70 people in Belgrade to work on this?

Eyal Nagar: Actually, also this was not an easy. It was a rollercoaster. Alex had some disappointments that we um that we disagree to accept some people. Um, but it paid off. The patience to bring uh A plus players paid off. And once you bring A plus players, they bring A players and you keep the talent and you can do 3x with the same team. So, yes I’m surprised that we were able to get to 80. I think we can get to 160. Uh I think there is enough talent here. Some of them may be young but they are super talented. Again there are people here that um, like top of the company. It’s not locally top players. And it pays off because then you don’t have inefficiencies and you don’t need to spend your time on um like low quality stuff. Everything is 3x value. You just deliver. And it’s better to do more with less than doing less with more.

Ivan Minić: A friend of mine once said when I was teaching project management in college, digital project management, and a friend of mine who who is a like crazy guy like some of the guys you have here, he told me, “Well project management has been developed in order for incapable incompetent people to be able to perform a task, usually inefficiently, but through the tools and and measures and everything in in the end you end up with something that’s that can pass”. But amazing people do not need micromanagement.

Eyal Nagar: Exactly.

Ivan Minić: They have drive, they have passion, they even though if, you know, if they are not big talkers and everything it’s too important for them to keep their mouth shut. And that’s something that I see every time I I come here. People are really high on working on this. And and it looks very nice. It’s colorful. It’s complicated.

Eyal Nagar: Yes.

Ivan Minić: You’ve done an amazing job. Uh, thank you so much and in the next episode you will be joining me and Alex and we will talk more in English.

Eyal Nagar: Thank you very much. I enjoyed being here and I’m really really proud of what we’ve achieved here.

Ivan Minić: Thank you.

Eyal Nagar: Thank you.

Ivan Minić: Thank you guys for listening to us. We see each other in next week.

Nove epizode u vašem inbox-u:

Podržite Pojačalo:

Donirajte jednokratno ili kroz dobrovoljnu mesečnu pretplatu već od 5 EUR.

Pratite nas:

Društvene mreže:

Podcast platforme: